Kubernetes Production Checklist: EKS Edition
A comprehensive checklist for running production-grade Kubernetes clusters on AWS EKS, covering security, networking, monitoring, and reliability
Moving applications to Kubernetes is exciting, but running a production-grade EKS cluster requires careful planning. After managing multiple production EKS clusters, here’s my comprehensive checklist.
🔐 Security
IAM Roles and Service Accounts
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-app
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/my-app-role
Checklist:
- ✅ IRSA (IAM Roles for Service Accounts) configured
- ✅ Least privilege IAM policies
- ✅ No hardcoded credentials in pods
- ✅ Secrets encrypted at rest using KMS
- ✅ Pod Security Standards enforced
Network Security
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all-except-allowed
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: allowed-namespace
Checklist:
- ✅ Network policies in place
- ✅ Private EKS endpoints (or restricted public access)
- ✅ Security groups properly configured
- ✅ VPC CNI custom networking for pod IPs
- ✅ AWS PrivateLink for AWS services
Image Security
Checklist:
- ✅ Container image scanning (Trivy, Snyk)
- ✅ Private ECR repositories with scanning enabled
- ✅ Image pull policies configured (
IfNotPresentorAlways) - ✅ Non-root containers
- ✅ Read-only root filesystems where possible
🌐 Networking
VPC and Subnet Design
# Terraform example
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
cidr = "10.0.0.0/16"
# Separate subnets for EKS
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
# Required tags for EKS
private_subnet_tags = {
"kubernetes.io/role/internal-elb" = "1"
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
}
public_subnet_tags = {
"kubernetes.io/role/elb" = "1"
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
}
}
Checklist:
- ✅ Sufficient IP addresses for pod scaling
- ✅ Multi-AZ setup for high availability
- ✅ VPC CNI properly configured
- ✅ CoreDNS optimized for cluster size
- ✅ Service mesh evaluated (Istio, Linkerd, AWS App Mesh)
Ingress and Load Balancing
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app-ingress
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:...
alb.ingress.kubernetes.io/ssl-redirect: '443'
spec:
ingressClassName: alb
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: app-service
port:
number: 80
Checklist:
- ✅ AWS Load Balancer Controller installed
- ✅ SSL/TLS certificates configured
- ✅ Health checks properly configured
- ✅ WAF integrated for security
- ✅ Request routing optimized
📊 Monitoring and Observability
Metrics
# Prometheus ServiceMonitor example
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
Checklist:
- ✅ Prometheus Operator installed
- ✅ Grafana dashboards for cluster and apps
- ✅ CloudWatch Container Insights enabled
- ✅ Custom metrics exported
- ✅ SLO/SLI defined and tracked
Logging
Checklist:
- ✅ Centralized logging (Fluent Bit + CloudWatch/S3)
- ✅ Application logs structured (JSON)
- ✅ Log retention policies defined
- ✅ Log aggregation and searching capability
- ✅ Audit logging enabled
Alerting
Example Prometheus Alert:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-alerts
spec:
groups:
- name: app
interval: 30s
rules:
- alert: HighPodMemory
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 5m
annotations:
summary: "Pod {{ $labels.pod }} memory usage > 90%"
Checklist:
- ✅ Critical alerts configured
- ✅ Alert routing to appropriate channels
- ✅ Runbooks for common alerts
- ✅ Alert fatigue minimized
- ✅ On-call rotation defined
🔄 Reliability and High Availability
Node Groups
resource "aws_eks_node_group" "main" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "main-node-group"
scaling_config {
desired_size = 3
max_size = 10
min_size = 3
}
update_config {
max_unavailable_percentage = 33
}
instance_types = ["t3.large"]
labels = {
role = "general"
}
}
Checklist:
- ✅ Multi-AZ node distribution
- ✅ Cluster Autoscaler or Karpenter configured
- ✅ Pod Disruption Budgets defined
- ✅ Resource requests and limits set
- ✅ Node lifecycle management automated
Application Resilience
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: my-app
topologyKey: topology.kubernetes.io/zone
containers:
- name: app
image: my-app:v1.0.0
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Checklist:
- ✅ Multiple replicas for critical workloads
- ✅ Liveness and readiness probes configured
- ✅ Pod anti-affinity rules for HA
- ✅ Graceful shutdown handling
- ✅ Circuit breakers implemented
💾 Backup and Disaster Recovery
Checklist:
- ✅ Velero installed for cluster backups
- ✅ Persistent volume snapshots automated
- ✅ Backup retention policy defined
- ✅ Disaster recovery plan documented
- ✅ Recovery procedures tested regularly
🚀 CI/CD Integration
# GitLab CI example for EKS deployment
deploy:
stage: deploy
image: alpine/k8s:1.28.0
script:
- aws eks update-kubeconfig --region us-east-1 --name production-cluster
- kubectl apply -f k8s/
- kubectl rollout status deployment/my-app
only:
- main
Checklist:
- ✅ Automated deployments via GitOps (ArgoCD/Flux)
- ✅ Staging environment for testing
- ✅ Automated rollback on failure
- ✅ Blue-green or canary deployments
- ✅ Deployment approval process for production
🔧 Operational Excellence
Cost Optimization
Checklist:
- ✅ Right-sized node instances
- ✅ Spot instances for non-critical workloads
- ✅ Resource quotas and limits enforced
- ✅ Cost monitoring dashboards
- ✅ Unused resources regularly cleaned up
Updates and Maintenance
Checklist:
- ✅ EKS version upgrade strategy defined
- ✅ Add-on updates automated
- ✅ Node AMI updates scheduled
- ✅ Maintenance windows communicated
- ✅ Rollback plan for failed upgrades
📚 Documentation
Checklist:
- ✅ Architecture diagrams up to date
- ✅ Runbooks for common operations
- ✅ Access control documented
- ✅ Onboarding guide for new team members
- ✅ Incident response procedures defined
🎯 Next Steps
Use this checklist as you build or audit your EKS cluster. Not everything needs to be implemented on day one, but having a roadmap ensures you’re moving toward a production-ready state.
Priority Order
- Security - Get the basics right
- Monitoring - Know what’s happening
- Reliability - Keep things running
- Cost - Optimize as you scale
- Automation - Reduce toil
Conclusion
Running Kubernetes in production is a journey, not a destination. This checklist covers the essentials, but your specific needs may vary. Start with security and observability, then build from there.
Need help with your EKS setup? Reach out on Twitter or LinkedIn!
