Kubernetes Production Checklist: EKS Edition

A comprehensive checklist for running production-grade Kubernetes clusters on AWS EKS, covering security, networking, monitoring, and reliability

Moving applications to Kubernetes is exciting, but running a production-grade EKS cluster requires careful planning. After managing multiple production EKS clusters, here’s my comprehensive checklist.

🔐 Security

IAM Roles and Service Accounts

apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/my-app-role

Checklist:

✅ IRSA (IAM Roles for Service Accounts) configured
✅ Least privilege IAM policies
✅ No hardcoded credentials in pods
✅ Secrets encrypted at rest using KMS
✅ Pod Security Standards enforced

Network Security

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-except-allowed
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: allowed-namespace

Checklist:

✅ Network policies in place
✅ Private EKS endpoints (or restricted public access)
✅ Security groups properly configured
✅ VPC CNI custom networking for pod IPs
✅ AWS PrivateLink for AWS services

Image Security

Checklist:

✅ Container image scanning (Trivy, Snyk)
✅ Private ECR repositories with scanning enabled
✅ Image pull policies configured (IfNotPresent or Always)
✅ Non-root containers
✅ Read-only root filesystems where possible

🌐 Networking

VPC and Subnet Design

# Terraform example
module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  
  cidr = "10.0.0.0/16"
  
  # Separate subnets for EKS
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
  
  # Required tags for EKS
  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = "1"
    "kubernetes.io/cluster/${var.cluster_name}" = "shared"
  }
  
  public_subnet_tags = {
    "kubernetes.io/role/elb" = "1"
    "kubernetes.io/cluster/${var.cluster_name}" = "shared"
  }
}

Checklist:

✅ Sufficient IP addresses for pod scaling
✅ Multi-AZ setup for high availability
✅ VPC CNI properly configured
✅ CoreDNS optimized for cluster size
✅ Service mesh evaluated (Istio, Linkerd, AWS App Mesh)

Ingress and Load Balancing

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:...
    alb.ingress.kubernetes.io/ssl-redirect: '443'
spec:
  ingressClassName: alb
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: app-service
            port:
              number: 80

Checklist:

✅ AWS Load Balancer Controller installed
✅ SSL/TLS certificates configured
✅ Health checks properly configured
✅ WAF integrated for security
✅ Request routing optimized

📊 Monitoring and Observability

Metrics

# Prometheus ServiceMonitor example
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    interval: 30s

Checklist:

✅ Prometheus Operator installed
✅ Grafana dashboards for cluster and apps
✅ CloudWatch Container Insights enabled
✅ Custom metrics exported
✅ SLO/SLI defined and tracked

Logging

Checklist:

✅ Centralized logging (Fluent Bit + CloudWatch/S3)
✅ Application logs structured (JSON)
✅ Log retention policies defined
✅ Log aggregation and searching capability
✅ Audit logging enabled

Alerting

Example Prometheus Alert:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-alerts
spec:
  groups:
  - name: app
    interval: 30s
    rules:
    - alert: HighPodMemory
      expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
      for: 5m
      annotations:
        summary: "Pod {{ $labels.pod }} memory usage > 90%"

Checklist:

✅ Critical alerts configured
✅ Alert routing to appropriate channels
✅ Runbooks for common alerts
✅ Alert fatigue minimized
✅ On-call rotation defined

🔄 Reliability and High Availability

Node Groups

resource "aws_eks_node_group" "main" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "main-node-group"
  
  scaling_config {
    desired_size = 3
    max_size     = 10
    min_size     = 3
  }
  
  update_config {
    max_unavailable_percentage = 33
  }
  
  instance_types = ["t3.large"]
  
  labels = {
    role = "general"
  }
}

Checklist:

✅ Multi-AZ node distribution
✅ Cluster Autoscaler or Karpenter configured
✅ Pod Disruption Budgets defined
✅ Resource requests and limits set
✅ Node lifecycle management automated

Application Resilience

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: my-app
              topologyKey: topology.kubernetes.io/zone
      containers:
      - name: app
        image: my-app:v1.0.0
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

Checklist:

✅ Multiple replicas for critical workloads
✅ Liveness and readiness probes configured
✅ Pod anti-affinity rules for HA
✅ Graceful shutdown handling
✅ Circuit breakers implemented

💾 Backup and Disaster Recovery

Checklist:

✅ Velero installed for cluster backups
✅ Persistent volume snapshots automated
✅ Backup retention policy defined
✅ Disaster recovery plan documented
✅ Recovery procedures tested regularly

🚀 CI/CD Integration

# GitLab CI example for EKS deployment
deploy:
  stage: deploy
  image: alpine/k8s:1.28.0
  script:
    - aws eks update-kubeconfig --region us-east-1 --name production-cluster
    - kubectl apply -f k8s/
    - kubectl rollout status deployment/my-app
  only:
    - main

Checklist:

✅ Automated deployments via GitOps (ArgoCD/Flux)
✅ Staging environment for testing
✅ Automated rollback on failure
✅ Blue-green or canary deployments
✅ Deployment approval process for production

🔧 Operational Excellence

Cost Optimization

Checklist:

✅ Right-sized node instances
✅ Spot instances for non-critical workloads
✅ Resource quotas and limits enforced
✅ Cost monitoring dashboards
✅ Unused resources regularly cleaned up

Updates and Maintenance

Checklist:

✅ EKS version upgrade strategy defined
✅ Add-on updates automated
✅ Node AMI updates scheduled
✅ Maintenance windows communicated
✅ Rollback plan for failed upgrades

📚 Documentation

Checklist:

✅ Architecture diagrams up to date
✅ Runbooks for common operations
✅ Access control documented
✅ Onboarding guide for new team members
✅ Incident response procedures defined

🎯 Next Steps

Use this checklist as you build or audit your EKS cluster. Not everything needs to be implemented on day one, but having a roadmap ensures you’re moving toward a production-ready state.

Priority Order

Security - Get the basics right
Monitoring - Know what’s happening
Reliability - Keep things running
Cost - Optimize as you scale
Automation - Reduce toil

Conclusion

Running Kubernetes in production is a journey, not a destination. This checklist covers the essentials, but your specific needs may vary. Start with security and observability, then build from there.

Need help with your EKS setup? Reach out on Twitter or LinkedIn!

posts

Home

Blog

Categories

Tags

About

Recent Posts

Kubernetes Production Checklist: EKS Edition

🔐 Security

IAM Roles and Service Accounts

Network Security

Image Security

🌐 Networking

VPC and Subnet Design

Ingress and Load Balancing

📊 Monitoring and Observability

Metrics

Logging

Alerting

🔄 Reliability and High Availability

Node Groups

Application Resilience

💾 Backup and Disaster Recovery

🚀 CI/CD Integration

🔧 Operational Excellence

Cost Optimization

Updates and Maintenance

📚 Documentation

🎯 Next Steps

Priority Order

Conclusion

Resources

Colm Monks

Recent Posts

Categories

About