Kubernetes Migration from Monolith to Microservices on EKS

The story is by now familiar: a monolith ships fine for years, the team grows, deploys slow down, the blast radius of any change becomes the whole product, and someone draws a diagram of microservices on a whiteboard. Six months later, the diagram has been implemented, the team has Kubernetes, and the deployment frequency is somehow worse than before. The pattern is common enough that “you don’t need microservices, you need a better monolith” is now a respectable architectural position.

Microservices on Kubernetes are not always wrong. Done carefully, they enable independent scaling, organizational autonomy, fault isolation, and language flexibility. Done carelessly, they trade one set of problems (a single big codebase) for a much harder set (distributed systems, network failures, distributed transactions, observability gaps). This post is about the careful version: the migration patterns that work, the platform decisions that matter on EKS specifically, and what to avoid.

When Microservices Make Sense

Before talking about how, talk about whether. The signal that microservices are the right answer is rarely “the codebase is big.” It is typically a combination of:

Independent scaling needs. One workload (ML inference, image processing, real-time messaging) has dramatically different resource or latency requirements than the rest.
Independent deployment needs. Teams are blocked on each other’s release cadences; deploys to one part of the system gate deploys to another.
Independent fault domains. A bug in one component should not bring down the rest of the product.
Organizational scale. More than ~50 engineers contributing to one codebase, with clear sub-product ownership boundaries.

The fix for “merges are slow” is better CI, modular code, and feature flags — not microservices. The fix for “we can’t scale the search subsystem independently” sometimes is.

EKS Specifically

Amazon EKS is a managed Kubernetes control plane. It manages the API server, etcd, and the controller manager; you manage worker nodes (or use Fargate), networking, and workloads. A few EKS-specific points worth internalizing:

VPC-native networking. Each pod gets a routable VPC IP via the AWS VPC CNI. Simplifies network policy and service-to-service connectivity; consumes IP space rapidly. Plan IP allocation per subnet — the most common day-2 pain on EKS is exhausted subnets.
IAM Roles for Service Accounts (IRSA) and Pod Identity. Pods authenticate to AWS APIs via short-lived credentials tied to a Kubernetes service account. Use Pod Identity (newer, simpler) or IRSA for any pod that needs AWS access. Never bake long-lived credentials into images.
Managed node groups vs. Karpenter. Managed node groups are simple and acceptable for steady-state workloads. Karpenter is now the standard for production EKS: faster node provisioning, finer-grained instance selection, and built-in consolidation.
EKS add-ons. AWS manages versions of CoreDNS, kube-proxy, the VPC CNI, EBS CSI driver, and others through the add-ons API. Use it — it removes a class of upgrade-coordination problems.
Control plane upgrades. Minor version upgrades happen frequently (every 3 months). Build the operational muscle to upgrade quickly; lagging more than two versions creates compatibility headaches.

The Strangler Fig Pattern

A clean rewrite of a working monolith into microservices is one of the highest-risk software projects you can undertake. Almost every successful migration uses the strangler fig pattern instead:

Put the monolith behind a routing layer (API gateway, service mesh ingress, or just a reverse proxy).
Identify a bounded context with a clear interface.
Extract it as a new service running alongside the monolith.
Route relevant traffic to the new service.
Delete the corresponding code from the monolith.
Repeat.

Two rules keep this from going sideways:

Extract a service only when its data is separable. A “user service” that still reads users from the monolith’s database is not a service; it is a function call dressed up as HTTP. Data ownership has to move with the service.
Never have two services writing to the same table. This is the cardinal sin of premature decomposition. Once two services own the same data, schema changes require coordination and consistency is forever broken.

Order the extractions by leverage. Start with services that have clear interfaces and homogeneous workloads (image processing, search, notifications). Save the gnarly transactional core for last; sometimes it stays in the monolith forever, and that’s fine.

Service Decomposition Boundaries

The right service boundary is a bounded context, not a database table or a UI screen. A working heuristic:

One team can own it. A service that requires three teams to coordinate on every change is too big.
One database. Shared databases are not microservices.
One deploy cadence. Services that always deploy together should be one service.
One SLO. Services with very different availability or latency requirements should be split.

Anti-patterns:

Distributed monolith. Many small services, deployed together, talking via synchronous HTTP, with shared schemas — all the cost of microservices, none of the benefits.
Nano-services. A service per function. Operational overhead crushes any architectural benefit.
Entity services. One service per database table (UserService, OrderService, ProductService). Hides nothing, multiplies complexity, requires distributed transactions for any real workflow.

Workload Manifests and Helm

Every service ships at least a Deployment, a Service, possibly an Ingress, a ServiceAccount, a PodDisruptionBudget, a HorizontalPodAutoscaler, and a NetworkPolicy. Writing those by hand for 30 services is how you end up with bugs.

The standard answer is Helm: a single chart per service or a shared library chart that all services consume.

apiVersion: apps/v1
kind: Deployment
metadata: { name: {{ .Values.name }} }
spec:
  replicas: {{ .Values.replicas }}
  strategy:
    type: RollingUpdate
    rollingUpdate: { maxUnavailable: 0, maxSurge: 25% }
  template:
    spec:
      serviceAccountName: {{ .Values.serviceAccount }}
      containers:
        - name: app
          image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
          resources:
            requests: { cpu: {{ .Values.resources.cpu }}, memory: {{ .Values.resources.mem }} }
            limits:   { cpu: {{ .Values.resources.cpuLimit }}, memory: {{ .Values.resources.mem }} }
          readinessProbe:
            httpGet: { path: /healthz/ready, port: 8080 }
            initialDelaySeconds: 5
          livenessProbe:
            httpGet: { path: /healthz/live, port: 8080 }
            initialDelaySeconds: 30
          lifecycle:
            preStop:
              exec: { command: ["/bin/sh","-c","sleep 5"] }

A few production defaults worth keeping:

maxUnavailable: 0 on rolling updates. Never reduce capacity during a deploy.
PodDisruptionBudgets for every workload. Node draining and cluster autoscaler need this signal; without it, voluntary disruptions can take down whole services.
Resource requests and limits. Requests for scheduling, limits to prevent runaway pods. Set both. Set them based on observed usage plus headroom, not guesses.
Distinct readiness and liveness probes. Readiness gates traffic; liveness restarts. They are not the same and conflating them causes outages.
preStop sleep. During a rolling deploy, in-flight requests need to drain. A 5-second preStop sleep gives the Service endpoints time to remove the pod from rotation before the SIGTERM hits the process.

For cluster-wide concerns (cert-manager, ingress controllers, monitoring, logging), Helmfile or Argo CD app-of-apps patterns let you version and roll out platform components alongside applications.

Networking and Service Mesh

Service-to-service communication on EKS has three viable patterns:

Direct service calls via Kubernetes DNS. http://orders.default.svc.cluster.local. Simple, no extra infrastructure, no observability without application-level instrumentation.
API gateway / ingress controller (NGINX, ALB Ingress, Envoy Gateway, Traefik). For north-south traffic. Adds TLS termination, auth, rate limiting at the edge.
Service mesh (Istio, Linkerd, AWS App Mesh — though App Mesh is being deprecated). For east-west traffic. Adds mTLS, traffic shifting, retries, circuit breaking, and golden-signal observability.

The honest take: a service mesh is overkill until you have 20+ services or hard mTLS / compliance requirements. Linkerd is the right choice when you decide you need one — it is the simplest, lightest, and most operationally well-behaved. Istio is more powerful and an order of magnitude more operational complexity.

Network policies — Kubernetes-native or via Cilium — are not optional in any serious environment. Default-deny ingress, allow only the actual traffic patterns, and log denied connections. This is one of the few security mechanisms in Kubernetes that actually contains compromise.

Observability: The Distributed Tax

The hidden cost of microservices is that debugging gets harder. A request that used to be one stack trace is now spread across five services, each with its own logs, metrics, and traces.

The minimum production-grade observability stack on EKS:

Metrics. Prometheus (kube-prometheus-stack) or Amazon Managed Prometheus + Grafana / Amazon Managed Grafana. RED metrics per service, USE metrics per node, standard Kubernetes dashboards.
Logs. Fluent Bit → CloudWatch Logs or → OpenSearch / Loki. Structured JSON logs, every line tagged with service, version, pod, node, and trace_id.
Traces. OpenTelemetry SDKs in every service, OTel Collector aggregating, Tempo / Jaeger / AWS X-Ray as the backend. Sample wisely (head-based at 1–10%, tail-based for errors and slow requests).
Dashboards and alerts. Per-service RED dashboards, per-cluster capacity dashboards, alerts on SLO burn rate rather than raw error rate.

Without traces, post-incident analysis in a microservices architecture is largely guesswork. Instrument from day one of the migration, not after the third outage.

Database Decomposition

The hardest part of microservice extraction is data. A few patterns that work:

Database-per-service. The textbook answer. Works when bounded contexts have clean data ownership. Often requires data duplication and eventual consistency between services.
Dual-write during migration. During the strangler period, the monolith writes to its DB and the new service’s DB simultaneously. Risky — partial failures create divergence. Prefer the next pattern when possible.
CDC (Change Data Capture) replication. Debezium or AWS DMS streams writes from the monolith’s DB to the new service’s DB. The new service owns the read model; the monolith remains the writer until cutover, then the direction reverses.
Saga pattern for transactions that span services. Replaces distributed transactions with a sequence of local transactions and compensating actions. Operationally complex; implement only when needed.

Avoid distributed transactions (XA, 2PC) on application data. They exist, they sort of work, and they will lock up your services in ways that are hard to debug.

Deployment Patterns

Rolling deployments are the EKS default and fine for most services. Two patterns are worth knowing for higher-stakes services:

Blue/Green. Deploy the new version alongside the old; flip the Service or Ingress to point at the new ReplicaSet; keep the old one running for instant rollback. Higher resource footprint, near-zero risk.
Canary. Send a small percentage of traffic to the new version, monitor metrics, ramp up gradually. Requires either a service mesh, Flagger, or careful ingress configuration; pays off for high-traffic services.

Argo Rollouts and Flagger both implement canary and blue/green automation on top of Kubernetes. For a multi-service migration, standardizing on one is far better than every team rolling their own.

Migration Pitfalls

A short list of patterns I’ve seen kill migrations:

Underestimating the platform investment. EKS, Argo CD, observability, secrets management, CI/CD, service mesh — adopting microservices means standing up a platform. Plan and staff for that or pick a managed PaaS instead.
Inconsistent service templates. Every team’s service has a slightly different shape. Two years in, no one can move between services. Enforce templates from day one.
Skipping the platform team. “Every team owns its own infra” works at 5 teams and falls apart at 20. A dedicated platform group that owns the cluster, networking, observability stack, and shared tooling is necessary for scale.
Treating Kubernetes as a magic substrate. It will not fix bad service boundaries, bad data architecture, or bad observability. It will faithfully run whatever you build, including the bad parts.
Premature feature work on Kubernetes. Spend the first three months building the platform foundations (CI, observability, secrets, networking, deploy pipeline) before doing serious workload work. Skipping this creates technical debt that becomes load-bearing.

Closing

Microservices on EKS are a serious engineering investment. The pay-off is real for organizations that have hit the limits of a single deployable unit; the cost is real for organizations that haven’t. When the migration makes sense, the patterns are well-known: strangler-fig extraction, bounded-context service boundaries, database ownership per service, Helm-based templated deployments, EKS-native networking with disciplined IP planning, observability instrumented from day one, and a platform team that owns the shared substrate. The mechanics are the easy part. The hard part is the discipline: refusing to extract a service until its data is separable, refusing to share databases across services, refusing to deploy without observability, refusing to ship the first 30 services with 30 different shapes. Get those right and EKS becomes a force multiplier. Get them wrong and you end up with a distributed monolith that costs three times as much to operate as the one it replaced.