Designing Secure AWS VPC Architectures for Production Systems
A VPC looks deceptively simple in a diagram: a few subnets, a route table or two, an internet gateway. In production it is the substrate of every security control you have. Get the topology right and you have a defensible system with explicit blast radii. Get it wrong and you have a flat network with a thousand IAM policies trying to compensate for the fact that any compromised pod can reach any database in the account.
This post is about what a defensible AWS VPC architecture looks like, how the primitives compose, where the common misconfigurations live, and how to think about isolation in a real production environment.
The Core Primitives
Every VPC design works with the same handful of primitives:
- VPC. A logically isolated network in a region, defined by a CIDR block (e.g.,
10.0.0.0/16). - Subnets. A range within the VPC CIDR, tied to a single Availability Zone. Subnets are public or private based on routing, not on any intrinsic property.
- Route tables. Map destination CIDRs to targets (Internet Gateway, NAT Gateway, VPC peering, Transit Gateway, etc.). Each subnet associates with one route table.
- Internet Gateway (IGW). Enables routing to and from the public internet for resources with public IPs.
- NAT Gateway. Allows outbound internet from private subnets, with no inbound connectivity. Highly available within an AZ.
- Security Groups. Stateful, instance/ENI-level allow-lists. The primary application-layer network control.
- Network ACLs (NACLs). Stateless, subnet-level allow-and-deny lists. Coarser than security groups; rarely the right primary control but useful for defense in depth.
- VPC Endpoints. Private connectivity to AWS services (S3, DynamoDB, ECR, etc.) without traversing the internet or NAT.
- VPC Flow Logs. Records of accepted and rejected traffic at the ENI/subnet/VPC level.
The trick is composing them so the network enforces what the architecture intends.
The Three-Tier Subnet Model
The canonical defensible topology in AWS has three subnet classes per AZ, with a fourth optional one:
- Public subnets. Hold only resources that must be reachable from the internet — typically load balancers, NAT gateways, and bastion hosts. Route table sends
0.0.0.0/0to the IGW. - Private subnets (application). Hold application workloads (EC2 instances, EKS pods, ECS tasks). No direct internet ingress; outbound via NAT Gateway. Route table sends
0.0.0.0/0to the NAT. - Private subnets (data). Hold databases, caches, internal data services. No internet routing at all — neither inbound nor outbound. The route table contains only the VPC local route plus VPC Endpoint routes.
- Optional: management subnets. For tooling, admin access, and operational services that need broader access patterns.
The data subnets having no internet routing at all is the most important property in this picture. A compromised database instance cannot exfiltrate data via outbound internet — it doesn’t have a route. NAT misconfiguration is a common cause of “private” databases that turn out to be internet-reachable.
CIDR Planning
CIDR planning is one of those decisions that is technically reversible but operationally painful. A /16 VPC gives you 65,536 addresses; the EKS VPC CNI assigns IPs to pods one-to-one, which exhausts that surprisingly fast.
A defensible approach:
- Use
/16for production VPCs by default./20saves nothing meaningful and constrains you. - Allocate subnets in standard sizes.
/20for application subnets (4096 addresses);/24for public subnets (256, mostly for ALBs and NATs);/24for data subnets. - Three AZs minimum in production. Two-AZ designs work but lose more capacity during an AZ failure.
- Plan for non-overlapping ranges across VPCs. Once you peer VPCs or attach them to Transit Gateway, overlapping CIDRs are unroutable. A regional plan (
10.0.0.0/16prod-us-east-1,10.1.0.0/16prod-us-west-2,10.10.0.0/16staging-us-east-1) avoids this. - Reserve space for secondary CIDRs. A VPC can have additional CIDR blocks attached. Helpful when EKS exhausts the primary range.
For EKS specifically, the IP pressure is real. Two mitigations:
- Use a secondary CIDR (
100.64.0.0/16— Carrier-Grade NAT range) for pod IPs. AWS VPC CNI supports custom networking that places pods in a separate subnet from nodes. Conserves primary CIDR for nodes and services. - Prefix delegation assigns
/28blocks to ENIs instead of individual IPs. Increases per-node pod density.
Security Groups: The Real Firewall
Security groups are the application-level network policy in AWS. The principles that hold up:
- Default-deny by default. A security group with no inbound rules denies everything. Add specifically.
- Reference security groups, not CIDRs, for internal traffic.
Allow from sg-appis robust to instance churn;allow from 10.0.10.0/24is fragile and overly broad. - Egress rules matter too. The default egress rule (
allow all to 0.0.0.0/0) is convenient and wrong. Restrict outbound to known destinations. - Group security groups by tier, not by service.
sg-data-tierfor RDS access,sg-app-tierfor application instances. Service-specific exceptions can be additional groups.
resource "aws_security_group" "rds" { name = "rds-postgres" description = "Postgres access for the application tier" vpc_id = aws_vpc.main.id
ingress { from_port = 5432 to_port = 5432 protocol = "tcp" security_groups = [aws_security_group.app.id] }
egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = [] }}(Note the explicit empty egress — necessary in Terraform to override the AWS default. RDS doesn’t need outbound; deny it.)
The security group quota matters at scale: 60 inbound and 60 outbound rules per group, 5 groups per ENI by default (raisable). Beyond about 20 services in one VPC, you start hitting these limits. Plan to consolidate rules or restructure tiers.
NACLs: Defense in Depth, Not Primary Control
NACLs are stateless: you must explicitly allow both directions of every flow, including return traffic on ephemeral ports. They are coarser than security groups and have lower per-rule expressiveness.
The right use: defense in depth at the subnet boundary, with a small number of broad rules. Block all ingress from 0.0.0.0/0 on data subnets. Deny known-bad CIDR ranges (Spamhaus DROP list, etc.) at the public subnet boundary. Block egress to internal RFC 1918 ranges from public subnets that shouldn’t reach the internal network.
NACLs as primary controls are a maintenance nightmare. They are not the right place to encode “service A can call service B.” Security groups are.
VPC Endpoints: Why You Want Them
A VPC Endpoint provides private connectivity to an AWS service without traversing the internet. Two flavors:
- Gateway endpoints. Free. Supported for S3 and DynamoDB. A route table entry directs traffic for the service to the endpoint.
- Interface endpoints (PrivateLink). $0.01/hour per endpoint per AZ + data charges. Supported for most AWS services. An ENI is provisioned in your subnet; DNS resolves the service’s hostname to the ENI’s IP.
Reasons to use them:
- No NAT egress charges. S3 traffic through a Gateway endpoint costs nothing; through NAT it costs $0.045/GB. For an S3-heavy workload this is the cheapest optimization you can make.
- Tighter security posture. Endpoints can enforce VPC-scoped policies — “only this VPC can write to this S3 bucket.” With private DNS, traffic to
s3.us-east-1.amazonaws.comnever leaves your VPC. - Compliance. Many compliance frameworks require AWS service traffic to traverse private networks.
A production VPC with serious workloads should have, at minimum, endpoints for: S3, DynamoDB, ECR (API + dkr), STS, EC2, SSM, Logs, Secrets Manager, KMS, and SQS.
Centralized Egress: The Transit Gateway Pattern
In multi-VPC environments, every VPC having its own NAT Gateway is wasteful (NAT Gateways cost $0.045/hour each, plus data processing) and operationally fragmented. The standard pattern at scale:
- Inspection VPC with NAT Gateways, AWS Network Firewall, and centralized logging.
- Transit Gateway routes traffic from all spoke VPCs to the inspection VPC for outbound traffic.
- Default route in spoke VPCs points at the Transit Gateway.
This consolidates egress in one auditable place, allows uniform security policy (intrusion detection, allow-list filtering, threat blocking), and reduces NAT cost across many VPCs. The trade-off is added complexity and a critical dependency on the inspection VPC’s availability.
Bastions and Private Access
Direct SSH/RDP to production instances over the internet is a 2010 pattern. The current options:
- AWS Systems Manager Session Manager. No SSH port open, no bastion host. Sessions are recorded and IAM-authenticated. The right default for almost all administrative access.
- AWS Client VPN for engineers who need broader network access (e.g., to RDP into Windows hosts, query a private RDS from local tooling).
- Tailscale or similar mesh VPN as an alternative for organizations preferring vendor-neutral tooling.
A traditional bastion with an SSH key is rarely the right answer in 2026. Session Manager removes the bastion entirely and replaces it with IAM-controlled, audited access.
Flow Logs and Visibility
VPC Flow Logs record metadata about IP traffic (5-tuple, bytes, packets, accept/reject). They are necessary for security investigations, compliance audits, and “is anything talking to this database that shouldn’t be?” diagnostics.
Configuration that earns its keep:
- Capture ALL traffic (accept + reject), not just accepted.
- Destination: S3 for cost; query with Athena.
- Aggregation interval: 1 minute for production-grade visibility.
- Custom format to include fields like
tcp-flags,pkt-srcaddr,pkt-dstaddr(post-NAT addresses), andflow-direction.
For higher-resolution visibility, consider VPC Traffic Mirroring (mirror packet flows to an inspection ENI) or GuardDuty (AWS’s managed threat detection, which consumes flow logs and DNS logs).
Multi-Account and Multi-Region
A real production environment lives across multiple accounts (separation of duties, blast radius, billing) and often multiple regions (DR, latency, sovereignty).
The patterns that work:
- One VPC per account per region. Avoid sharing a single VPC across many accounts via Resource Access Manager unless you really need it; account boundaries are easier to reason about than RAM-shared resources.
- Transit Gateway for inter-VPC routing. Across accounts, across regions (TGW peering), as the connectivity backbone.
- Distinct CIDRs per VPC so peering and Transit Gateway routes are unambiguous.
- Centralized DNS via Route 53 Resolver, with private hosted zones associated with the relevant VPCs. Forwarding rules let on-prem networks resolve AWS-private names and vice versa.
Common Misconfigurations
A list of issues I’ve seen repeatedly in production VPC audits:
- Default VPC still in use. Every AWS account has a default VPC with
0.0.0.0/0open egress and predictable CIDR. Delete it; use only custom VPCs. 0.0.0.0/0ingress on common ports. SSH (22), RDP (3389), database ports (3306, 5432, 27017). Even briefly. Use Session Manager and security groups referencing other groups.- Public subnets being treated as private. A “private” RDS in a public subnet with
Publicly Accessible = falseis still on a subnet that routes to the IGW. The RDS itself isn’t reachable but the topology is wrong; move it. - Single-AZ deployments. A “production” workload in one AZ is one AZ failure away from being down. Multi-AZ is the floor, not a goal.
- Forgotten Internet Gateways or NAT Gateways. Unused IGWs are inert; unused NATs cost $32/month each and are easy to leave behind.
- Security groups referencing the VPC’s own CIDR. Often appears when someone wanted “anywhere internal.” Allow specific groups instead — this is one of the largest sources of unintended lateral access.
- No VPC Flow Logs. Investigating a security event without flow logs is investigating with one eye closed.
Closing
A defensible AWS VPC architecture is built from a small number of well-understood primitives composed with explicit intent. The three-tier subnet model isolates traffic by function; security groups enforce service-level access; VPC Endpoints keep AWS service traffic private and cheap; NAT consolidation via Transit Gateway centralizes egress; Session Manager replaces bastions; Flow Logs and GuardDuty provide visibility. None of this is novel — these patterns have been the AWS reference architecture for years. What separates secure VPCs from insecure ones is the discipline: deliberate CIDR planning, default-deny security groups, no internet routing in data subnets, endpoints for AWS services, multi-AZ everywhere, and audit-grade visibility from day one. The network is the foundation every other security control sits on; getting it right once means a thousand subsequent decisions don’t have to be heroic.