Terragrunt Multi-Account Setup: 7 Days → 1 Hour AWS DR
Learn terragrunt multi-account setup for AWS disaster recovery. Reduce onboarding from 7 days to 1 hour with DRY configs and pilot-light DR architecture.
Terragrunt Multi-Account Setup: 7 Days → 1 Hour AWS DR
- I designed and implemented this architecture in about 2 weeks, leveraging AI-powered development tools to accelerate the process.*
What We Achieved with Terragrunt
We re-architected a per-client AWS platform from manual provisioning → Terraform → Terragrunt modularization to solve two practical constraints:
- onboarding new client environments was slow and inconsistent
- multi-region disaster recovery (DR) needed to be repeatable and testable
Infrastructure became standardized across clients, while application delivery (CI/CD + in-cluster deployments) remained client-specific.
💡 Key Result: Infra-only onboarding dropped from ~7 days to ~1 hour - measured as "from an empty client account (with bootstrap wiring in place) to a working infrastructure environment."
The Manual Infrastructure Problem
The original model: manual infrastructure
Before infrastructure-as-code (IaC), onboarding meant creating resources by hand, validating by checklist, and relying on tribal knowledge.
That was workable until we hit the combination of:
- multiple clients
- multiple environments per client
- a DR requirement with real RTO/RPO targets
What the business needed
- repeatable onboarding for new clients/environments
- a consistent baseline across environments
- multi-region DR that could be exercised regularly (not "hope it works")
Constraints that shaped the solution
- Small platform team: optimize for clarity and operational simplicity
- Client isolation: each client runs in their own AWS account
- Environment model: Dev/UAT/Prod/DR live in the same client account
- Region pair standardization: primary us-east-1, DR us-west-2 (parameterized per client)
- Cost efficiency: avoid heavy licensing fees for orchestration tooling
Why We Chose Terragrunt Over Terraform Cloud
We chose Terragrunt over Terraform Cloud or other paid orchestrators because it enabled a "write once, deploy everywhere" approach while remaining open source — reducing licensing overhead for a small team.
🎯 Why Terragrunt? Open-source orchestration that keeps modules DRY without the licensing cost of Terraform Cloud or Spacelift.
Stage 1 - Terraform: stop the variance
The first priority was to convert runbooks into code and reduce configuration drift. Terraform gave us repeatable deployments and plan safety checks.
Stage 2 - Terragrunt: make it a platform
Once Terraform existed, our biggest cost was duplication. Terragrunt became the orchestration layer that provided:
- DRY configuration: backend + provider logic defined once at the root
- Hierarchical config: settings cascade from Client → Environment → Region
- Predictable layout: the folder structure becomes documentation and state key structure
📐 Design Principle: The folder structure IS the documentation. If you can navigate the filesystem, you can understand the infrastructure.
High-level architecture
We used a pilot-light DR strategy: data is replicated continuously, while compute in the DR region is kept minimal until failover.
🔥 Pilot-Light Strategy: Keep data warm (always replicating), keep compute cold (spin up on failover). This balances cost with recovery speed.
Terragrunt Repository Structure Best Practices
We standardized the layout so day-2 operations were predictable:
Infra/
terragrunt.hcl # Root configuration (DRY logic)
<client>/
client.hcl # Client-specific variables
<env>/
env.hcl # Environment variables (dev/uat/prod/dr)
<region>/
region.hcl # Region-specific variables
<resource-type>/ # Networking, Compute, Queues, Storage...
<resource>/ # vpc, eks, mq, s3, ecr...
terragrunt.hcl # Leaf configuration
modules/
<resource-template>/ # Reusable Terraform module(s)
Why this layout worked
- state keys follow paths (no "mystery states")
- onboarding becomes "compose known building blocks" rather than "rebuild"
- troubleshooting starts from the filesystem (predictable, consistent)
How Root Inheritance Keeps Terragrunt Configs DRY
One of the biggest wins was stripping duplicate boilerplate from leaf modules. We used findinparent_folders() to inherit configurations from the root.
Example: leaf module stays minimal (sanitized)
1) Inherit provider/backend config from root
2) Inherit common variables (account id, tags, region, naming)
3) Define module source + version pin
4) Only pass inputs that differ for this instance
✨ Less is More: Each leaf module contains only what's unique to that instance. Everything else is inherited.
Solving Cross-Region Dependency Cycles
We relied on Terragrunt dependency blocks for ordering. This worked well — until multi-region replication surfaced real-world dependency cycles.
The RDS challenge (Global DB)
Aurora Global Database requires the primary to exist before the replica can be created. The replica often depends on shared foundations (e.g., VPC/security groups), so the sequence matters.
- Solution:** enforce a strict apply order:
- networking foundations (global)
- primary database (primary region)
- replica database (DR region)
The S3 CRR challenge
S3 Cross-Region Replication (CRR) can create a classic cycle:
- source + destination buckets must exist
- IAM policies must exist and reference buckets
- replication rule is applied to the source bucket and references roles/policies
If attempted "in one pass," Terraform can detect a cycle.
- Solution:** split S3 into layers:
- Physical layer: buckets (create first)
- Logical layer: IAM + replication rules (apply second)
This kept the graph acyclic and predictable.
⚠️ Watch Out: Cross-region replication (S3 CRR, Aurora Global DB) often creates dependency cycles. Split resources into "physical" and "logical" layers to break the cycle.
Example: dependency wiring (sanitized)
dependency "network" {
config_path = "../Networking/vpc"
inputs = ```
Inputs: environment-variable injection
Shared inputs (account identifiers, naming prefixes, and selectors) were passed via environment variables to keep the pipeline simple.
locals {
client = get_env("TG_CLIENT")
env = get_env("TG_ENV")
region = get_env("TG_REGION")
}
Remote state backend: S3 (per client account)
What we used
- Terraform remote state in S3
- one state bucket per client account
- key naming mirrored the live repo path
- S3 versioning enabled
- S3 SSE-KMS enabled
Why we did not add DynamoDB locking
We intentionally did not add DynamoDB locking because concurrency was naturally low (2-person team) and we enforced "one set of hands on platform applies" via human agreement.
This is pragmatic. If you scale the team, revisit locking and pipeline guardrails.
🤝 Team Size Matters: For a 2-person team, human coordination beats complex locking mechanisms. Scale your tooling with your team.
Secrets management: resource vs data separation
We used AWS Secrets Manager with a deliberate separation:
- Infrastructure layer: creates the Secrets Manager resources; output ARNs can be passed as dependencies
- Application layer: internal app/app teams populate the actual secret values
This keeps state clean of sensitive values while ensuring the "plumbing" is consistent.
- Onboarding note:** secrets population (beyond platform-generated outputs) was one of the few steps that still takes longer than the ~1 hour infra metric.
The "80 GB cache" problem
Terragrunt is aggressive with caching. In local development, .terragrunt-cache grew to ~75–80 GB and caused failures due to disk exhaustion.
- Fix:** aggressive cache pruning in pipeline post-steps (after successful plan/apply), removing downloaded providers and module artifacts to keep agents stable.
🧹 Pro Tip: Always add a cache cleanup step in CI/CD pipelines. Terragrunt's .terragrunt-cache can balloon to 80+ GB and crash your build agents.
Multi-Region Disaster Recovery Architecture
We standardized a primary/DR pairing across clients:
- Primary: us-east-1
- DR: us-west-2
These were parameters — clients could request different region pairs without changing the platform design.
What stays "always on" vs created during DR
- Always provisioned (foundations + stateful data primitives):**
- networking
- IAM/security boundaries
- state/data replication primitives (where applicable)
- registries/artifacts needed for recovery
- Provisioned during DR activation (cost-optimized):**
💰 Cost Optimization Principle: Replicate durable data (databases, S3). Recreate stateless compute (EKS, Redis) on-demand. This keeps DR costs minimal while meeting RTO targets.
DR execution and testing
DR testing wasn't "flip a switch." We treated it as an operational event:
- verify data sync (target RPO < 10s)
- promote Aurora in DR region (managed failover)
- update Route 53 failover behavior (health checks target the ALB directly)
- recover applications (apps intentionally down until recovery; acceptable for a 1-hour RTO)
Targets
- RTO: 1 hour
- RPO: < 10 seconds
- Test cadence: every 6 months (aligned with internal ISO/SOC expectations)
🧪 Test Your DR: A disaster recovery plan that hasn't been tested is just a hypothesis. We ran DR drills every 6 months.
Measuring the outcome
What "~1 hour" includes
- infrastructure provisioning only
- starting from: client account bootstrapped (state bucket + repo wiring in place)
- ending at: a working environment's infrastructure is created
What still takes longer
- data backfill
- adding non-platform secrets into Secrets Manager
Lessons Learned from Terragrunt Multi-Account Setup
- Don't over-modularize too early. Terraform first to reduce variance, then Terragrunt to reduce duplication.
- Circular dependencies are real in DR scenarios (S3 CRR, cross-region DB). Design layers and apply order intentionally.
- Watch disk space in CI. Terragrunt isolation is great, but cache growth can be severe without cleanup.
- Design for your team size. A 2-person platform needs clarity and predictable workflows more than clever automation.
Check out my other projects
If you found this deep dive into AWS infrastructure useful, you might also like:
- AI Code Analysis Reliability: High-availability architecture for an AI SaaS platform
- Elastic Stack Observability: Building a monitoring system at scale (15M+ logs/week)
See more on the Projects page.
Read the fully interactive version at https://www.parilsanghvi.in/blog/terragrunt-aws-dr