Terragrunt Multi-Account Setup: 7 Days → 1 Hour AWS DR

Learn terragrunt multi-account setup for AWS disaster recovery. Reduce onboarding from 7 days to 1 hour with DRY configs and pilot-light DR architecture.

Terragrunt Multi-Account Setup: 7 Days → 1 Hour AWS DR

I designed and implemented this architecture in about 2 weeks, leveraging AI-powered development tools to accelerate the process.*

What We Achieved with Terragrunt

We re-architected a per-client AWS platform from manual provisioning → Terraform → Terragrunt modularization to solve two practical constraints:

onboarding new client environments was slow and inconsistent
multi-region disaster recovery (DR) needed to be repeatable and testable

Infrastructure became standardized across clients, while application delivery (CI/CD + in-cluster deployments) remained client-specific.

💡 Key Result: Infra-only onboarding dropped from ~7 days to ~1 hour - measured as "from an empty client account (with bootstrap wiring in place) to a working infrastructure environment."

The Manual Infrastructure Problem

The original model: manual infrastructure

Before infrastructure-as-code (IaC), onboarding meant creating resources by hand, validating by checklist, and relying on tribal knowledge.

That was workable until we hit the combination of:

multiple clients
multiple environments per client
a DR requirement with real RTO/RPO targets

What the business needed

repeatable onboarding for new clients/environments
a consistent baseline across environments
multi-region DR that could be exercised regularly (not "hope it works")

Constraints that shaped the solution

Small platform team: optimize for clarity and operational simplicity
Client isolation: each client runs in their own AWS account
Environment model: Dev/UAT/Prod/DR live in the same client account
Region pair standardization: primary us-east-1, DR us-west-2 (parameterized per client)
Cost efficiency: avoid heavy licensing fees for orchestration tooling

Why We Chose Terragrunt Over Terraform Cloud

We chose Terragrunt over Terraform Cloud or other paid orchestrators because it enabled a "write once, deploy everywhere" approach while remaining open source — reducing licensing overhead for a small team.

🎯 Why Terragrunt? Open-source orchestration that keeps modules DRY without the licensing cost of Terraform Cloud or Spacelift.

Stage 1 - Terraform: stop the variance

The first priority was to convert runbooks into code and reduce configuration drift. Terraform gave us repeatable deployments and plan safety checks.

Stage 2 - Terragrunt: make it a platform

Once Terraform existed, our biggest cost was duplication. Terragrunt became the orchestration layer that provided:

DRY configuration: backend + provider logic defined once at the root
Hierarchical config: settings cascade from Client → Environment → Region
Predictable layout: the folder structure becomes documentation and state key structure

📐 Design Principle: The folder structure IS the documentation. If you can navigate the filesystem, you can understand the infrastructure.

High-level architecture

We used a pilot-light DR strategy: data is replicated continuously, while compute in the DR region is kept minimal until failover.

🔥 Pilot-Light Strategy: Keep data warm (always replicating), keep compute cold (spin up on failover). This balances cost with recovery speed.

Terragrunt Repository Structure Best Practices

We standardized the layout so day-2 operations were predictable:

Infra/
  terragrunt.hcl                     # Root configuration (DRY logic)
  <client>/
    client.hcl                       # Client-specific variables
    <env>/
      env.hcl                        # Environment variables (dev/uat/prod/dr)
      <region>/
        region.hcl                   # Region-specific variables
        <resource-type>/             # Networking, Compute, Queues, Storage...
          <resource>/                # vpc, eks, mq, s3, ecr...
            terragrunt.hcl           # Leaf configuration
modules/
  <resource-template>/               # Reusable Terraform module(s)

Why this layout worked

state keys follow paths (no "mystery states")
onboarding becomes "compose known building blocks" rather than "rebuild"
troubleshooting starts from the filesystem (predictable, consistent)

How Root Inheritance Keeps Terragrunt Configs DRY

One of the biggest wins was stripping duplicate boilerplate from leaf modules. We used findinparent_folders() to inherit configurations from the root.

Example: leaf module stays minimal (sanitized)

1) Inherit provider/backend config from root

2) Inherit common variables (account id, tags, region, naming)

3) Define module source + version pin

4) Only pass inputs that differ for this instance

✨ Less is More: Each leaf module contains only what's unique to that instance. Everything else is inherited.

Solving Cross-Region Dependency Cycles

We relied on Terragrunt dependency blocks for ordering. This worked well — until multi-region replication surfaced real-world dependency cycles.

The RDS challenge (Global DB)

Aurora Global Database requires the primary to exist before the replica can be created. The replica often depends on shared foundations (e.g., VPC/security groups), so the sequence matters.

Solution:** enforce a strict apply order:

networking foundations (global)
primary database (primary region)
replica database (DR region)

The S3 CRR challenge

S3 Cross-Region Replication (CRR) can create a classic cycle:

source + destination buckets must exist
IAM policies must exist and reference buckets
replication rule is applied to the source bucket and references roles/policies

If attempted "in one pass," Terraform can detect a cycle.

Solution:** split S3 into layers:

Physical layer: buckets (create first)
Logical layer: IAM + replication rules (apply second)

This kept the graph acyclic and predictable.

⚠️ Watch Out: Cross-region replication (S3 CRR, Aurora Global DB) often creates dependency cycles. Split resources into "physical" and "logical" layers to break the cycle.

Example: dependency wiring (sanitized)

dependency "network" {
  config_path = "../Networking/vpc"

inputs = ```

Inputs: environment-variable injection

Shared inputs (account identifiers, naming prefixes, and selectors) were passed via environment variables to keep the pipeline simple.

locals {
  client = get_env("TG_CLIENT")
  env    = get_env("TG_ENV")
  region = get_env("TG_REGION")
}

Remote state backend: S3 (per client account)

What we used

Terraform remote state in S3
one state bucket per client account
key naming mirrored the live repo path
S3 versioning enabled
S3 SSE-KMS enabled

Why we did not add DynamoDB locking

We intentionally did not add DynamoDB locking because concurrency was naturally low (2-person team) and we enforced "one set of hands on platform applies" via human agreement.

This is pragmatic. If you scale the team, revisit locking and pipeline guardrails.

🤝 Team Size Matters: For a 2-person team, human coordination beats complex locking mechanisms. Scale your tooling with your team.

Secrets management: resource vs data separation

We used AWS Secrets Manager with a deliberate separation:

Infrastructure layer: creates the Secrets Manager resources; output ARNs can be passed as dependencies
Application layer: internal app/app teams populate the actual secret values

This keeps state clean of sensitive values while ensuring the "plumbing" is consistent.

Onboarding note:** secrets population (beyond platform-generated outputs) was one of the few steps that still takes longer than the ~1 hour infra metric.

The "80 GB cache" problem

Terragrunt is aggressive with caching. In local development, .terragrunt-cache grew to ~75–80 GB and caused failures due to disk exhaustion.

Fix:** aggressive cache pruning in pipeline post-steps (after successful plan/apply), removing downloaded providers and module artifacts to keep agents stable.

🧹 Pro Tip: Always add a cache cleanup step in CI/CD pipelines. Terragrunt's .terragrunt-cache can balloon to 80+ GB and crash your build agents.

Multi-Region Disaster Recovery Architecture

We standardized a primary/DR pairing across clients:

Primary: us-east-1
DR: us-west-2

These were parameters — clients could request different region pairs without changing the platform design.

What stays "always on" vs created during DR

Always provisioned (foundations + stateful data primitives):**

networking
IAM/security boundaries
state/data replication primitives (where applicable)
registries/artifacts needed for recovery

Provisioned during DR activation (cost-optimized):**

EKS
OpenSearch
Redis

💰 Cost Optimization Principle: Replicate durable data (databases, S3). Recreate stateless compute (EKS, Redis) on-demand. This keeps DR costs minimal while meeting RTO targets.

DR execution and testing

DR testing wasn't "flip a switch." We treated it as an operational event:

verify data sync (target RPO < 10s)
promote Aurora in DR region (managed failover)
update Route 53 failover behavior (health checks target the ALB directly)
recover applications (apps intentionally down until recovery; acceptable for a 1-hour RTO)

Targets

RTO: 1 hour
RPO: < 10 seconds
Test cadence: every 6 months (aligned with internal ISO/SOC expectations)

🧪 Test Your DR: A disaster recovery plan that hasn't been tested is just a hypothesis. We ran DR drills every 6 months.

Measuring the outcome

What "~1 hour" includes

infrastructure provisioning only
starting from: client account bootstrapped (state bucket + repo wiring in place)
ending at: a working environment's infrastructure is created

What still takes longer

data backfill
adding non-platform secrets into Secrets Manager

Lessons Learned from Terragrunt Multi-Account Setup

Don't over-modularize too early. Terraform first to reduce variance, then Terragrunt to reduce duplication.
Circular dependencies are real in DR scenarios (S3 CRR, cross-region DB). Design layers and apply order intentionally.
Watch disk space in CI. Terragrunt isolation is great, but cache growth can be severe without cleanup.
Design for your team size. A 2-person platform needs clarity and predictable workflows more than clever automation.

Check out my other projects

If you found this deep dive into AWS infrastructure useful, you might also like:

AI Code Analysis Reliability: High-availability architecture for an AI SaaS platform
Elastic Stack Observability: Building a monitoring system at scale (15M+ logs/week)

See more on the Projects page.

Read the fully interactive version at https://www.parilsanghvi.in/blog/terragrunt-aws-dr