Automating per-tenant infrastructure with a serverless control plane

· aws, terraform, serverless, platform-engineering, iac

A pattern I’ve been working with lately: instead of one big shared environment, every tenant (a customer, a dataset, a “case” — pick your noun) gets its own isolated stack of infrastructure, provisioned on demand and torn down automatically when it’s no longer needed.

Doing that by hand doesn’t scale. So you build a control plane: a small service whose only job is to create, update, and delete other infrastructure on request. Here’s the shape of one I find works really well — fully serverless on AWS, orchestrating Terraform Cloud (TFC) under the hood.

The core idea

Each tenant maps to one Terraform Cloud workspace. The control plane never runs Terraform itself — it drives TFC’s API to create a workspace, kick off a run, watch it to completion, and record the result. State for which tenant is in which state lives in DynamoDB, keyed by a tenant code.

The whole thing is deployed once per environment, not once per tenant.

High-level flow

   client ──► private API Gateway ──► API Lambdas
                                          │ PutEvents

                              EventBridge buses (one per action)
                              create / update / delete
                                          │  rule ► target

                              Step Functions (one per action)
                              create-sf / update-sf / delete-sf

                          Trigger ► Wait ► Check ─loop─► Finalize

                          ┌───────────────┼───────────────┐
                          ▼               ▼               ▼
                     Terraform Cloud   DynamoDB         SNS
                      (workspaces +    (tenant state)  (lifecycle
                        runs)                            events)

Decouple the API from the work

The API is deliberately thin. A request comes in, the Lambda validates it, writes the desired state to DynamoDB, and publishes an event — then returns immediately with a status like creating. It does not wait around for Terraform.

The trick that keeps this clean: one EventBridge bus per actioncreate, update, delete. Each bus has a single rule that targets its own Step Function. There’s no if event_type == ... branching anywhere; the flow is implied by which state machine ran. Want to add an audit log or a notification consumer later? Subscribe to the bus — the API code never changes.

Each rule target also gets a dead-letter queue, so a failed hand-off to a Step Function is captured rather than lost.

One state machine per action

There are three Step Functions — create, update, delete — and they share the exact same definition. Per-flow constants (is this a destroy run? what’s the success status? which event do we emit?) are injected by the EventBridge rule’s input transformer. The delete machine is the only real variation: it triggers a destroy run and deletes the workspace on success.

Every machine follows the same loop:

  1. Trigger — read the TFC token from Secrets Manager, ensure the workspace exists, upsert the tenant’s variables, and start a run (auto_apply).
  2. Wait — a native wait state (default 120s).
  3. Check — poll the run; classify it as success-terminal, failure-terminal, or not-yet.
  4. Choice — finished or out of poll budget? Go to Finalize. Otherwise loop back to Wait.
  5. Finalize — on success, read the workspace outputs and persist them to DynamoDB; on failure, record failed. Either way, publish an SNS event.

A few details that matter more than they look:

The lifecycle, as state

DynamoDB holds one row per tenant with a small, explicit status machine:

StatusMeaning
creating / updating / deletingA flow is in progress
readyTerminal success
failedTerminal failure — retryable via update
deletedTerminal success of delete

Because the workspace outputs are persisted on success, a status API can return them directly — no need to call Terraform Cloud on every read.

Decommission without anyone asking

Each tenant carries a decommission_date. A scheduled Lambda runs daily, scans for anything past its date that isn’t already deleted, and publishes a delete event straight to the delete bus (as system, skipping the API). It skips anything mid-flow and retries the next day. Infrastructure cleans up after itself.

Observability

Network as the trust boundary

The REST API is private — reachable only from inside the VPC via an execute-api interface endpoint, with a resource policy that allows invocation only from that endpoint and denies everything else. Method-level auth is NONE on purpose: the network is the boundary. For hands-on operations there’s a tiny EC2 admin host with no SSH key, no public IP, and no inbound ports — access is via SSM Session Manager only.

Why I like this shape

Things I’d still like to add: method-level API auth, alarms on failure events, X-Ray tracing, and idempotency tokens for client retries.

If you’re provisioning the same shaped stack over and over for different tenants, a thin control plane in front of Terraform Cloud is a genuinely nice place to land.

← All posts