Automating per-tenant infrastructure with a serverless control plane

June 11, 2026 · aws, terraform, serverless, platform-engineering, iac

A pattern I’ve been working with lately: instead of one big shared environment, every tenant (a customer, a dataset, a “case” — pick your noun) gets its own isolated stack of infrastructure, provisioned on demand and torn down automatically when it’s no longer needed.

Doing that by hand doesn’t scale. So you build a control plane: a small service whose only job is to create, update, and delete other infrastructure on request. Here’s the shape of one I find works really well — fully serverless on AWS, orchestrating Terraform Cloud (TFC) under the hood.

The core idea

Each tenant maps to one Terraform Cloud workspace. The control plane never runs Terraform itself — it drives TFC’s API to create a workspace, kick off a run, watch it to completion, and record the result. State for which tenant is in which state lives in DynamoDB, keyed by a tenant code.

The whole thing is deployed once per environment, not once per tenant.

High-level flow

   client ──► private API Gateway ──► API Lambdas
                                          │ PutEvents
                                          ▼
                              EventBridge buses (one per action)
                              create / update / delete
                                          │  rule ► target
                                          ▼
                              Step Functions (one per action)
                              create-sf / update-sf / delete-sf
                                          │
                          Trigger ► Wait ► Check ─loop─► Finalize
                                          │
                          ┌───────────────┼───────────────┐
                          ▼               ▼               ▼
                     Terraform Cloud   DynamoDB         SNS
                      (workspaces +    (tenant state)  (lifecycle
                        runs)                            events)

Decouple the API from the work

The API is deliberately thin. A request comes in, the Lambda validates it, writes the desired state to DynamoDB, and publishes an event — then returns immediately with a status like creating. It does not wait around for Terraform.

The trick that keeps this clean: one EventBridge bus per action — create, update, delete. Each bus has a single rule that targets its own Step Function. There’s no if event_type == ... branching anywhere; the flow is implied by which state machine ran. Want to add an audit log or a notification consumer later? Subscribe to the bus — the API code never changes.

Each rule target also gets a dead-letter queue, so a failed hand-off to a Step Function is captured rather than lost.

One state machine per action

There are three Step Functions — create, update, delete — and they share the exact same definition. Per-flow constants (is this a destroy run? what’s the success status? which event do we emit?) are injected by the EventBridge rule’s input transformer. The delete machine is the only real variation: it triggers a destroy run and deletes the workspace on success.

Every machine follows the same loop:

Trigger — read the TFC token from Secrets Manager, ensure the workspace exists, upsert the tenant’s variables, and start a run (auto_apply).
Wait — a native wait state (default 120s).
Check — poll the run; classify it as success-terminal, failure-terminal, or not-yet.
Choice — finished or out of poll budget? Go to Finalize. Otherwise loop back to Wait.
Finalize — on success, read the workspace outputs and persist them to DynamoDB; on failure, record failed. Either way, publish an SNS event.

A few details that matter more than they look:

Idempotency: creates use a conditional write (attribute_not_exists) so a retried request can’t double-provision.
Retry-safety guard: if the latest run is still in flight, reuse its run ID instead of starting a duplicate. But terminal runs don’t short-circuit — that’s what lets “update” and “retry-after-failure” produce a genuinely fresh run.
Bounded polling: a max-poll count means a stuck run finalizes as failed rather than looping forever.
Always-safe Finalize: error is initialized to null and every task has a Catch that jumps straight to Finalize with full context, so the terminal step’s input always resolves — happy path or not.

The lifecycle, as state

DynamoDB holds one row per tenant with a small, explicit status machine:

Status	Meaning
`creating` / `updating` / `deleting`	A flow is in progress
`ready`	Terminal success
`failed`	Terminal failure — retryable via update
`deleted`	Terminal success of delete

Because the workspace outputs are persisted on success, a status API can return them directly — no need to call Terraform Cloud on every read.

Decommission without anyone asking

Each tenant carries a decommission_date. A scheduled Lambda runs daily, scans for anything past its date that isn’t already deleted, and publishes a delete event straight to the delete bus (as system, skipping the API). It skips anything mid-flow and retries the next day. Infrastructure cleans up after itself.

Observability

Structured logs on every Lambda, tagged with request and tenant IDs so they’re greppable in CloudWatch Logs Insights.
A metrics Lambda on a 1-minute schedule scans the table, groups by status, and emits a count per status — including zeros — so the dashboard stays accurate even when nothing is being written. (This replaced a DynamoDB-Streams approach, which only emitted on writes and let counts go stale.)
A dashboard with one widget per status.

Network as the trust boundary

The REST API is private — reachable only from inside the VPC via an execute-api interface endpoint, with a resource policy that allows invocation only from that endpoint and denies everything else. Method-level auth is NONE on purpose: the network is the boundary. For hands-on operations there’s a tiny EC2 admin host with no SSH key, no public IP, and no inbound ports — access is via SSM Session Manager only.

Why I like this shape

The API is boring and fast. It never blocks on Terraform.
The flows are isolated. Per-action buses and state machines mean create, update, and delete fail and retry independently.
It self-heals and self-cleans. Idempotent writes, bounded retries, and scheduled decommissioning mean less to babysit.
It’s all serverless. No servers to patch; you pay for what runs.

Things I’d still like to add: method-level API auth, alarms on failure events, X-Ray tracing, and idempotency tokens for client retries.

If you’re provisioning the same shaped stack over and over for different tenants, a thin control plane in front of Terraform Cloud is a genuinely nice place to land.