Human-in-the-Loop by Design: Architectural Patterns for AI Agents That Don't Go Off the Rails

In February 2024, Canada's Civil Resolution Tribunal ruled that Air Canada had to honor a bereavement-fare discount its customer-service chatbot invented out of thin air. The airline's defense was memorable: the chatbot was a "separate legal entity" and Air Canada couldn't be held responsible for what it said. The tribunal disagreed. The chatbot was the company.

That ruling isn't a legal curiosity. It's a preview of how every business will eventually be held accountable for its agents. And it's a reminder that "the agent works most of the time" is not a production standard.

AI agents fail differently than traditional code. Traditional code fails loudly — a null pointer, a stack trace, a 500. You see it. You fix it. The blast radius is usually one request.

Agents fail quietly. They generate a plausible-looking output that's wrong. They take an action that looks correct until someone traces the audit log three weeks later and realizes something's been off for a month. They drift — slowly, as the data they see shifts and the model's behavior moves with it.

We've built two production automation systems that illustrate the patterns that keep this from happening: a billing engine for a national 3PL that processes every invoice cycle without a human touching a single line item, and a vehicle inventory platform for a regional manufacturer where dealer orders, holds, and production state flow through the system end-to-end. Neither has rolled back. Neither has paged anyone in the middle of the night. Neither is magic. They're infrastructure.

Here are the architectural patterns that make the difference.

1. Bounded Action Scope

The most common failure mode is also the most avoidable: an agent that has permission to do things it should never do.

Define the action surface explicitly. Not "the agent can access the database" — "the agent can read from tables A, B, C and write to table D." Not "the agent can send emails" — "the agent can send emails from this specific address, to addresses on this allowlist, using these three templates."

On the 3PL billing engagement, this meant the billing engine was granted read access to the ERP tables that defined rate cards, customer records, and shipment data, and write access to exactly two tables: the invoice ledger and the exception queue. That was the entire action surface. Anything outside of it — adjusting a customer record, modifying a rate card, issuing a refund — was structurally impossible for the engine to do. A human had to do those, in the ERP, with a logged action.

An agent without a bounded action surface is an agent that will eventually do something you didn't anticipate. The question isn't whether. It's when.

Implementation: wrap every tool call in a permission check, and fail closed. If the tool isn't in the allowlist, the call never leaves the agent process.

Agent Intent → Action Validator → (approved | rejected) → Execution

The validator is dumb on purpose. It doesn't try to be smart about gray areas. If the action isn't explicitly permitted, it's rejected. A human adds new permissions deliberately. Anthropic's December 2024 "Building Effective Agents" post makes the same point from the model side: the fewer tools an agent has access to, the more predictable its behavior.

2. Confidence-Gated Execution

Models produce outputs with implicit or explicit confidence. Use it.

Every action should pass through a confidence threshold. Above the threshold, execute. Below, route to a human. The threshold varies by action — sending an internal Slack notification tolerates low confidence; generating a wire transfer does not.

The mistake most teams make is treating confidence as binary. It isn't. The correct pattern is tiered:

Confidence	Action
High	Execute immediately, log
Medium	Execute, notify human async
Low	Queue for human review before execution
Very low	Block, flag for investigation

In the 3PL billing system, "confidence" isn't a model score — it's a rules-based confidence derived from how cleanly the inputs match known patterns. An invoice where every line item matches a published rate card, the customer record matches exactly, and the shipment data reconciles with carrier manifests: high confidence, sent automatically. An invoice where one line item exceeds a threshold or the customer record is ambiguous: exception queue. The logic is explicit, the tiers are configurable, and the distribution of where invoices land is instrumented.

The tiers matter because the cost of false positives differs dramatically across action types. A medium-confidence email draft is fine to send. A medium-confidence invoice correction is not.

3. Approval Queues for High-Blast-Radius Actions

Some actions are expensive to undo. Refunds. Bulk database updates. External notifications to customers. Financial transactions.

These don't belong in the autonomous path. They belong in an approval queue.

The pattern is straightforward: the agent prepares the action, stages it with full context (inputs, reasoning, predicted outcome), and waits. A human approves, rejects, or edits. The agent executes the final version.

This is what the 3PL billing engine's exception queue actually is. When an invoice exceeds defined parameters — a charge over a threshold, a customer record without a clean match, a rate calculation that diverges from historical patterns — the engine doesn't refuse to work, and it doesn't page anyone. It completes its analysis, stages the proposed invoice with full context, and waits. A billing specialist reviews the exception, decides whether to release or edit or escalate, and the system executes the final decision. Every step is logged.

The same pattern shows up in the vehicle inventory platform: when dealer holds conflict — two dealers placing overlapping holds on the same unit — the system doesn't guess. It surfaces the conflict to a human with the full context and the proposed resolutions.

Approval queues are the single biggest safety lever for production agents. They also surface systematic patterns — if a human is approving the same type of action a thousand times in a row, that action probably shouldn't need approval. The queue itself becomes a dataset for deciding what's safe to automate further.

4. Idempotency and Reversibility

Every action an agent takes should be one of two things: idempotent or reversible. Preferably both.

Idempotent means running the same operation twice produces the same result as running it once. Creating invoice ID 12345 is idempotent if the system checks for an existing invoice ID before writing a duplicate. Posting a Slack message is not idempotent unless you deduplicate on message content.

Reversible means you can undo the action. Writing to a database with a soft-delete column is reversible. Sending a physical package is not.

The combination matters because agents retry. They retry because networks fail, because model calls time out, because the orchestration layer restarts mid-workflow. If your agent is not idempotent, retries create duplicates. If your agent is not reversible, you cannot recover from mistakes that pass the confidence gate.

The 3PL billing cutover is what this pattern looks like when executed seriously. Before the new engine touched a single live invoice, it ran in parallel with the legacy system for two weeks. Every invoice generated by the new engine was compared against the legacy output, line by line. That comparison is only possible because the engine was idempotent from day one — you can feed it the same billing cycle twice and get byte-identical output. If it weren't, you couldn't run the comparison. You'd just have two different invoices and no way to know which was correct.

Design rule: no agent action ships to production unless it satisfies one of these properties. Usually this means idempotency keys on every write and soft-delete or reversal flags on every mutation.

5. Append-Only Audit Trail

Every decision the agent makes gets logged. Not a summary. The full inputs, the model outputs, the confidence scores, the action taken, and the resulting state change.

The log is append-only — you can never delete or modify an entry. This is non-negotiable for two reasons.

First, when something goes wrong, you need to reconstruct what happened. A summary log lies by omission. The full log tells you exactly what the agent saw and why it made the call it made. Before the 3PL rebuild, billing errors required someone to trace back through raw database records — a multi-day job that sometimes couldn't be completed at all, because the data needed to reconstruct the decision simply wasn't there. Now every billing event, adjustment, and exception is logged with timestamp, actor, and reason. Compliance reviews that used to raise the same flags every quarter close the first time.

Second, if you fine-tune or update the underlying model later, you need ground-truth data for evaluation. The audit log is your evaluation dataset.

Structure the log so it's queryable. Every entry should have:

Agent version
Model version (where applicable)
Input hash
Full input payload
Full model or rules output
Confidence score
Action executed (or queued, or rejected)
Downstream effect — what changed in the business system
Correlation ID linking to upstream triggers

When an investigation happens, the correlation ID is what lets you trace the full causal chain across systems. Without it, you're guessing. The vehicle inventory platform's audit trail — every order, every hold, every status change, timestamped and attributed — is what lets the manufacturer resolve dealer disputes definitively. Before the rebuild, there was no system of record. A dispute was a conversation.

6. Circuit Breakers

Rate limits. Error thresholds. Kill switches.

The pattern traces back to Michael Nygard's Release It! — the canonical text on production stability — and has been standard in distributed systems for two decades. It applies directly to agents.

Every agent deployment has three kill switches baked in:

Rate limit. The agent cannot execute more than N actions in window W. If it tries, the next action blocks and alerts.
Error threshold. If the agent's actions result in more than X% errors over the last N actions, the agent pauses itself.
Manual kill. A single config change or API call stops the agent cold. No graceful shutdown, no confirmation dialog.

The reason for three is redundancy. The rate limit protects against runaway loops. The error threshold protects against silent drift. The kill switch protects against everything else.

All three have to be testable. The deployment pipeline should include an exercise that trips each one in a staging environment, verifies the agent stops, and verifies it alerts. If you haven't tested the kill switch, you don't have one.

7. Dry-Run Mode

Before any agent runs in production, it runs in dry-run mode. Dry-run means:

Full input pipeline
Full model or rules inference
Full decision logic
Full intended action generation
Zero side effects

The agent produces a report of what it would have done. A human reviews the report. Discrepancies between intended behavior and actual output surface here, before any customer is affected.

This is exactly the two-week parallel testing phase on the 3PL billing cutover. New engine runs. Legacy engine runs. Zero live side effects from the new engine — it's generating invoices into a staging ledger that nobody is collecting against. Every discrepancy is investigated. Every edge case is resolved. Only after the diff is small and explainable does the new engine take over. The result: zero rollbacks post-deployment.

Dry-run mode doesn't go away after launch. It's the default for every model change, every prompt change, every tool addition. You run dry-run against a week of historical inputs and compare outputs to what actually happened. If the diff is small and explainable, you promote. If it's large or weird, you investigate.

This is the single most underused pattern in the industry. Teams ship model updates directly to production and find out about regressions from customer complaints. That's backwards.

8. Drift Detection

Agents drift. The model doesn't change, but the world does. Input distributions shift. Edge cases that were rare become common. A prompt that worked last quarter produces subtly worse results this quarter because the data has moved.

Instrument for drift. The specific metrics depend on the agent, but typically include:

Distribution of confidence scores over time
Distribution of output categories over time
Rate of human overrides on approval queue items
Rate of reversal actions per action type

When a metric shifts beyond a configured bound, the system alerts. It doesn't wait for a human to notice.

The principle is the same one the Google SRE book formalized for traditional services: you don't measure whether the system is up, you measure whether it's doing what it's supposed to do, and you build alerts off the second definition. Drift detection applied to agents is a straight extension of that idea — and the one that separates agents that work at launch from agents that work a year later. Most teams miss this because drift is quiet. The agent still runs, still produces outputs, still passes checks. It just gets worse.

The Division of Labor

None of these patterns are sophisticated. They're unglamorous. Most of the work of building production-grade agents is infrastructure work — queues, logs, permission systems, instrumentation — not model work.

The reason is structural. The model is the cheapest part of the system to improve. Swap models, update prompts, fine-tune. The infrastructure is what determines whether a model change is safe to ship.

An agent with a great model and no infrastructure is a demo. An agent with a mediocre model and this infrastructure is a production system. Over time, the second one wins.

What This Looks Like When You Get It Right

A well-architected agent system looks boring from the outside. It runs. It ships actions. It doesn't page anyone. When something is wrong, someone knows within minutes — because a circuit breaker tripped, a drift alert fired, a confidence threshold routed something to a human queue. The anomaly is caught and handled before it compounds.

The team running it spends almost no time on the agent. They spend time on the work the agent can't do — the judgment calls, the edge cases, the decisions that require context the model doesn't have. On the 3PL side, that's dispute resolution and customer relationship work. On the manufacturing side, that's dealer relationships and production planning.

That's what "the lights are off" actually looks like in practice. Not a magic black box. A deeply instrumented, carefully bounded, continuously monitored production system that does its job and surfaces the right problems to the right humans at the right time.

The agents that make the news — the Air Canada chatbot, the Chevy dealership bot that was prompt-injected into offering a Tahoe for a dollar, the lawyers sanctioned for citing fabricated cases ChatGPT invented in Mata v. Avianca — aren't the ones that weren't smart enough. They're the ones that were trusted without being verified. Shipped without the infrastructure that catches quiet failures before they become loud ones.

Build the infrastructure first. Ship the model second. That order matters.

We architect and build production AI agent systems with these patterns baked in from day one. See the 3PL billing engagement or vehicle inventory platform for how the patterns show up in practice — or book a discovery call if you're building, or rebuilding, autonomous systems for your operations.