Most "AI governance frameworks" are written for organizations with a Chief AI Officer, a steering committee, and a 40-page policy library. If you're a 50-to-500-person company in a regulated industry trying to ship one or two production workflows, that framework is the wrong shape. You need something an operator can hold in one hand.
This is the checklist we use when we ship systems into financial services, legal, and professional services teams. Twelve items, grouped into the four things compliance and finance actually ask about: policy, audit, evaluation, and operations.
Policy
1. A written AI usage policy that names what the system is allowed to do.
Not "use AI responsibly". A specific scope statement: which workflows, which data classifications, which model providers, which decisions a human must make. One page. Reviewed by counsel. Reviewed annually.
2. A model-choice rubric that anyone on the team can apply.
When does a workflow warrant a frontier model versus a local one? When does it require a human reviewer on every output versus a sample? When is structured output required versus free text? If the answer to "which model" is "ask the AI lead", you don't have a rubric — you have a bottleneck.
3. A list of the workflows you have chosen not to automate, and why.
The exclusions are the policy. Anyone can write what's allowed. The mature documents name what was considered, evaluated, and explicitly declined — usually because the failure mode was unacceptable, not because the model couldn't do it. Auditors trust this kind of document.
Audit
4. Full prompt-and-response logging, with at least seven years' retention.
Every model call captured: input, output, model identifier, timestamp, user, workflow. Stored where compliance can query it. Seven years matches the retention windows that most regulated industries already use for related records — match that, don't invent your own.
5. PII redaction at the boundary, not in the model.
Sensitive data should be scrubbed before it enters a third-party API call, not "asked" not to be retained. The model provider's promises are a contract; redaction at the boundary is a control. Auditors will care about the difference.
6. A versioned record of which model handled which output.
"This memo was generated by Claude Sonnet 4.6 on 2026-04-12 against eval v3.1." Six months later when someone asks why the system behaved differently, you need to be able to answer in one minute, not one week. Model versions drift; your record of which version did what shouldn't.
Evaluation
7. An evaluation set, owned by a domain expert, versioned alongside code.
Fifty to two hundred examples that a human expert has labeled with what the right answer should be. Stored in your repo. Run on every model change. This is the single artifact that distinguishes a system you trust from a demo you hope works.
8. A baseline measurement of the workflow before AI touches it.
How long it takes today, who does it, what the error rate is, what the cost per unit is. Without a baseline, you have no honest way to claim a pilot worked. Measure it before you start, even if the measurement is rough.
9. A named worst-case failure mode and the gate that catches it.
Every system makes mistakes. Mature ones name the mistake that would matter most — losing a customer, missing a regulatory deadline, recommending a wrong action — and design a specific human-in-the-loop check for that exact failure. Not for "errors in general". For the one error that would matter.
Operations
10. Per-workflow cost caps and alerts.
Every workflow has a monthly budget. The system stops or routes to a cheaper model when the budget is exhausted, and the owner gets paged. AI costs creep — not because of any single decision, but because nobody owns the aggregate. Set the cap before launch, not after the bill.
11. A named operator who runs the workflow after launch.
Not the consultant who built it. Not "the AI team". A specific person on your side who monitors the system, updates the eval set, reviews the quarterly report, and answers when something looks off. The handoff is the product. If there's no operator, you're buying a toy.
12. A quarterly review cadence, written for the board.
One page per workflow. What it does, what it cost, what its accuracy was, what was changed since last quarter, what's planned next quarter. The discipline of having to write this every ninety days is what keeps the program legible — and legible programs survive leadership changes, compliance reviews, and budget cycles.
How to use this
For a workflow already in production: rate yourself one to three on each item (1 = absent, 2 = exists but informal, 3 = documented and reviewed). The items at 1 are where you're exposed. Fix those first.
For a workflow in pilot: every item should be at least at 2 before you go to production. Items at 1 are reasons to delay, not reasons to push through.
For a workflow you're scoping: if you can't see how three or more items would be at 2 within the first four weeks, the workflow probably isn't ready for an AI pilot yet. Run a readiness assessment first.
Twelve items. None of them are about which model you use. None are about how clever the prompt is. They're about whether the system is one a regulated business can defend — to its board, to its customers, and to whoever shows up with an audit checklist.