AI & Your Organization — Part 3 of 3

The Audit Trail

Why hosted AI can't prove what it did


Imagine being asked, a year from now, to produce evidence of exactly what your AI system did on a specific Tuesday in March. Which model version ran. What the system prompt was. What data the model saw. Whether your organization's inputs were retained or used to train the next version. Whether another customer's traffic, batched together with yours, changed the answer you got.

If you're running a hosted AI service, you cannot produce most of that evidence. Not because you haven't tried, but because the vendor's architecture doesn't generate it in a form you can attest to. And increasingly — under the EU AI Act, HIPAA, ISO/IEC 42001, and most professional audit standards — "we couldn't produce it" is not going to be an acceptable answer.

This is the third and final piece in a series that started with who deploys the model (ethics) and moved to what you build around it (architecture). This one is about what you can prove. It is the most practical of the three and probably the most consequential for anyone whose AI work will eventually meet a regulator, a court, a funder, or a board.

What an Audit Trail Actually Is

An audit trail isn't a log file. A log file is a partial ingredient. An audit trail is a complete, attestable chain of evidence that answers four questions about any given AI interaction:

1. What ran? Which exact model weights, which system prompt, which retrieval context, which tools were available.
2. What did it see? The full input, including anything injected from RAG, memory, or tool responses.
3. What did it output? The full response, not a summary.
4. Can you prove the chain? Hash the model weights. Hash the inputs. Hash the outputs. Store them somewhere tamper-evident. Produce them on demand, months or years later, exactly as they were.

For regulated work, the fourth requirement is the interesting one. Anyone can keep logs. The question is whether the logs, the model, and the environment are preserved in a way that a third party — an auditor, a regulator, opposing counsel — would accept as evidence of what actually happened. That's where hosted AI runs into structural problems.

What Hosted AI Cannot Prove

The model version you queried is not guaranteed to be the model version that ran. Stanford and Berkeley researchers documented this in their 2023 paper How Is ChatGPT's Behavior Changing Over Time? Comparing the March and June 2023 releases of GPT-4, they found the model's accuracy on prime number identification dropped from 84% to 51%. Code-generation output that was directly executable dropped from 52% to 10%. None of this was announced as a version change. The API endpoint name stayed the same. The authors noted that this makes it "challenging, if not impossible, to reproduce results from the 'same' LLM."

This isn't a historical artifact. A 2025 framework for tracking behavioral drift found systematic drift across GPT-4, Claude 3, and Mixtral — 23% variance in GPT-4 response length, 31% inconsistency in Mixtral instruction adherence. Most LLM providers do not offer full transparency into model updates. For an audit trail, that means the very first question — what ran? — has no verifiable answer.

The same prompt does not always produce the same output, even at temperature zero. Setting temperature=0 makes the token-selection rule deterministic, but it doesn't make the underlying computation deterministic. Floating-point arithmetic on parallel hardware is not associative — (a + b) + c and a + (b + c) can produce slightly different results depending on operation order — and those last-bit differences propagate through a neural network. More importantly, hosted AI services batch your request together with other customers' requests, and the composition of the batch changes which kernels run, which can change the output. OpenAI explicitly documents that their API can only be "mostly deterministic." Anthropic's documentation notes the same limitation.

You cannot verify what happens to your inputs. Most providers offer data-handling policies, and enterprise tiers often include contractual commitments that your prompts won't be used for training. Those commitments are real, but they are also unverifiable from your side. For PHI under HIPAA, for attorney-client privileged material, for regulated financial data, "we were told it wasn't logged" and "we can prove it wasn't logged" are two very different statements.

You cannot audit multi-tenancy isolation. Historical incidents have shown this isn't theoretical — a March 2023 ChatGPT bug briefly exposed other users' chat titles. Any shared infrastructure carries some non-zero risk of cross-tenant leakage, and you have no ability to independently verify the isolation controls.

The Regulatory Frame Is Closing

For most of the past three years, these problems could be written off as theoretical. That window is closing.

EU AI Act — Articles 12, 13, 19, 50

Takes full effect August 2026. High-risk AI systems must "automatically generate logs of their operation throughout their lifecycle." High-risk categories include healthcare, financial services, insurance, HR, and critical infrastructure. Deployers must be given instructions to collect, store, and interpret data logs. Automatically generated logs must be retained.

ISO/IEC 42001

The first international AI management system standard (published December 2023) requires documented evidence of AI system specifications, data provenance, operational constraints, and incident logs. Certification requires external audit against these requirements.

NIST AI Risk Management Framework 1.0

While voluntary, increasingly cited as the baseline for responsible AI practice in US federal procurement and corporate governance. GOVERN 1.7 and 6.1 require "processes and procedures for conducting regular assessments and reviews of AI system behavior" — which depends on operation logs you can actually produce.

HIPAA

Any vendor that creates, receives, maintains, or transmits PHI is a Business Associate requiring a BAA. If your AI runs entirely on your own infrastructure, the AI vendor isn't a Business Associate because it never touches PHI. The compliance math gets dramatically simpler — and smaller organizations benefit most, because the bureaucratic overhead of cloud compliance scales poorly downward.

The direction across all of these frameworks is the same: AI operation must be logged, logs must be retained, and the organization deploying the AI must be able to produce the evidence. The venue is shifting from "we promise we're doing this well" to "show us the receipts."

What Local Hosting Actually Gives You

Running your own AI doesn't automatically produce a good audit trail. You still have to build the logging, the retention, the hash chain, and the attestation. But it makes all of those things possible in a way that hosted AI doesn't.

  • Model provenance by hash. You can cryptographically hash the weights file. Every query can be tagged with that hash. If weights change, the hash changes, and the audit trail records it.
  • System prompts you control and version. No hidden instructions from a vendor. What you wrote is what ran, and you can prove it because the prompt is in your git repository with a commit hash.
  • Full logging with real retention guarantees. Every input, every output, every retrieval context, every tool call — logged in a store whose retention policy is your own and whose integrity you can attest to.
  • Reproducibility, if you need it. With batch-invariant kernels and controlled hardware, bit-for-bit reproducibility is becoming achievable for self-hosted deployments. On a hosted API, it isn't possible.
  • No multi-tenancy to audit. Your requests don't share infrastructure with anyone else's. There's no tenant-isolation question to answer because there are no other tenants.
  • A clean chain of custody. The model weights, the inference software, the system prompt, the logs, the retention policy — all under one organization's control, all auditable by one organization's internal audit function.

The Honest Limits

Local hosting is necessary but not sufficient. Plenty of self-hosted deployments have terrible audit trails because nobody built the logging, nobody rotated the keys, nobody preserved the weights when they upgraded. The infrastructure has to be paired with the discipline. Without the discipline, the trail isn't actually there.

Cloud providers are responding. Enterprise AI tiers increasingly offer BAAs, SOC 2 reports, ISO 27001 certifications, versioned model endpoints, and retention guarantees. Azure OpenAI Service, AWS Bedrock, and Anthropic's enterprise offerings have made real progress here. For many organizations, a well-configured enterprise cloud deployment is genuinely audit-ready — with the caveat that you are still trusting the attestations rather than verifying them yourself. Whether that trust is sufficient depends on what you're regulated against and who has to sign the compliance document at the end.

Determinism is hard everywhere. Even on self-hosted infrastructure, achieving true bit-for-bit reproducibility requires careful work — batch-invariant kernels, pinned CUDA versions, controlled hardware. A 2024 study on non-determinism in "deterministic" LLM settings found alarming variance even on local setups. Local gives you the ability to chase this problem; it doesn't solve it automatically.

Closing the Series

Three articles, one argument. The first was about values: who deployed the model and under what incentives matters as much as what the model does. The second was about architecture: what you build around the model matters as much as the model itself. This third one is about verifiability: what you can prove about either of those things, to someone who wasn't there and doesn't take your word for it.

All three are the same problem viewed from different angles. The AI system is not the model. It is the model plus the deployment plus the incentives plus the evidence. The organizations that treat it as all four are the ones who will still be standing when the compliance questions arrive — and for the mission-driven work that most deserves to be done carefully, those questions are already arriving.

The chatbot was the demo. What comes next is systems you can reason about, architectures you can defend, and a chain of evidence you can hand to someone else and have them believe it.

Explore more