A framework for mapping where AI agency is appropriate - and where it isn't - within enterprise financial reporting workflows.
Enterprise financial reporting. Systems built over decades, serving statutory and operational needs. Users range from accountants filing compliance reports to analysts building ad-hoc queries. The tooling was designed for technical users; the business now needs it accessible to everyone.
Market pressure to integrate AI arrived faster than the legacy architecture could adapt. Five years of incremental modernization had limited compatibility with emerging AI capabilities. The question wasn't whether to use AI - it was where AI agency is appropriate and where it creates unacceptable risk.
Two forces pulling in opposite directions. On one side: usability debt. Too technical, poorly connected, weak discoverability. Users needed progressive disclosure, simplified workflows, reduced cognitive load.
On the other side: pressure toward full AI autonomy. "Make it agentic" - but statutory financial reports carry legal and compliance weight. AI hallucinations in an Income Statement aren't a UX inconvenience; they're an audit finding.
The question became: how much autonomy is appropriate - and how do we design the boundary between what AI handles and what stays with the human?
This case study documents the design framework I developed to navigate that tension - a graduated model of AI autonomy calibrated to the stakes of each task.
AI can bridge the gap between users and systems, reducing cognitive load while keeping the user in control. The example shows three levels of the shift: from requiring users to speak the system's language, to the system understanding theirs.
Ethan Mollick (Co-Intelligence) calls this 'cyborg' mode, where AI augments human capability rather than replacing it.
AI can adapt how it explains. Traditional systems offer one explanation for everyone. In complex ERPs, there is no single user. The system should reframe and deliver insights in each person's language to act on.
"Community lunch expenses reclassified from Personnel to Discretionary" means compliance impact to an accountant, changed acquisition cost to a sales rep, and remapped data source to an analyst. LLMs make it possible to match the explanation to the expertise - in real time.
Matching language to expertise isn't enough. People rarely know what they don't know. An accountant can evaluate whether a reclassification is correct - but not whether the underlying dataset is. The system must frame the unknown within the known. When the system uses a LEFT JOIN on invoice_date, an accountant doesn't need to hear about joins. She needs to hear "we used the sale date" - and she'll tell you "no, use the tax point date." A technical decision, framed as a domain decision she can assess.
Operating between overreliance and algorithmic aversion requires calibrated transparency - showing enough for informed judgment, not so much that the user disengages.
Expertise isn't static. It shifts across domains and evolves over time. The same accountant who speaks fluent compliance jargon becomes a novice navigating an HR module. The system needs to recognize who someone is and where they are.
Ethan Mollick (Co-Intelligence) and Chris Noessel (Designing Agentive Technology) both argue that effective AI must model the person, not just the task.
Expertise isn't static. It shifts across domains and evolves over time.
Traditional systems display data the same way whether they have one data point or a thousand. A line chart is a line chart; empty or overloaded. AI can match the interface to the complexity of the answer - context sculpts the output.
Josh Clark (Sentient Design) calls this "Intent to Interface" - the form of the answer follows the nature of the question. Whether in chat or a dashboard, the system evaluates what it knows and chooses how to best communicate it.
The system evaluates data depth, density, and available benchmarks to determine what context the answer needs to be meaningful. $120,000 in isolation tells you nothing. $120,000 compared to last year's $95,000, positioned against a segment median of $140,000 - that's an insight.
AI is fast. That's the point - until something goes wrong. Then speed becomes a damage multiplier. No person would delete hundreds of emails in seconds or burn through $82,000 in API costs in 48 hours, when regular usage is $180. Human slowness is a natural safety net. AI doesn't have one.
In February 2026, Summer Yue - Meta's Director of AI Alignment - told an OpenClaw agent to review her inbox and confirm before acting. It bulk-deleted hundreds of emails; ignored her stop commands. She had to run to her Mac Mini to kill the process. Root cause: context window compaction erased her safety instruction from memory.
Grace period was missing in these failures - no draft before sending, no buffer after. System should confirm high impact actions - a 455x spike in API costs should never go unquestioned. There was no emergency brake that works mid-action, when all the emails were being removed.
Spam filters solved a lot of issues agentic workflows have decades ago. Mails are marked as spam but not removed immediately - act, but don't delete. Show why it flagged - be transparent by design: show the ingredients - fields, lists, referenced reports - so the user can inspect, drill down, or swap them out. For high-risk tasks, don't just ask 'Approve' - ask the user to verify a specific uncertain element. Let the user correct what the system has done or hasn't - put spam back to inbox or inboxed mail to spam. Known scam goes to spam, suspicious sender gets a warning - confidence drives autonomy.
Boeing's MCAS showed the cost at the highest stakes. Faulty sensor triggered automated nose-down commands 20+ times in 11 minutes. Pilots didn't know the system existed. Override required a memorized checklist mid-emergency; 346 people died.
System that acts autonomously must monitor its own deviation and escalate proportionally. Small deviation - inform. Significant anomaly - pause. Extreme spike - stop. Thresholds are relative. User feels safe.
A technically grounded vision for AI-assisted report building was adopted as strategic direction for the reporting vertical. First milestone - AI-assisted formula building - delivered internally.
This case study captures the design thinking behind that work. Production interfaces are under NDA; the principles here are mine to share.
The principles - safety net, transparent by design, frame the unknown within the known, context sculpts the output - are transferable beyond this specific product. They apply wherever AI systems act on behalf of users in high-stakes domains.