07/02/2026
PRACTICE PROPOSAL NOTE FOR AUDITING THE USE OF ARTIFICIAL INTELLIGENCE (AI) IN PUBLIC FINANCE MANAGEMENT:
WHY AI STILL FLUNKS THE ACCOUNTING NUMBERS.
Artificial Intelligence (AI) is solving famously hard maths problems. So why can't it audit a simple spreadsheet? The answer reveals a deep truth about AI's struggle to master accounting.
At a glance
Generative AI struggles with the precise mathematics required for reliable accounting.
AI currently solves complex abstract maths but fails at simple, procedural business logic. Hybrid systems combining AI with deterministic rules may offer a path to automation.
For the past two years, generative AI has been on a dream run, creating new text, code, images and video. Tech companies have promised that these cheeky chatbots would automate much of white-collar work, and accounting has been near the top of the list. But a basic question still hangs in the air: can generative AI handle the mathematics of accounting yet?
It shouldn’t be too hard. Computers have been crunching calculations for decades. But as businesses start testing generative AI models on real-world finance tasks, reconciliation, forecasting, anomaly detection, they’re discovering the limits of today’s tools and are double-checking the vendors’ claims.
Some are running quiet experiments. Others, like Sales-force, are revising their strategy in public. What they’re finding could determine whether generative AI replaces the accountant, OR stays in its lane as a language model built for writing tasks.
The accountant who put AI to the test
Simon Thorne is a spreadsheet tragic. A senior lecturer in computer science at the UK’s Cardiff Metropolitan University, he was one of the first to test generative AI on real-world accounting logic.
Thorne had used various AI models before the current AI boom. He says he was “fairly astounded” by ChatGPT’s fluency once it was released to the public. He soon realised that people were using generative AI for more than just generating language and code. They were using it to fill spreadsheets.
“I noticed that there were some problems with what it would output. So I wanted to understand how well it can do certain things that are common in spreadsheets,” Thorne told Public Accountant.
In 2023, Thorne released a series of tests aimed at answering that question. They ranged from basic tasks – like error-spotting in a profit and loss statement – to abstract puzzles and multi-step calculations. Over the next two years he expanded the suite into five categories: auditing, spreadsheet logic, domain knowledge, deterministic logic and pure maths.
Each test mimicked something a real user might ask a chatbot – “audit this budget”, “build a rolling average”, “spot the error in this interest calculation”. Some came straight from financial workflows. Others were logic challenges reworded to avoid contamination from AI training data. They included the so-called “astronaut puzzle”, a classic constraint logic problem, and a punishing entropy test involving probabilities and logarithms.
The doctored profit and loss statement included typical errors: hardcoded formulas, inconsistent rounding, duplicated entries. Gemini 2.5 picked up all of them. Microsoft’s Copilot missed half. In the entropy test, models often started strong but unravelled midway. “Once you get beyond a series of steps, it seems to just break down,” Thorne says.
He also built a deliberately opaque spreadsheet using named ranges, seeded with subtle structural flaws. It was a realistic model, the kind that causes grief in every finance team.
To keep the tests clean, Thorne never published the prompts. “When you look through my paper, you won’t find the prompts that I used,” he says. “I’m trying to protect my own set of tests.” If he publishes them, he believes, they will get picked up by AI vendors who will train their algorithms specifically to defeat them.
In 2025, Thorne ran 21 models through his suite, including GPT-4, Claude, Gemini and Copilot. The results were “very fractured and inconsistent”. Some models hallucinated answers. Others guessed formulas that looked plausible but were logically wrong. Only Claude 2.5 and GPT-4 returned the correct bank interest formula, and only when prompted with exact phrasing.
“They work on probability. So whatever looks like the right answer based on the input is the most probable answer. When I give it exactly the same puzzle but it’s got a different theme … it’s utterly unable to do that,” Thorne says.
The maths paradox
So if generative AI can’t reliably calculate a rolling interest formula, why is it solving unsolved maths problems?
That’s the paradox confronting researchers watching AI progress on a different front. In recent months, AI models have been quietly knocking off problems from the Erdős list – a notorious collection of unsolved mathematical puzzles compiled by the late Hungarian prodigy Paul Erdős.
The list includes more than 1,000 problems, spanning number theory, combinatorics, graph theory and geometry. Many are deceptively simple to state, but fiendishly hard to solve.
Between Christmas 2025 and mid-January 2026, 15 problems have shifted from “open” to “solved” on the official Erdős website. Eleven of those credited AI models for their role in the solution. One of the more eye-catching results came from Neel Somani, a former quant and startup founder, who fed an Erdős problem into GPT-5.2 over the holiday break. Fifteen minutes later, ChatGPT returned a proof, reported TechCrunch. It cited Legendre’s formula, Bertrand’s postulate, and the Star of David theorem.
“When I give it exactly the same puzzle but it’s got a different theme … it’s utterly unable to do that.”
Simon Thorne.
It also drew on a 2013 MathOverflow thread by Harvard mathematician Noam Elkies, but crucially, it didn’t copy. It built a different argument, producing a more complete solution to a variant of the original problem.
“I was curious to establish a baseline for when LLMs are effectively able to solve open math problems compared to where they struggle,” Somani told TechCrunch. The surprise was that the latest models are more successful at complex mathematical problems than previous algorithms.
The mathematician Terence Tao has tracked eight problems where AI models made “meaningful autonomous progress”, and six more where they rediscovered and extended existing work.
How can this paradox exist? In one context, AI is struggling to audit a spreadsheet. In another, it’s collaborating with mathematicians to extend the frontiers of human knowledge.
Why the split? Part of the answer lies in how these problems are structured.
Brilliant theory, broken practice
The breakthroughs in mathematics are impressive. But when it comes to practical business workflows, the results are less inspiring.
Business software company Salesforce was one of the earliest and loudest voices in the generative AI boom. CEO Marc Benioff even suggested renaming the company after its AI platform, Agentforce. But when Agentforce was tested in the real world, things didn’t go to plan.
One customer, home security firm Vivint, set up a basic instruction: send a customer satisfaction survey after every support call. No impressive acrobatics required, just a trigger, a task and an outcome. But in production, the surveys only went out some of the time. There was no logic to the failures. The task was simply skipped.
Salesforce’s CTO later explained the problem: LLMs struggle to follow more than eight steps in sequence. The system wasn’t broken; it just quietly dropped instructions without telling anyone.
As of early 2026, Agentforce no longer relies on language models alone. It runs on what Salesforce calls hybrid reasoning. LLMs still manage the conversation. But critical tasks are handed off to deterministic scripts, step-by-step rules that guarantee follow-through. An “Agent Script” ensures every required action happens in order, no matter how confidently the chatbot responds.
What makes finance different
The contrast between solving Erdős problems and failing survey triggers isn’t as contradictory as it first appears. In fact, it highlights the fundamental design trade-off in generative AI and suggests why it struggles in accounting.
Large language models are probabilistic engines. They’re trained to predict the most likely next word in a sequence based on vast amounts of text. That makes them surprisingly good at exploring abstract problems like spotting patterns, drawing analogies OR suggesting proofs. Mathematics research is an example of an open-ended domain which thrives on variation and creative leaps. In such fields, large language models can be genuinely useful.
But accounting isn’t just about abstract. It’s procedural. Tasks like reconciliation, auditing and compliance require determinism. Every step must follow the last. Every figure must be accurate. There is no “roughly right”.
This is where LLMs fall down. They don’t run calculations; they simulate what a correct answer might look like. And they can be very convincing. They cite formulas, mimic logic, and format outputs perfectly. But it’s still pattern-matching. And the longer OR more ambiguous the task, the more likely they are to drift or hallucinate.
Thorne saw this in his entropy test: models would get the first part right, then break down midway. Salesforce saw it with Agentforce: fluent conversations, broken ex*****on. Digits, a US accounting startup, benchmarked LLMs on transaction classification, and found none exceeded 70 percent accuracy without tight constraints. (Like Agentforce, Digits achieves much higher accuracy by combining LLMs with deterministic models.)
A saying allegedly popular inside Microsoft’s Excel AI group encapsulates the problem: “Ninety-nine percent correct is 100 percent wrong.”
In accounting, near enough is just not good enough.
Not if, but how
The question isn’t whether generative AI can do accounting. We know the answer to that already. On its own, it can’t.
The better question is: Can generative AI, when paired with other models and rule-based systems, do accounting accurately enough to replace humans?
That’s still not a settled question. But the theory shows it’s not impossible. Hybrid systems like those used by Salesforce and Digits use language models for context and communication, and rely on deterministic logic for critical steps. Done well, this approach could deliver automation with guardrails, and accuracy that holds up under audit.
Personally, I haven’t seen a fully autonomous AI accounting system work at scale. But given enough time, it still seems possible.
PRACTICE PROPOSAL
AUDITING THE USE OF ARTIFICIAL INTELLIGENCE (AI) IN PUBLIC FINANCE
Presented as a proposal to the Office of the Auditor-General (OAG), Kenya.
1). Purpose of the Practice Note In Contradictions With Simon Thorne's AI Perspectives
This Practice Note provides guidance to auditors within the Office of the Auditor-General on the audit of Artificial Intelligence (AI) systems used in public finance management. It establishes minimum expectations, Audit approaches, and Risk considerations where AI, including generative AI and machine learning systems, is deployed by public entities.
The Practice Note supplements existing public sector auditing standards and does not replace statutory audit requirements under the Constitution of Kenya (2010) OR the Public Finance Management Act (PFMA).
2). Scope of Application
This Practice Note applies to audits of:
• National and county government entities
• State corporations and public enterprises
• Constitutional commissions and independent offices
• Public pension funds, authorities, and agencies
• Any public entity using AI systems in financial
management, reporting, controls, OR decision support
It covers both internally developed systems and externally procured AI platforms.
3). Definition of AI for Audit Purposes
For purposes of audit, AI refers to computational systems THAT:
• Perform tasks normally requiring human judgment;
AND
• Generate outputs based on statistical inference, pattern recognition, OR model-driven reasoning
This includes, but is not limited to:
• Generative AI and large language models (LLMs)
Machine learning classification OR prediction models
• Automated decision-support systems affecting financial outcomes
4). Fundamental Audit Principle
AI systems do not replace management responsibility OR auditor judgment.
The use of AI by a public entity does not diminish:
• The accountability of accounting officers
• The responsibility of management for internal controls
• The obligation of auditors to obtain sufficient and appropriate audit evidence
AI outputs shall never be treated as audit evidence in isolation.
5). Key Audit Risks Associated with Artificial Intelligence (AI):
Auditors shall explicitly consider the following AI-related risks:
5.1 Determinism Risk
The risk that AI systems generate non-reproducible OR probabilistic outputs for financial calculations that require exactness.
5.2 Opacity & Explainability Risk
The risk that management cannot explain how AI-generated outputs were produced, limiting auditability.
5.3 Data Integrity Risk
The risk that AI systems rely on incomplete, biased, OR manipulated data, amplifying errors, OR irregularities.
5.4 Accountability Dilution Risk
The risk that responsibility for financial decisions is obscured between systems, vendors, and officers.
5.5 Silent Failure Risk
The risk that AI systems omit steps, override controls, OR fail without alerting users.
6). Audit Planning Considerations
During audit planning, auditors shall:
• Identify all AI systems affecting financial information
• Understand the purpose, scope, and risk profile of each system
• Assess whether AI is used in advisory, OR ex*****on roles
• Determine whether AI affects material balances, OR disclosures
Where AI affects material items, auditors shall elevate inherent risk assessments accordingly.
7). Minimum Audit Procedures for Artificial Intelligence (AI) Systems
Auditors shall, at a minimum:
7.1 Governance & Oversight Review
• Confirm existence of formal approval for AI use
• Identify responsible officers and oversight committees
• Review AI policies and risk assessments
7.2 Architecture & Controls Assessment
• Determine whether AI outputs are validated by deterministic controls
• Assess segregation of duties between AI systems and human approvers
• Confirm existence of override and escalation mechanisms
7.3 Data Review
• Test completeness, accuracy, and legality of data inputs
• Assess data provenance and change controls
• Confirm compliance with data protection and records laws
7.4 Reproducibility Testing
• Re-run AI-assisted processes using identical inputs
• Assess consistency of outputs
• Document any material variance
7.5 Audit Trail & Logging
• Verify existence of immutable logs
• Confirm traceability from input to output
• Test log integrity and retention
8). Treatment of AI Outputs as Audit Evidence
AI-generated outputs may only be used as audit evidence when:
• Independently corroborated by deterministic calculations OR external evidence; AND
• The process generating the output is fully auditable and reproducible
Uncorroborated AI outputs shall be treated as management representations.
9). High-Risk Use Cases Requiring Enhanced Scrutiny
Auditors shall apply heightened scrutiny where AI is used in:
• Payroll and pension calculations
• Tax assessments and revenue enforcement
• Public debt servicing and guarantees
• Consolidated financial statements
• Procurement evaluation, OR supplier selection
• IFMIS & ICMS ( Ingetrated Customs Management System)
In such cases, reliance on AI outputs without independent audit validation is not permitted.
10). Reporting & Disclosures
Audit reports shall:
• Disclose material use of AI systems affecting financial information
• Highlight deficiencies in AI governance, OR controls
• Report risks that may affect transparency, accountability, or reliability
Significant AI-related weaknesses shall be reported to Parliament OR County Assemblies as appropriate.
11). Capacity Building and Continuous Review
The OAG shall:
• Build internal technical capacity on AI systems
• Periodically update this Practice Note Proposal
• Engage with regulators and standard setters
Auditors are encouraged to exercise professional skepticism and seek specialist IT support system where necessary.
12). Proposed Effective Date
This Practice Note is effective for audits shall remain open for public discourse and consideration, OR after the date of issuance and implementation by the Auditor-General, and shall be applied in conjunction with applicable auditing standards.
Issued as proposal to strengthen accountability, audit integrity, and public trust in the use of AI in public finance management.
Amos Ng'ongo CPA
Management Consultant
Nairobi, Kenya.