The Smarter the AI Agent, the More It Hallucinates — Here's What That Means for Professional Services Firms

April 29, 202611 min readBy The Crossing Report

Published: April 29, 2026 | By: The Crossing Report


There's a finding buried in an academic paper published at ICLR 2026 that every professional services firm owner deploying AI agents needs to read. The paper is called "The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination." The finding is this: the smarter the AI agent, the more it hallucinates on tool calls.

Not less. More.

Before you dismiss this as an academic footnote, consider what Deloitte's 2026 State of AI in the Enterprise found alongside it: 47% of enterprise AI users have already based at least one major business decision on hallucinated content. Nearly half. And that's from the organizations with the most resources to catch errors.

Your firm probably has fewer resources. Your AI tools are getting smarter. And the failure mode is getting harder to see.

Here's what tool hallucination actually is, why the ICLR finding matters to a 10-person accounting or law firm, and what three governance steps can protect your workflows without requiring an IT department.


What Is Tool Hallucination? (Different from Text Hallucination)

Most discussions of AI hallucination focus on text: the AI produces a confident-sounding answer that's factually wrong. A case citation that doesn't exist. A tax regulation number that's slightly off. You've heard about these. They're embarrassing, occasionally costly, but usually catchable — because the error lives in the output, and a human reading the output can spot it.

Tool hallucination is different. The error doesn't live in the text. It lives in what the agent did.

An AI agent — as opposed to a basic AI assistant — doesn't just generate text. It takes actions. It invokes tools: database lookups, API calls, code execution, document retrievals. These tool calls are what make agents useful for professional services workflows. Instead of just describing what a contract says, an agentic system can actually retrieve the contract, compare it against a clause database, flag deviations, and produce a variance report.

When an agent hallucinates a tool call, it fabricates one of those actions. It reports that it ran a lookup — but it didn't. It generates output that looks like it came from a database query, complete with plausible-looking data, but the query never happened. The audit trail shows a tool call. The tool call was invented.

The output looks clean. The process looks correct. The error is invisible until someone digs into the underlying system logs — and most professional services firms are not digging into system logs.


The ICLR 2026 Finding: Reasoning and Hallucination Rise Together

The paper, published at ICLR 2026, tested multiple AI models on a benchmark called SimpleToolHalluBench. The benchmark was designed to measure a specific failure mode: does the agent correctly recognize when no available tool can satisfy a given request?

This is a simple test. If no tool exists to answer a question, the correct agent behavior is to say "I can't do that with the tools available." A model that fabricates a tool call instead — inventing an API or a database reference to produce an answer anyway — fails the benchmark.

The finding that should concern you: higher-reasoning models failed the benchmark more often than lower-reasoning models. Across the tested architectures, as reasoning capability improved, tool hallucination rate increased alongside it.

The mechanism is what you'd expect once it's explained. A model trained to reason deeply becomes more capable of constructing plausible-sounding justifications for its outputs. When it encounters a gap — a tool that doesn't exist, a lookup it can't actually perform — it reasons its way around the gap rather than flagging it. The very capability that makes it seem more intelligent also makes it more dangerous in agentic workflows.

This is the reasoning trap: better reasoning, worse behavior when the guardrails matter most.


Why This Matters in a Professional Services Workflow

Think about how AI agents are being deployed in your type of firm right now.

A law firm might run an AI agent that reviews incoming contracts, invokes a clause library to check deviations from standard terms, and produces a variance report for partner review. The value is real: what used to take two hours takes twenty minutes.

An accounting firm might use an agent that processes tax documents, references IRS code lookups and prior-year filings, and flags anomalies. The efficiency gain is genuine.

A consulting firm might build a research agent that retrieves client data, runs it against industry benchmarks, and surfaces strategic recommendations in a structured format.

In each case, the agent's value depends entirely on its tool calls being real. If the clause library lookup was fabricated, the variance report is fiction. If the IRS code reference was invented, the tax analysis is wrong in ways that may not be visible in the output. If the benchmark retrieval was hallucinated, the strategic recommendation rests on nothing.

The professional services context makes this particularly dangerous for one reason: your clients are trusting you with the output, not the agent. When the agent hallucinates, the liability flows to the firm, not to the AI vendor. Your engagement letter, your professional responsibility obligations, your malpractice exposure — these don't change because an AI agent was in the workflow.

And as we covered in our companion piece on agent washing, the tendency to overclaim what AI agents can do autonomously compounds this risk. The firm that markets its "AI-powered" workflow to clients, then deploys an agent that hallucinates tool calls without detection, is in a genuinely bad legal position.


The Silent Failure: When the Audit Trail Looks Clean

The most dangerous aspect of tool hallucination is that it doesn't announce itself.

Text hallucination at least produces wrong text that a reader might catch. A fabricated case citation can be Googled. A mangled statute number stands out to someone who knows the code.

Tool hallucination produces a workflow log that says the right things happened. The agent reports: "Retrieved contract clause database. Found 3 deviations from standard terms. Generating report." None of that may have occurred. The report exists. The citations look plausible. The process notation is correct.

This is the failure mode that the Sullivan & Cromwell/OpenAI case pointed toward — not obviously wrong outputs, but outputs that passed initial review. And it's the failure mode that makes purely output-level QA insufficient.

When you read the output, you're reading what the agent said it did. You're not reading what it actually did.


Three Governance Steps That Don't Require an IT Department

You don't need a data science team or a dedicated AI governance function to protect your workflows from this failure mode. These three steps are implementable by any firm owner this week.

1. Build human review checkpoints at every agent handoff — not just at the end.

Most firms run AI agents through a workflow and apply human review at the final output stage. That's too late. By the time a human sees the final report, any tool hallucination from step two of a six-step workflow has already propagated through four more steps.

Checkpoints need to exist at each handoff between agent actions. For a contract review workflow, this means a human reviews the clause retrieval result before the comparison runs, not just the final variance report. For a research workflow, it means the human sees the benchmark data before the recommendation is generated.

This adds time. It adds less time than catching a compounded error post-delivery.

2. Ask your AI vendor for tool-call logs, not just output logs.

Most AI workflow platforms expose output logs — records of what the agent produced. Fewer expose tool-call logs — records of what the agent actually invoked during execution.

Ask your vendor directly: "Can you show me a log of which tools the agent called during this run, with the actual inputs and outputs of each call?" If the answer is no, you cannot audit your workflows for tool hallucination. You can only audit the outputs — which, as established above, look correct even when they aren't.

This question alone will tell you a great deal about whether your vendor has thought about this failure mode.

3. No fully autonomous agent chains on client-facing deliverables until one supervised cycle is documented.

If you're deploying AI agents in a workflow that produces deliverables that go to clients — contracts, reports, compliance documents, research memos — do not run those workflows autonomously until you have completed at least one full supervised cycle.

A supervised cycle means: a human observer tracks every tool call, every intermediate output, and every handoff in the workflow from start to finish. You document what actually happened versus what the agent reported. You compare the two.

This takes time. Do it once. Then you know what your agent actually does — not what it claims to do.


What to Ask Your AI Vendor Right Now

If you're currently using or evaluating AI agents in your firm's workflows, these are the questions to ask:

  • What benchmark do you use to measure tool-use accuracy? If they don't have an answer, that's an answer.
  • Do you provide tool-call logs with inputs and outputs, or only output logs? Audit capability requires the former.
  • How does your model handle tool unavailability? A vendor who can't explain what happens when the agent encounters a request no available tool can satisfy hasn't thought through the failure mode.
  • What's your model's performance on SimpleToolHalluBench or equivalent? Not all vendors will have this. But asking signals that you know what to look for.
  • What's your indemnification position if an agent-generated deliverable contains a hallucinated tool call? Consult your own counsel before relying on any vendor answer here. But the question surfaces how the vendor thinks about liability allocation.

The Bottom Line

The ICLR 2026 finding is not a reason to stop using AI agents. Agentic workflows are genuinely useful in professional services — the efficiency gains are real, and the competitive pressure to deploy them is only increasing.

But the reasoning trap means that the default assumption — smarter model, safer outputs — is wrong for tool-use contexts. The models your AI vendor is selling you on because of their improved reasoning capabilities may be more likely to fabricate tool calls than the models they replaced.

That's not a vendor problem to solve. It's a governance problem for you to solve.

Human review checkpoints, tool-call log access, and one supervised cycle before autonomous deployment: three steps that don't require a technology budget, an IT department, or a consultant. They require a principal — you — who understands that the audit trail showing everything went correctly isn't the same as everything actually going correctly.

The professional services firms that build this into their AI deployment process now will be ahead of the compliance requirements that are coming. The ones that don't will be caught by them.


Subscribe to The Crossing Report for weekly intelligence on the AI risks reshaping professional services.


Frequently Asked Questions

What is tool hallucination in AI agents?

Tool hallucination is when an AI agent fabricates a tool call — inventing an API reference, a data lookup, or a calculation result that doesn't actually exist. Unlike text hallucination (where AI produces incorrect prose), tool hallucination is embedded in the workflow itself. The agent's output looks properly sourced and procedurally correct, but the underlying action was never performed. In compliance workflows, document review chains, or multi-step research pipelines, this failure propagates downstream before anyone catches it.

What did the ICLR 2026 Reasoning Trap paper find?

The paper "The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination," published at ICLR 2026, found that training AI models to reason more deeply increases their rate of tool hallucination in lockstep. As the model's reasoning capability improves, it becomes more confident in fabricating plausible-looking tool calls. The paper introduced SimpleToolHalluBench, a benchmark designed specifically to measure this failure mode — and found that higher-reasoning models consistently scored worse on it than their lower-reasoning counterparts.

How does tool hallucination affect professional services firms specifically?

Professional services firms — law, accounting, consulting, staffing — are increasingly using AI agents in workflows where tool calls matter: contract review pipelines that invoke clause databases, tax workflows that reference IRS code lookups, compliance checklists that ping regulatory APIs. When an AI agent hallucinates a tool call in one of these workflows, the error looks clean in the audit trail. The output document appears properly sourced. The mistake doesn't surface until a human catches it — or doesn't.

What is SimpleToolHalluBench?

SimpleToolHalluBench is a benchmark introduced in the ICLR 2026 paper to measure AI agent tool hallucination specifically. It tests whether a model correctly identifies when no available tool can satisfy a given request — a failure mode where lower-capability models often correctly say "I can't do that" while higher-capability reasoning models confidently fabricate a tool call instead. The benchmark revealed the counterintuitive pattern: better reasoning, worse tool accuracy.

How can a small professional services firm protect against AI agent tool hallucination?

Three steps that don't require an IT department: First, build human review checkpoints at each agent handoff in a workflow — not just a final QA pass at the end. Second, ask your AI vendor whether they provide tool-call logs, not just output logs. If they can't show you what tools the agent invoked, you can't audit for hallucination. Third, don't run fully autonomous agent chains on client-facing deliverables until you've completed at least one full supervised cycle and documented what the agent actually did.

Frequently Asked Questions

What is tool hallucination in AI agents?

Tool hallucination is when an AI agent fabricates a tool call — inventing an API reference, a data lookup, or a calculation result that doesn't actually exist. Unlike text hallucination (where AI produces incorrect prose), tool hallucination is embedded in the workflow itself. The agent's output looks properly sourced and procedurally correct, but the underlying action was never performed. In compliance workflows, document review chains, or multi-step research pipelines, this failure propagates downstream before anyone catches it.

What did the ICLR 2026 Reasoning Trap paper find?

The paper 'The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination,' published at ICLR 2026, found that training AI models to reason more deeply increases their rate of tool hallucination in lockstep. As the model's reasoning capability improves, it becomes more confident in fabricating plausible-looking tool calls. The paper introduced SimpleToolHalluBench, a benchmark designed specifically to measure this failure mode — and found that higher-reasoning models consistently scored worse on it than their lower-reasoning counterparts.

How does tool hallucination affect professional services firms specifically?

Professional services firms — law, accounting, consulting, staffing — are increasingly using AI agents in workflows where tool calls matter: contract review pipelines that invoke clause databases, tax workflows that reference IRS code lookups, compliance checklists that ping regulatory APIs. When an AI agent hallucinates a tool call in one of these workflows, the error looks clean in the audit trail. The output document appears properly sourced. The mistake doesn't surface until a human catches it — or doesn't.

What is SimpleToolHalluBench?

SimpleToolHalluBench is a benchmark introduced in the ICLR 2026 paper to measure AI agent tool hallucination specifically. It tests whether a model correctly identifies when no available tool can satisfy a given request — a failure mode where lower-capability models often correctly say 'I can't do that' while higher-capability reasoning models confidently fabricate a tool call instead. The benchmark revealed the counterintuitive pattern: better reasoning, worse tool accuracy.

How can a small professional services firm protect against AI agent tool hallucination?

Three steps that don't require an IT department: First, build human review checkpoints at each agent handoff in a workflow — not just a final QA pass at the end. Second, ask your AI vendor whether they provide tool-call logs, not just output logs. If they can't show you what tools the agent invoked, you can't audit for hallucination. Third, don't run fully autonomous agent chains on client-facing deliverables until you've completed at least one full supervised cycle and documented what the agent actually did.

Get the weekly briefing

AI adoption intelligence for accounting, law, and consulting firms. Free to start.

Related Reading

This is the kind of intelligence premium subscribers get every week.

Deep analysis, cross-sector patterns, and the frameworks that help professional services firms make the crossing.