Your AI Agent Didn't Crash. It Just Quietly Started Lying.
· By Hidai Bar-Mor, Creator of EvalView · 4 min read
The scariest agent bugs aren't the ones that throw errors. They're the ones where everything looks fine but the agent stopped calling its tools days ago and has been hallucinating ever since.
This keeps happening
Some version update quietly drops tool access. Or the model behind the API gets updated server side and suddenly your agent stops using its tools. Or a checkpoint resumes with bad state. Or an entire sub agent team just doesnt run and the orchestrator fills in the gap with whatever sounds right.
No crash. No error. The output reads fine. Its just wrong.
Why normal testing misses this completely
Think about how most people test agents. You check the final output against some expected answer. Or you run an LLM judge that asks "is this response good" and it says yeah looks good to me.
But the output isnt whats broken. The execution path is. The agent took a completely different route to get to something that sounds similar enough to pass. It skipped tools, hallucinated data, took shortcuts.
What I actually do now
I stopped testing outputs and started testing the execution path. When my agent is working correctly I record everything. Which tools got called, what order, what parameters. I save that as a baseline. Then after any change I run the same thing again and diff the two traces.
✓ login-flow PASSED
⚠ refund-request TOOLS_CHANGED
- lookup_order → check_policy → process_refund
+ lookup_order → process_refund
✗ billing-dispute REGRESSION score 85 → 55
Tools disappeared? I see it immediately. Score tanked? Same. Before any user does.
If you take one thing from this
Just start recording your agents tool calls. Save what the agent does when its working, and compare against that after you make changes. Prompt tweaks, model swaps, dependency bumps, all of it.
I ended up building EvalView around this. Snapshot behavior. Compare runs. Catch regressions before they hit production.
Start recording tool calls. Start diffing trajectories. Start treating agent behavior like something you can baseline, compare, and protect.