What Building AI Agents Taught Me
I've spent most of my career in deterministic systems. Rendering engines, plugin architectures, monorepo tooling. Problems where a function returns the same output for the same input, a test passes or fails, and correctness is binary. That mental model served me well for years.
Then I was asked to build an AI agent. Actually, two of them.
At Grammarly, we built the Citation Finder and Fact Checker: agents that scan a student's essay, identify claims that need evidence, find credible sources on the web, and generate properly formatted citations. One finds sources to support your arguments. The other flags when your claims contradict the evidence.
Building them broke something in my engineering intuition. The assumption I'd carried through every system I'd ever built — the same input produces the same output — didn't hold anymore.
The pipeline that taught me humility
My first day on the project, I sat down to build what I thought was straightforward: take an essay, find claims, search for sources, show results. I assumed the hard part would be the UI.
I was wrong about everything.
The hard part is that LLMs are unreliable in what they say. Not "sometimes they're slow" unreliable. The same prompt, the same essay, run twice: different claims detected, different sources found, different citation formatting. A comma missing between two author names in APA style. An ampersand where there should be an "and." A journal title italicized when it shouldn't be. The output looks right but isn't.
The agent works in four stages:
- Claim detection: The LLM reads the essay and identifies verifiable claims, statements that could be supported or contradicted by evidence.
- Source search: For each claim, a web search finds relevant sources. The LLM ranks them by relevance.
- Evidence classification: Each source is classified as supporting, contradicting, or debating the claim.
- Citation formatting: Source metadata is formatted into APA, MLA, or Chicago style.
Each stage can fail independently. And silently.
The claim detector misses a claim or flags a subjective opinion as factual. The web search returns irrelevant results. The relevance ranking is opaque: the LLM generates a score, but you can't inspect its reasoning. The citation formatter hallucinates author names and publication dates.
My instinct was to build abstractions. Define interfaces. Make things composable and testable. But you can't unit test an LLM's judgment. You can test that the pipeline runs without crashing. You can't test that it's right.
Testing becomes evaluation
What you can do is use LLM-as-a-judge: a second model evaluates the first model's output across hundreds of essays, scoring claim detection quality, source relevance, and citation accuracy. The judge has its own biases. But it catches regressions that would otherwise ship silently, and it scales in a way manual review doesn't.
Pass/fail doesn't exist here. You build scoring frameworks and track quality as a distribution over time. A regression isn't a red test. It's a shift in the curve.
I had to learn to think in distributions rather than assertions, and to build systems that degrade gracefully when any stage produces garbage. The pipeline was running. The evaluation scores were okay. Not bad, not great.
It took us a while to realize the problem wasn't in any single stage. It was in what we were asking the pipeline to do.
Two goals that couldn't live together
We launched the first version as a single agent that handled both citation finding and fact checking. It seemed natural. Both tasks start with claim detection, both involve web search, both show sources to the user.
But the goals conflicted at the ML level.
Citation finding needs high recall: surface everything that might need a citation. You'd rather show an unnecessary suggestion than miss a claim that should be cited. Fact checking needs high precision: only flag things that are actually contradicted by evidence. A false positive, telling a student their correct claim is wrong, is worse than missing a disputed claim.
Same pipeline, opposite tuning needs. We tried to balance both and ended up with a system that was mediocre at each.
The fix was embarrassingly obvious in hindsight: if two features optimize for different objectives, they shouldn't share a model. We split them into two separate agents. Claim detection got tuned independently for each. The quality of both improved immediately.
This felt like a product lesson. It was really an engineering lesson about coupling. I knew not to couple code with different change frequencies. I didn't realize the same principle applies to ML objectives.
Citations need grounding, not generation
Of everything I worked on, citation formatting taught me the most about where LLMs should and shouldn't be trusted.
APA, MLA, and Chicago each have hundreds of rules. The format depends on the source type, the number of authors, the presence of a DOI, the edition, whether it was accessed online. The right approach isn't to ask the LLM to generate citation metadata. It's to ground citation search in authoritative indices that already maintain structured data.
Academic databases, paper repositories, and publisher APIs have accurate author names, publication dates, DOIs, and journal metadata. When your search index provides structured data, formatting becomes a deterministic transformation rather than a generative task.
The LLM's role shifts from generating citation data to selecting the right source and mapping the user's claim to the relevant evidence. Generation where the model is strong: reasoning, relevance. Structured data where precision matters: metadata, formatting.
What this changed for me
Building these agents taught me things that didn't fit into my existing mental model:
Determinism is a luxury. Most of my career, correctness was binary. LLMs don't work that way. You build for variance, not just for edge cases.
Evaluation replaces testing. You can't assert that an LLM is right. You can track whether it's getting worse. That requires a fundamentally different relationship with quality.
Coupling applies to ML objectives, not just code. If two features optimize for different things, they shouldn't share a pipeline. We learned this by shipping a coupled version and watching both features underperform.
"Good enough" is a design decision, not a compromise. In deterministic systems, "good enough" means you haven't finished. In probabilistic systems, it means you've chosen where to spend your error budget. Citations need to be near-perfect. Source ranking can tolerate more noise.
I went in thinking I was building a feature. I came out thinking differently about what it means to build software when the core computation is nondeterministic.
That shift applies to a lot more than citation finding.