Treating an LLM as an Agent, Not an API

The hundred-documents-a-day problem

Every day, a hundred-plus tender eligibility documents land in a Drive folder, and the right sections have to be pulled out of each one reliably. The obvious build is four lines: file → API call → response → store. I built that first. It demoed beautifully and quietly corrupted the database within a week.

The failure mode wasn't dramatic. Gemini would extract "C. Qualification Criteria / Must have 5 years experience…" and truncate mid-sentence. The row still inserted. Nothing threw. The data was just subtly, permanently wrong.

An API endpoint has no judgement

The naive pipeline treats the model as a pure function: input in, text out, trust the text. But an LLM isn't a function — it's a probabilistic narrator that will hand you a hallucination or a half-finished answer with total confidence and never tell you which one you got. If the only thing between that output and your database is an INSERT, the model's worst day becomes your data's worst day.

Bad data in a database costs far more than slow processing. So I made the system slower on purpose.

Promoting the model to an agent

Instead of one call, each document now runs a lifecycle: Context Analysis → Strategy Selection → Execution → Self-Validation → Quality Decision — and only then persist, retry, or abandon. The model doesn't just answer; it has to earn the write.

The gatekeeper is a five-layer validation engine that runs independently of the model's own confidence:

Heading validation — the start and end headings exist, in the right order.
Content validation — real, substantial text between them (over 500 characters), not an empty shell.
Truncation detection — natural endings and sane token counts; nothing cut off mid-thought.
Format validation — Markdown and HTML render consistently.
Encoding validation — valid UTF-8, no corruption, no mojibake.

Crucially, this validation is my code, not the API's self-reported score. A model that hallucinates is equally capable of being confidently wrong about its own confidence; independent domain rules catch the blind spots a confidence number never will.

Designing for failure, not just success

Every document also moves through an explicit state machine — NEW → IN_PROGRESS → COMPLETED | ERROR | ABANDONED | RE_PROCESS — so a failure is never a dead end, it's a state with a recovery strategy. A structural error retries immediately; a transient API error backs off exponentially (1s, 2s, 4s, 8s, 16s) with jitter so a fleet of workers never stampedes the rate limit; once the budget runs out, the document is parked for a human rather than silently dropped.

The lesson

The agentic version burns more tokens and more wall-clock time than the four-line version ever did. That's the point. The expensive part of an AI system isn't the inference — it's the wrong answer you didn't catch. Treating the model as an agent that must validate its own work turned an impressive demo into something I'd actually trust with a production database.