Technical Architecture

Why One AI Model Isn't Enough for Trustworthy Analytics

A 2026 benchmark across 37 LLMs found hallucination rates between 15% and 52%. Single-model analytics pipelines inherit that variance. Here's why a two-stage architecture - Classifier then Synthesizer - produces deterministic, trustworthy answers.

Mihir Sanchala·1 month ago·7 min read

A 2026 benchmark across 37 large language models (SQ Magazine, 2026) found hallucination rates between 15% and 52% depending on the model and task. The best model in the benchmark recorded 15% (Grok-4). The industry average sits closer to 22%.

For most use cases, that range is an inconvenience. For analytics - where the answer informs a financial decision, a risk assessment, or an operational call - it is an architecture problem.

Most AI analytics tools use a single model to do everything: read your question, infer your intent, determine which table and columns to query, generate the SQL, and return an answer. That is one model handling two fundamentally different jobs. And the error surface of one job compounds with the error surface of the other.

TL;DR: LLM hallucination rates range 15%-52% (2026 benchmark, 37 models). Single-model pipelines inherit that variance on every query. A two-stage architecture - Classifier resolving intent, Synthesizer generating the query - removes the hallucination surface by separating two fundamentally different jobs.

Key Takeaways

LLM hallucination rates range 15%-52% across leading 2026 models - single-model analytics pipelines inherit that variance
Intent classification and query synthesis are different problems - one model optimised for both is optimised for neither
A two-stage pipeline catches intent errors before a query runs, not after a wrong answer is returned
Deterministic SQL means the same question always produces the same answer - phrasing variations cannot change the result
BYOK lets teams choose their own model provider without being locked to a single LLM

Why Does One Model Doing Two Jobs Produce Wrong Answers?

When a business user asks "what was our net revenue last month?", an AI analytics tool has to do two things in sequence.

First: understand what the user actually means. Net revenue, not gross. Last month, not last 30 days. Revenue attributed to completed transactions, not pipeline. This is intent classification - understanding the business meaning behind a natural language phrase.

Second: generate a query that retrieves exactly that. The right table, the right column, the right date filter, the right aggregation. This is query synthesis - translating a structured intent into executable logic.

These are different problems. Intent classification requires understanding business context - what terms mean in your specific organisation, what the user's role implies about their typical questions, what metric definitions your semantic layer has locked. Query synthesis requires translating structured intent into valid, deterministic SQL or aggregation logic against your specific schema.

When one model handles both, the errors from the first task flow directly into the second. A model that misclassifies "net revenue" as "gross revenue" generates a query against the wrong column - and returns a confidently wrong answer. The user sees a number. The number is wrong. The model did not signal uncertainty because it was not uncertain - it was certain about the wrong thing.

Probabilistic text generation and deterministic query synthesis are not the same problem. Using one model to solve both means accepting the error surface of the harder task on every single query.

What Does a Two-Stage Pipeline Actually Do?

The architecture decision we made at Edilitics was to separate these two tasks explicitly. Not because it is simpler - it is not. Because it is the only architecture that can produce reliable answers at scale.

Stage 1: The Classifier

The Classifier takes the user's question and produces structured intent. Not a query. Not an answer. A machine-readable specification of what the question means: the metric requested, the time dimension, the filter conditions, the aggregation logic, and the data source.

This stage works from your governed semantic layer - the validated column descriptions and metric definitions that your team has confirmed. The Classifier is not guessing what "net revenue" means. It is looking up what your team has defined it to mean, and producing an intent specification that reflects that definition. How that semantic layer gets built - and why human validation is the only path to a trustworthy one - is covered in why your AI analytics tool doesn't know your business.

Intent errors are caught here. If the Classifier cannot confidently resolve the user's question against the semantic layer - because the question references a term that has not been defined, or falls outside the scope of what the data supports - it returns OUT_OF_SCOPE with specific suggestions on what the user could ask instead. A useful redirect returned is better than a wrong answer delivered.

Stage 2: The Synthesizer

The Synthesizer takes the structured intent from Stage 1 and generates the query. Because the intent is already structured and validated, the Synthesizer does not need to interpret business language. It translates a precise specification into SQL, a MongoDB aggregation pipeline, or Polars expressions depending on your data source.

This is where determinism comes from. The same structured intent always produces the same query. Phrasing variations in how a user asks the question are resolved in Stage 1 - by the time Stage 2 runs, the ambiguity is gone.

The query that runs is logged and shown to the user alongside the answer. Anyone who wants to verify the result can read the query, check that it matches their intent, and act on the answer with confidence.

	Classifier	Synthesizer
Task	Intent classification	Query generation
Input	Natural language question	Structured intent specification
Output	Machine-readable intent	SQL / MongoDB pipeline / Polars
Error surface	Misclassified business intent	Malformed query logic
What stops errors	Governed semantic layer	Deterministic translation from structured intent

Why the Model Underneath Still Matters

The two-stage architecture removes the hallucination surface from the query generation process. But the model that powers each stage still matters - for speed, for cost, for the specific types of classification it handles best.

This is why model flexibility is an architectural requirement, not a deployment detail. Different data types, query complexity profiles, and regulatory constraints require different model characteristics. A team with strict data residency requirements needs different model configuration than one running growth analytics against a cloud warehouse. Locking every deployment to a single provider forces every team to accept the same tradeoffs.

Edilitics supports BYOK - Bring Your Own Key. Teams on Individual Pinnacle, Team Scale, Team Pinnacle, and Enterprise tiers can supply their own LLM provider API keys, using their existing model agreements instead of the platform default. BYOK doubles effective analysis credits once active, because the platform is no longer consuming inference costs on the team's behalf.

The architectural point is that the two-stage pipeline decouples the model from the architecture. The Classifier and Synthesizer can use different models, or models from different providers, because their interfaces are structured - they pass machine-readable intent, not natural language. Swapping the underlying model does not change how the pipeline works.

What Deterministic Means in Practice

"Deterministic" is an overloaded word in AI discussions. It is worth being precise about what it means in the context of analytics.

A deterministic analytics answer is one where the same question, asked by the same user against the same data, always produces the same query and the same result. Phrasing the question differently - "net revenue last month" vs "what did we make in May net of refunds" - does not change the answer, because both phrases resolve to the same structured intent before the query is generated.

This matters because analytics is a repeatability problem. If two analysts ask the same question and get different answers, one of the answers is wrong. If the same analyst asks the same question on two different days and gets different answers, the system cannot be trusted for trend analysis.

Probabilistic text generation - the mechanism underlying single-model pipelines - produces answers where phrasing variation can change the result. Not because the data changed. Because the model made a different inference. The two-stage pipeline eliminates this by resolving intent before synthesis, not during it.

The result is an answer you can reproduce, explain, and defend. That is what trustworthy analytics actually means at the architecture level.

The Tradeoff We Made

Building a two-stage pipeline is harder than building a single-model pipeline. It requires a governed semantic layer for the Classifier to reason from. It requires a structured intent schema that both stages agree on. It requires the Synthesizer to handle multiple query languages - SQL for relational databases, aggregation pipelines for MongoDB, Polars for flat files - from the same structured intent.

We made this choice at the start because the alternative - a faster single-model pipeline on ungoverned data - produces answers that are probably right. We needed answers that are verifiably right.

That distinction is the whole architecture. If the answer cannot be reproduced from a logged query, it is not a reliable answer. Everything else follows from that constraint.

The two-stage architecture is how that guarantee is maintained at scale.

AskEdi is the AI analytics layer that runs on this architecture - Classifier, then Synthesizer, every answer with the query visible. Integrate builds the governed semantic foundation it reasons from.

Start a free 14-day evaluation.

Sources

SQ Magazine, LLM Hallucination Statistics 2026: AI Gets Facts Wrong Up to 82% of the Time, 2026. Retrieved June 11, 2026.

Written by

Mihir Sanchala

Co-Founder, Edilitics. Engineers the systems that bring Edilitics to production. Writes about the technical reality of building governed AI analytics - the decisions, the tradeoffs, and what building for trust actually requires.

Connect on LinkedIn

Product + Proof

Decision Intelligence Told Me What to Do. It Couldn't Tell Me Why.

Most AI analytics tools tell you what to do. AskEdi now tells you why the problem exists and what changes when you have a diagnosis before the recommendation.

7 min read

Founder Journey

The 90/10 Problem Nobody in Data Talks About

A decade in retail and growth taught me how to read a business. It didn't give me access to my own data. That gap became Edilitics.

7 min read

Product + Proof

Anthropic Published How They Built AI Analytics. Here's What We Found.

Anthropic's engineering team published the internal architecture they built to make AI analytics trustworthy. The four decisions they made are the same four decisions Edilitics is built on.

8 min read