Product + Proof

Anthropic Published How They Built AI Analytics. Here's What We Found.

Anthropic's engineering team published the internal architecture they built to make AI analytics trustworthy. The four decisions they made are the same four decisions Edilitics is built on.

Edilitics·1 month ago·8 min read

On June 3, 2026, Anthropic's engineering team published a detailed account of how they built their internal self-service analytics system using Claude. They described the failure modes they encountered, the architecture they built to address them, and the accuracy results they achieved.

The post is worth reading in full: How Anthropic enables self-service data analytics with Claude.

We read it carefully. What we found was not a competitor's playbook or a benchmark to measure against. It was a description of the same problems we set out to solve - and the same architectural conclusions we reached in solving them.

This post documents that comparison directly. Not to claim credit. To show why the architecture Edilitics is built on is the right one.

TL;DR: Anthropic published the internal architecture they built to achieve 95%+ accuracy on self-service analytics. The four decisions they made - human-validated semantic layers, continuous data freshness, structured retrieval, separated intent and synthesis - are the same four foundations Edilitics is built on.

Key Takeaways

Without structured semantic context, Anthropic found Claude's accuracy on analytics didn't exceed 21%. With it: consistently above 95%.
Both Anthropic and Edilitics identified the same root cause: the AI has no way to know what your column names actually mean without a human-validated semantic layer
Auto-generated semantic definitions - without human review - encode the same ambiguities they are meant to eliminate
Data staleness is an architecture problem, not a cleanup problem. It requires continuous profiling, not a one-time fix
The AIR score in Edilitics makes the quality of the semantic layer visible and improvable - users can ask questions, validate descriptions, ask again, and see the difference

Failure Mode	Anthropic's Finding	Edilitics Implementation
The AI doesn't know what your columns mean	Human-validated semantic layer required as mandatory first lookup. Auto-generated definitions "encode the very ambiguities we were trying to eliminate."	AI Column Insights generates descriptions on connection. AIR Score grades human vs AI-validated coverage. Grade A requires near-complete human validation.
Data changes and the AI doesn't know	CI enforcement: 90% of data-model pull requests include semantic skill updates in the same diff.	DQ Refresh profiles automatically on connection, on hostname/port/database change, and via daily background job (7+ days stale AND 5%+ data growth). Schema drift flagged between runs.
Raw SQL access doesn't solve the problem	Unstructured retrieval across historical queries improved accuracy by less than one point. Structured reference documents with explicit routing were the solution.	AskEdi's Classifier works exclusively from the governed semantic layer. Never performs unstructured retrieval. Returns `OUT_OF_SCOPE` with specific guidance rather than guessing.
One model doing two different jobs	Separated knowledge skills (intent routing) from execution skills (query generation). Two distinct jobs, handled distinctly.	Two-stage pipeline: Classifier resolves intent against the semantic layer. Synthesizer generates a deterministic query from that structured intent. Same intent always produces the same query.

Why Does a Semantic Layer Matter More Than Raw Data Access?

Anthropic's team ran an ablation test to understand where accuracy was actually coming from. They gave the AI direct access to thousands of historical SQL queries - every query ever run against their data - and measured whether that improved answers.

Accuracy moved less than one point. About 80% of incorrectly answered questions had relevant answers somewhere in that corpus. The model could see them. It just couldn't use them.

The bottleneck wasn't access. It was structure. The model couldn't reliably map a new question to the right entity in an unstructured corpus.

When they added structured semantic context - validated descriptions, metric definitions, explicit routing - accuracy went from below 21% to consistently above 95%.

That gap - 21% to 95% - is the precise gap between an AI analytics tool that connects to your database and one that understands your data. It is why Edilitics is built the way it is.

Failure Mode 1: The AI Doesn't Know What Your Columns Mean

Anthropic called this "concept-entity ambiguity." When hundreds of data model options exist, the AI struggles to map a user's question to the right field. Their example: "active users" requires knowing what actions constitute activity, whether to exclude fraudulent users, and what lookback window applies. None of that is in the column name.

Their solution: a human-validated semantic layer as the mandatory first lookup. And a specific warning about the shortcut:

"Bootstrapping the semantic layer by having an LLM auto-generate metric definitions from raw tables and query logs produced plausible-looking definitions that encoded the very ambiguities we were trying to eliminate."

Auto-generation is a starting point. It is not a semantic layer.

How Edilitics addresses this:

AI Column Insights generates a description for every column in your schema on connection - grounded in column name, data type, DQ statistics, and your organisation's industry sector. That generation runs automatically. It is the starting point.

What happens next is what matters. Every column carries a validation status. An AI-generated description contributes a fractional score to the AIR (AI Readiness) Score. A human-validated description contributes fully. Grade A - the level at which AskEdi has the context it needs to produce answers grounded in your actual business logic - requires near-complete human validation.

Before every AskEdi session, an AI Advisory reflects the current AIR grade for the table. It never blocks access. A team that wants to explore their data at Grade C can do so - and many do, precisely because seeing where the AI gets it wrong on ambiguous columns is what motivates the validation work. Ask a question with unvalidated descriptions, see the answer, edit the description, ask again. The quality difference is visible and immediate. Governance is something you observe working, not a gate you have to pass through.

The distinction Anthropic drew - between auto-generated definitions and human-validated ones - is built into every AIR score Edilitics surfaces.

What the LLM actually receives at query time:

When AskEdi generates a query, it does not pass a raw schema to the model. It passes a structured context block for every column: the validated description, data type, null count, cardinality, value range, and statistical summary (mean and standard deviation for numeric columns). This is what grounds the query in your data rather than in the model's probabilistic assumptions about what column names mean.

What is included depends on the privacy and context mode selected when the session starts. In all three modes, DQ statistics are included. In Balanced and Full modes, real column names are passed. In Private mode, columns are anonymised to col_1, col_2, and so on, but the statistical grounding remains intact. In Full mode, the most frequent values per column are also included, giving the model tighter constraints on valid filter values. The AI Column Insights and AIR score documentation covers exactly what each mode passes.

This is the architectural point Anthropic validated from a different angle: accuracy did not improve by giving the model more raw access. It improved when the model was given structured, validated context to reason from. The DQ statistics Edilitics passes at query time are part of that structure.

Auto-generated descriptions are a starting point. What the AI cannot know is what your column means in your business. That knowledge belongs to your team. Validation is where you put it into the system.

Failure Mode 2: Data Changes and the AI Doesn't Know

Anthropic described this plainly: "data sources, business definitions, and schemas change constantly; assets and agent knowledge go stale and return subtly incorrect answers."

Their solution was CI enforcement - requiring that any change to a data model triggers a corresponding update to the semantic skill files. Around 90% of their data-model pull requests now include skill changes in the same diff.

The same problem exists for any AI analytics system. A column that was 95% complete last quarter may be 70% complete today. A column renamed in a migration still has the old description. An integration that hasn't been profiled in three months may be answering questions from stale metadata.

How Edilitics addresses this:

The DQ Refresh system profiles every integration automatically: on connection, on any change to hostname, port, or database name, and via a daily background job that targets integrations not profiled in seven or more days where the underlying data has grown by more than 5% - both conditions must be true. Schema drift is tracked between runs - columns added, removed, or changed since the last profile are flagged.

The DQ score is not a badge earned once at setup. It is a live measurement, updated continuously, visible on every integration card. Data quality is always current because the system treats it as an ongoing signal, not a one-time assessment. This is the same principle Anthropic enforced through CI: the metadata has to stay current with the data.

Failure Mode 3: Raw SQL Access Doesn't Solve the Problem

The ablation result Anthropic published runs counter to the intuition that more context is always better. Giving the AI access to thousands of historical SQL queries - the full query corpus - improved accuracy by less than one point. Eighty percent of questions that were answered incorrectly had the relevant information somewhere in the corpus. The model just couldn't find it reliably.

Their conclusion: "unstructured retrieval couldn't map a new question to the right precedent." The solution was to distill query history into structured reference documents - explicit routing logic, table grain descriptions, gotchas - rather than expose the raw corpus.

How Edilitics addresses this:

AskEdi's Classifier works exclusively from the governed semantic layer - the validated column descriptions and metric definitions that have been confirmed by your team. It never performs unstructured retrieval against raw schema or query history.

If the Classifier cannot confidently resolve a question against the semantic layer - because a term hasn't been defined, or the question references data outside the connected table - it returns OUT_OF_SCOPE with specific guidance on what to ask instead. A useful redirect is better than a confident wrong answer.

This is the same architectural conclusion Anthropic reached: structure the knowledge the AI reasons from, rather than giving the model more unstructured access and hoping it finds the right thing. As covered in more detail in why your AI analytics tool doesn't know your business, the bottleneck has never been access - it has always been structure.

Failure Mode 4: One Model Doing Two Different Jobs

Anthropic's skills architecture separates the problem of knowing what a question means from the problem of generating a query to answer it. A thin knowledge skill narrows the search space - routing the agent to the right domain-specific context. An execution skill handles the actual analysis workflow. Two jobs, handled distinctly.

The underlying insight: intent classification and query synthesis are different problems requiring different optimisations. When one model handles both, errors from the first compound into the second. A model that misclassifies "net revenue" as "gross revenue" generates a query against the wrong column - and returns a confidently wrong answer.

How Edilitics addresses this:

AskEdi uses a two-stage pipeline. The Classifier takes the user's question and produces structured intent - the metric, time dimension, filter conditions, aggregation logic, and data source - against the governed semantic layer. It does not generate a query. It resolves meaning.

The Synthesizer takes that structured intent and generates a deterministic query. The same question, asked with the same intent, always produces the same query. Phrasing it differently - "net revenue last month" versus "what did we make in May net of refunds" - resolves to the same structured intent in Stage 1, and therefore the same query in Stage 2. Every query that runs is logged and shown to the user via the Analysis View - one click from any response - so anyone who needs to verify the result can read exactly what ran.

This architecture is described in detail in why one AI model isn't enough for trustworthy analytics. Anthropic reached the same structural conclusion from a different direction: the two jobs require different handling, and collapsing them into one model means accepting the error surface of the harder task on every query.

The same intent always produces the same query. That is what deterministic means in practice - not that the AI never makes mistakes, but that the same question always gets the same answer.

Where Do the Two Architectures Differ?

Anthropic built an automated correction-harvesting loop: scheduled agents scan internal channels for correction language, draft one-line fixes to reference docs, and open pull requests to domain owners. The feedback cycle is closed automatically.

Edilitics collects response feedback - every AskEdi answer can be rated, and those ratings are logged - but the loop back to description updates is currently manual. A user who rates an answer as unhelpful surfaces that signal; acting on it requires a human to open the Metadata Viewer and update the description.

This is a gap we are aware of. It does not affect the accuracy of the semantic layer itself - human validation of descriptions is more reliable than automated correction inference. But it means the feedback cycle requires more deliberate attention from teams who want to continuously improve their AIR score over time.

What Convergence Means

Anthropic built an internal analytics system for one of the most technically sophisticated teams in AI. They documented the failure modes they encountered and the architecture they built in response.

The four conclusions they reached - human-validated semantic layers, continuous data freshness, structured retrieval over raw corpus access, and separated intent classification from query synthesis - are the same four foundations Edilitics is built on.

This is not a coincidence. These conclusions follow from the nature of the problem. Any team that takes AI analytics seriously enough to measure accuracy, diagnose failure modes, and build for reliability will arrive at the same architecture. The 2026 Semantic Layer Summit reached the same conclusion independently: business context is critical infrastructure for enterprise AI, not an optional enrichment layer. The question is not whether this is the right approach. Anthropic's findings confirm it is.

For mid-market teams who do not have Anthropic's engineering capacity to build this themselves, Edilitics is what that architecture looks like as a product: 24 live connectors, automated DQ profiling, an AIR-graded semantic layer your team validates, a two-stage pipeline that produces deterministic results, and every answer traceable to the exact query that ran. The full walkthrough of a verified answer shows what that looks like in a real session.

The reason self-serve analytics keeps failing is not that the tools are wrong. It is that the foundations are ungoverned. Anthropic confirmed this from the inside. Edilitics is built to make those foundations accessible without a team of engineers to construct them.

One more thing worth stating directly: Anthropic is one of the three AI providers you can run AskEdi on. Pinnacle plan users can select Anthropic as their provider natively, and teams on qualifying plans can bring their own Anthropic API key via BYOK. The company whose architecture validates this approach is also one of the models powering it.

Start a free 14-day evaluation.

Sources

Anthropic, How Anthropic enables self-service data analytics with Claude, June 2026. Retrieved June 15, 2026.
Semantic Layer Summit 2026, Business Context as Critical Infrastructure for Enterprise AI, May 2026. Retrieved June 15, 2026.

Written by

Edilitics

Edilitics is a governed AI analytics platform built for mid-market teams who need decision-ready answers without technical dependency. Writes about data governance, AI analytics, and what it takes to make data accessible to the people who actually use it.

Product + Proof

Decision Intelligence Told Me What to Do. It Couldn't Tell Me Why.

Most AI analytics tools tell you what to do. AskEdi now tells you why the problem exists and what changes when you have a diagnosis before the recommendation.

7 min read

Founder Journey

The 90/10 Problem Nobody in Data Talks About

A decade in retail and growth taught me how to read a business. It didn't give me access to my own data. That gap became Edilitics.

7 min read

Founder Journey

I Asked for a Resource. He Saw Something Else.

In early 2024 I wasn't looking for a co-founder. I didn't think I was in a position to ask. This is what happened in the week Mihir Sanchala said yes anyway.

9 min read