§17.2

Talking to the Warehouse: Text-to-SQL and the Semantic Layer

The oldest dream in business intelligence is to ask the database a question in plain English and get the right answer back. For decades it failed, because translating “what was revenue by city last quarter?” into correct SQL requires knowing the schema, the joins, and what revenue even means at this company. Large language models cracked the easy version of this problem almost overnight — and then ran straight into the hard version. This chapter is about both: how good text-to-SQL has actually become, why it still breaks on real databases, and the one architectural idea — the semantic layer — that separates a demo from a system you can trust.

Querying production data in natural language is the most immediately useful thing a data agent does, and the most dangerous to get subtly wrong. A chart that is 90% right looks exactly like a chart that is 100% right. So the question is not “can the model write SQL?” — it obviously can — but “how often is it right on your database, and how would you know?”

What the benchmarks really say

The headline numbers are a tale of two worlds. On Spider 1.0, the sanitized academic benchmark, the best systems exceed 91% execution accuracy — the problem looks solved.2 On BIRD, which uses larger, messier real databases, the top system reaches about 82% against a human-expert baseline of 93% — close, but with a persistent gap.1 Then comes Spider 2.0, built deliberately from real enterprise workflows with thousand-column databases and vendor SQL dialects, and the floor drops out: the best agent solved only 21.3% of tasks, and a raw frontier model managed about 10%.2

Execution accuracy: from solved benchmark to open problem

0%25%50%75%100%Spider 1.0 (sanitized academic DBs)91.2%BIRD — best system81.95%Spider 2.0 — best agent (enterprise DBs)21.3%Spider 2.0 — GPT-4o baseline10.1%human expert 92.96%

On clean academic schemas the problem looks nearly solved. On real enterprise databases — thousands of columns, vendor dialects, ambiguous business terms — the best agents still solve only about a fifth of the tasks. The gap between these two bars is the whole story.

Figure 1. The same capability, measured three ways. Execution accuracy looks near-solved on clean academic schemas and collapses on realistic enterprise ones. The dashed line is the human-expert baseline.

Why it breaks: the schema, not the SQL

When text-to-SQL fails on a real database, it usually fails in one of two ways, and neither is about SQL syntax. First, hallucinated schema: faced with hundreds of cryptically named tables, the model invents a plausible-but-nonexistent column or joins on the wrong key. Second, ambiguous business definitions: ask for “revenue” and the model has no way to know whether you mean booked, recognized, gross, or net — so it guesses, and produces SQL that runs cleanly and answers the wrong question.13 The model generates from patterns, not from an understanding of your business. The error is invisible precisely because the query executes.

The fix that works: the semantic layer

The breakthrough idea is not a better model — it is giving the model a contract. A semantic layer is a governed set of definitions sitting between the question and the warehouse: certified metrics (net_revenue means exactly this), the join paths between tables, and the access rules. The agent no longer writes raw SQL against bare tables; it composes queries from definitions a human has already vetted.

The semantic layer — the contract between language and SQL

Who asks

AI data agentBI dashboardAnalyst in chat

Semantic layer

Certified metric definitions, join paths, and access rules. One place where “revenue” means one thing.

net_revenueactive_userchurn_rate

Where data lives

Warehouse tablesRaw columnsRow/column security
Routed through a semantic model, one vendor’s text-to-SQL jumped from 51% (a raw model on bare tables) to 90%+accuracy on real BI questions. The model didn’t change — the context did.
Figure 2. The semantic layer is the contract between natural language and SQL. It is where 'revenue' is pinned to one definition, where joins are pre-declared, and where row- and column-level security live.

The effect is large and well-documented. Snowflake reports that routing questions through a semantic model lifted its Cortex Analyst from the 51% a raw model scored on bare tables to over 90% accuracy on real BI questions — the model unchanged, the context transformed.45

51 → 90%
Text-to-SQL accuracy on real BI questions, raw model vs. the same model grounded in a governed semantic model (Snowflake Cortex Analyst).
+39 points from context alone

This is why the semantic layer has become the industry's center of gravity. In September 2025 a coalition including Snowflake, Salesforce, and dbt Labs launched the Open Semantic Interchange to make metric definitions portable across BI tools and AI agents — an attempt to standardize the contract itself, so the same definition of “churn” works everywhere.10

How an agent actually queries

Underneath a good text-to-SQL agent is a small pipeline, not a single prompt. It links the question to the relevant tables and metrics (schema linking), generates several candidate queries rather than one, runs them, reads the database's own error messages, and self-corrects — a frontier traceable to research like CHASE-SQL, whose multi-candidate approach reached 73% on BIRD, and to schema-linking-and-self-correction work like RSL-SQL.312

How a text-to-SQL agent actually answers a question

Question"revenue by city,last quarter"Schema +semantic layertables, joins,certified metricsGenerate candidatesdecompose · multipleSQL draftsExecute & checkrun · read errors ·validateAnswertable · chart ·explanationself-correct: “no such column” → revise and retry↑ enforces RBAC, masking, and the firm’s metric definitions

The accuracy gains of the last two years come less from a smarter model than from this scaffolding: grounding the question in a governed semantic layer, generating several candidate queries, and letting the database’s own error messages drive correction.

Figure 3. The pipeline behind a reliable answer: ground the question in the semantic layer, generate candidate SQL, execute, and let the database's errors drive correction.

Crucially, the agent reaches the warehouse through a governed path. Snowflake's managed MCP servers, in preview since October 2025, expose the text-to-SQL tool to agents while reusing the warehouse's existing role-based access control, masking, and policies — so an agent can answer questions without becoming a way around security.9

The shipping state of the art

This is no longer a research topic; it is a product category. Snowflake's Cortex Analyst, Databricks' AI/BI Genie (generally available since June 2025, after 4,000-plus customers used the preview), Google's Conversational Analytics grounded in the Looker semantic model, and Amazon Q generative SQL for Redshift all ship the same essential design: natural language in, governed SQL out, with an explanation of how the answer was derived.678 And the payoff in practice is real: Uber's internal QueryGPT cut typical query authoring from about ten minutes to three, with roughly 300 daily users across a platform handling more than a million interactive queries a month.11

Sources

Verified June 2026

  1. 1BIRD-bench: a Big Benchmark for Large-scale Database Grounded Text-to-SQL Evaluation (leaderboard) · BIRD-bench, 2025. bird-bench.github.io
  2. 2Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows · arXiv 2411.07763 (ICLR 2025), 2025. arxiv.org/abs/2411.07763
  3. 3CHASE-SQL: Multi-Path Reasoning and Preference-Optimized Candidate Selection in Text-to-SQL · Google Cloud / Stanford (MarkTechPost summary), 2024. www.marktechpost.com/2024/10/12/google-cloud-and-stanford-researchers-propose-chase-sql-an-ai-framework-for-multi-path-reasoning-and-preference-optimized-candidate-selection-in-text-to-sql
  4. 4Cortex Analyst: Evaluating Text-to-SQL Accuracy for Real-World BI · Snowflake Engineering Blog, 2024. www.snowflake.com/en/blog/engineering/cortex-analyst-text-to-sql-accuracy-bi
  5. 5Cortex Analyst — semantic model documentation · Snowflake, 2025. docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-analyst
  6. 6AI/BI Genie is now Generally Available · Databricks, 2025. www.databricks.com/blog/aibi-genie-now-generally-available
  7. 7Conversational Analytics in Looker — overview · Google Cloud Documentation, 2025. docs.cloud.google.com/looker/docs/conversational-analytics-overview
  8. 8General availability of Amazon Q generative SQL for Amazon Redshift · AWS, 2024. aws.amazon.com/about-aws/whats-new/2024/09/amazon-q-generative-sql-amazon-redshift
  9. 9Introducing Snowflake Managed MCP Servers for Secure, Governed Data Agents · Snowflake, 2025. www.snowflake.com/en/blog/managed-mcp-servers-secure-data-agents
  10. 10Open Semantic Interchange Initiative · Snowflake / Salesforce / dbt Labs, 2025. www.snowflake.com/en/news/press-releases/snowflake-salesforce-dbt-labs-and-more-revolutionize-data-readiness-for-ai-with-open-semantic-interchange-initiative
  11. 11QueryGPT — Natural Language to SQL Using Generative AI · Uber Engineering Blog, 2024. www.uber.com/en-CA/blog/query-gpt
  12. 12RSL-SQL: Robust Schema Linking in Text-to-SQL Generation · arXiv 2411.00073, 2024. arxiv.org/abs/2411.00073
  13. 13Enterprise Text-to-SQL: What Accuracy Benchmarks Really Mean · Promethium (practitioner guide), 2025. promethium.ai/guides/enterprise-text-to-sql-accuracy-benchmarks-2