the data resident

Why I'm Building an Open-Source Agentic BI Tool

· #1 in series
ai agentic-bi mcp text-to-sql open-source semantic-layer

I have spent 12 years watching people not use dashboards.

At every company I worked at, the cycle was the same. Business team requests a dashboard. Data team builds it over weeks. Stakeholders look at it twice. Then it rots. Six months later someone asks a question the dashboard almost answers but not quite, and the cycle starts again.

Dashboards answer the questions you already know to ask. Real business users don’t know what to ask. They have a vague concern. Revenue feels off. Pipeline looks thin. Something changed but they can’t name it.

So they ping a data analyst. The analyst writes SQL. Maybe the answer comes back in a day, maybe a week. The feedback loop is too slow to drive decisions.

This is the status quo in 2026. Billions spent on BI tooling. Usage is still random.

The promise of “just ask your data”

Every BI vendor now has a natural language feature. Databricks has AI/BI Genie. Snowflake has Intelligence. Tableau has Next. Power BI has Copilot. ThoughtSpot has been doing this since before the LLM wave.

The pitch is simple: type a question in English, get an answer from your database. No SQL required.

It sounds like the fix. It is not.

23% accuracy on real databases

The Spider 2.0 benchmark tests text-to-SQL on enterprise-realistic schemas. Thousand-column databases on BigQuery and Snowflake. The kind of messy, sprawling schemas that exist at every real company.

The best models score around 23%.

For context, Spider 1.0 (clean schemas, simple questions) is basically solved at 91%. BIRD (harder, real databases) tops out around 75%. But Spider 2.0, the one that actually resembles production, sits at 23%.

It gets worse. A VLDB 2026 paper found that Spider 2.0’s own annotation error rate is 66.1%. The benchmark questions and answers are wrong two-thirds of the time. We are measuring accuracy against a broken ruler.

The six ways text-to-SQL fails

Karl Weinmeister from Google Cloud published a taxonomy that matches everything I have seen:

Schema linking failure. The model picks the wrong tables or columns. With hundreds of irrelevant columns in context, this is the default outcome, not the exception.

Join hallucination. The model invents relationships between tables that don’t exist. Or it misses joins that are necessary.

Aggregation errors. Wrong GROUP BY. Wrong aggregate function. MAX when it should be AVG. The structure looks right, the numbers are wrong.

Filter omissions. The logic is correct but secondary conditions get dropped. WHERE clauses go missing.

Column selection errors. Right logic, wrong output columns.

Logic conflation. Complex queries cause cognitive overload. The model mixes up conditions, combines unrelated filters, or short-circuits multi-step reasoning.

These are not edge cases. On enterprise schemas, these are the common case.

Why no incumbent defends well

Every major BI vendor bolted NL features onto their existing product. That sentence contains the problem.

Databricks Genie generates SQL against Unity Catalog. Snowflake Intelligence runs NL queries over Snowflake tables. Tableau Next adds conversational queries to Tableau. These are NL interfaces grafted onto architectures designed around dashboards and manual configuration.

None of them are truly agentic. None do multi-step reasoning with self-correction. None generate a semantic layer from raw schema. None score their own confidence. None flag data quality issues before answering.

They can’t. Their product architecture constrains them. You can’t retrofit a fundamentally different interaction model onto a dashboard-first product and get a good result. The incentives are also wrong. These companies sell seats and dashboards. An agent that replaces the dashboard threatens the business model.

The semantic layer is the missing piece

Here is the most important number I found in my research: adding a semantic layer to text-to-SQL jumps accuracy from roughly 75% to 90%+. This comes from Cube.dev and dbt MetricFlow research.

A semantic layer is a YAML file (or similar) that defines what “revenue” means, what “active customer” means, how tables join, which columns matter. It sits between the LLM and the raw schema. Instead of drowning the model in 500 columns and hoping it picks the right one, you give it a curated map.

Looker understood this years ago with LookML. But nobody in the open-source agentic BI space generates a semantic layer automatically. Vanna 2.0, the most mature open-source option, uses RAG over training data. QueryWeaver uses a knowledge graph. Neither auto-discovers the schema or generates metric definitions.

The open-source gap

I looked at every open-source alternative I could find. Here is what’s missing:

CapabilityVanna 2.0QueryWeaver
Auto-discoveryNoNo
Semantic layer generationNoNo
Confidence scoringNoNo
Ambiguity detectionNoNo
Quality agentNoNo
Eval harnessNoNo

Both are good SQL generation tools. Neither tries to understand your database before querying it. Neither tells you when it is uncertain. Neither ships a way to measure its own accuracy.

What I think could work

I have been designing a six-step architecture. I have not built most of it yet. But I have enough research and one proof-of-concept to believe the thesis holds.

Step 1: Connect. An MCP server that connects to any PostgreSQL or MySQL database. Read-only. Schema introspection tools. Nothing novel here.

Step 2: Discover. This is the core of the whole thing. An agent that runs column stats (nulls, cardinality, distribution, freshness). Infers which columns are actually used versus dead schema. Infers the business workflows generating the data. Flags traps: huge tables, missing foreign keys, soft deletes that will silently corrupt your aggregates.

Step 3: Generate. The discovery output feeds into a semantic layer. YAML definitions for metrics, dimensions, join paths. This is the curated map the LLM needs.

Step 4: Ask. A business agent that uses the schema, the stats, and the semantic layer in context. Confidence scoring on every answer. Ambiguity detection: “Did you mean revenue as bookings or as recognized ARR?” Self-correction with different fix strategies for different error types.

Step 5: Trust. A quality agent that knows common data quality problems per workflow type. CRM data has duplicate contacts and skipped pipeline stages. ERP data has currency conversion gaps. E-commerce data has cancelled orders still counted in revenue. This agent flags issues before answering, not after.

Step 6: Visualize. Pick the best chart type based on the result shape and the question intent. Time series gets a line chart. Rankings get horizontal bars. A single KPI gets a big number with a delta arrow. No pie charts with 50 slices.

The Discovery Agent proof

I have built and tested Step 2.

I pointed it at a test environment reproducing a midmarket internal management system. 244 tables. Django ORM schema. Cryptic column names. No documentation.

The agent found the spine. It identified the core business entities, mapped their relationships, inferred the pipeline stages. It flagged a 7.4M row trap table that would blow up any naive query. It built a data dictionary from scratch.

No custom code. No training data. Just Claude with read-only database access and search tools.

This is the differentiator. If discovery works, everything downstream gets dramatically easier. The semantic layer writes itself. The SQL generation has better context. The quality agent knows what to watch for.

If discovery doesn’t work, you are back to manual configuration, which is where every other tool starts and where adoption dies.

Why open source, why now

This is an Apache 2.0 project. Open source from day one.

Three reasons.

First, I believe the “honest AI” angle matters. BI tools should tell you when they are uncertain. They should flag data quality issues. They should say “I don’t know” instead of hallucinating an answer. That philosophy should be inspectable, not proprietary.

Second, I want this to be testable. An eval harness ships with it. Fifty question-SQL pairs against the demo database. Run python eval.py and see the accuracy scores. If the tool is bad, the numbers will show it.

Third, I am building this in public because I think the research is interesting regardless of whether the product succeeds. The discovery agent concept, the semantic layer generation, the confidence scoring framework. These are ideas worth sharing even if my specific implementation turns out to be wrong.

What comes next

I am in the design phase. The discovery agent works. The architecture is defined. The competitive research is done.

Next I need to build the demo database (a realistic GTM/sales schema, not a toy example), wire up the MCP server, implement the semantic layer generation, and build the ask/trust/visualize agents.

I will write about each step as I build it. What works, what breaks, what I learn.

If you work on agentic BI or have seen these problems firsthand, I’d like to hear from you. I am almost certainly wrong about some of this, and I would rather find out now.

The repo will be at github.com/lucassherpa/honest-analyst when there is code worth sharing.