AI Trust

Why LLMs Should Never Calculate Your Churn Rate

Intent is linguistic. Math is deterministic.

Swarnim Shrey

Founder, MindPalace

April 19, 20267 min read

If you remember only one thing from this essay, remember the subtitle.

Every AI-native BI tool we have looked at is racing to let a large language model "talk to data." Demos look magical: ask a question in English, get a number, maybe even a chart. For a moment, it feels like we have finally killed the dashboard.

Under the hood, many of these systems are doing something irresponsible. They are letting a probabilistic language model calculate your business metrics. That is not innovation. That is a category error.

This post explains why LLMs should never compute churn, revenue, retention, or any metric you are about to make a decision on, and what a safer architecture looks like. We learned this the hard way while building MindPalace, so the examples are not hypothetical.

The seductive failure of chat-to-SQL

Most "chat with your data" tools follow the same pattern.

Chat-to-SQL in one loop. The LLM does interpretation, translation, and math validation. Three jobs, one model, one weakest link.

This works surprisingly well, for demos.

The problem is that this single-loop architecture quietly assigns three incompatible responsibilities to one model:

Understanding intent
Translating intent into logic
Performing or validating the math

LLMs are excellent at the first task. They are passable at the second. They are fundamentally unqualified for the third.

Why this fails in the real world

Churn looks simple until it is not.

Ask ten companies how they calculate churn and you will get twelve answers:

Logo churn vs revenue churn
Gross vs net
Monthly vs cohort-based
Voluntary vs involuntary
Trial users included or excluded

Each definition is contextual. Each one encodes business judgment that a Finance team made and a CRO signed off on.

When you ask an LLM "what is our churn last quarter?" you are not asking a math question. You are asking a specification question. And LLMs do not ask clarifying questions unless explicitly forced to. They assume.

Assumptions are poison in analytics. When we test generic chat-to-SQL tools against realistic warehouse schemas, the same pattern shows up: the SQL parses, the number comes back, the executive nods, and the answer is wrong. Worse, it is plausibly wrong. Wrong by 8 percent, not by 800. The kind of wrong that survives a glance and dies under scrutiny.

Probability vs determinism

LLMs work by predicting the most likely next token.

Statistics work by applying deterministic functions to data.

These are not adjacent disciplines. They are orthogonal. When an LLM generates SQL or computes a metric inline, you get four failure modes that all look the same from the outside:

Silent assumption drift
Inconsistent results across runs
Undetectable logical errors
Confidence without correctness

The output sounds right. That is the danger. A wrong number with high linguistic confidence is more damaging than a dashboard nobody trusts. With a distrusted dashboard, people verify. With a confident answer, they act.

The AI-native BI architecture mistake

Here is what plausibly-wrong looks like in practice. We reproduced this on our reference B2B SaaS dataset, modeled after a typical billing-and-usage warehouse. We asked a popular chat-to-SQL tool for "monthly churn." The model produced this:

SELECT 1.0 - COUNT(DISTINCT customer_id) FILTER (
    WHERE last_active_at >= NOW() - INTERVAL '30 days'
  )::float
  / COUNT(DISTINCT customer_id)
FROM customers

It returned 7.4 percent. The canonical definition for the same dataset, encoded in the semantic layer Finance uses, returns 6.8 percent. Off by 8 percent of the value. Looks fine, fails review.

The query has three quiet bugs that no LLM caught. It treats customers as the active denominator (it is not, the table includes trial accounts the canonical definition excludes). It uses "active in last 30 days" as the inverse of churn (the canonical version uses "billing event in current period," which is different). It ignores the is_paying flag that gates the trusted definition. None of these are visible in the SQL. All of them are visible in the semantic layer.

Many AI-native BI tools still collapse everything into one loop, though the better ones are starting to separate planning from execution. The separation is what we built MindPalace around from day one. LLMs are planners, not calculators. It forced us to build a real semantic layer first, before any AI features. Cartographer exists because of that decision.

A safer split: planning vs execution

We separate the system into two engines that have different jobs and different failure modes.

The plan is the hand-off contract. The LLM cannot touch the math. The analyzer cannot reinterpret intent.

The planner (LLM)

Responsibilities:

Interpret intent
Clarify ambiguous questions
Propose hypotheses
Select analytical methods

Output is not a number. It is a structured plan: which metric definition applies, which population is in scope, which statistical test fits. This is language work. LLMs are great at it.

The analyzer (deterministic)

Responsibilities:

Execute math
Run statistical tests
Validate assumptions
Produce reproducible results

This engine is built on boring tools: Python, NumPy, SciPy, statsmodels. No creativity. No guessing. Just math.

Inside the analyzer. Failure is loud. Silent correction is not allowed.

Why we do not let LLMs "fix" the numbers

Some systems try to be clever:

If SQL errors, ask the LLM to fix it
If numbers look off, ask the LLM to adjust
If results conflict, ask the LLM to reconcile

This creates a feedback loop where the model optimizes for plausibility, not truth.

Our analyzer is intentionally dumb:

If assumptions are violated, it fails
If data is insufficient, it stops
If definitions conflict, it escalates to the metric owner via the review queue

Failure is a feature. Silent correction is not.

Below is a real shape of what we show. The plan, the assumption checks, and the test statistics all appear together so the audit trail is visible at the same time as the answer.

Deep Analysis

Is monthly churn higher in the SMB segment?

Hypothesis supported

Plan

Metric: Monthly Logo Churn
Population: SMB vs Mid-Market vs Enterprise
Window: Last 6 calendar months
Test: One-way ANOVA

Assumption checks

Sample size adequate per group
Variance homogeneity (Levene p = 0.31)
Independence of observations
No confounding by tenure

F-statistic

14.27

p-value

0.0008

Effect size (η²)

0.41

Sample size

4,228

Illustrative output rendered from our reference SaaS dataset. In production, every result links to its full audit trail: the SQL that ran, the row counts at each step, and each of the assumption checks above with their pass conditions.

If you cannot trace why a number exists, we did not earn the right to show it.

Why this matters for trust

Executives do not distrust data because they hate numbers. They distrust data because:

Numbers change without explanation
Metrics disagree across tools
Nobody can trace why a value exists

Letting an LLM compute metrics accelerates all three problems. Separating intent from execution reverses them. Every result has a plan. Every plan has assumptions. Every assumption can be inspected. That is how trust gets built. Not with confidence, with traceability.

The real role of AI in analytics

AI should not replace your math. It should replace the coordination cost of analysis. Coordination cost is the time spent translating between business question, semantic definition, SQL, statistical method, and result. That is what AI compresses. The math itself was always the easy part.

In MindPalace specifically, the planner produces a small JSON document on every question: the metric being asked about, the population being scoped, the test being proposed (ANOVA, t-test, cohort comparison), and the assumptions that need to hold for the test to be valid. We use Claude as the planner because of its tool-use reliability and structured-output handling. The model choice matters less than keeping it out of the math entirely. Deep Analysis takes that JSON, runs the math in Python, and returns a result with the assumption checks attached.

That split lets us do four things at once:

Translate business questions into analytical steps
Remember how each metric is defined in the company's semantic layer
Pick the right statistical method for the shape of the data
Show why a number exists, not just what it is

We have not seen an AI-native BI system work in production any other way. The ones that fail tend to fail the same way: a single LLM doing too many jobs at once, with nothing to catch it when it drifts.

A simple rule of thumb

If the output is a sentence, an LLM is appropriate.

If the output is a number you will make a decision on, an LLM should not touch it.

Intent is linguistic. Math is deterministic. Design your systems accordingly.

If you want to see how the planner-and-analyzer split actually runs in production, take a look at the product. If you want the long version of how we built the grounding layer underneath it, read what is a Decision Context Graph. If your team is currently the Human API between executives and the warehouse, that is the problem we are trying to dissolve.