Why LLMs Should Never Calculate Your Churn Rate
Intent is linguistic. Math is deterministic.
Swarnim Shrey
Founder, MindPalace
If you remember only one thing from this essay, remember the subtitle.
Every AI-native BI tool we have looked at is racing to let a large language model "talk to data." Demos look magical: ask a question in English, get a number, maybe even a chart. For a moment, it feels like we have finally killed the dashboard.
Under the hood, many of these systems are doing something irresponsible. They are letting a probabilistic language model calculate your business metrics. That is not innovation. That is a category error.
This post explains why LLMs should never compute churn, revenue, retention, or any metric you are about to make a decision on, and what a safer architecture looks like. We learned this the hard way while building MindPalace, so the examples are not hypothetical.
The seductive failure of chat-to-SQL
Most "chat with your data" tools follow the same pattern.
This works surprisingly well, for demos.
The problem is that this single-loop architecture quietly assigns three incompatible responsibilities to one model:
- Understanding intent
- Translating intent into logic
- Performing or validating the math
LLMs are excellent at the first task. They are passable at the second. They are fundamentally unqualified for the third.
Why this fails in the real world
Churn looks simple until it is not.
Ask ten companies how they calculate churn and you will get twelve answers:
- Logo churn vs revenue churn
- Gross vs net
- Monthly vs cohort-based
- Voluntary vs involuntary
- Trial users included or excluded
Each definition is contextual. Each one encodes business judgment that a Finance team made and a CRO signed off on.
When you ask an LLM "what is our churn last quarter?" you are not asking a math question. You are asking a specification question. And LLMs do not ask clarifying questions unless explicitly forced to. They assume.
Assumptions are poison in analytics. When we test generic chat-to-SQL tools against realistic warehouse schemas, the same pattern shows up: the SQL parses, the number comes back, the executive nods, and the answer is wrong. Worse, it is plausibly wrong. Wrong by 8 percent, not by 800. The kind of wrong that survives a glance and dies under scrutiny.
Probability vs determinism
LLMs work by predicting the most likely next token.
Statistics work by applying deterministic functions to data.
These are not adjacent disciplines. They are orthogonal. When an LLM generates SQL or computes a metric inline, you get four failure modes that all look the same from the outside:
- Silent assumption drift
- Inconsistent results across runs
- Undetectable logical errors
- Confidence without correctness
The output sounds right. That is the danger. A wrong number with high linguistic confidence is more damaging than a dashboard nobody trusts. With a distrusted dashboard, people verify. With a confident answer, they act.
The AI-native BI architecture mistake
Here is what plausibly-wrong looks like in practice. We reproduced this on our reference B2B SaaS dataset, modeled after a typical billing-and-usage warehouse. We asked a popular chat-to-SQL tool for "monthly churn." The model produced this:
SELECT 1.0 - COUNT(DISTINCT customer_id) FILTER (
WHERE last_active_at >= NOW() - INTERVAL '30 days'
)::float
/ COUNT(DISTINCT customer_id)
FROM customersIt returned 7.4 percent. The canonical definition for the same dataset, encoded in the semantic layer Finance uses, returns 6.8 percent. Off by 8 percent of the value. Looks fine, fails review.
The query has three quiet bugs that no LLM caught. It treats customers as the active denominator (it is not, the table includes trial accounts the canonical definition excludes). It uses "active in last 30 days" as the inverse of churn (the canonical version uses "billing event in current period," which is different). It ignores the is_paying flag that gates the trusted definition. None of these are visible in the SQL. All of them are visible in the semantic layer.
Many AI-native BI tools still collapse everything into one loop, though the better ones are starting to separate planning from execution. The separation is what we built MindPalace around from day one. LLMs are planners, not calculators. It forced us to build a real semantic layer first, before any AI features. Cartographer exists because of that decision.
A safer split: planning vs execution
We separate the system into two engines that have different jobs and different failure modes.
The planner (LLM)
Responsibilities:
- Interpret intent
- Clarify ambiguous questions
- Propose hypotheses
- Select analytical methods
Output is not a number. It is a structured plan: which metric definition applies, which population is in scope, which statistical test fits. This is language work. LLMs are great at it.
The analyzer (deterministic)
Responsibilities:
- Execute math
- Run statistical tests
- Validate assumptions
- Produce reproducible results
This engine is built on boring tools: Python, NumPy, SciPy, statsmodels. No creativity. No guessing. Just math.
Why we do not let LLMs "fix" the numbers
Some systems try to be clever:
- If SQL errors, ask the LLM to fix it
- If numbers look off, ask the LLM to adjust
- If results conflict, ask the LLM to reconcile
This creates a feedback loop where the model optimizes for plausibility, not truth.
Our analyzer is intentionally dumb:
- If assumptions are violated, it fails
- If data is insufficient, it stops
- If definitions conflict, it escalates to the metric owner via the review queue
Failure is a feature. Silent correction is not.
Below is a real shape of what we show. The plan, the assumption checks, and the test statistics all appear together so the audit trail is visible at the same time as the answer.
Deep Analysis
Is monthly churn higher in the SMB segment?
Plan
- Metric
- Monthly Logo Churn
- Population
- SMB vs Mid-Market vs Enterprise
- Window
- Last 6 calendar months
- Test
- One-way ANOVA
Assumption checks
- Sample size adequate per group
- Variance homogeneity (Levene p = 0.31)
- Independence of observations
- No confounding by tenure
F-statistic
14.27
p-value
0.0008
Effect size (η²)
0.41
Sample size
4,228
Illustrative output rendered from our reference SaaS dataset. In production, every result links to its full audit trail: the SQL that ran, the row counts at each step, and each of the assumption checks above with their pass conditions.
Why this matters for trust
Executives do not distrust data because they hate numbers. They distrust data because:
- Numbers change without explanation
- Metrics disagree across tools
- Nobody can trace why a value exists
Letting an LLM compute metrics accelerates all three problems. Separating intent from execution reverses them. Every result has a plan. Every plan has assumptions. Every assumption can be inspected. That is how trust gets built. Not with confidence, with traceability.
The real role of AI in analytics
AI should not replace your math. It should replace the coordination cost of analysis. Coordination cost is the time spent translating between business question, semantic definition, SQL, statistical method, and result. That is what AI compresses. The math itself was always the easy part.
In MindPalace specifically, the planner produces a small JSON document on every question: the metric being asked about, the population being scoped, the test being proposed (ANOVA, t-test, cohort comparison), and the assumptions that need to hold for the test to be valid. We use Claude as the planner because of its tool-use reliability and structured-output handling. The model choice matters less than keeping it out of the math entirely. Deep Analysis takes that JSON, runs the math in Python, and returns a result with the assumption checks attached.
That split lets us do four things at once:
- Translate business questions into analytical steps
- Remember how each metric is defined in the company's semantic layer
- Pick the right statistical method for the shape of the data
- Show why a number exists, not just what it is
We have not seen an AI-native BI system work in production any other way. The ones that fail tend to fail the same way: a single LLM doing too many jobs at once, with nothing to catch it when it drifts.
A simple rule of thumb
If the output is a sentence, an LLM is appropriate.
If the output is a number you will make a decision on, an LLM should not touch it.
Intent is linguistic. Math is deterministic. Design your systems accordingly.
If you want to see how the planner-and-analyzer split actually runs in production, take a look at the product. If you want the long version of how we built the grounding layer underneath it, read what is a Decision Context Graph. If your team is currently the Human API between executives and the warehouse, that is the problem we are trying to dissolve.
Read this next
What is a Decision Context Graph? An Architectural Guide
A Decision Context Graph is the missing layer between your warehouse and your decisions. Here is what it is, how we build one in four hours, and why it matters now.
The Data-Driven Lie: Why Most Companies Fail at What They Claim to Do Best
Data-driven decision making has become table stakes language. In practice, most leadership meetings still run on whoever argues best. Here is what actually goes wrong, and what would fix it.