SQL Tells You What. Comments Tell You Why.
The best SQL doesn’t need comments. Write meaningful CTE names, descriptive aliases, clear column labels — and a skilled reader will follow your logic without a single annotation. That’s the right instinct.
It’s also only half right.
SQL is a declarative language. You’re not writing how the database retrieves your data; you’re writing what you want. That’s a useful distinction, because “what” and “why” are very different questions, and SQL can answer exactly one of them.
A query can be perfectly, elegantly readable and still be completely opaque about the reason it exists. The name of your CTE, no matter how well chosen, cannot explain the business decision that gave birth to it.
The self-documenting ceiling
Consider this query:
SELECT
customer_id,
SUM(order_total) AS lifetime_value,
DATEDIFF(day, MIN(order_date), CURRENT_DATE) AS customer_age_days
FROM
orders
WHERE
order_date >= DATEADD(day, -90, CURRENT_DATE)
AND status_code NOT IN (1, 2, 9)
GROUP BY
customer_id
Clean SQL. No ambiguity about what it does. But try answering these questions from the code alone:
Why 90 days? Why not 60, or 180, or the full history? Is 90 the standard customer lifetime window, a finance reporting period, the SLA in a merchant agreement, or the number someone picked in a meeting four years ago and nobody has questioned since?
What are status codes 1, 2, and 9? You might guess from context — pending, draft, cancelled — but you’re guessing. More importantly: why are they excluded? Is this a business rule about what counts as “real” revenue, or a data quality workaround because those status codes appear when an upstream webhook fires twice?
These are not trivial questions. The 90-day cutoff defines what “active customer” means across your entire reporting layer. The excluded status codes determine what gets counted as revenue. Change either without understanding the original intent, and you’re not refactoring — you’re silently redefining business logic that stakeholders are relying on.
Better SQL helps, but only so far:
SELECT
customer_id,
SUM(order_total) AS lifetime_value,
DATEDIFF(day, MIN(order_date), CURRENT_DATE) AS customer_age_days
FROM
orders
WHERE
order_date >= DATEADD(day, -90, CURRENT_DATE)
AND status_code NOT IN (
1, -- pending
2, -- draft
9 -- cancelled
)
GROUP BY
customer_id
This is better. You’ve replaced the mystery with labels. But inline labels aren’t explanations — they’re just naming the what more explicitly. You still don’t know why.
What SQL literally cannot tell you
SQL cannot explain why the program was written the way it was. It cannot discuss the reasons certain alternative approaches were taken. It cannot tell you that the business definition changed, that the upstream system is broken, or that this logic exists specifically to handle a problem that was supposed to be fixed in Q2 and wasn’t.
Here’s the kind of context that belongs in a comment:
-- Customer lifetime window: 90-day lookback aligns with the SLA in the Merchant Agreement (v3.2).
-- Reviewed and confirmed with Finance, Dec 2024. Don't change without checking with that team first.
-- status_codes 1, 2, 9 excluded per the order lifecycle defined in the legacy Salesforce migration.
-- These codes appear as artefacts when the Salesforce sync fires before the order is confirmed.
-- The upstream fix was deprioritised (JIRA: DATA-3841). Removing this filter will cause double-counting.
SELECT
customer_id,
SUM(order_total) AS lifetime_value,
DATEDIFF(day, MIN(order_date), CURRENT_DATE) AS customer_age_days
FROM
orders
WHERE
order_date >= DATEADD(day, -90, CURRENT_DATE)
AND status_code NOT IN (1, 2, 9)
GROUP BY
customer_id
That comment block is five lines. It contains: the business rationale, the stakeholder who owns the decision, the date it was confirmed, the upstream system causing the problem, the known issue reference, and the consequence of removing the filter. None of that information is recoverable from the SQL. All of it is load-bearing.
The deduplication that nobody remembers adding
Have you seen this line before:
QUALIFY ROW_NUMBER() OVER (
PARTITION BY customer_id, order_id
ORDER BY ingested_at DESC
) = 1
The SQL is completely transparent about what it does: take the most recently ingested row for each customer/order combination. What it cannot tell you is why duplicates exist in the first place, whether the source of those duplicates has been fixed, whether this deduplication logic is still necessary, or whether “most recently ingested” is actually the right tiebreaker for your use case.
-- Deduplicating on customer_id + order_id to handle duplicate webhook events from Eftpos.
-- Eftpos retries failed webhooks, which can fire the same order_created event multiple times.
-- Taking the latest ingested_at gives us the most recent event state (status, amount, etc.)
-- Revisit if we adopt Eftpos's idempotency keys at the integration layer — this dedup may become redundant.
-- Background: this logic was added after the October 2024 incident (post-mortem in Confluence).
QUALIFY ROW_NUMBER() OVER (
PARTITION BY customer_id, order_id
ORDER BY ingested_at DESC
) = 1
The second version tells the reader everything they need to know to maintain this safely: the upstream cause, the logic behind the tiebreaker, when to reconsider it, and where to find the history. Without that comment, every future engineer who touches this code has to reverse-engineer context that no longer exists anywhere.
When two models define the same word differently
This one doesn’t generate an error. It generates confusion at the worst possible moment.
-- IMPORTANT: 'churned' here means >90 days since last order.
-- This differs from mkt_customer_segments, which uses >60 days for winback campaign targeting.
-- Both definitions are intentional. Customer Success uses 90 days because cohort analysis
-- showed that 30% of "60-day churned" customers reorder organically within the next 30 days
-- and shouldn't receive a discount. Marketing uses the stricter threshold to maximise winback opportunities.
-- The discrepancy is documented and known. Do not "fix" one to match the other.
CASE
WHEN days_since_last_order <= 30 THEN 'active'
WHEN days_since_last_order <= 90 THEN 'at_risk'
ELSE 'churned'
END AS customer_segment
Without that comment, someone eventually “cleans up” one of the two definitions because having different churn thresholds in two models looks like a bug. They’re wrong, and they’re about to break a winback campaign that’s been performing well for six months.
The SQL cannot warn them. The comment can.
dbt descriptions aren’t exempt
dbt gives you a proper home for the “why”: model descriptions and column-level descriptions in your YAML.
Bad:
models:
- name: fct_orders
description: "The orders fact table. Contains order data."
columns:
- name: customer_segment
description: "The customer segment based on days since last order."
That description restates what the model name already says. It is a comment-shaped void.
Good:
models:
- name: fct_orders
description: >
Order-level fact table used as the single source of truth for revenue reporting.
Revenue excludes test accounts (prefixed 'INTERNAL_') and gift card redemptions,
per the Finance revenue recognition policy agreed January 2025.
See Confluence: Revenue Recognition Standards v3.
Do not use this model for marketing attribution — use mkt_attributed_revenue instead,
which applies different channel logic.
columns:
- name: customer_segment
description: >
Lifecycle segment defined by the Customer Success team, Q1 2025.
Thresholds (30/90 days) were derived from cohort analysis showing a median
reorder window of 28 days. Intentionally differs from the Marketing segment
definition in mkt_customer_segments, which uses a 60-day churn threshold
for winback campaign purposes. Both are correct for their respective use cases.
That description does something a column name never can: it tells you who owns the definition, when it was set, what analysis underpins it, and — critically — where the definition intentionally diverges from a similar field elsewhere in the warehouse. That last part is the thing that saves someone three hours of confused Teams messages.
The archaeology problem
Data teams inherit codebases. It happens constantly, and it will keep happening. Engineers leave. Consultants finish their contracts. Reorgs shuffle ownership. What gets left behind is the SQL, and whatever context wasn’t written down is gone.
Good SQL survives the handover. It’s readable, consistent, and correct. But “readable” in the sense of mechanically parseable is not the same as “understandable” in the sense of knowing what business problem the code was solving, whose decision it was, and what you’d need to change if that decision changed.
Comments are how you write for the engineer who inherits this in two years. They’re how you write for yourself in six months. They’re how you ensure that a working pipeline stays working through the context collapse that happens every time anyone leaves a team.
SQL can only tell you what. Try not to shortchange the people reading it in either direction.
