Your Data Platform Costs More Than It Should

Chris Hillman — Sat, 25 Apr 2026 09:00:00 +1100

Let me tell you about the moment I stopped treating cloud costs as someone else’s problem.

We were three months into a Snowflake migration. Everything was humming. Pipelines were green, dashboards were fast, the analytics team was happier than I’d seen them before. I felt good about the work we’d done.

Then finance forwarded me the invoice.

The number wasn’t catastrophic. But it was significantly higher than what we’d budgeted, and when I started digging, I couldn’t explain where most of it was going. I knew we had warehouses running. I knew we had pipelines executing. But I couldn’t tell you which warehouse was responsible for what cost, which pipelines were the expensive ones, or whether the money was well spent. I had built a platform I was proud of — and I had no idea what it actually cost to operate.

That’s the moment that changed how I think about data engineering. Not because of the dollar amount, but because of the realisation underneath it: I had built something I couldn’t explain to the people paying for it. And if I couldn’t explain it, I couldn’t defend it. And if I couldn’t defend it, someone else would make the decisions for me — someone who didn’t understand why the platform mattered.

I’m telling you this because cost management is one of those things that sounds like a finance problem until you experience the consequences firsthand. It’s not about being cheap. It’s about being intentional. It’s about knowing that every credit you spend is buying something valuable — and being able to prove it when someone asks.

The data engineers who understand their costs don’t just save money. They earn trust. They get budget for the projects that matter. They sleep better because they’ve eliminated the waste that eventually becomes someone else’s excuse to cut headcount or freeze hiring.

This article is about building that understanding. Not with a vendor’s optimisation tool or a consultant’s audit — but with the habits, queries, and mental models that let you own your platform’s economics from the inside. Everything here is grounded in Snowflake and AWS, with specific code you can run today.

You can’t optimise what you can’t see

Before you touch a single warehouse configuration, you need to answer one question: where is the money going?

Most teams skip this step. They read a blog post about auto-suspend settings, change a few defaults, and call it optimisation. That’s like going on a diet by switching to diet soda while eating three pizzas a day. The soda wasn’t the problem.

Here’s the query I run first on every Snowflake environment I touch. It tells you which warehouses are consuming the most credits over the last 30 days:

SELECT
 warehouse_name,
 SUM(credits_used) AS total_credits,
 SUM(credits_used) * 3.00 AS estimated_cost_usd, -- adjust your credit price
 COUNT(DISTINCT DATE_TRUNC('day', start_time)) AS active_days,
 ROUND(SUM(credits_used) / COUNT(DISTINCT DATE_TRUNC('day', start_time)), 2) AS credits_per_day
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD('day', -30, CURRENT_TIMESTAMP())
GROUP BY warehouse_name
ORDER BY total_credits DESC;

Run that. Right now. I’ll wait.

If you’re like most teams, 70–80% of your credits come from two or three warehouses. That’s your starting point. Not everything — just the expensive stuff.

Now do the same thing for queries. This one finds your top 20 most expensive queries by bytes scanned:

SELECT
 query_id,
 query_text,
 warehouse_name,
 user_name,
 total_elapsed_time / 1000 AS elapsed_seconds,
 bytes_scanned / POWER(1024, 3) AS gb_scanned,
 partitions_scanned,
 partitions_total,
 ROUND(partitions_scanned / NULLIF(partitions_total, 0) * 100, 1) AS pct_partitions_scanned
FROM
 snowflake.account_usage.query_history
WHERE
 start_time >= DATEADD('day', -7, CURRENT_TIMESTAMP())
 AND bytes_scanned > 0
ORDER BY
 bytes_scanned DESC
LIMIT 20;

Pay attention to that pct_partitions_scanned column. If you see queries scanning 90–100% of a table’s partitions, those queries aren’t benefiting from clustering or partition pruning. That’s where the big wins hide.

This 15-minute exercise — top warehouses, top queries — tells you more about your cost profile than any dashboard. It’s the equivalent of checking your bank statement before creating a budget. Obvious in hindsight. Almost nobody does it.

The idle tax is your single biggest waste

Here’s what nobody tells you about Snowflake billing: you pay for compute in 60-second minimums. Every time a warehouse resumes from suspension, you’re billed for at least one minute — regardless of whether the query takes 2 seconds or 58 seconds.

This matters more than it sounds. Picture a BI tool like Metabase or Tableau hitting your warehouse with 20 small metadata queries over 15 minutes. If your warehouse auto-suspends after 5 minutes (Snowflake’s default for many setups), it might suspend and resume multiple times during that window. Each resume triggers another 60-second charge.

Twenty queries that take 3 seconds each? That’s 60 seconds of actual compute. But if the warehouse suspends and resumes 4 times, you’re billed for 240 seconds. A 4x overhead.

The fix is straightforward but requires thought:

-- For transformation warehouses (predictable, bursty workloads)
ALTER WAREHOUSE transform_wh
 SET AUTO_SUSPEND = 60
 AUTO_RESUME = TRUE
 WAREHOUSE_SIZE = 'MEDIUM';

-- For BI/dashboard warehouses (frequent small queries)
ALTER WAREHOUSE bi_wh
 SET AUTO_SUSPEND = 300 -- 5 min keeps it warm between dashboard interactions
 AUTO_RESUME = TRUE
 WAREHOUSE_SIZE = 'SMALL';

-- For ad-hoc/analyst warehouses (unpredictable, intermittent)
ALTER WAREHOUSE adhoc_wh
 SET AUTO_SUSPEND = 60
 AUTO_RESUME = TRUE
 WAREHOUSE_SIZE = 'XSMALL';

The principle: transformation warehouses should suspend aggressively because they run in defined windows with gaps between runs. BI warehouses should stay warm a bit longer because dashboard users generate clusters of queries with short pauses in between. Ad-hoc warehouses should be small and aggressive — analysts can tolerate a 1–2 second resume delay.

Here’s how to find warehouses that are running but idle — the silent cost killer:

SELECT
 warehouse_name,
 SUM(credits_used) AS total_credits,
 SUM(credits_used_compute) AS compute_credits,
 SUM(credits_used_cloud_services) AS cloud_credits,
 ROUND(
 (SUM(credits_used) - SUM(credits_used_compute))
 / NULLIF(SUM(credits_used), 0) * 100, 1
 ) AS pct_idle_cost
FROM
 snowflake.account_usage.warehouse_metering_history
WHERE
 start_time >= DATEADD('day', -30, CURRENT_TIMESTAMP())
GROUP BY
 warehouse_name
HAVING
 pct_idle_cost > 20
ORDER BY
 total_credits DESC;

If pct_idle_cost is above 20% on any warehouse, you’re burning money on idle time. Tighten the auto-suspend, or investigate what’s keeping the warehouse awake between queries.

The most expensive query is the one that scans everything

Snowflake charges you for the compute time your queries consume, and nothing drives compute time like full table scans. The two most common culprits are queries that wrap filter columns in functions, and queries that select more columns than they need.

Here’s what I mean. This query looks innocent:

-- Expensive: function on the filter column disables pruning
SELECT
 order_id, customer_id, total_amount
FROM
 analytics.fct_orders
WHERE
 DATE(order_timestamp) = '2026-03-01';

But it’s quietly terrible. Wrapping order_timestamp in DATE() forces Snowflake to evaluate every row before filtering. The query planner can’t use micro-partition metadata to skip irrelevant partitions.

This version does the same thing, but lets Snowflake prune:

-- Cheap: range filter on raw column enables pruning
SELECT
 order_id, customer_id, total_amount
FROM
 analytics.fct_orders
WHERE
 order_timestamp >= '2026-03-01'
 AND order_timestamp < '2026-03-02';

The difference can be enormous on large tables — I’ve seen this single change reduce bytes scanned by 95% on billion-row fact tables.

The other silent killer is SELECT * in intermediate transformations. In a dbt project, I often find staging models that look like this:

-- stg_orders.sql (before)
SELECT *
FROM
 {{ source('raw', 'orders') }}
WHERE
 _loaded_at > (SELECT MAX(_loaded_at) FROM {{ this }})

That SELECT * pulls every column from the source — including columns nobody downstream ever uses. In a columnar store like Snowflake, you only pay to scan the columns you reference. Trimming to the columns you actually need is free performance:

-- stg_orders.sql (after)
SELECT
 order_id,
 customer_id,
 order_timestamp,
 total_amount,
 status,
 _loaded_at
FROM
 {{ source('raw', 'orders') }}
WHERE
 _loaded_at > (SELECT MAX(_loaded_at) FROM {{ this }})

This isn’t premature optimisation. It’s hygiene. Every SELECT * in your transformation layer is a small tax you pay on every single run.

Full refreshes are the most expensive default in data engineering

If you’re using dbt — and most teams on Snowflake are — the default materialisation is either view or table. Both have the same problem at scale: they rebuild everything, every time.

A table materialisation on a 500-million-row fact table means Snowflake reads, transforms, and writes 500 million rows every run. Even if only 50,000 rows changed since yesterday. That’s a 10,000x overhead.

Switching to incremental materialisation is the single highest-impact cost change most teams can make:

-- fct_orders.sql
{{
 config(
 materialized='incremental',
 unique_key='order_id',
 incremental_strategy='merge',
 on_schema_change='append_new_columns'
 )
}}

SELECT
 order_id,
 customer_id,
 order_timestamp,
 total_amount,
 status,
 updated_at
FROM
 {{ ref('stg_orders') }}

{% if is_incremental() %}
WHERE updated_at > (
 SELECT DATEADD('hour', -3, MAX(updated_at))
 FROM {{ this }}
)
{% endif %}

A few things worth noting here. The DATEADD('hour', -3, ...) creates a 3-hour lookback window. This catches late-arriving data and handles clock skew between source systems. Without it, you’ll miss rows that arrive slightly out of order — and your numbers will silently drift.

The unique_key with incremental_strategy='merge' means Snowflake will update existing rows and insert new ones. This is essential for tables where source records get modified after initial load (order status changes, for example).

The rule of thumb: if a table has more than a million rows and less than 20% changes per run, make it incremental. The cost savings are usually 80–95% on that model’s compute.

But — and this is important — schedule a periodic full refresh to correct any drift. I typically set up a weekly dbt build --full-refresh --select fct_orders via a separate Airflow DAG or GitHub Actions workflow. Belt and suspenders.

The hidden costs your Snowflake dashboard won’t show you

Most teams monitor warehouse credits because that’s what the Snowflake UI makes visible. But there are cost vectors that don’t show up in the obvious places.

Cloud services credits accrue when queries use Snowflake’s coordination layer — query compilation, metadata operations, result set caching. Normally this is covered by a 10% “free” adjustment against your compute credits. But if you have BI tools making thousands of small metadata queries, cloud services can exceed that adjustment and start costing real money.

Find them:

SELECT
 DATE_TRUNC('day', usage_date) AS day,
 SUM(credits_used) AS total_credits,
 SUM(credits_adjustment_cloud_services) AS cloud_services_adjustment,
 CASE
 WHEN SUM(credits_adjustment_cloud_services) < 0
 THEN ABS(SUM(credits_adjustment_cloud_services))
 ELSE 0
 END AS excess_cloud_services_cost
FROM
 snowflake.account_usage.metering_daily_history
WHERE
 usage_date >= DATEADD('day', -30, CURRENT_DATE())
GROUP BY
 day
ORDER BY
 day DESC;

Serverless feature credits — Snowpipe, automatic clustering, materialized view maintenance, search optimisation — all consume credits outside your warehouse billing. They don’t show up in warehouse_metering_history. You need to check separately:

-- Snowpipe costs
SELECT
 pipe_name,
 SUM(credits_used) AS total_credits
FROM
 snowflake.account_usage.pipe_usage_history
WHERE
 start_time >= DATEADD('day', -30, CURRENT_TIMESTAMP())
GROUP BY
 pipe_name
ORDER BY
 total_credits DESC;

-- Automatic clustering costs
SELECT
 table_name,
 SUM(credits_used) AS total_credits
FROM
 snowflake.account_usage.automatic_clustering_history
WHERE
 start_time >= DATEADD('day', -30, CURRENT_TIMESTAMP())
GROUP BY
 table_name
ORDER BY
 total_credits DESC;

On the AWS side, the sneaky costs tend to be data transfer. Moving data between S3 regions, or from S3 out to the internet, adds up quietly. If your Snowflake account is in us-east-1 but your S3 landing zone is in ap-southeast-2, every byte of ingestion carries a cross-region transfer charge. Check your AWS Cost Explorer with the “Data Transfer” service filter — the numbers are often surprising.

You cannot reduce costs you can’t attribute. The simplest and most underused tool in Snowflake is the query tag — a piece of metadata you attach to every query that tells you what generated it.

If you’re using dbt, this takes about five minutes to set up. Create a macro:

-- macros/set_query_tag.sql
{% macro set_query_tag() %}
 {% set query_tag = {
 "dbt_model": model.name,
 "dbt_schema": model.schema,
 "dbt_materialized": model.config.materialized,
 "dbt_invocation_id": invocation_id,
 "environment": target.name
 } %}
 {% do run_query("ALTER SESSION SET QUERY_TAG = '{}'".format(query_tag | tojson)) %}
{% endmacro %}

Then add it as a pre-hook in your dbt_project.yml:

# dbt_project.yml
models:
 your_project:
 +pre-hook: "{{ set_query_tag() }}"

Now every query dbt runs is tagged with the model name, materialisation type, and environment. You can query cost by model:

SELECT
 PARSE_JSON(query_tag):dbt_model::STRING AS model_name,
 PARSE_JSON(query_tag):dbt_materialized::STRING AS materialization,
 COUNT(*) AS query_count,
 SUM(total_elapsed_time) / 1000 AS total_seconds,
 SUM(bytes_scanned) / POWER(1024, 4) AS tb_scanned
FROM
 snowflake.account_usage.query_history
WHERE
 query_tag IS NOT NULL
 AND TRY_PARSE_JSON(query_tag) IS NOT NULL
 AND start_time >= DATEADD('day', -7, CURRENT_TIMESTAMP())
GROUP BY
 model_name, materialization
ORDER BY
 tb_scanned DESC
LIMIT 20;

This is how you find the one dbt model that’s responsible for 40% of your bill. Without tags, it’s a guessing game.

For non-dbt workloads — Airflow tasks, Lambda functions, BI tools — set query tags at the session level in your connection configuration. In Python:

import snowflake.connector
import json

conn = snowflake.connector.connect(
 account='your_account',
 user='your_user',
 password='your_password',
 warehouse='transform_wh',
 session_parameters={
 'QUERY_TAG': json.dumps({
 'pipeline': 'customer_ingestion',
 'task': 'load_raw_customers',
 'environment': 'production'
 })
 }
)

The goal is simple: every query that runs on your platform should be attributable to a team, a pipeline, or a tool. Start with the big consumers and work outward.

Kill the zombies

Every data platform accumulates dead weight. Tables nobody queries. Pipelines that run faithfully every morning, transforming data that no dashboard, no analyst, and no model has touched in months.

These zombies cost you twice: once in compute (the pipeline that refreshes them) and once in storage (the data that sits there). Finding them is straightforward:

-- Tables with zero reads in the last 90 days
SELECT
 t.table_schema,
 t.table_name,
 t.row_count,
 t.bytes / POWER(1024, 3) AS size_gb,
 t.last_altered,
 MAX(ah.query_start_time) AS last_queried
FROM
 snowflake.account_usage.tables t

 LEFT JOIN snowflake.account_usage.access_history ah
 ON ah.base_objects_accessed LIKE '%' || t.table_name || '%'
 AND ah.query_start_time >= DATEADD('day', -90, CURRENT_TIMESTAMP())
WHERE
 t.table_schema NOT IN ('INFORMATION_SCHEMA')
 AND t.deleted IS NULL
 AND t.row_count > 0
GROUP BY
 t.table_schema, t.table_name, t.row_count, t.bytes, t.last_altered
HAVING
 last_queried IS NULL
ORDER BY
 size_gb DESC;

On the AWS side, check your S3 storage for orphaned data. Landing zones accumulate raw files that were loaded months ago and never cleaned up. A simple lifecycle policy handles this:

{
 "Rules": [
 {
 "ID": "Archive raw data after 90 days",
 "Status": "Enabled",
 "Filter": {
 "Prefix": "raw-landing/"
 },
 "Transitions": [
 {
 "Days": 90,
 "StorageClass": "GLACIER_INSTANT_RETRIEVAL"
 }
 ],
 "Expiration": {
 "Days": 365
 }
 }
 ]
}

The cultural shift matters as much as the tooling. Make it a quarterly habit: pull the zombie table list, review it with the team, and deprecate what nobody uses. If someone screams, you can always restore from Time Travel. But in my experience, nobody screams. The data was already dead — you’re just acknowledging it.

Cost anomalies will find you — or you can find them first

The scariest cost event isn’t the gradual creep. It’s the single bad query or misconfigured pipeline that doubles your weekly bill overnight. I’ve seen a single Snowflake query with a missing WHERE clause scan an entire 2TB table repeatedly inside a loop — burning through hundreds of credits in an hour.

On Snowflake, set up resource monitors as a basic guardrail:

-- Create a resource monitor with alerts and hard stop
CREATE OR REPLACE RESOURCE MONITOR monthly_budget
 WITH CREDIT_QUOTA = 5000 -- adjust to your monthly budget
 FREQUENCY = MONTHLY
 START_TIMESTAMP = IMMEDIATELY
 TRIGGERS
 ON 75 PERCENT DO NOTIFY -- email alert at 75%
 ON 90 PERCENT DO NOTIFY -- email alert at 90%
 ON 100 PERCENT DO SUSPEND; -- hard stop at 100%

-- Apply it to a warehouse
ALTER WAREHOUSE transform_wh
 SET RESOURCE_MONITOR = monthly_budget;

On AWS, enable Cost Anomaly Detection in the AWS Cost Management console. It uses ML to detect unusual spending patterns and sends alerts via SNS. For Snowflake-specific monitoring, a lightweight approach is a scheduled task that checks daily credit consumption against a rolling average:

-- Create a simple anomaly detection view
CREATE OR REPLACE VIEW monitoring.daily_cost_anomalies AS
WITH daily_usage AS (
 SELECT
 DATE_TRUNC('day', start_time) AS usage_date,
 warehouse_name,
 SUM(credits_used) AS daily_credits
 FROM
 snowflake.account_usage.warehouse_metering_history
 WHERE
 start_time >= DATEADD('day', -60, CURRENT_TIMESTAMP())
 GROUP BY
 usage_date, warehouse_name
),
averages AS
(
 SELECT
 warehouse_name,
 AVG(daily_credits) AS avg_daily_credits,
 STDDEV(daily_credits) AS stddev_daily_credits
 FROM
 daily_usage
 WHERE
 usage_date < CURRENT_DATE()
 GROUP BY
 warehouse_name
)
SELECT
 du.usage_date,
 du.warehouse_name,
 du.daily_credits,
 a.avg_daily_credits,
 ROUND((du.daily_credits - a.avg_daily_credits) / NULLIF(a.stddev_daily_credits, 0), 2) AS z_score
FROM
 daily_usage du

 JOIN averages a
 ON du.warehouse_name = a.warehouse_name
WHERE
 du.usage_date = CURRENT_DATE()
 AND (du.daily_credits - a.avg_daily_credits) / NULLIF(a.stddev_daily_credits, 0) > 2
ORDER BY
 z_score DESC;

A z-score above 2 means today’s spend is more than two standard deviations above the 60-day average. That’s worth investigating. Pipe this into a Teams or Slack alert via an AWS Lambda function and you’ve got same-day cost anomaly detection for the price of a few lines of SQL.

What your data model costs you

This one’s subtle but powerful. The way you model your data directly affects how much compute you burn on every query.

In columnar warehouses like Snowflake, wide denormalised tables query faster than star schemas with multiple JOINs — because each JOIN has overhead, and Snowflake is optimised for scanning columns from flat structures. But wide tables cost more to maintain because updating a single dimension attribute means rewriting many rows in the fact table.

The practical approach is a layered architecture:

Staging models (views or ephemeral): Minimal transformation, column selection, type casting. Cheap to run, no storage cost.
Intermediate models (tables or incremental): Business logic, deduplication, SCD handling. Materialised because they’re referenced by multiple downstream models.
Mart models (incremental or table): Wide, denormalised, optimised for consumer queries. These are what analysts and BI tools actually hit.

The cost trap is materialising too much too early. Every table materialisation means Snowflake stores and maintains that data. Every dbt run rebuilds it. If an intermediate model is only referenced by one downstream model, make it ephemeral or a view — let Snowflake inline it at query time.

Check your dbt DAG for this pattern: a staging model materialised as a table, referenced by a single intermediate model, which is also materialised as a table, referenced by a single mart. That’s three materialisations where one (the mart) would suffice. The staging and intermediate models can be views or ephemeral — the compute happens once when the mart builds, not three separate times.

Coming back to why this matters

I started this article with a story about not being able to explain my own platform’s costs. That wasn’t a technical failure. It was a leadership failure. I’d built something valuable and then neglected the part that kept it funded.

The data engineers I respect most aren’t the ones who build the most elegant pipelines. They’re the ones who can walk into a budget conversation and say: “Here’s what we spend. Here’s what we get for it. Here’s what I’d cut, and here’s what I’d invest more in.” That’s the kind of clarity that earns trust — and trust is what keeps data teams alive when the inevitable cost-cutting conversations happen.

You don’t need a FinOps certification or an expensive monitoring tool to get there. You need the queries in this article, a quarterly habit of reviewing them, and the willingness to kill the zombies nobody wants to admit are dead.

Start with the 15-minute audit. Tag your queries. Set up a resource monitor. Do those three things this week and you’ll know more about your platform’s economics than most data teams learn in a year.

The cheapest query is the one you never need to run. But the most valuable skill is knowing which queries are worth paying for.

Data Platform on Ghost in the data