Why Your Pipeline Finishes Later Every Month

Chris Hillman — Sat, 18 Apr 2026 09:00:00 +1100

Let me tell you about a graph that changed how I think about data engineering.

A junior engineer on my team — let’s call her Priya — had been tracking something nobody asked her to track. Every morning for two months, she’d noted the timestamp when our main analytics pipeline completed. She wasn’t trying to make a point. She was just curious, because the finance team kept mentioning their dashboards weren’t ready when they arrived at 8 AM anymore.

One afternoon she pulled me aside and showed me a scatter plot on her laptop. Pipeline completion time, plotted daily over several months. The trend was unmistakable: a slow, steady drift to the right. What used to finish at 5:47 AM was now finishing at 7:23 AM. And the slope wasn’t flattening.

“If this keeps going,” she said, “we’ll miss the 9 AM SLA in about six weeks.”

She was right. And nobody else on the team — including me — had noticed. We were watching for failures. Green DAGs, clean logs, no alerts. But the pipeline wasn’t failing. It was slowing. And slow is harder to see than broken, because slow doesn’t trigger an alert. Slow just quietly erodes trust until one day someone in finance builds their own spreadsheet and stops asking you for anything.

That’s the moment I understood that pipeline health isn’t about pass/fail. It’s about trajectory. A pipeline that runs successfully but takes 5% longer every month is a ticking clock. And the people who notice first aren’t the engineers watching the DAG — they’re the consumers waiting for their data.

I’m telling you this because pipeline optimisation sounds like a performance engineering problem, and it is. But underneath the technical work, it’s really about a commitment: the commitment to deliver data when you said you would, every single day. That’s what builds trust between data teams and the rest of the organisation. Not fancy architectures. Not real-time everything. Just showing up on time, reliably.

This article is about diagnosing why your pipelines get slower, identifying the bottlenecks that actually matter, and fixing the patterns that cause data to arrive later than it should. Everything is grounded in Snowflake, AWS, Airflow, and dbt — with specific patterns you can apply immediately.

Measure the pipeline in stages, not as a single number

When a pipeline is slow, the instinct is to look at the longest-running task. That’s often the wrong place to start.

A pipeline is a chain of stages: extract, transform, load, test. The total runtime is the sum of all stages on the critical path — the longest chain of dependent tasks. But the bottleneck might not be the longest task. It might be a 30-second task that blocks five parallel branches from starting.

Before you optimise anything, instrument your pipeline to measure each stage independently. In Airflow, the task instance metadata already captures this.

Two numbers matter here: duration_seconds (how long the task actually ran) and queue_wait_seconds (how long it waited before it could run). If queue wait is high, your problem isn’t the task — it’s resource contention. Too many tasks competing for too few Airflow worker slots, or too many queries competing for the same Snowflake warehouse.

For dbt runs specifically, the run_results.json that dbt generates after every invocation is a goldmine:

import json

with open('target/run_results.json') as f:
 results = json.load(f)

# Find your slowest models
models = sorted(
 [r for r in results['results'] if r['status'] == 'success'],
 key=lambda x: x['execution_time'],
 reverse=True
)

for m in models[:10]:
 print(f"{m['unique_id']:60s} {m['execution_time']:8.1f}s")

Run that after your next dbt build. You’ll immediately see which models dominate your pipeline runtime. In my experience, 3–5 models account for 60–80% of total execution time. Those are the only models worth optimising.

Your DAG probably has dependencies it doesn’t need

This is the most common structural problem in data pipelines, and it comes in two forms. One is obvious once you know to look for it. The other is sneaky, because it starts as a completely reasonable engineering decision.

The first form: phantom dependencies.

An engineer builds Model C and declares it depends on Model B — because Model B produces a table that Model C reads. Fair enough. But six months later, someone refactors Model C and it no longer reads from Model B. The ref() gets removed from the SQL. But the dependency in the Airflow DAG? That stays. Or in dbt, someone adds a ref() to a model they don’t actually need, just to “make sure it runs first.”

The result is a DAG with phantom dependencies — tasks that wait for other tasks to complete even though they don’t use those tasks’ outputs. Every phantom dependency adds serial wait time to your pipeline.

The second form: dependency monsters.

This one is trickier, because it starts with good intentions.

The customer team needs three new enrichment attributes on dim_customers — regional segment codes sourced from a CRM export, tenure tier derived from a subscription history table, and a propensity score from a data science model. All reasonable requests. Each one approved and added without much ceremony.

But dim_customers is upstream of 40+ models across your DAG. Adding those three attributes means pulling in the CRM extract, joining the subscription history table — which has its own upstream dependencies — and waiting on the propensity score model to complete before any of those 40 downstream models can start. Your pipeline used to finish at 9 AM. Eighteen months and a dozen enrichment requests later, it finishes at 3 PM.

Each addition was individually justified. Nobody modelled what they’d cost collectively. That’s how you build a dependency monster.

The fix isn’t to refuse enrichment requests — it’s to stop embedding enrichment into your core spine model. Keep dim_customers lean: identifiers, names, status, the attributes that nearly every consumer genuinely needs. Build a separate dim_customers_extended model that joins in the expensive enrichment for the consumers who actually need it. Most of your downstream models will never touch the propensity score. There’s no reason to make them wait for it.

-- dim_customers.sql: lean spine — runs fast, unblocks the DAG
SELECT
 customer_id,
 customer_name,
 email,
 status,
 created_at
FROM
 {{ ref('stg_customers') }}

-- dim_customers_extended.sql: enrichment for consumers who need it
-- runs independently, doesn't block the main pipeline
SELECT
 c.customer_id,
 c.customer_name,
 c.status,
 crm.regional_segment,
 sub.tenure_tier,
 ps.propensity_score
FROM
 {{ ref('dim_customers') }} c

 LEFT JOIN {{ ref('stg_crm_segments') }} crm
 on crm.customer_id = c.customer_id

 LEFT JOIN {{ ref('int_subscription_tenure') }} sub
 on sub.customer_id = c.customer_id

 LEFT JOIN {{ ref('ml_propensity_scores') }} ps
 on ps.customer_id = c.customer_id

Now dim_customers_extended and its expensive upstream dependencies sit on their own branch of the DAG. Consumers who need the enrichment reference the extended model. The rest of the pipeline is unblocked.

A useful governance rule of thumb before adding a field to a core model: if fewer than half your consumers will ever query that attribute, it probably doesn’t belong there. The team who needs it should own the enrichment themselves, as a downstream model they maintain.

Auditing for both problems

In dbt, the dbt_project_evaluator package surfaces structural issues systematically. Install it:

# packages.yml
packages:
 - package: dbt-labs/dbt_project_evaluator
 version: ">=0.8.0 <1.0.0"

Then run:

dbt build --select package:dbt_project_evaluator

It will flag models with fan-out issues (one model referenced by too many downstream models), unnecessary dependencies, and models that could be parallelised but aren’t. It won’t catch dependency monsters directly — those require a human asking “does every consumer of this model actually use every attribute in it?” — but it’s a good starting point for the structural audit.

For Airflow DAGs, the visual DAG view tells you a lot at a glance. A healthy DAG looks like a tree with wide parallel branches. An unhealthy DAG looks like a chain — or worse, a funnel where everything converges through a single overloaded task before it can branch out again. That funnel shape is the visual signature of a dependency monster: many expensive upstreams flowing into one hub model, with dozens of downstream models queued behind it.

If your DAG looks like a chain, ask this question for every dependency edge: does downstream task B actually read data produced by upstream task A? If the answer is no, remove the dependency. Let them run in parallel.

I once audited a DAG with 47 tasks running in a strict serial chain. After removing phantom dependencies and restructuring, 31 of those tasks could run in parallel. The pipeline went from 2 hours 15 minutes to 38 minutes. Same tasks, same compute, same data. Just fewer unnecessary wait states.

The critical path is the only thing worth optimising

Here’s a concept from project management that applies directly to pipeline engineering: the critical path.

The critical path is the longest chain of dependent tasks through your DAG. It determines your pipeline’s minimum possible runtime. Every other path through the DAG has “float” — slack time where tasks can be delayed without affecting the overall completion time.

This means: optimising a task that isn’t on the critical path has zero impact on your pipeline’s end-to-end runtime. You could make a non-critical task 10x faster and your pipeline would finish at exactly the same time.

Finding the critical path requires knowing each task’s duration and dependency structure. For a dbt project, you can approximate it:

-- Using dbt's run results stored in Snowflake (if you log them)
-- Or query your orchestrator's task instance table
WITH task_durations AS (
 SELECT
 task_id AS model_name,
 AVG(EXTRACT(EPOCH FROM (end_date - start_date))) AS avg_duration
 FROM task_instance
 WHERE dag_id = 'dbt_daily'
 AND state = 'success'
 AND execution_date >= CURRENT_DATE - INTERVAL '7 days'
 GROUP BY task_id
)
SELECT
 model_name,
 ROUND(avg_duration, 1) AS avg_seconds,
 ROUND(avg_duration / 60, 1) AS avg_minutes
FROM task_durations
ORDER BY avg_duration DESC
LIMIT 20;

The longest tasks are candidates for the critical path, but only if they’re on the dependency chain that determines total runtime. A 20-minute model that runs in parallel with a 45-minute model isn’t the bottleneck — the 45-minute model is.

Once you’ve identified the critical path, you have three options for shortening it: make the slow tasks faster (query optimisation, incremental models), reduce the number of tasks on the path (remove unnecessary dependencies), or parallelise sequential tasks (split a monolithic model into independent pieces).

Shifting right: diagnose it before it breaks your SLA

Remember Priya’s scatter plot? That pattern — pipeline completion drifting later and later — has a name: shifting right. And it has predictable causes.

Track it with a simple query against your orchestrator’s metadata:

-- Airflow: track pipeline completion time drift
SELECT
 execution_date::DATE AS run_date,
 MAX(end_date)::TIME AS completion_time,
 EXTRACT(EPOCH FROM (MAX(end_date) - MIN(start_date))) / 60 AS total_runtime_minutes
FROM
 task_instance
WHERE
 dag_id = 'daily_analytics'
 AND state = 'success'
 AND execution_date >= CURRENT_DATE - INTERVAL '60 days'
GROUP BY
 run_date
ORDER BY
 run_date;

Plot completion_time over run_date. If it trends upward, you’re shifting right. The four root causes, in order of how often I see them:

1. Data volume growth. Your table had 10 million rows when you built the model. Now it has 200 million. That JOIN that took 8 seconds now takes 3 minutes. This is the most common cause and the easiest to fix — switch to incremental materialisation and the growth stops mattering.

2. Resource contention. You’ve added more pipelines and they all run in the same window, competing for the same Snowflake warehouse. In Snowflake, you can see this directly:

-- Find warehouse queueing (queries waiting for compute)
SELECT
 warehouse_name,
 DATE_TRUNC('hour', start_time) AS hour,
 COUNT(*) AS total_queries,
 SUM(CASE WHEN queued_overload_time > 0 THEN 1 ELSE 0 END) AS queued_queries,
 AVG(queued_overload_time) / 1000 AS avg_queue_seconds
FROM
 snowflake.account_usage.query_history
WHERE
 start_time >= DATEADD('day', -7, CURRENT_TIMESTAMP())
 AND warehouse_name = 'TRANSFORM_WH'
GROUP BY
 warehouse_name, hour
HAVING
 queued_queries > 0
ORDER BY
 hour;

If queued_queries is non-zero during your pipeline window, queries are waiting for warehouse compute. Either scale up the warehouse during that window, split workloads across dedicated warehouses, or stagger pipeline start times to reduce concurrency.

3. Dependency chain lengthening. Someone added a new intermediate model between your staging and mart layers. That model takes 4 minutes. But because it’s on the critical path, the entire pipeline now finishes 4 minutes later. This is death by a thousand cuts — each addition is small, but they accumulate.

4. Upstream source delays. Your pipeline starts at 4 AM because the source system’s extract used to land in S3 by 3:45 AM. But the source system has grown too, and now the extract doesn’t land until 4:30 AM. Your pipeline sensors wait, and everything shifts right.

For upstream delays in an AWS environment, replace time-based scheduling with event-driven triggers. Instead of scheduling your Airflow DAG at 4 AM and hoping the data is there, trigger it when the data actually arrives:

# Airflow DAG: trigger on S3 file landing using AWS sensor
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
from airflow import DAG
from datetime import datetime

with DAG(
 'daily_analytics',
 start_date=datetime(2026, 1, 1),
 schedule_interval=None, # triggered externally, not on a schedule
 catchup=False,
) as dag:

 wait_for_source = S3KeySensor(
 task_id='wait_for_orders_extract',
 bucket_name='your-data-lake',
 bucket_key='raw/orders/dt={{ ds }}/orders.parquet',
 aws_conn_id='aws_default',
 mode='reschedule', # frees the worker slot while waiting
 poke_interval=300, # check every 5 minutes
 timeout=7200, # fail after 2 hours if file never arrives
 )

The mode='reschedule' is critical. Without it, the sensor occupies a worker slot for the entire time it’s waiting. With reschedule, it checks, releases the slot, and checks again later. This prevents the classic deadlock where all your worker slots are consumed by sensors and no actual work can run.

Even better: use S3 event notifications to trigger a Lambda function that kicks off the DAG via Airflow’s REST API. Zero polling, zero wasted slots:

# Lambda function triggered by S3 PutObject event
import boto3
import requests
import os

def handler(event, context):
 bucket = event['Records'][0]['s3']['bucket']['name']
 key = event['Records'][0]['s3']['object']['key']

 # Trigger Airflow DAG via REST API
 airflow_url = os.environ['AIRFLOW_API_URL']
 response = requests.post(
 f"{airflow_url}/api/v1/dags/daily_analytics/dagRuns",
 json={
 "conf": {"source_bucket": bucket, "source_key": key}
 },
 auth=(os.environ['AIRFLOW_USER'], os.environ['AIRFLOW_PASSWORD']),
 headers={"Content-Type": "application/json"}
 )

 return {"statusCode": response.status_code}

This eliminates the “safety margin” scheduling pattern entirely. Your pipeline runs as soon as the data is available — not 30 minutes after you hope it will be available.

Partition and cluster to eliminate full table scans

The single most effective performance optimisation in Snowflake is making sure your queries only read the data they need. Snowflake’s micro-partition pruning does this automatically — but only if your data is physically organised in a way that aligns with your query patterns.

Clustering keys tell Snowflake how to organise data within micro-partitions. If you consistently filter on order_date, clustering on that column means queries with a WHERE order_date = '2026-03-01' clause scan a tiny fraction of the table instead of the whole thing.

-- Add clustering to a large fact table
ALTER TABLE analytics.fct_orders
 CLUSTER BY (order_date, customer_segment);

Choose clustering keys based on how the table is actually queried, not how it’s loaded. The best candidates are columns that appear in WHERE clauses, JOIN conditions, and range filters. Limit yourself to 2–3 keys — more than that and Snowflake can’t maintain effective clustering.

Check whether your existing tables benefit from clustering:

-- Check clustering depth (lower is better, 0 is perfect)
SELECT
 table_name,
 clustering_key,
 total_constant_partition_count,
 total_partition_count,
 average_overlaps,
 average_depth
FROM
 snowflake.account_usage.table_storage_metrics
WHERE
 table_catalog = 'YOUR_DATABASE'
 AND clustering_key IS NOT NULL
 AND active_bytes > 0
ORDER BY
 average_depth DESC;

An average_depth above 5 means your clustering isn’t effective — queries are still scanning more partitions than they should. Either the clustering key doesn’t match query patterns, or the table has had so many small writes that the clustering has degraded. Running ALTER TABLE ... RECLUSTER manually or relying on automatic clustering (which costs credits) addresses the latter.

On the AWS side, if you’re using S3 as a data lake with Parquet or Iceberg, the equivalent is partition layout. Partition your S3 data by the columns you filter most:

When Snowflake external tables or AWS Athena query this structure with a date filter, they skip entire directories. A query for March 15th reads two files instead of scanning the entire events/ prefix. The savings compound with data volume — at a billion rows, proper partitioning can reduce query times from minutes to seconds.

Make every task idempotent or debugging becomes a nightmare

Here’s a pattern I see constantly: a pipeline fails midway through, the engineer reruns it, and now there are duplicate rows in the target table. Because the first run inserted half the data before failing, and the rerun inserted all the data — including the half that already existed.

Idempotency means a task produces the same result whether it runs once or ten times. This isn’t just a nice-to-have — it’s the foundation that makes everything else in pipeline engineering possible. Without it, you can’t safely retry. You can’t backfill. You can’t debug.

In dbt, incremental models with a unique_key are idempotent by default — the MERGE statement handles duplicates:

{{
 config(
 materialized='incremental',
 unique_key='event_id',
 incremental_strategy='merge'
 )
}}

SELECT
 event_id,
 user_id,
 event_type,
 event_timestamp,
 properties
FROM
 {{ ref('stg_events') }}

{% if is_incremental() %}
WHERE
 event_timestamp >= (
 SELECT DATEADD('day', -3, MAX(event_timestamp))
 FROM {{ this }}
)
{% endif %}

For non-dbt loads — say, a Lambda function loading data from an API into Snowflake — use MERGE instead of INSERT:

MERGE INTO raw.api_customers AS target
USING (
 SELECT $1:id::STRING AS customer_id,
 $1:name::STRING AS customer_name,
 $1:email::STRING AS email,
 $1:updated_at::TIMESTAMP_NTZ AS updated_at
 FROM @raw.s3_stage/customers/
 (FILE_FORMAT => 'json_format')
) AS source
ON target.customer_id = source.customer_id

 WHEN MATCHED AND source.updated_at > target.updated_at THEN
 UPDATE SET
 customer_name = source.customer_name,
 email = source.email,
 updated_at = source.updated_at

 WHEN NOT MATCHED THEN
 INSERT (customer_id, customer_name, email, updated_at)
 VALUES (source.customer_id, source.customer_name, source.email, source.updated_at);

The WHEN MATCHED AND source.updated_at > target.updated_at condition prevents overwriting newer data with older data during reruns. This matters when you’re replaying historical loads or running overlapping backfills.

Most “we need real-time” requests actually need “we need faster batch”

Before you reach for Kafka, Kinesis, or any streaming infrastructure, have this conversation with your stakeholders: “When you say real-time, what do you actually mean?”

In my experience, the answers fall into three categories:

“I want data from today, not yesterday.” That’s daily batch with a morning refresh. You already have this. Maybe you need to move the refresh earlier.

“I want data that’s at most an hour old.” That’s micro-batch — running your existing pipeline every 15–60 minutes instead of once a day. No new infrastructure required. Your existing SQL, dbt, and Airflow tools work unchanged at shorter intervals.

“I need to see events within seconds of them happening.” This is actual real-time. And it’s genuinely rare. Fraud detection, safety monitoring, real-time bidding — these need streaming. Your internal sales dashboard almost certainly does not.

For the micro-batch pattern in Airflow, it’s a scheduling change:

with DAG(
 'micro_batch_analytics',
 schedule_interval='*/15 * * * *', # every 15 minutes
 catchup=False,
 max_active_runs=1, # prevent overlap
) as dag:
 # ...tasks here

The max_active_runs=1 is important. Without it, if a run takes longer than 15 minutes, the next run starts before the previous one finishes. They compete for the same warehouse, both run slower, and you’ve created the resource contention problem from the shifting-right section.

In Snowflake, pair this with Snowpipe for continuous S3 ingestion:

-- Create a pipe for continuous loading from S3
CREATE OR REPLACE PIPE raw.orders_pipe
 AUTO_INGEST = TRUE
 AS
 COPY INTO raw.orders_stream
 FROM @raw.s3_orders_stage
 FILE_FORMAT = (TYPE = 'PARQUET')
 MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

Snowpipe loads files within minutes of them landing in S3. Your micro-batch dbt pipeline then transforms whatever arrived since the last run. The combination delivers data freshness measured in minutes — not seconds, but close enough for the vast majority of business use cases — without any streaming infrastructure.

Measure time-to-consumer, not just pipeline runtime

Pipeline runtime tells you how long your DAG takes. Time-to-consumer tells you how long a business event takes to reach a human decision-maker. These are different numbers, and the second one is what your stakeholders actually care about.

Time-to-consumer decomposes into stages:

Source extraction delay: Time between the event occurring and the data landing in your lake
Pipeline processing time: Your DAG runtime (the part you control most directly)
Warehouse serving time: Query execution time when a dashboard or report reads the data
Cache/refresh delay: How often the BI tool refreshes its cache

Track it by stamping timestamps at each handoff:

-- Build a freshness tracking model in dbt
-- mart_data_freshness.sql
SELECT
 'fct_orders' AS model_name,
 MAX(order_timestamp) AS latest_source_event,
 MAX(_loaded_at) AS latest_load_time,
 CURRENT_TIMESTAMP() AS measured_at,
 TIMESTAMPDIFF(
 'minute',
 MAX(order_timestamp),
 CURRENT_TIMESTAMP()
 ) AS minutes_since_latest_event,
 TIMESTAMPDIFF(
 'minute',
 MAX(_loaded_at),
 CURRENT_TIMESTAMP()
 ) AS minutes_since_latest_load
FROM
 {{ ref('fct_orders') }}

Schedule this model to run after every pipeline completion. Over time, you’ll see whether minutes_since_latest_event is stable, improving, or drifting. If it’s drifting, you can pinpoint which stage is responsible by comparing the event timestamp, load timestamp, and measurement timestamp.

In dbt, you can also use source freshness tests to alert when upstream data is stale:

# models/staging/_sources.yml
sources:
 - name: raw
 database: raw_db
 schema: public
 loaded_at_field: _loaded_at
 freshness:
 warn_after: { count: 2, period: hour }
 error_after: { count: 6, period: hour }
 tables:
 - name: orders
 - name: customers
 - name: events

Running dbt source freshness checks whether the upstream data is as fresh as you expect. If raw.orders hasn’t received new rows in 6 hours, the test fails — and that’s a signal that the source system’s extract is delayed, not that your pipeline is broken.

The view-on-view problem: the compound interest of bad performance

This is the architectural anti-pattern I see most often in Snowflake environments, and it’s one of those problems that starts small and compounds quietly.

A view in Snowflake doesn’t store any data — it re-executes its SQL every time it’s queried. That’s fine for a simple view. But when View C references View B, which references View A, which scans a large table — every query against View C triggers the entire chain from scratch.

I’ve seen environments where analysts querying a “simple” dashboard view were unknowingly triggering five layers of nested views, each with its own JOINs and aggregations. The query took 4 minutes. After materialising the two most expensive intermediate layers as tables (refreshed daily), the same query took 3 seconds.

The fix is to audit your materialisation strategy:

-- Find views that reference other views (potential nesting)
SELECT
 referencing_object_name AS downstream_view,
 referenced_object_name AS upstream_object,
 referenced_object_type
FROM
 snowflake.account_usage.object_dependencies
WHERE
 referencing_object_type = 'VIEW'
 AND referenced_object_type = 'VIEW'
ORDER BY
 downstream_view;

If this query returns results, you have view-on-view nesting. Not all of it is bad — a simple renaming view on top of a complex view is fine. But if you find three or more layers, or if any of the intermediate views involve heavy JOINs or aggregations, materialise the expensive layers as tables.

In dbt terms, the decision framework is:

Ephemeral: Simple CTEs, column renaming, type casting. Zero cost.
View: Light transformations queried infrequently by a few consumers.
Table: Complex transformations queried frequently. Rebuilt every run.
Incremental: Large tables where only a fraction of rows change. The right choice for most fact tables.

If a model takes more than 30 seconds to build and is queried more than once between builds, it should be a table or incremental — not a view.

Coming back to Priya’s scatter plot

Six weeks after Priya showed me that graph, we’d fixed the three root causes: an intermediate model that had grown from 5 million to 180 million rows without being switched to incremental, a phantom dependency chain that added 12 minutes of serial wait time, and a warehouse that was being shared between transformation and BI queries during the morning refresh window.

Pipeline completion moved from 7:23 AM back to 5:15 AM. The finance team got their dashboards before their morning coffee. And Priya? She built a monitoring dashboard that tracked completion time drift automatically — the thing I should have built from the start.

Here’s what I learned from that experience, and what I hope you take from this article: pipeline performance isn’t a problem you solve once. It’s a trajectory you manage. Data grows. Teams add models. Source systems change. The pipeline that’s fast today will be slow in six months unless someone is watching the trend line.

The engineers who build reliable data platforms aren’t the ones who build the fastest pipeline on day one. They’re the ones who notice when completion time drifts by 3 minutes per week — and fix it before anyone else has to ask.

Track your completion times. Audit your dependencies quarterly. Profile your critical path after every significant change. And teach your team that a green DAG doesn’t mean a healthy pipeline. Sometimes green just means it hasn’t failed yet.

Data Freshness on Ghost in the data