Chris Hillman

What You’re Getting

Ever inherited a data warehouse and wondered if anything actually works? Explained to stakeholders why “high quality data” matters but couldn’t quantify it?

This toolkit solves that. It’s the collection of scripts I’ve built and refined across years of engagements where data teams needed to go from “we think our data is okay” to “here’s exactly what’s wrong and here’s how we’re fixing it.”

This isn’t generic code you’ll find on Stack Overflow. These are scripts that work across PostgreSQL, SQL Server, MySQL, Oracle, Snowflake, Trino, and Delta Lake. They implement the six data quality dimensions (Completeness, Uniqueness, Timeliness, Validity, Accuracy, Consistency) in ways that actually help you make decisions.

What Makes This Different

Most data quality tools force you into their framework. They’re expensive, require specialized skills, and generate reports nobody reads. These scripts do the opposite.

They work where your data lives. No new infrastructure. No vendor lock-in. Just SQL (ANSI standard, with platform-specific notes) and Python for the advanced stuff.

They generate actionable intelligence, not dashboards. Every query produces a work list. Records that failed validation. Duplicates that need merging. Fields that need fixing. Your data stewards can actually do something with the output.

They scale from exploration to production. Start by profiling a table to understand what you have. Once you know what matters, formalize those checks into production monitoring. The toolkit supports both modes.

What’s Included

You get 13 SQL scripts organized into two sections:

Data Profiling Scripts (Exploratory Analysis)

00_basic_statistics.sql - Dynamic query generators that profile entire tables
01_completeness_analysis.sql - NULL and empty string detection
02_uniqueness_analysis.sql - Duplicate identification (exact and fuzzy)
03_value_distribution_analysis.sql - Frequency analysis and cardinality
04_data_type_validation.sql - Type checking and casting validation
05_pattern_analysis.sql - Email, phone, postal code format validation
06_outlier_detection.sql - Z-score and percentile-based anomaly detection
07_cross_column_analysis.sql - Relationship and dependency validation
08_temporal_analysis.sql - Data freshness and update frequency checks

Data Quality Checks (Production Monitoring)

00_dq_metadata_tables.sql - Enterprise metadata framework (7 tables + 3 views)
01_completeness_queries.sql - Production NULL checks with thresholds
02_uniqueness_queries.sql - Duplicate monitoring and alerting
03_timeliness_queries.sql - Data latency and SLA tracking
04_validity_queries.sql - Format and range enforcement
05_accuracy_consistency_queries.sql - Cross-system reconciliation

Plus:

fuzzy_duplicate_detection.py - Advanced Python script for fuzzy matching (Levenshtein distance, Jaro-Winkler, phonetic matching, multi-field scoring)
Data_Quality_Business_Case_Calculator.xlsx - ROI calculator with worked examples

The Metadata Framework

Here’s something most data quality tools don’t give you: a complete metadata infrastructure.

The 00_dq_metadata_tables.sql script creates seven interconnected tables that turn ad-hoc checks into a program:

Check Registry - Catalog all your quality rules with ownership, thresholds, and schedules
Results History - Track quality scores over time for trending
Failed Records - Detailed log of which records failed and why
Work Queues - Automatic routing of failures to data stewards
Threshold Configuration - Flexible pass/warn/fail thresholds by dimension
Benefits Tracking - Quantify ROI with before/after measurements
Dashboard Views - Pre-built executive and operational dashboards

This framework answers questions like:

“How has email completeness trended over the last quarter?”
“Which checks are consistently failing?”
“What’s the backlog of records needing remediation?”
“What’s the actual cost avoidance from fixing data quality?”

How to Use This

If you’re exploring a new dataset: Start with 00_basic_statistics.sql. Run the dynamic variant to generate a profile query for your table. Copy, paste, execute. You’ll see row counts, NULL percentages, distinct value counts - everything you need to understand what you’re dealing with.

Then pick the dimension that matters most. Worried about duplicates? Run 02_uniqueness_analysis.sql. Concerned about missing data? Use 01_completeness_analysis.sql. Each script includes both static examples (copy and customize) and dynamic generators (let the code write the SQL for you).

If you’re building production monitoring: Use the Data Quality Checks scripts. They’re designed to run on a schedule (Airflow, dbt, cron, whatever you have). Each query produces pass/warn/fail status and can log results to the metadata framework for trending.

If you need advanced duplicate detection: The SQL scripts handle exact duplicates well. But what about “Acme Corp” vs “ACME Corporation” vs “Acme Corp.”? That’s where fuzzy_duplicate_detection.py comes in. It connects to any database (PostgreSQL, MySQL, SQL Server, Oracle, SQLite), extracts the data, and uses sophisticated algorithms (Levenshtein distance, Jaro-Winkler, Metaphone, NYSIIS) to find near-matches.

The Python script supports multi-field weighted scoring too. You can say “match if name is 85% similar AND (email OR phone) matches” and get results ranked by confidence. Perfect for deduplication projects.

If you need to justify investment: Open the Business Case Calculator. It includes a worked example showing how to quantify the cost of poor quality (example: $120K annually from missing supplier emails causing payment delays) and calculate expected savings from fixing it. The spreadsheet covers effort estimation, resource planning, and ROI calculation.

Cross-Platform Compatibility

Every script includes portability notes in the header explaining platform-specific syntax differences. For example, date arithmetic varies (PostgreSQL uses + INTERVAL '1 day', SQL Server uses DATEADD(), Oracle uses + 1), so the comments guide you how to adapt them for your ecosystem.

The goal is ANSI SQL compliance wherever possible, with clear alternatives when platforms diverge.

Technical Requirements

For SQL Scripts:

Access to a SQL database (any platform listed above)
Read access to information_schema or equivalent (optional for dynamic queries)
Write permissions if you want to create the metadata tables (optional)

For Python Fuzzy Matching:

Python 3.8+
Libraries: sqlalchemy, pandas, python-Levenshtein, jellyfish, rapidfuzz
Appropriate database driver (psycopg2, pymysql, pyodbc, cx_Oracle)

For Business Case Calculator:

Microsoft Excel or compatible spreadsheet software

Performance Considerations

For large tables (10M+ rows):

Use SAMPLE clauses (examples provided)
Leverage partitioning for time-series data
Create indexes on frequently checked columns
Use columnar storage (Parquet, ORC) for analytical queries

For Python fuzzy matching:

Blocking strategies (group by first letter, phonetic code, geography)
Sampling support for initial testing
Parallel processing for multi-core systems
Database connection pooling for batch operations

The profiling scripts include both full-table and sampled versions. Start with a sample to understand the data, then decide if a full scan is necessary.

What This Costs

$47 AUD

That’s less than an hour of data engineer time. You’re getting:

13 production-ready SQL scripts
1 advanced Python script with fuzzy matching
Enterprise metadata framework (7 tables + 3 views)
ROI calculator with worked examples
Comprehensive documentation

What it actually saves: The average data engineer spends 3-5 hours per week writing ad-hoc queries to check data quality. These scripts eliminate that. At $150/hour, that’s $450-$750 weekly, or $23K-$39K annually per engineer.

The Business Case Calculator includes a worked example showing how a single dimension (email completeness) can justify $95K in annual savings. Apply that across six dimensions and multiple critical datasets, and the ROI becomes obvious.

Purchase & Download

Click the button below to purchase. You’ll receive immediate access to download all files (SQL scripts, Python script, Excel calculator, documentation).

Buy Data Quality Toolkit - $47 AUD

check the full methodology guide on the blog.

License Terms

You get a personal, non-transferable license for your own work (personal projects and commercial work within your organization). You can modify the scripts, learn from them, and use the techniques in your job. You own what you create with them.

What you can’t do: Share the original scripts publicly, redistribute them, or include them in products you sell. Think of it like a cookbook - use the recipes all you want, but don’t photocopy the book for your friends.

Full terms in LICENSE.md included with the download. Your Australian Consumer Law rights are protected.