The CSV Test Suite Nobody Writes
In October 2020, roughly 16,000 positive COVID-19 test results vanished from the UK’s public health reporting for nearly a week. Not because the tests weren’t run. Not because the labs didn’t report them. The results were collected, transmitted, and received — inside CSV files.
The problem? Public Health England was importing those CSV files into Microsoft Excel’s legacy .xls format. The format has a hard row limit of 65,536. When the files grew past that limit, Excel didn’t throw an error. It didn’t warn anyone. It just silently dropped the extra rows. Sixteen thousand people who tested positive for a deadly virus during a second wave went untraced. An estimated 50,000 of their contacts were never notified. And the system this happened in? Part of a £12 billion Test and Trace programme.
I remember reading that story and feeling something beyond frustration. I felt recognition. Because I’d seen this pattern before — not at that scale, not with those stakes — but the same fundamental failure. Someone produced a CSV. Someone else consumed it. And nobody, at any point in the chain, had written a test that asked: “What happens when this file gets bigger than we expect?”
That’s the moment I became a bit obsessive about CSV testing. Not because CSV is complicated — it’s deceptively simple. But because that simplicity breeds a dangerous confidence. Everyone thinks they know how to handle a CSV. Everyone thinks the format is too basic to need a test suite. And that assumption has cost organisations billions of dollars, corrupted scientific research, and — in the case of COVID — put actual lives at risk.
I’m telling you this because CSV testing is one of those things that sounds unnecessary until you’ve spent a weekend rebuilding a pipeline because a vendor changed their quoting convention and nobody noticed for three weeks. The engineers I’ve worked with who do this well — who write proper CSV test suites — don’t do it because they’re cautious by nature. They do it because they’ve been burned. And they decided it wouldn’t happen again.
This article is the test suite I wish someone had handed me years ago. We’re going to take RFC 4180, the closest thing CSV has to a standard, and turn it into concrete test cases. We’ll build manifest files that make file delivery verifiable. And we’ll walk through how to systematically break down everything you know about CSV edge cases — embedded commas, encoding traps, null representation chaos, the lot — into tests you can run during development, not discoveries you make in production at 2am.
Whether you’re producing CSVs for a downstream consumer or receiving them from a vendor who swears the file is “fine,” this is your playbook.
Why “I Know How to Test” Isn’t Enough
Here’s what I’ve noticed when I ask data engineers how they test their CSV files: most of them describe validation, not testing. They’ll say things like “we check the row count” or “we verify the columns match the schema.” And those are good things to do. But they’re runtime checks — they catch problems after they’ve already arrived at your door.
Testing is different. Testing happens during development. Testing means deliberately creating malformed files and confirming your system handles them correctly. Testing means asking “what’s the worst thing a vendor could send me?” and then actually building that worst-case file and feeding it through your pipeline.
The gap between validation and testing is where the expensive surprises live. Validation tells you this specific file has a problem. Testing tells you your system has a problem — before any real data gets near it.
So how do you build a proper CSV test suite? You need a specification to test against. And despite its limitations, that specification is RFC 4180.
RFC 4180: Your Test Specification (and Its Gaps)
RFC 4180 was published in October 2005 by Yakov Shafranovich. It’s universally treated as “the CSV standard,” but its opening line tells a different story: it’s an “Informational” memo that “does not specify an Internet standard of any kind.” It was written to document common practice and register the text/csv MIME type. That’s it.
But here’s the thing — even a flawed specification is better than no specification. RFC 4180 gives us seven concrete rules, and each one translates directly into test cases. Let’s walk through them.
Rule 1: Records Are Separated by CRLF
The spec mandates \r\n (carriage return + line feed) as the line ending. Not just \n (Unix/macOS default). Not just \r (old Mac).
Test cases for producers:
def test_line_endings_are_crlf(csv_content: bytes):
"""Every record must end with CRLF, not LF-only or CR-only."""
lines = csv_content.split(b'\r\n')
# Check no bare LF exists outside of quoted fields
for i, line in enumerate(lines):
assert b'\n' not in line, (
f"Row {i+1}: Found bare LF (Unix line ending). "
f"RFC 4180 requires CRLF."
)
# Check no bare CR exists outside of quoted fields
for i, line in enumerate(lines):
assert b'\r' not in line, (
f"Row {i+1}: Found bare CR (old Mac line ending). "
f"RFC 4180 requires CRLF."
)
Test cases for consumers:
def test_consumer_handles_mixed_line_endings(parser):
"""Consumer should handle LF, CR, and CRLF gracefully."""
test_cases = {
"unix_lf": b"name,age\nAlice,30\nBob,25\n",
"windows_crlf": b"name,age\r\nAlice,30\r\nBob,25\r\n",
"old_mac_cr": b"name,age\rAlice,30\rBob,25\r",
"mixed": b"name,age\r\nAlice,30\nBob,25\r",
}
for label, content in test_cases.items():
result = parser(content)
assert len(result) == 2, (
f"Line ending style '{label}': Expected 2 data rows, "
f"got {len(result)}"
)
This distinction matters. If you’re a producer, your job is to emit compliant files. If you’re a consumer, your job is to survive whatever arrives. The test mindset is different for each role — producers test for correctness, consumers test for resilience.
Rule 2: The Last Record May or May Not Have a Trailing CRLF
This sounds trivial, but it catches people. Some systems emit a final newline, others don’t. Your parser needs to handle both without creating a phantom empty row at the end.
def test_trailing_newline_does_not_create_empty_row(parser):
"""A trailing CRLF should not produce an extra empty record."""
with_trailing = b"name,age\r\nAlice,30\r\n"
without_trailing = b"name,age\r\nAlice,30"
assert len(parser(with_trailing)) == len(parser(without_trailing))
Rule 3: There Should Be a Header Row
The RFC says “should” — not “must.” This ambiguity is a source of real problems. Some vendors send headers, some don’t, and some send headers that change between deliveries.
Not having headers isn’t always a mistake. From a security perspective, omitting headers is actually defensible. Think about what a header row really is: it’s a plaintext schema sitting right there in the first line of your file. If someone intercepts a CSV with columns named ssn, credit_score, annual_income, and account_balance, they don’t need to decode anything. The header tells them exactly what they’ve stolen and which column to exploit first.
Stripping headers from files in transit — especially files containing PII, financial data, or health records — reduces the value of an intercepted file to an attacker. A file full of numbers and strings without column names is significantly harder to weaponise than one with a helpful schema attached.
But here’s the catch: headerless files only work if you have a reliable mechanism to reconstruct the schema at the consumer end. The column definitions need to live somewhere — and that somewhere should be your manifest file or your data contract. The manifest we build later in this article includes a columns array with names, types, and ordering for exactly this reason. The producer strips the headers, the manifest carries the schema, and the consumer rebuilds the structure on ingestion.
This means your tests need to handle both scenarios:
def test_header_row_present_and_matches_schema(csv_file, expected_columns):
"""When headers are expected, verify they match the schema exactly."""
header = csv_file.readline().strip().split(',')
assert header == expected_columns, (
f"Header mismatch.\n"
f"Expected: {expected_columns}\n"
f"Got: {header}"
)
def test_no_duplicate_column_names(csv_file):
"""Duplicate column names cause silent column overwrites in pandas."""
header = csv_file.readline().strip().split(',')
duplicates = [h for h in header if header.count(h) > 1]
assert not duplicates, (
f"Duplicate column names found: {set(duplicates)}. "
f"Pandas will silently rename these to name.1, name.2, etc."
)
def test_headerless_file_matches_manifest_schema(csv_file, manifest):
"""When headers are intentionally omitted for security,
the first data row must match the manifest's column count
and the manifest must define column names and types."""
first_row = csv_file.readline().strip().split(',')
expected_cols = manifest['files'][0]['columns']
assert len(first_row) == len(expected_cols), (
f"Headerless file has {len(first_row)} fields but manifest "
f"defines {len(expected_cols)} columns. Without headers, "
f"column alignment is entirely dependent on the manifest."
)
# Verify manifest actually defines names (not just count)
for col_def in expected_cols:
assert 'name' in col_def, (
"Manifest column definitions must include 'name' — "
"headerless files are only safe if the schema is "
"fully specified in the manifest."
)
Whether you include headers or not, the decision must be explicit and documented in your data contract. The worst scenario isn’t a headerless file or a headered file — it’s a file where nobody agreed which one it should be, and the answer changes depending on who ran the export.
Rule 4: Each Row Should Contain the Same Number of Fields
This is where “ragged rows” come from — and they’re surprisingly common. A free-text field containing an unescaped comma silently creates an extra column in that row, shifting everything to the right.
def test_consistent_column_count(csv_reader):
"""Every row must have the same number of fields as the header."""
header = next(csv_reader)
expected = len(header)
ragged_rows = []
for i, row in enumerate(csv_reader, start=2):
if len(row) != expected:
ragged_rows.append({
'row': i,
'expected': expected,
'got': len(row),
'content': row[:3] # First 3 fields for debugging
})
assert not ragged_rows, (
f"Found {len(ragged_rows)} ragged rows. "
f"First: row {ragged_rows[0]['row']} has "
f"{ragged_rows[0]['got']} fields (expected {expected}). "
f"This usually means an unescaped comma or newline in the data."
)
This test is the single most valuable check you can run against a CSV file. If it passes, most of the structural problems aren’t present. If it fails, you know exactly where to look.
Rules 5-7: Quoting and Escaping
These three rules govern how fields handle special characters:
- Fields containing commas, double-quotes, or line breaks must be enclosed in double-quotes
- Double-quotes inside a quoted field are escaped by doubling them:
"" - Spaces are part of field data and must not be trimmed
These rules are where the most insidious bugs hide:
# -- Producer tests: Do we generate valid quoted fields? --
def test_fields_with_commas_are_quoted():
"""A field containing a comma must be wrapped in double-quotes."""
record = generate_csv_row(address="456 Oak Ave, Apt 2B")
assert record == 'id,"456 Oak Ave, Apt 2B",other_field'
def test_fields_with_quotes_are_escaped():
"""Double-quotes in data must be doubled, and field must be quoted."""
record = generate_csv_row(company='Acme "Holdings" Ltd')
assert '"Acme ""Holdings"" Ltd"' in record
def test_fields_with_newlines_are_quoted():
"""A field containing a line break must be wrapped in double-quotes."""
record = generate_csv_row(notes="Line one\r\nLine two")
# The entire field, including the newline, should be inside quotes
assert '"Line one\r\nLine two"' in record
# -- Consumer tests: Can we parse valid-but-tricky fields? --
def test_parse_embedded_comma():
content = b'name,address\r\n"Alice","456 Oak Ave, Apt 2B"\r\n'
result = parse(content)
assert result[0]['address'] == '456 Oak Ave, Apt 2B'
def test_parse_embedded_newline():
content = b'name,notes\r\n"Alice","Line one\r\nLine two"\r\n'
result = parse(content)
assert result[0]['notes'] == 'Line one\r\nLine two'
def test_parse_escaped_quotes():
content = b'name,company\r\n"Alice","Acme ""Holdings"" Ltd"\r\n'
result = parse(content)
assert result[0]['company'] == 'Acme "Holdings" Ltd'
That covers the seven RFC rules. But here’s the catch — the RFC is only half the story. The really nasty problems come from everything the RFC doesn’t cover.
Testing What the RFC Forgot
RFC 4180 says nothing about character encoding, data types, null values, or how Excel will mangle your data. These are the gaps where production failures breed. Each one becomes a test category.
Encoding Tests
The most common encoding problem is invisible: the UTF-8 BOM (Byte Order Mark). Excel on Windows prepends three bytes (EF BB BF) to CSV files saved as UTF-8. Those bytes become an invisible character attached to your first column name. You try df['Name'] and get a KeyError because the actual column is '\ufeffName'.
def test_no_bom_prefix(csv_path):
"""UTF-8 BOM causes invisible characters in the first column name."""
with open(csv_path, 'rb') as f:
first_bytes = f.read(3)
assert first_bytes != b'\xef\xbb\xbf', (
"File contains UTF-8 BOM. This will prepend an invisible "
"character to the first column name and break column lookups."
)
def test_file_is_valid_utf8(csv_path):
"""Catch encoding mismatches before they become mojibake."""
with open(csv_path, 'rb') as f:
content = f.read()
try:
content.decode('utf-8')
except UnicodeDecodeError as e:
# Try to identify the actual encoding
import chardet
detected = chardet.detect(content)
raise AssertionError(
f"File is not valid UTF-8. Decode error at byte {e.start}. "
f"Detected encoding: {detected['encoding']} "
f"(confidence: {detected['confidence']:.0%}). "
f"This is likely a Latin-1 or Windows-1252 file."
)
def test_no_null_bytes(csv_path):
"""Null bytes (0x00) indicate binary data or a UTF-16 file
being read as UTF-8."""
with open(csv_path, 'rb') as f:
content = f.read()
null_positions = [i for i, b in enumerate(content) if b == 0]
assert not null_positions, (
f"Found {len(null_positions)} null bytes. "
f"First at position {null_positions[0]}. "
f"This might be a UTF-16 encoded file."
)
If you’re a consumer, you should also test that your parser survives non-UTF-8 input gracefully — does it raise a clear error, or does it silently produce mojibake? Build test fixtures with Latin-1 encoded files containing accented characters (café, naïve, München) and verify your system either decodes correctly or fails explicitly.
Null Value Tests
RFC 4180 has no concept of null. An empty field (,,) is just an empty string. But every system represents null differently, and when you’re receiving files from multiple vendors, the chaos compounds.
# These are all real null representations I've seen in production
NULL_VARIANTS = [
'', # Empty string (most common)
'NULL', # MySQL Workbench default
'null', # JavaScript/JSON convention
'None', # Python convention
r'\N', # PostgreSQL COPY convention
'N/A', # Business user convention
'n/a', # Lowercase variant
'NA', # R convention
'#N/A', # Excel formula error
'-', # Dashboard export convention
'NaN', # Pandas/NumPy convention
'(null)', # Some BI tools
]
def test_null_representation_is_consistent(csv_path, null_convention=''):
"""All null values should use the agreed convention.
Mixed conventions cause type inference failures."""
import csv
with open(csv_path, newline='') as f:
reader = csv.reader(f)
header = next(reader)
violations = []
for row_num, row in enumerate(reader, start=2):
for col_idx, value in enumerate(row):
if value in NULL_VARIANTS and value != null_convention:
violations.append({
'row': row_num,
'column': header[col_idx],
'value': repr(value),
'expected': repr(null_convention)
})
assert not violations, (
f"Found {len(violations)} non-standard null representations. "
f"First: row {violations[0]['row']}, "
f"column '{violations[0]['column']}' contains "
f"{violations[0]['value']} (expected {violations[0]['expected']})"
)
This is a test you should agree on with your vendor as part of a data contract — which we’ll get to shortly. Without explicit agreement, you’ll spend more time debugging null handling than you will on actual analysis.
The Leading Zeros Problem
Excel silently strips leading zeros from any field it interprets as numeric. ZIP code 07102 becomes 7102. Product code 00456 becomes 456. Numbers over 15 digits lose precision permanently. This is not a CSV problem — it’s an Excel problem — but it affects every CSV that passes through Excel, which is most of them.
def test_leading_zeros_preserved(csv_path, zero_padded_columns):
"""Fields that should retain leading zeros still have them.
Excel strips these silently on open-and-save."""
import csv
with open(csv_path, newline='') as f:
reader = csv.DictReader(f)
for row_num, row in enumerate(reader, start=2):
for col in zero_padded_columns:
value = row[col]
if value and not value[0].isdigit():
continue # Skip non-numeric values
if value and value[0] == '0' and len(value) > 1:
pass # Good — leading zero preserved
elif value and col in zero_padded_columns:
# Check against expected length
# e.g., ZIP should be 5 digits
pass
def test_no_scientific_notation(csv_path, numeric_columns):
"""Numbers should not be in scientific notation (1.23E+11).
This happens when Excel converts long numbers."""
import csv, re
sci_notation = re.compile(r'^-?\d+\.?\d*[eE][+-]?\d+$')
with open(csv_path, newline='') as f:
reader = csv.DictReader(f)
violations = []
for row_num, row in enumerate(reader, start=2):
for col in numeric_columns:
if sci_notation.match(row[col] or ''):
violations.append((row_num, col, row[col]))
assert not violations, (
f"Scientific notation found — data may have been "
f"corrupted by Excel. First: row {violations[0][0]}, "
f"column '{violations[0][1]}', value '{violations[0][2]}'"
)
Date Format Tests
Is 03/04/2024 March 4th or April 3rd? Unless you know the source system’s locale, you literally cannot tell for dates where both day and month values are 12 or below. The only safe format is ISO 8601: YYYY-MM-DD.
def test_dates_are_iso8601(csv_path, date_columns):
"""Date fields should use ISO 8601 (YYYY-MM-DD) format.
Ambiguous formats like MM/DD/YYYY vs DD/MM/YYYY
are programmatically unresolvable."""
import csv, re
iso_pattern = re.compile(r'^\d{4}-\d{2}-\d{2}')
with open(csv_path, newline='') as f:
reader = csv.DictReader(f)
violations = []
for row_num, row in enumerate(reader, start=2):
for col in date_columns:
val = row.get(col, '')
if val and not iso_pattern.match(val):
violations.append((row_num, col, val))
assert not violations, (
f"Non-ISO date format found. First: row {violations[0][0]}, "
f"column '{violations[0][1]}', value '{violations[0][2]}'. "
f"Ambiguous date formats cannot be safely parsed."
)
Invisible Character Tests
These are the ones that make you question your sanity. Non-breaking spaces (U+00A0), zero-width spaces (U+200B), zero-width joiners — they look identical to normal characters (or invisible entirely) but break numeric conversion, string matching, and joins.
INVISIBLE_CHARS = {
'\u00a0': 'Non-breaking space',
'\u200b': 'Zero-width space',
'\u200c': 'Zero-width non-joiner',
'\u200d': 'Zero-width joiner',
'\ufeff': 'BOM / Zero-width no-break space',
'\u2028': 'Line separator',
'\u2029': 'Paragraph separator',
'\u00ad': 'Soft hyphen',
}
def test_no_invisible_characters(csv_path):
"""Invisible Unicode characters break joins, type casting,
and string comparisons while being impossible to spot visually."""
with open(csv_path, encoding='utf-8') as f:
content = f.read()
found = []
for char, name in INVISIBLE_CHARS.items():
positions = [i for i, c in enumerate(content) if c == char]
if positions:
found.append(f"{name} (U+{ord(char):04X}): "
f"{len(positions)} occurrences")
assert not found, (
"Invisible characters detected:\n" +
"\n".join(f" - {f}" for f in found)
)
Manifest Files: Testing Secure Delivery
All of those tests assume you’ve received the file intact. But how do you know? A file could be truncated during transfer, silently corrupted on disk, or swapped out entirely. This is where manifest files come in — they’re the contract between producer and consumer.
A manifest is metadata about the file that travels alongside it. At minimum, it should contain the filename, a cryptographic checksum, the row count, and the file size. NIST has formally deprecated MD5 and SHA-1, so use SHA-256.
Here’s what a solid manifest looks like:
{
"manifest_version": "1.0",
"generated_at": "2026-03-04T08:00:00Z",
"source_system": "vendor-crm",
"batch_id": "batch-20260304-001",
"files": [
{
"filename": "customers_20260304.csv",
"sha256": "a3f2b8c9d1e4f5a6b7c8d9e0f1a2b3c4d5e6f7...",
"row_count": 50432,
"row_count_includes_header": true,
"file_size_bytes": 8234567,
"column_count": 12,
"encoding": "UTF-8",
"delimiter": ",",
"quote_char": "\"",
"has_header": true,
"columns": [
{ "name": "id", "type": "integer" },
{ "name": "name", "type": "string" },
{ "name": "email", "type": "string" },
{ "name": "created_date", "type": "date", "format": "YYYY-MM-DD" }
]
}
]
}
Now here’s the test suite that validates a delivery against its manifest:
import hashlib
import json
import os
def validate_delivery(manifest_path, data_dir):
"""Validate a CSV delivery against its manifest.
Returns list of errors. Empty list = valid delivery."""
errors = []
# 1. Parse manifest
with open(manifest_path) as f:
manifest = json.load(f)
for file_spec in manifest['files']:
filepath = os.path.join(data_dir, file_spec['filename'])
# 2. File existence
if not os.path.exists(filepath):
errors.append(f"MISSING: {file_spec['filename']} not found")
continue
# 3. File size
actual_size = os.path.getsize(filepath)
if actual_size == 0:
errors.append(f"EMPTY: {file_spec['filename']} is zero bytes")
continue
if actual_size != file_spec['file_size_bytes']:
errors.append(
f"SIZE MISMATCH: {file_spec['filename']} "
f"expected {file_spec['file_size_bytes']} bytes, "
f"got {actual_size}"
)
# 4. Checksum
sha256 = hashlib.sha256()
with open(filepath, 'rb') as f:
for chunk in iter(lambda: f.read(8192), b''):
sha256.update(chunk)
actual_hash = sha256.hexdigest()
if actual_hash != file_spec['sha256']:
errors.append(
f"CHECKSUM MISMATCH: {file_spec['filename']} "
f"has been modified or corrupted in transit"
)
# 5. Row count
with open(filepath, encoding=file_spec.get('encoding', 'utf-8')) as f:
actual_rows = sum(1 for _ in f)
expected = file_spec['row_count']
if actual_rows != expected:
errors.append(
f"ROW COUNT MISMATCH: {file_spec['filename']} "
f"expected {expected} rows, got {actual_rows}"
)
# 6. Column count and names
with open(filepath, encoding=file_spec.get('encoding', 'utf-8')) as f:
import csv
reader = csv.reader(f)
header = next(reader)
if len(header) != file_spec['column_count']:
errors.append(
f"COLUMN COUNT MISMATCH: {file_spec['filename']} "
f"expected {file_spec['column_count']} columns, "
f"got {len(header)}"
)
if 'columns' in file_spec:
expected_names = [c['name'] for c in file_spec['columns']]
if header != expected_names:
errors.append(
f"COLUMN NAME MISMATCH: {file_spec['filename']}\n"
f" Expected: {expected_names}\n"
f" Got: {header}"
)
return errors
The key insight about manifest validation is that it should be a gate, not a check. If the manifest fails, you don’t process the file. Period. No “well, the row count is close enough.” No “the checksum is probably fine.” Either the delivery matches the contract or it gets quarantined.
For orchestration tools like Airflow, the pattern looks like this:
# Airflow DAG pattern: manifest-gated processing
wait_for_manifest = FileSensor(
task_id="wait_for_manifest",
filepath="/landing/{{ ds_nodash }}/manifest.json",
poke_interval=120,
timeout=7200,
)
validate = BranchPythonOperator(
task_id="validate_delivery",
python_callable=validate_and_route,
# Routes to either 'load_data' or 'quarantine_and_alert'
)
wait_for_manifest >> validate >> [load_data, quarantine_and_alert]
As a producer, you should generate the manifest after writing the CSV file — computing the checksum on the exact bytes that were written. Ship the manifest alongside the data file. If you’re using SFTP or S3, upload the data file first, then the manifest. The consumer watches for the manifest as a “ready” signal — its presence means the data file is complete.
As a consumer, you should test your manifest validation against deliberately corrupted scenarios:
def test_detects_truncated_file():
"""Simulate a file that was cut short during transfer."""
# Write a file, then truncate it
...
def test_detects_missing_file():
"""Manifest references a file that doesn't exist."""
...
def test_detects_extra_rows():
"""File has more rows than manifest claims —
could indicate duplicate data or appended garbage."""
...
def test_detects_wrong_checksum():
"""File was modified after manifest was generated."""
...
def test_rejects_zero_byte_file():
"""Empty files should never pass validation."""
...
The CSV Test Matrix: Thinking Like a Tester
Most people know about edge cases — embedded commas, encoding problems, null ambiguity — but they don’t know how to systematically translate that knowledge into a test suite. They end up with ad hoc checks scattered across their codebase, catching some things and missing others.
The trick is to think in categories, not individual cases. Every CSV problem falls into one of five categories, and each category has a predictable set of test cases.
Category 1: Structural Integrity
Does the file conform to the expected shape?
| Test | What It Catches | Producer/Consumer |
|---|---|---|
| Consistent column count across all rows | Unescaped delimiters in data | Both |
| Header row present and matches schema (when expected) | Schema drift, renamed columns | Consumer |
| Headerless file matches manifest column count | Misaligned data when headers stripped for security | Consumer |
| No duplicate column names | Silent column overwrites | Consumer |
| No trailing delimiter creating phantom column | Off-by-one column counts | Both |
| File is not empty / not just a header | Failed exports, empty result sets | Consumer |
| Row count within expected range | Truncation, duplication, data loss | Consumer |
Category 2: Encoding and Character Safety
Are the bytes what we think they are?
| Test | What It Catches | Producer/Consumer |
|---|---|---|
| File is valid UTF-8 | Latin-1/Windows-1252 masquerading as UTF-8 | Both |
| No BOM prefix | Invisible first-column corruption | Producer |
| No null bytes | Binary data or wrong encoding family | Consumer |
| No invisible Unicode characters | Broken joins, failed type casts | Both |
| No smart quotes or em-dashes | Copy-paste from Word/email | Both |
Category 3: Quoting and Escaping
Are special characters handled correctly?
| Test | What It Catches | Producer/Consumer |
|---|---|---|
| Fields with commas are quoted | Row splitting, column misalignment | Producer |
Fields with double-quotes use "" escaping | Unclosed quotes consuming multiple rows | Producer |
| Fields with newlines are quoted | Row count mismatches, multi-line chaos | Producer |
| Consumer survives unclosed quotes gracefully | Vendor bugs eating your pipeline | Consumer |
| No backslash escaping (non-standard) | Interop failures with strict parsers | Producer |
Category 4: Data Type Safety
Will downstream systems interpret values correctly?
| Test | What It Catches | Producer/Consumer |
|---|---|---|
| Dates in ISO 8601 format | Ambiguous MM/DD vs DD/MM | Both |
| Leading zeros preserved | Excel corruption of ZIPs, product codes | Both |
| No scientific notation in numeric fields | Excel corruption of long numbers | Both |
| Consistent null representation | Mixed NULL/None/N/A/empty conventions | Both |
| Boolean fields use consistent values | TRUE/True/true/1/yes/Y chaos | Both |
| Numeric fields contain only valid numbers | Currency symbols, percentage signs, commas in numbers | Consumer |
Category 5: Delivery Integrity
Did the file arrive intact and complete?
| Test | What It Catches | Producer/Consumer |
|---|---|---|
| SHA-256 checksum matches manifest | Corruption in transit | Consumer |
| File size matches manifest | Truncation, wrong file | Consumer |
| Row count matches manifest | Silent data loss (the COVID problem) | Consumer |
| Column count matches manifest | Schema changes without notice | Consumer |
| File is not a duplicate of previous delivery | Accidental re-sends | Consumer |
| Manifest is present and valid JSON | Incomplete delivery | Consumer |
This matrix is your test plan. Print it out. Stick it on the wall next to your desk. When you’re building a new CSV integration, walk through each category and ask: “Do I have a test for this?”
Building Test Fixtures: The Art of Deliberate Breakage
A test suite is only as good as its fixtures. And for CSV testing, your fixtures need to be deliberately, creatively broken. Here’s how I build mine.
Start with a golden file — a perfectly compliant CSV that passes every test. This is your baseline. Then create variations that each break exactly one thing:
Each fixture is a test case. Your producer tests verify you never generate files that look like the broken fixtures. Your consumer tests verify you detect and handle each type of breakage correctly.
The most valuable fixture?unclosed_quote.csv. A single unclosed double-quote can cause a parser to consume every subsequent row as part of one giant field — potentially the entire rest of the file. I’ve seen this eat 2 million rows. Test for it explicitly.
What to Put in Your Data Contract
Tests tell you whether a specific file is good or bad. Adata contracttells you what “good” means. Without one, you’re testing against assumptions — and your assumptions will eventually differ from your vendor’s.
Here’s what a data contract for CSV exchange should specify. Think of it as the RFC that RFC 4180 forgot to write:
File format:
- Delimiter (comma, pipe, tab — be explicit)
- Quote character (double-quote)
- Escape method (doubled quotes, per RFC 4180)
- Line ending (CRLF)
- Encoding (UTF-8, no BOM)
- Header row policy (present, omitted for security, or conditional — and if omitted, where the schema lives)
- Column names and order (in manifest if headers are stripped)
Data type conventions:
- Date format (ISO 8601:
YYYY-MM-DD) - Timestamp format (ISO 8601:
YYYY-MM-DDTHH:MM:SSZ) - Null representation (empty field — no “NULL”, no “N/A”)
- Boolean representation (
true/false) - Numeric precision (no scientific notation, no thousands separators)
Delivery protocol:
- Transfer method (SFTP, S3, etc.)
- Manifest file required (JSON, SHA-256 checksum)
- Signal file or completion marker
- Delivery schedule and SLA
- Who to contact when something breaks
Change management:
- Minimum notice period for schema changes
- Versioning strategy (new columns appended, never reordered)
- Breaking vs non-breaking change definitions
You don’t need a 50-page legal document. A one-page markdown file committed to a shared repository works. The point isn’t formality — it’s explicitness. When the vendor’s export changes and your pipeline breaks at 3am, you want something to point to that says “this is what we agreed.”
Putting It All Together: A Reusable Validation Function
Here’s a practical, copy-paste-ready validation function that covers the most critical checks. This isn’t a framework — it’s a function you can drop into any Python project and start using immediately.
importcsvimporthashlibimportosimportrefromdataclassesimportdataclass, fieldfromtypingimportOptional@dataclassclassCSVValidationResult:"""Results from validating a CSV file."""filepath: stris_valid: bool=Trueerrors: list=field(default_factory=list)warnings: list=field(default_factory=list)stats: dict=field(default_factory=dict)defadd_error(self, message: str):self.errors.append(message)self.is_valid=Falsedefadd_warning(self, message: str):self.warnings.append(message)defvalidate_csv(filepath: str,expected_columns: Optional[list]=None,expected_encoding: str='utf-8',date_columns: Optional[list]=None,zero_padded_columns: Optional[list]=None,null_convention: str='',max_row_count: Optional[int]=None,min_row_count: Optional[int]=None,)->CSVValidationResult:"""Validate a CSV file against common quality checks.Works for both producers (checking your own output)and consumers (checking vendor deliveries)."""result=CSVValidationResult(filepath=filepath)# --- Check 1: File exists and isn't empty ---ifnotos.path.exists(filepath):result.add_error(f"File not found:{filepath}")returnresultfile_size=os.path.getsize(filepath)iffile_size==0:result.add_error("File is empty (zero bytes)")returnresultresult.stats['file_size_bytes']=file_size# --- Check 2: Encoding ---withopen(filepath,'rb')asf:raw_head=f.read(10000)# BOM checkifraw_head[:3]==b'\xef\xbb\xbf':result.add_warning("File contains UTF-8 BOM — this will corrupt ""the first column name in most parsers")# Null byte checkifb'\x00'inraw_head:result.add_error("File contains null bytes — likely wrong encoding ""(e.g., UTF-16 read as UTF-8) or binary data")returnresult# Encoding validitytry:raw_full=open(filepath,'rb').read()raw_full.decode(expected_encoding)exceptUnicodeDecodeErrorase:result.add_error(f"File is not valid{expected_encoding}. "f"Decode error at byte position{e.start}.")returnresult# --- Check 3: Structural integrity ---encoding='utf-8-sig'ifraw_head[:3]==b'\xef\xbb\xbf'elseexpected_encodingwithopen(filepath, newline='', encoding=encoding)asf:reader=csv.reader(f)# Header checktry:header=next(reader)exceptStopIteration:result.add_error("File has no rows (not even a header)")returnresultresult.stats['column_count']=len(header)result.stats['columns']=header# Duplicate column checkseen={}forcolinheader:seen[col]=seen.get(col,0)+1dupes={k: vfork, vinseen.items()ifv>1}ifdupes:result.add_error(f"Duplicate column names:{dupes}")# Expected columns checkifexpected_columnsandheader!=expected_columns:result.add_error(f"Column mismatch.\n"f" Expected:{expected_columns}\n"f" Got:{header}")# Row-level checksrow_count=0ragged_rows=[]iso_date=re.compile(r'^\d{4}-\d{2}-\d{2}')forrow_num, rowinenumerate(reader, start=2):row_count+=1# Column count consistencyiflen(row)!=len(header):ragged_rows.append((row_num, len(row)))# Date format checkifdate_columns:forcolindate_columns:idx=header.index(col)ifcolinheaderelseNoneifidxisnotNoneandrow[idx]:ifnotiso_date.match(row[idx]):result.add_warning(f"Row{row_num}, column '{col}': "f"Non-ISO date '{row[idx]}'")result.stats['row_count']=row_countifragged_rows:result.add_error(f"{len(ragged_rows)}rows have wrong column count. "f"Expected{len(header)}columns. "f"First bad row: line{ragged_rows[0][0]}"f"has{ragged_rows[0][1]}fields.")# Row count boundsifmin_row_countandrow_count<min_row_count:result.add_error(f"Row count{row_count}is below minimum "f"threshold of{min_row_count}")ifmax_row_countandrow_count>max_row_count:result.add_error(f"Row count{row_count}exceeds maximum "f"threshold of{max_row_count}")# --- Check 4: Invisible characters ---invisible={'\u00a0':'Non-breaking space','\u200b':'Zero-width space','\ufeff':'BOM (mid-file)',}text=raw_full.decode(encoding, errors='replace')forchar, nameininvisible.items():count=text.count(char)ifcount>0:result.add_warning(f"Found{count}'{name}' characters (U+{ord(char):04X})")# --- Compute checksum for manifest comparison ---sha256=hashlib.sha256(raw_full).hexdigest()result.stats['sha256']=sha256returnresultUse it like this:
result=validate_csv('vendor_delivery/customers_20260304.csv',expected_columns=['id','name','email','city','state','zip'],date_columns=['created_date','updated_date'],min_row_count=1000,max_row_count=100000,)ifnotresult.is_valid:forerrorinresult.errors:print(f"ERROR:{error}")# Route to quarantineelse:print(f"Valid.{result.stats['row_count']}rows, "f"SHA-256:{result.stats['sha256'][:16]}...")# Proceed to loadingWhat 16,000 Missing Records Taught Us
Let’s come back to where we started. Sixteen thousand COVID test results, dropped silently because nobody tested what would happen when the file exceeded a row limit. The labs did their job. The CSV was generated correctly. The file was transmitted successfully. Every individual component worked. The system failed because nobody tested theboundaries— the places where one component’s assumptions meet another component’s limitations.
That’s what CSV testing is really about. It’s not about whether you can parse a comma-separated file — of course you can. It’s about whether you’ve thought through every place where the producer’s assumptions and the consumer’s assumptions might diverge. The vendor who exportsNULLas the literal string “NULL.” The CRM that allows newlines in address fields. The business user who opens your carefully formatted CSV in Excel, saves it, and sends it back with stripped zeros and mangled dates.
Every one of those is a test case. Every one of those is a boundary you can verify during development instead of discovering in production.
The engineers who sleep well at night aren’t the ones who assume their CSV files are fine. They’re the ones who wrote the tests to prove it.
