Navigating the Data Labyrinth: The Art of Data Profiling

Introduction

Imagine navigating a sprawling network of interconnected threads, each strand holding a vital clue. That’s the world of data for us, and profiling is our key to unlocking its secrets. It’s like deciphering a cryptic message, each character a piece of information waiting to be understood.

But why is this so important? Ever encountered an error in your analysis, or a misleading conclusion based on faulty data? Data profiling helps us avoid these pitfalls by ensuring the data we work with is accurate, consistent, and ready to yield valuable insights. It’s like building a sturdy foundation before constructing a skyscraper.

So, how do we tackle this data beast? Let’s break it down into key steps:

Cracking the Code: Understanding Keys and Relationships

Think of keys like unique identifiers for each data point. Identifying these “candidate keys” is crucial for ensuring data integrity and linking different pieces of information. Imagine analyzing customer orders – you might find “order ID” as a strong candidate key, while “customer name” alone could be less reliable due to potential duplicates.

Here’s where SQL comes in handy. We can use queries like below to expore the data

SELECT 
	ColumnId
	, ColumnName
	, ColumnType,
    (
		SELECT 
			COUNT(DISTINCT ColumnName) 
			FROM table_name
	) AS "Cardinality (Count)",
    (
		SELECT 
			COUNT(DISTINCT ColumnName) * 1.0000 / 
			CASE 
				WHEN COUNT(1) > 0 THEN COUNT(1) * 1.0000 
				ELSE 1.0000 
			END * 100 
		FROM 
			table_name
	) AS "Cardinality (Percent)"
FROM 
	INFORMATION_SCHEMA.COLUMNS
WHERE 
	table_name = 'your_table';

This query delves into the metadata of a specific table to uncover potential keys.
It identifies unique values within columns, calculating their cardinality (the number of distinct values) both as a count and a percentage.
This helps us pinpoint columns that might serve as reliable identifiers for each data point, ensuring consistency and integrity.

Decoding the Data Types: From Numbers to Words and Beyond

Just like different languages have their own alphabets, data has its own types. Understanding these types, like numbers, text, dates, and more, is essential for handling and manipulating the data effectively. Imagine trying to add apples and oranges – it wouldn’t make much sense, right?

Think of a “varchar(255)” field, meant to hold colors. Initially we only see values of “Red” and “Orange”, the system might automatically assume it can handle the length of varchar(10) values, so modifys or interupts that as the field data type. But what happens when “Razzle Dazzle Rose” shows up? It gets truncated, leading to errors and inconsistencies.

Profiling helps us anticipate and avoid such issues by analyzing data types and ensuring they match the actual content.

SELECT 
	COLUMN_NAME
	, DATA_TYPE
FROM 
	INFORMATION_SCHEMA.COLUMNS
WHERE 
	TABLE_NAME = 'your_table';

SELECT 
	COLUMN_NAME
	, CHARACTER_MAXIMUM_LENGTH
FROM 
	INFORMATION_SCHEMA.COLUMNS
WHERE 
	TABLE_NAME = 'your_table';

SELECT 
	null_or_not
	, COUNT(*) total_row
FROM
  (
	SELECT 
		CASE WHEN column_name IS NULL THEN 'Null'
   			 ELSE 'Non-Null' 
		END AS null_or_not
  	FROM 
		table_name
	) count_null
GROUP BY 
	null_or_not

SELECT 
	column_name , 
      COUNT(column_name ) AS frequency 
FROM 
	table_name
GROUP BY 
	column_name 
ORDER BY 
	column_name DESC;

SELECT * 
FROM 
	table_name 
WHERE 
	column_name !~ '^[0-9]+$';

SELECT * 
FROM 
	table_name 
WHERE 
	column_name !~ '^[A-Za-z]+$';

These queries extract crucial information about data types:
The first query reveals the general data types (e.g., varchar, integer, date) for each column.
The second query delves deeper, providing the maximum character length for text-based columns, essential for understanding potential truncation issues.
The third query looks at the percentage of nulls within a column. This could also be done for empty strings, strings with leading space, or trailing space etc.
The forth query looks at distribution frequency of column values.
The firth query determine if the column only contains numeric values
The sixth query determines if the column only contains character values.
By proactively identifying and addressing these nuances, we prevent errors and ensure accurate analysis.

import pandas as pd
import psycopg2  # or other database connector
import seaborn as sns

# Connect to database
conn = psycopg2.connect(database="database_name", user="username", password="password")

# Load data from table
df = pd.read_sql("SELECT * FROM table_name", conn)

# Perform profiling
df.info()
df.describe()

# Datatype analysis
print(df.dtypes)

# Null and empty value analysis
null_counts = df.isnull().sum()
null_percentages = null_counts / df.shape[0] * 100
print(null_counts)
print(null_percentages)

# Distinct value analysis
distinct_counts = df.nunique()
print(distinct_counts)

correlations = df.corr()

sns.heatmap(correlations, annot=True)

df.boxplot()
#df[['column1', 'column2']].boxplot()

# Close connection
conn.close()

In this example we can connect to a dataset. The connector here uses psycopg2 but you could use pyodbc, sqlalchemy or spark.
info() returns the data types like non-null counts
describe() returns a summary of statistics (min, max, quartiles etc)
unique() returns the number of unique values
value_counts() provides frequency counts.
dtypes returns the data types (that pandas has understood the field to be)
isnull() isna() to itentify empty values or null
df.corr() allows us to do correlations with a heatmap
boxplot allows us to plot numerical columns and understand the distribution of data

This is a rather simplistic example, but more complex types might be a numeric precision, or date or time formats. Timezones of the data and more.

The foundations of this anaylsis is to understand what the data looked like as a baseline, to set yourself up for validation, testing, quality controls for the future.

Classifying the Information: Sensitive Secrets and Everyday Facts

Not all data is created equal. Some, like customer addresses or financial details, require careful handling due to privacy concerns. Data profiling helps us classify information based on its sensitivity and confidentiality. Think of it like sorting valuables into different vaults for safekeeping.

We can categorize data based on its “Business Sensitivity” or “Risk Exposure.” This helps us determine who needs access, and whether certain information should be masked, encrypted, or even excluded entirely from our analysis. It’s like drawing a map of the data landscape, highlighting the areas that require extra caution.

SQL

SELECT 
	COLUMN_NAME
	, COUNT(*)
FROM 
	your_table
WHERE 
	COLUMN_NAME REGEXP '^([\d]{3}-[\d]{2}-[\d]{4})$'  
	-- Date format (YYYY-MM-DD)
	OR 
	COLUMN_NAME REGEXP '[a-zA-Z]\d[a-zA-Z]\s\d[a-zA-Z]\d' 
	-- Australian postcode format
	OR 
	COLUMN_NAME REGEXP '\d{10}'                       
	-- 10-digit phone numbers
GROUP BY COLUMN_NAME;

This query uses regular expressions to match specific patterns within column values, such as date formats, postcode formats, or 10-digit phone numbers. Regular expressions offer powerful pattern matching capabilities, but they require careful crafting and testing.

Challenges and Triumphs: The Art of Data Profiling

Data profiling isn’t always smooth sailing. We might encounter inconsistencies, misinterpret patterns, or face technical hurdles. But by learning from these challenges and adopting best practices, like using clear documentation and sharing findings, we can continuously improve our profiling skills.

Remember, data profiling is an investment in the future. The time and effort spent today laying a strong foundation will pay off in the form of accurate, reliable, and insightful data analyses tomorrow. So, grab your detective hat, sharpen your SQL skills, and get ready to untangle the web of data – the secrets it holds are waiting to be discovered!

Chris Hillman