Data Modeling Showdown: Kimball vs One Big Table vs Relational
Introduction
When architecting a data warehouse, one of the most crucial decisions is choosing the right data modeling approach. Like selecting the right tool for a job, each modeling methodology has its strengths and ideal use cases. Today, we’ll explore three popular approaches: Kimball’s dimensional modeling (star schema), the one big table approach, and traditional relational modeling.
The Dataset: Understanding Our Example
To illustrate these approaches, let’s consider a retail sales system with these core components:
- Orders (transactions with amount, date, etc.)
- Customers (who made the purchase)
- Products (what was purchased)
- Stores (where the purchase occurred)
This real-world scenario will help us demonstrate how each modeling approach handles relationships, redundancy, and querying patterns.
Three Ways to Model: A Deep Dive
1. The One Big Table Approach
The simplest approach is to denormalize everything into a single table. While this might seem inefficient, it can be surprisingly effective for certain use cases, especially when combined with JSON columns for analytical data.
Benefits of JSON columns in the one big table approach:
- Flexible schema for metrics that change over time
- Efficient storage of time-series data within a single row
- Ability to store both daily snapshots and cumulative metrics
- Easy querying of complex nested structures
Here’s how our retail data looks in a one big table model:
-- Creating the one big table with JSON columns
CREATE TABLE sales_flat (
order_id INTEGER PRIMARY KEY,
order_date DATE,
customer_name VARCHAR(100),
customer_email VARCHAR(100),
customer_segment VARCHAR(50),
product_name VARCHAR(100),
product_price DECIMAL(10,2),
product_category VARCHAR(50),
store_name VARCHAR(100),
store_region VARCHAR(50),
store_manager VARCHAR(100),
order_amount DECIMAL(10,2),
quantity_sold INTEGER,
daily_metrics JSONB, -- Stores daily snapshots
cumulative_metrics JSONB -- Stores running totals
);
-- Sample data insertion with JSON columns
INSERT INTO sales_flat VALUES
(1, '2024-01-14', 'John Doe', 'john@email.com', 'Retail', 'Gaming Laptop', 1299.99,
'Electronics', 'Downtown Store', 'East', 'Jane Smith', 1299.99, 1,
'{
"daily_sales": 1299.99,
"daily_units": 1,
"avg_basket_size": 1299.99,
"unique_customers": 1
}',
'{
"total_sales": 1299.99,
"total_units": 1,
"customer_lifetime_value": 1299.99
}'
);
Advantages:
- No joins needed for queries
- Straightforward for business users to understand
- Excellent for BI tools that prefer flat structures
- Fast query performance for many analytical queries
Disadvantages:
- Data redundancy (same customer/product info repeated)
- Higher storage requirements
- Update anomalies (changing store manager requires updating many rows)
- Potential data inconsistencies
2. The Kimball (Star Schema) Approach
The Kimball approach, also known as dimensional modeling, organizes data into fact and dimension tables. This creates a structure resembling a star when visualized:
-- Create dimension tables
CREATE TABLE dim_customer (
customer_key SERIAL PRIMARY KEY,
customer_name VARCHAR(100),
customer_email VARCHAR(100),
customer_segment VARCHAR(50)
);
CREATE TABLE dim_product (
product_key SERIAL PRIMARY KEY,
product_name VARCHAR(100),
product_price DECIMAL(10,2),
product_category VARCHAR(50)
);
CREATE TABLE dim_store (
store_key SERIAL PRIMARY KEY,
store_name VARCHAR(100),
store_region VARCHAR(50),
store_manager VARCHAR(100)
);
-- Create fact table
CREATE TABLE fact_sales (
sale_key SERIAL PRIMARY KEY,
customer_key INTEGER REFERENCES dim_customer(customer_key),
product_key INTEGER REFERENCES dim_product(product_key),
store_key INTEGER REFERENCES dim_store(store_key),
sale_date DATE,
amount DECIMAL(10,2),
quantity INTEGER
);
Advantages:
- Optimized for analytical queries
- Reduced data redundancy compared to one big table
- Clear separation between metrics and attributes
- Industry standard for data warehousing
Disadvantages:
- Requires joins for most queries
- More complex to maintain than one big table
- Initial setup requires careful design
3. The Relational (Normalized) Approach
The traditional relational approach focuses on normalizing data to eliminate redundancy. This results in more tables but ensures data consistency:
-- Create normalized tables
CREATE TABLE categories (
category_id SERIAL PRIMARY KEY,
name VARCHAR(50)
);
CREATE TABLE regions (
region_id SERIAL PRIMARY KEY,
name VARCHAR(50)
);
CREATE TABLE customers (
customer_id SERIAL PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(100),
segment VARCHAR(50)
);
CREATE TABLE stores (
store_id SERIAL PRIMARY KEY,
region_id INTEGER REFERENCES regions(region_id),
name VARCHAR(100),
manager VARCHAR(100)
);
CREATE TABLE products (
product_id SERIAL PRIMARY KEY,
category_id INTEGER REFERENCES categories(category_id),
name VARCHAR(100),
price DECIMAL(10,2)
);
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INTEGER REFERENCES customers(customer_id),
store_id INTEGER REFERENCES stores(store_id),
order_date DATE,
total_amount DECIMAL(10,2)
);
CREATE TABLE order_items (
order_item_id SERIAL PRIMARY KEY,
order_id INTEGER REFERENCES orders(order_id),
product_id INTEGER REFERENCES products(product_id),
quantity INTEGER,
unit_price DECIMAL(10,2)
);
Advantages:
- Minimal data redundancy
- Strong data integrity
- Efficient for OLTP workloads
- Flexible for complex relationships
Disadvantages:
- Complex queries requiring many joins
- Can be slower for analytical queries
- More difficult for business users to understand
- Requires careful indexing strategy
Making the Choice: Real-World Considerations
When choosing between these approaches, consider these key factors:
Data Volume and Scale
One Big Table
- Best for smaller datasets (< 1TB)
- Works well when data fits in memory
- Consider compression ratios due to redundancy
Kimball/Star Schema
- Excellent for large-scale data warehouses
- Efficient for petabyte-scale analytics
- Good balance of storage and performance
Relational
- Scales well for transactional data
- Storage efficient due to normalization
- May require partitioning for very large datasets
Query Pattern Examples
Let’s compare how each model handles common analytical queries:
Basic Aggregation Queries
-- One Big Table with JSON: Multiple metrics at once
SELECT
product_category,
SUM(order_amount) as total_sales,
SUM((daily_metrics->>'daily_units')::integer) as total_units,
AVG((daily_metrics->>'avg_basket_size')::decimal) as avg_basket,
SUM((cumulative_metrics->>'total_sales')::decimal) as lifetime_sales
FROM
sales_flat
GROUP BY
product_category;
-- Star Schema: Requires dimension joins
SELECT
dp.product_category,
SUM(fs.amount) as total_sales,
SUM(fs.quantity) as total_units,
AVG(fs.amount / fs.quantity) as avg_basket
FROM
fact_sales fs
JOIN dim_product dp
ON fs.product_key = dp.product_key
GROUP BY
dp.product_category;
-- Relational: Multiple joins needed
SELECT
c.name as category,
SUM(oi.quantity * oi.unit_price) as total_sales,
SUM(oi.quantity) as total_units,
AVG(oi.unit_price) as avg_price
FROM
order_items oi
JOIN products p
ON oi.product_id = p.product_id
JOIN categories c
ON p.category_id = c.category_id
GROUP BY
c.name;
Time-Based Analysis
-- One Big Table with JSON: Historical metrics in single query
SELECT
order_date,
product_category,
daily_metrics->>'daily_sales' as daily_sales,
cumulative_metrics->>'total_sales' as running_total,
daily_metrics->>'unique_customers' as daily_customers
FROM
sales_flat
WHERE
order_date BETWEEN '2024-01-01' AND '2024-01-31'
ORDER BY
order_date;
-- Star Schema: Time dimension join
SELECT
fs.sale_date,
dp.product_category,
SUM(fs.amount) as daily_sales,
COUNT(DISTINCT fs.customer_key) as daily_customers
FROM
fact_sales fs
JOIN dim_product dp
ON fs.product_key = dp.product_key
WHERE
fs.sale_date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY
fs.sale_date, dp.product_category
ORDER BY
fs.sale_date;
-- Relational: Complex joins and aggregations
SELECT
DATE(o.order_date) as sale_date,
c.name as category,
SUM(oi.quantity * oi.unit_price) as daily_sales,
COUNT(DISTINCT o.customer_id) as daily_customers
FROM
orders o
JOIN order_items oi
ON o.order_id = oi.order_id
JOIN products p
ON oi.product_id = p.product_id
JOIN categories c
ON p.category_id = c.category_id
WHERE
o.order_date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY
DATE(o.order_date), c.name
ORDER BY
sale_date;
Performance Comparison
Here’s a comparison of typical query performance across different operations:
Operation Type | One Big Table | Star Schema | Relational |
---|---|---|---|
Simple Aggregations | Fast | Medium | Slow |
Point Queries | Medium | Fast | Fast |
Complex Joins | N/A | Medium | Slow |
Updates | Slow | Medium | Fast |
Storage Efficiency | Poor | Good | Excellent |
Recommendations
Choose One Big Table when:
- Your dataset is relatively small (large data can work in some use cases)
- You need quick insights without complex transformations
- Business users need direct access to data
- You’re prototyping or doing proof-of-concept work
Choose Kimball/Star Schema when:
- Building an enterprise data warehouse
- Supporting complex analytical queries
- Need to balance performance with storage efficiency
- Have well-defined business processes and dimensions
Choose Relational when:
- Building an OLTP system
- Data integrity is crucial
- Handling complex relationships between entities
- Storage space is at a premium
Testing the Concepts
To help you experiment with these different modeling approaches, lets use python and UV to setup a test environment. You can use it to try out the queries and see the performance characteristics firsthand:
with UV we can call a script which virtually creates the depedencies for you.
#!/usr/bin/env python
# requires-python>=3.8
# requirements:
# pandas>=2.0.0
# tabulate>=0.9.0
import sqlite3
import json
from datetime import datetime, timedelta
import random
import pandas as pd
from tabulate import tabulate
class DataModelTester:
def __init__(self, db_path=":memory:"):
"""Initialize with in-memory SQLite database by default"""
self.conn = sqlite3.connect(db_path)
self.cursor = self.conn.cursor()
def setup_one_big_table(self):
"""Create and populate the one big table model"""
print("Setting up One Big Table model...")
self.cursor.execute('''
CREATE TABLE IF NOT EXISTS sales_flat (
order_id INTEGER PRIMARY KEY,
order_date TEXT,
customer_name TEXT,
customer_email TEXT,
customer_segment TEXT,
product_name TEXT,
product_price REAL,
product_category TEXT,
store_name TEXT,
store_region TEXT,
store_manager TEXT,
order_amount REAL,
quantity_sold INTEGER,
daily_metrics TEXT, -- JSON string for SQLite
cumulative_metrics TEXT -- JSON string for SQLite
)
''')
# Generate sample data
base_date = datetime.now()
for i in range(100):
order_date = (base_date - timedelta(days=random.randint(0, 30))).strftime('%Y-%m-%d')
daily_metrics = {
"daily_sales": random.uniform(100, 1000),
"daily_units": random.randint(1, 10),
"avg_basket_size": random.uniform(50, 200),
"unique_customers": random.randint(1, 5)
}
cumulative_metrics = {
"total_sales": random.uniform(1000, 5000),
"total_units": random.randint(10, 50),
"customer_lifetime_value": random.uniform(500, 2000)
}
self.cursor.execute('''
INSERT INTO sales_flat VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
''', (
i,
order_date,
f'Customer {i}',
f'customer{i}@example.com',
random.choice(['Retail', 'Wholesale', 'Online']),
f'Product {i}',
random.uniform(10, 1000),
random.choice(['Electronics', 'Clothing', 'Food']),
f'Store {i % 5}',
random.choice(['North', 'South', 'East', 'West']),
f'Manager {i % 3}',
random.uniform(100, 1000),
random.randint(1, 10),
json.dumps(daily_metrics),
json.dumps(cumulative_metrics)
))
self.conn.commit()
print("One Big Table model setup complete!")
def test_one_big_table_queries(self):
"""Run test queries on the one big table model"""
print("\nTesting One Big Table Queries:")
# Basic aggregation query
query = '''
SELECT
product_category,
SUM(order_amount) as total_sales,
COUNT(DISTINCT customer_email) as unique_customers,
AVG(CAST(json_extract(daily_metrics, '$.daily_units') AS FLOAT)) as avg_daily_units
FROM sales_flat
GROUP BY product_category
'''
df = pd.read_sql_query(query, self.conn)
print("\nProduct Category Analysis:")
print(tabulate(df, headers='keys', tablefmt='pretty', showindex=False))
# Time-based analysis
query = '''
SELECT
order_date,
COUNT(*) as num_orders,
SUM(order_amount) as daily_sales,
AVG(CAST(json_extract(daily_metrics, '$.avg_basket_size') AS FLOAT)) as avg_basket
FROM sales_flat
GROUP BY order_date
ORDER BY order_date DESC
LIMIT 5
'''
df = pd.read_sql_query(query, self.conn)
print("\nRecent Daily Sales Analysis:")
print(tabulate(df, headers='keys', tablefmt='pretty', showindex=False))
def cleanup(self):
"""Clean up the database"""
self.conn.close()
print("\nTest environment cleaned up!")
if __name__ == "__main__":
print("Starting Data Model Test Environment...")
tester = DataModelTester()
# Setup and test one big table model
tester.setup_one_big_table()
tester.test_one_big_table_queries()
# Clean up
tester.cleanup()
You can run this script easily using UV’s run command with inline dependencies:
uv run test_data_models.py
This will automatically create a virtual environment, install the required packages, and run the test script. The script creates an in-memory SQLite database to demonstrate the one big table approach, populates it with sample data, and runs example queries to show how the model performs.
Feel free to modify the script to test different scenarios or add implementations of the star schema and relational models to compare their performance characteristics.
Conclusion
The choice of data model significantly impacts your system’s performance, maintainability, and usability. While the one big table approach offers simplicity and quick implementation, the star schema provides a robust foundation for analytics, and the relational model ensures data integrity.
Remember:
- Start with your use case and requirements
- Consider your data volume and growth
- Think about your users and their query patterns
- Plan for future scalability needs
The best approach might even be a hybrid - using different models for different parts of your system based on specific requirements. The key is understanding these trade-offs and making an informed decision that aligns with your business needs.