Data Modeling Showdown: Kimball vs One Big Table vs Relational

Introduction

When architecting a data warehouse, one of the most crucial decisions is choosing the right data modeling approach. Like selecting the right tool for a job, each modeling methodology has its strengths and ideal use cases. Today, we’ll explore three popular approaches: Kimball’s dimensional modeling (star schema), the one big table approach, and traditional relational modeling.

The Dataset: Understanding Our Example

To illustrate these approaches, let’s consider a retail sales system with these core components:

Orders (transactions with amount, date, etc.)
Customers (who made the purchase)
Products (what was purchased)
Stores (where the purchase occurred)

This real-world scenario will help us demonstrate how each modeling approach handles relationships, redundancy, and querying patterns.

Three Ways to Model: A Deep Dive

1. The One Big Table Approach

The simplest approach is to denormalize everything into a single table. While this might seem inefficient, it can be surprisingly effective for certain use cases, especially when combined with JSON columns for analytical data.

Benefits of JSON columns in the one big table approach:

Flexible schema for metrics that change over time
Efficient storage of time-series data within a single row
Ability to store both daily snapshots and cumulative metrics
Easy querying of complex nested structures

Here’s how our retail data looks in a one big table model:

-- Creating the one big table with JSON columns
CREATE TABLE sales_flat (
    order_id INTEGER PRIMARY KEY,
    order_date DATE,
    customer_name VARCHAR(100),
    customer_email VARCHAR(100),
    customer_segment VARCHAR(50),
    product_name VARCHAR(100),
    product_price DECIMAL(10,2),
    product_category VARCHAR(50),
    store_name VARCHAR(100),
    store_region VARCHAR(50),
    store_manager VARCHAR(100),
    order_amount DECIMAL(10,2),
    quantity_sold INTEGER,
    daily_metrics JSONB,  -- Stores daily snapshots
    cumulative_metrics JSONB  -- Stores running totals
);

-- Sample data insertion with JSON columns
INSERT INTO sales_flat VALUES
(1, '2024-01-14', 'John Doe', 'john@email.com', 'Retail', 'Gaming Laptop', 1299.99, 
 'Electronics', 'Downtown Store', 'East', 'Jane Smith', 1299.99, 1,
 '{
    "daily_sales": 1299.99,
    "daily_units": 1,
    "avg_basket_size": 1299.99,
    "unique_customers": 1
 }',
 '{
    "total_sales": 1299.99,
    "total_units": 1,
    "customer_lifetime_value": 1299.99
 }'
);

Advantages:

No joins needed for queries
Straightforward for business users to understand
Excellent for BI tools that prefer flat structures
Fast query performance for many analytical queries

Disadvantages:

Data redundancy (same customer/product info repeated)
Higher storage requirements
Update anomalies (changing store manager requires updating many rows)
Potential data inconsistencies

2. The Kimball (Star Schema) Approach

The Kimball approach, also known as dimensional modeling, organizes data into fact and dimension tables. This creates a structure resembling a star when visualized:

-- Create dimension tables
CREATE TABLE dim_customer (
    customer_key SERIAL PRIMARY KEY,
    customer_name VARCHAR(100),
    customer_email VARCHAR(100),
    customer_segment VARCHAR(50)
);

CREATE TABLE dim_product (
    product_key SERIAL PRIMARY KEY,
    product_name VARCHAR(100),
    product_price DECIMAL(10,2),
    product_category VARCHAR(50)
);

CREATE TABLE dim_store (
    store_key SERIAL PRIMARY KEY,
    store_name VARCHAR(100),
    store_region VARCHAR(50),
    store_manager VARCHAR(100)
);

-- Create fact table
CREATE TABLE fact_sales (
    sale_key SERIAL PRIMARY KEY,
    customer_key INTEGER REFERENCES dim_customer(customer_key),
    product_key INTEGER REFERENCES dim_product(product_key),
    store_key INTEGER REFERENCES dim_store(store_key),
    sale_date DATE,
    amount DECIMAL(10,2),
    quantity INTEGER
);

Advantages:

Optimized for analytical queries
Reduced data redundancy compared to one big table
Clear separation between metrics and attributes
Industry standard for data warehousing

Disadvantages:

Requires joins for most queries
More complex to maintain than one big table
Initial setup requires careful design

3. The Relational (Normalized) Approach

The traditional relational approach focuses on normalizing data to eliminate redundancy. This results in more tables but ensures data consistency:

-- Create normalized tables
CREATE TABLE categories (
    category_id SERIAL PRIMARY KEY,
    name VARCHAR(50)
);

CREATE TABLE regions (
    region_id SERIAL PRIMARY KEY,
    name VARCHAR(50)
);

CREATE TABLE customers (
    customer_id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100),
    segment VARCHAR(50)
);

CREATE TABLE stores (
    store_id SERIAL PRIMARY KEY,
    region_id INTEGER REFERENCES regions(region_id),
    name VARCHAR(100),
    manager VARCHAR(100)
);

CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    category_id INTEGER REFERENCES categories(category_id),
    name VARCHAR(100),
    price DECIMAL(10,2)
);

CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INTEGER REFERENCES customers(customer_id),
    store_id INTEGER REFERENCES stores(store_id),
    order_date DATE,
    total_amount DECIMAL(10,2)
);

CREATE TABLE order_items (
    order_item_id SERIAL PRIMARY KEY,
    order_id INTEGER REFERENCES orders(order_id),
    product_id INTEGER REFERENCES products(product_id),
    quantity INTEGER,
    unit_price DECIMAL(10,2)
);

Advantages:

Minimal data redundancy
Strong data integrity
Efficient for OLTP workloads
Flexible for complex relationships

Disadvantages:

Complex queries requiring many joins
Can be slower for analytical queries
More difficult for business users to understand
Requires careful indexing strategy

Making the Choice: Real-World Considerations

When choosing between these approaches, consider these key factors:

Data Volume and Scale

One Big Table
- Best for smaller datasets (< 1TB)
- Works well when data fits in memory
- Consider compression ratios due to redundancy
Kimball/Star Schema
- Excellent for large-scale data warehouses
- Efficient for petabyte-scale analytics
- Good balance of storage and performance
Relational
- Scales well for transactional data
- Storage efficient due to normalization
- May require partitioning for very large datasets

Query Pattern Examples

Let’s compare how each model handles common analytical queries:

Basic Aggregation Queries

-- One Big Table with JSON: Multiple metrics at once
SELECT 
    product_category,
    SUM(order_amount) as total_sales,
    SUM((daily_metrics->>'daily_units')::integer) as total_units,
    AVG((daily_metrics->>'avg_basket_size')::decimal) as avg_basket,
    SUM((cumulative_metrics->>'total_sales')::decimal) as lifetime_sales
FROM 
    sales_flat
GROUP BY 
    product_category;

-- Star Schema: Requires dimension joins
SELECT 
    dp.product_category,
    SUM(fs.amount) as total_sales,
    SUM(fs.quantity) as total_units,
    AVG(fs.amount / fs.quantity) as avg_basket
FROM 
    fact_sales fs
    
    JOIN dim_product dp 
    ON fs.product_key = dp.product_key
GROUP BY 
    dp.product_category;

-- Relational: Multiple joins needed
SELECT 
    c.name as category,
    SUM(oi.quantity * oi.unit_price) as total_sales,
    SUM(oi.quantity) as total_units,
    AVG(oi.unit_price) as avg_price
FROM 
    order_items oi
    
    JOIN products p 
    ON oi.product_id = p.product_id
    
    JOIN categories c 
    ON p.category_id = c.category_id
GROUP BY 
    c.name;

Time-Based Analysis

-- One Big Table with JSON: Historical metrics in single query
SELECT 
    order_date,
    product_category,
    daily_metrics->>'daily_sales' as daily_sales,
    cumulative_metrics->>'total_sales' as running_total,
    daily_metrics->>'unique_customers' as daily_customers
FROM 
    sales_flat
WHERE 
    order_date BETWEEN '2024-01-01' AND '2024-01-31'
ORDER BY 
    order_date;

-- Star Schema: Time dimension join
SELECT 
    fs.sale_date,
    dp.product_category,
    SUM(fs.amount) as daily_sales,
    COUNT(DISTINCT fs.customer_key) as daily_customers
FROM 
    fact_sales fs
    
    JOIN dim_product dp 
    ON fs.product_key = dp.product_key
WHERE 
    fs.sale_date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY 
    fs.sale_date, dp.product_category
ORDER BY 
    fs.sale_date;

-- Relational: Complex joins and aggregations
SELECT 
    DATE(o.order_date) as sale_date,
    c.name as category,
    SUM(oi.quantity * oi.unit_price) as daily_sales,
    COUNT(DISTINCT o.customer_id) as daily_customers
FROM 
    orders o
    
    JOIN order_items oi 
    ON o.order_id = oi.order_id
    
    JOIN products p 
    ON oi.product_id = p.product_id
    
    JOIN categories c 
    ON p.category_id = c.category_id
WHERE 
    o.order_date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY 
    DATE(o.order_date), c.name
ORDER BY 
    sale_date;

Performance Comparison

Here’s a comparison of typical query performance across different operations:

Operation Type	One Big Table	Star Schema	Relational
Simple Aggregations	Fast	Medium	Slow
Point Queries	Medium	Fast	Fast
Complex Joins	N/A	Medium	Slow
Updates	Slow	Medium	Fast
Storage Efficiency	Poor	Good	Excellent

Recommendations

Choose One Big Table when:
- Your dataset is relatively small (large data can work in some use cases)
- You need quick insights without complex transformations
- Business users need direct access to data
- You’re prototyping or doing proof-of-concept work
Choose Kimball/Star Schema when:
- Building an enterprise data warehouse
- Supporting complex analytical queries
- Need to balance performance with storage efficiency
- Have well-defined business processes and dimensions
Choose Relational when:
- Building an OLTP system
- Data integrity is crucial
- Handling complex relationships between entities
- Storage space is at a premium

Testing the Concepts

To help you experiment with these different modeling approaches, lets use python and UV to setup a test environment. You can use it to try out the queries and see the performance characteristics firsthand:

with UV we can call a script which virtually creates the depedencies for you.

#!/usr/bin/env python
# requires-python>=3.8
# requirements:
# pandas>=2.0.0
# tabulate>=0.9.0

import sqlite3
import json
from datetime import datetime, timedelta
import random
import pandas as pd
from tabulate import tabulate

class DataModelTester:
    def __init__(self, db_path=":memory:"):
        """Initialize with in-memory SQLite database by default"""
        self.conn = sqlite3.connect(db_path)
        self.cursor = self.conn.cursor()
        
    def setup_one_big_table(self):
        """Create and populate the one big table model"""
        print("Setting up One Big Table model...")
        self.cursor.execute('''
        CREATE TABLE IF NOT EXISTS sales_flat (
            order_id INTEGER PRIMARY KEY,
            order_date TEXT,
            customer_name TEXT,
            customer_email TEXT,
            customer_segment TEXT,
            product_name TEXT,
            product_price REAL,
            product_category TEXT,
            store_name TEXT,
            store_region TEXT,
            store_manager TEXT,
            order_amount REAL,
            quantity_sold INTEGER,
            daily_metrics TEXT,  -- JSON string for SQLite
            cumulative_metrics TEXT  -- JSON string for SQLite
        )
        ''')
        
        # Generate sample data
        base_date = datetime.now()
        for i in range(100):
            order_date = (base_date - timedelta(days=random.randint(0, 30))).strftime('%Y-%m-%d')
            daily_metrics = {
                "daily_sales": random.uniform(100, 1000),
                "daily_units": random.randint(1, 10),
                "avg_basket_size": random.uniform(50, 200),
                "unique_customers": random.randint(1, 5)
            }
            cumulative_metrics = {
                "total_sales": random.uniform(1000, 5000),
                "total_units": random.randint(10, 50),
                "customer_lifetime_value": random.uniform(500, 2000)
            }
            
            self.cursor.execute('''
            INSERT INTO sales_flat VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            ''', (
                i,
                order_date,
                f'Customer {i}',
                f'customer{i}@example.com',
                random.choice(['Retail', 'Wholesale', 'Online']),
                f'Product {i}',
                random.uniform(10, 1000),
                random.choice(['Electronics', 'Clothing', 'Food']),
                f'Store {i % 5}',
                random.choice(['North', 'South', 'East', 'West']),
                f'Manager {i % 3}',
                random.uniform(100, 1000),
                random.randint(1, 10),
                json.dumps(daily_metrics),
                json.dumps(cumulative_metrics)
            ))
        
        self.conn.commit()
        print("One Big Table model setup complete!")

    def test_one_big_table_queries(self):
        """Run test queries on the one big table model"""
        print("\nTesting One Big Table Queries:")
        
        # Basic aggregation query
        query = '''
        SELECT 
            product_category,
            SUM(order_amount) as total_sales,
            COUNT(DISTINCT customer_email) as unique_customers,
            AVG(CAST(json_extract(daily_metrics, '$.daily_units') AS FLOAT)) as avg_daily_units
        FROM sales_flat
        GROUP BY product_category
        '''
        
        df = pd.read_sql_query(query, self.conn)
        print("\nProduct Category Analysis:")
        print(tabulate(df, headers='keys', tablefmt='pretty', showindex=False))
        
        # Time-based analysis
        query = '''
        SELECT 
            order_date,
            COUNT(*) as num_orders,
            SUM(order_amount) as daily_sales,
            AVG(CAST(json_extract(daily_metrics, '$.avg_basket_size') AS FLOAT)) as avg_basket
        FROM sales_flat
        GROUP BY order_date
        ORDER BY order_date DESC
        LIMIT 5
        '''
        
        df = pd.read_sql_query(query, self.conn)
        print("\nRecent Daily Sales Analysis:")
        print(tabulate(df, headers='keys', tablefmt='pretty', showindex=False))

    def cleanup(self):
        """Clean up the database"""
        self.conn.close()
        print("\nTest environment cleaned up!")

if __name__ == "__main__":
    print("Starting Data Model Test Environment...")
    
    tester = DataModelTester()
    
    # Setup and test one big table model
    tester.setup_one_big_table()
    tester.test_one_big_table_queries()
    
    # Clean up
    tester.cleanup()

You can run this script easily using UV’s run command with inline dependencies:

uv run test_data_models.py

This will automatically create a virtual environment, install the required packages, and run the test script. The script creates an in-memory SQLite database to demonstrate the one big table approach, populates it with sample data, and runs example queries to show how the model performs.

Feel free to modify the script to test different scenarios or add implementations of the star schema and relational models to compare their performance characteristics.

Conclusion

The choice of data model significantly impacts your system’s performance, maintainability, and usability. While the one big table approach offers simplicity and quick implementation, the star schema provides a robust foundation for analytics, and the relational model ensures data integrity.

Remember:

Start with your use case and requirements
Consider your data volume and growth
Think about your users and their query patterns
Plan for future scalability needs

The best approach might even be a hybrid - using different models for different parts of your system based on specific requirements. The key is understanding these trade-offs and making an informed decision that aligns with your business needs.

Chris Hillman

Data Modeling Showdown: Kimball vs One Big Table vs Relational

Introduction

The Dataset: Understanding Our Example

Three Ways to Model: A Deep Dive

1. The One Big Table Approach

2. The Kimball (Star Schema) Approach

3. The Relational (Normalized) Approach

Making the Choice: Real-World Considerations

Data Volume and Scale

Query Pattern Examples

Basic Aggregation Queries

Time-Based Analysis

Performance Comparison

Recommendations

Testing the Concepts

Conclusion

Table of Contents