Ghost in the data
  • Home
  • About
  • Posts
  • Posts
  • 2025
    • UV Tools
    • Zsh Virtual Environments
    • 2025 Data Trends
    • Data Modeling Approaches
    • MacOS Dev Setup
    • Windows Dev Setup
    • Business Context Guide
    • Data Impact
    • Data Engineering Interviews
    • First 90 Days as Data Engineer
    • Senior to Staff Engineer
    • LLMs for Business Part 1
    • LLMs for Business Part 2
    • Mastering 1:1 Meetings
    • AI Prompting Secret
    • Conceptual Data Modeling
    • WAP Pattern for Data Pipelines
    • AI Simplified
  • 2024
    • Delta-lake
    • Data Normalisation
    • Data Profiling
    • Defensive Engineering
    • CI/CD
    • Setup Docker and Airflow
    • Find and Attract Data Engineers
    • 17 Years of Insights
    • Relationship Building
    • Individual Contributor
  • 2023
    • GitBash with SSH
    • Journalling
    • Minecraft Server in GCP
    • Onboarding a data team
    • File Format for Big Data
    • Incident Management
    • Data Vault
    • Books that are worth you time?
Hero Image
Setting Up Your Data Engineering Environment on Windows

Introduction Setting up a development environment for data engineering on Windows requires some specific considerations that differ from Unix-based systems. This guide will walk you through creating a robust Python development environment on Windows, with detailed explanations of each component and why it’s important. Clean Slate: Removing Existing Python Installations Before starting, it’s important to remove any existing Python installations to avoid conflicts: Open Windows Settings > Apps > Apps & Features Search for “Python” Uninstall any Python versions listed Also check and remove Python from these locations:

  • Python
  • DBT
  • Windows
  • UV Package Manager
  • VSCode
Monday, February 3, 2025 Read
Hero Image
Setting Up Your Data Engineering Environment on MacOS

Introduction Setting up a development environment for data engineering on MacOS requires careful consideration of package management, Python version control, and tool configuration. This guide will walk you through the process, explaining not just how to set up these tools, but why each component is important. Clean Slate: Removing Existing Python Installations Before we begin, it’s important to ensure we’re starting with a clean slate. Multiple Python installations can cause confusion and conflicts. Let’s remove any existing Python installations:

  • Python
  • DBT
  • MacOS
  • UV Package Manager
  • VSCode
Sunday, February 2, 2025 Read
Hero Image
Data Modeling Showdown: Kimball vs One Big Table vs Relational

Introduction When architecting a data warehouse, one of the most crucial decisions is choosing the right data modeling approach. Like selecting the right tool for a job, each modeling methodology has its strengths and ideal use cases. Today, we’ll explore three popular approaches: Kimball’s dimensional modeling (star schema), the one big table approach, and traditional relational modeling. The Dataset: Understanding Our Example To illustrate these approaches, let’s consider a retail sales system with these core components:

  • Data Warehouse
  • SQL
  • Star Schema
  • Database Design
  • Performance Optimization
Saturday, January 25, 2025 Read
Hero Image
Data Industry Trends: What to Expect in 2025

Introduction The data industry has kicked off 2025 with transformative developments that are fundamentally reshaping our approach to data management and analytics. The landscape is witnessing seismic shifts - from Databricks’ historic funding round to Boomi’s strategic acquisition of Rivery, and the industry-shaking Iceberg buyout. Yet amid this technological evolution, a critical question emerges: how will these advancements translate into tangible value for organizations? As we navigate through this dynamic environment, the focus extends beyond identifying dominant technologies to understanding their practical impact on business outcomes. Let’s explore the key trends that are defining the data world in 2025, and more importantly, how they’re reshaping the way organizations leverage their data assets.

  • Industry Trends
  • Apache Iceberg
  • AI
  • Data Solutions
  • SQL
  • Data Governance
Saturday, January 18, 2025 Read
Hero Image
Automating Python Virtual Environments with Zsh on macOS

Automating Python Virtual Environments with Zsh on macOS Managing Python virtual environments can be tedious - having to manually activate and deactivate them as you move between projects. I decided to create a script that will automatically activate virtual environment. When you cd into a directory containing a .venv folder. When you leave, it will deactivate it. Installing Zsh While macOS comes with zsh as the default shell since Catalina (10.15), you may want to ensure you have the latest version. The easiest way to install or update zsh is using Homebrew:

  • Python
  • Virtual Environments
  • Zsh
  • macOS
  • Automation
Tuesday, January 14, 2025 Read
Hero Image
UV: A Game-Changer for Data Engineering Scripts

Introduction While pip install has been the go-to package installer for Python developers, UV brings game-changing performance improvements to dependency management. UV achieves significantly faster installation speeds through several clever optimizations: Parallel Downloads: Unlike pip’s sequential approach, UV downloads multiple packages simultaneously, dramatically reducing wait times for large dependency sets. Wheel-First Strategy: UV prioritizes pre-built wheels over source distributions, avoiding time-consuming compilation steps when possible. Rust-Based Implementation: Built with Rust’s memory safety and concurrent processing capabilities, UV handles package resolution more efficiently than pip’s Python-based implementation. In real-world testing, UV often installs packages 5-10x faster than pip, particularly in environments with many dependencies. For data professionals working with complex libraries like pandas, numpy, scikit-learn, or pyspark, this speed difference isn’t just convenient – it’s transformative for workflow efficiency.

  • UV
  • Python
  • Data Testing
  • Data Transformation
  • Development Tools
Saturday, January 11, 2025 Read
Hero Image
Individual Contributor to Senior Manager of Data

Introduction Starting a new role at any organization—whether it’s a school, a workplace, or another setting—typically begins with a focus on individual contribution. Your success is directly tied to your personal efforts. You have control over the pace and quality of your work, and ultimately, you are solely accountable for your outcomes. This phase allows you to develop the skills and discipline necessary to excel in more complex roles. The Path to Success as an Individual Contributor During my time in this phase, I likely spent longer than most. I always had the mindset of making my manager—and by extension, my team—look good. This meant not only delivering quality work but also taking full accountability for my tasks.

    Saturday, August 10, 2024 Read
    Hero Image
    Enhance Workplace Relationships

    Introduction: A Tale of Two Tribes and the Modern Workplace Imagine a serene summer camp in the rugged heart of Robbers Cave State Park, Oklahoma, 1954. Two groups of boys, unaware of each other’s existence, are about to embark on an adventure that mirrors the timeless tale of rivalry and reconciliation—a story that still resonates in the corridors of contemporary workplaces. The Robbers Cave Experiment, conducted by social psychologist Muzafer Sherif, is not just a fascinating study on group dynamics; it’s a blueprint for understanding and enhancing cooperation in any setting where diverse minds meet. This experiment beautifully illustrates how perceived differences can dissolve into unity, given the right conditions and shared objectives.

    • Robbers Cave Experiment
    • Workplace Relationships
    • Team Collaboration
    • Intergroup Conflict
    • Employee Productivity
    Saturday, April 6, 2024 Read
    Hero Image
    Mastering Data Engineering: Insights and Best Practices

    Introduction I have been working with Data for a bit over 17 years now, I have seen it evolve from its nascent stages to a cornerstone of the tech industry. The journey has been nothing short of revolutionary, impacting businesses and society at large. The evolution and the role of a data engineer have expanded, requiring not just technical skills, but a deep understanding of business, security, and the human element within technology.

    • Culture
    • Continuous Learning
    • Data Quality
    • Professional Growth
    • Data Pipeline
    • Data System Resilience
    • Team Collaboration
    Saturday, March 30, 2024 Read
    Hero Image
    How to Find and Attract Top Data Engineers

    Introduction In my journey of filling open positions, I tend to get inundated with a multitude of resumes. Sifting through applications, your reaction varies from “this might work,” to a straightforward “no”. Rarely do I encounter a resume that makes me exclaim, “This person is exceptional! We need them on our team.” Despite reviewing thousands of job applications, the quest to find a standout Data Engineer often feels challenging. I believe there’s a reason for this rarity. The truth is, that the most talented Data Engineers, along with top professionals in any field, are seldom actively seeking employment.

    • Culture
    • Employee Engagement
    • Hiring Strategies
    • Talent Acquisition
    • Recruitment
    Thursday, March 14, 2024 Read
    Hero Image
    Docker and Airflow: A Comprehensive Setup Guide

    Introduction Docker and Airflow are like peanut butter and jelly for data engineers; they just work perfectly together. Docker simplifies deployment by wrapping your applications in containers, ensuring consistency across environments. It’s like having a genie that makes sure your software behaves the same, no matter where you deploy it. On the flip side, Airflow is the maestro of orchestrating complex workflows, making it a go-to tool for managing data pipelines in various organizations.

    • Apache Airflow
    • ETL Pipeline
    • Data Engineering
    • PostegreSQL
    Saturday, March 9, 2024 Read
    Hero Image
    Optimizing CI/CD with SlimCi DBT for Efficient Data Engineering

    Introduction In the rapidly evolving landscape of software development and data engineering, the ability to adapt and respond to changes quickly is not just an advantage; it’s a necessity. One of the core practices enabling this agility is Continuous Integration (CI), a methodology that encourages developers to integrate their work into a shared repository early and often. At its heart, CI embodies the “fail fast” principle, a philosophy that values early detection of errors and inconsistencies, allowing teams to address issues before they escalate into more significant problems.

    • Pipeline
    Saturday, February 17, 2024 Read
    • ««
    • «
    • 1
    • 2
    • 3
    • »
    • »»