Ghost in the data
  • Home
  • About
  • Posts
  • Tags
  • AI
  • AI Agents
  • AI Business Applications
  • AI Communication
  • AI Concepts
  • AI Productivity
  • AI Prompting
  • AI Workflows
  • Airflow
  • Apache Airflow
  • Apache Iceberg
  • Automation
  • AVRO
  • Bedrock Edition
  • Blue-Green Deployment
  • Business Value
  • Career Advice
  • Career Growth
  • Chapter Lead
  • ChatGPT
  • CI/CD
  • Claude
  • Cloud Gaming
  • Code Review
  • Communication
  • ConceptualDataModeling
  • Continuous Learning
  • CSV
  • Culture
  • Data Architecture
  • Data Culture
  • Data Engineering
  • Data Governance
  • Data Impact
  • Data Leadership
  • Data Modeling
  • Data Modelling
  • Data Pipeline
  • Data Quality
  • Data Reliability
  • Data Solutions
  • Data System Resilience
  • Data Testing
  • Data Transformation
  • Data Vault
  • Data Warehouse
  • Data Warehouse Architecture
  • Database Design
  • DataEngineering
  • DataPipelines
  • DBT
  • Delta-Lake
  • Development
  • Development Tools
  • Emotional Intelligence
  • EmpatheticDesign
  • Employee Engagement
  • Employee Productivity
  • Engineering Career
  • ETL
  • ETL Pipeline
  • Family Gaming
  • Feedback
  • File Formats
  • GCP
  • Git
  • GitBash
  • Github
  • GitHub Actions
  • Hiring Strategies
  • Incident Response
  • Industry Trends
  • Inspirational Quote
  • Intergroup Conflict
  • Interviews
  • Journal
  • Journaling Techniques
  • JSON
  • Language Models
  • LLM
  • LLM Interaction
  • MacOS
  • Management
  • Mentorship
  • Mindfulness Practices
  • Minecraft
  • Onboarding
  • One-on-One Meetings
  • ORC
  • Parquet
  • Performance Optimization
  • Personal Growth
  • Pipeline
  • PostegreSQL
  • Problem Solving
  • Professional Development
  • Professional Growth
  • Promotion
  • Python
  • RAG
  • Recruitment
  • Remote Work
  • RequirementGathering
  • Risk Management
  • Robbers Cave Experiment
  • Roleplaying
  • Schema Evolution
  • Self-Reflection
  • Server Setup
  • SQL
  • SQL Standards
  • SSH
  • SSH Keys
  • Staff Engineer
  • Stakeholder Engagement
  • Stakeholder Management
  • StakeholderManagement
  • Star Schema
  • Success Habits
  • Talent Acquisition
  • Team Collaboration
  • Team Enablement
  • Technical Assessment
  • Technical Leadership
  • Tools and Access
  • Trust Building
  • UV
  • UV Package Manager
  • Value Creation
  • Vector Databases
  • Virtual Environments
  • Visualization
  • VSCode
  • WAP Pattern
  • Windows
  • Workplace Communication
  • Workplace Relationships
  • Write-Audit-Publish
  • Zsh
Hero Image
Streamlining Data Pipeline Reliability: The Write-Audit-Publish Pattern

Introduction: Why Safe Data Pipelines Matter In the world of data engineering, there’s a constant challenge we all face: how do we ensure our production data remains reliable and error-free when deploying updates? Anyone who’s experienced the cold sweat of a bad deployment affecting critical business data knows this pain all too well. Enter the Write-Audit-Publish pattern—a robust approach that can significantly reduce the risk of data pipeline failures. This pattern, which shares DNA with the well-known Blue-Green deployment strategy from software engineering, creates a safety net that can save your team countless hours of troubleshooting and emergency fixes.

  • Write-Audit-Publish
  • WAP Pattern
  • Airflow
  • Data Reliability
  • Blue-Green Deployment
  • Data Quality
  • Python
Sunday, May 18, 2025 Read
Hero Image
Setting Up Your Data Engineering Environment on Windows

Introduction Setting up a development environment for data engineering on Windows requires some specific considerations that differ from Unix-based systems. This guide will walk you through creating a robust Python development environment on Windows, with detailed explanations of each component and why it’s important. Clean Slate: Removing Existing Python Installations Before starting, it’s important to remove any existing Python installations to avoid conflicts: Open Windows Settings > Apps > Apps & Features Search for “Python” Uninstall any Python versions listed Also check and remove Python from these locations:

  • Python
  • DBT
  • Windows
  • UV Package Manager
  • VSCode
Monday, February 3, 2025 Read
Hero Image
Setting Up Your Data Engineering Environment on MacOS

Introduction Setting up a development environment for data engineering on MacOS requires careful consideration of package management, Python version control, and tool configuration. This guide will walk you through the process, explaining not just how to set up these tools, but why each component is important. Clean Slate: Removing Existing Python Installations Before we begin, it’s important to ensure we’re starting with a clean slate. Multiple Python installations can cause confusion and conflicts. Let’s remove any existing Python installations:

  • Python
  • DBT
  • MacOS
  • UV Package Manager
  • VSCode
Sunday, February 2, 2025 Read
Hero Image
Automating Python Virtual Environments with Zsh on macOS

Automating Python Virtual Environments with Zsh on macOS Managing Python virtual environments can be tedious - having to manually activate and deactivate them as you move between projects. I decided to create a script that will automatically activate virtual environment. When you cd into a directory containing a .venv folder. When you leave, it will deactivate it. Installing Zsh While macOS comes with zsh as the default shell since Catalina (10.15), you may want to ensure you have the latest version. The easiest way to install or update zsh is using Homebrew:

  • Python
  • Virtual Environments
  • Zsh
  • macOS
  • Automation
Tuesday, January 14, 2025 Read
Hero Image
UV: A Game-Changer for Data Engineering Scripts

Introduction While pip install has been the go-to package installer for Python developers, UV brings game-changing performance improvements to dependency management. UV achieves significantly faster installation speeds through several clever optimizations: Parallel Downloads: Unlike pip’s sequential approach, UV downloads multiple packages simultaneously, dramatically reducing wait times for large dependency sets. Wheel-First Strategy: UV prioritizes pre-built wheels over source distributions, avoiding time-consuming compilation steps when possible. Rust-Based Implementation: Built with Rust’s memory safety and concurrent processing capabilities, UV handles package resolution more efficiently than pip’s Python-based implementation. In real-world testing, UV often installs packages 5-10x faster than pip, particularly in environments with many dependencies. For data professionals working with complex libraries like pandas, numpy, scikit-learn, or pyspark, this speed difference isn’t just convenient – it’s transformative for workflow efficiency.

  • UV
  • Python
  • Data Testing
  • Data Transformation
  • Development Tools
Saturday, January 11, 2025 Read
Hero Image
Data Vault Data Modeling with Python and dbt

Introduction Data Vault is a data modeling technique that is specifically designed for use in Data Warehouses. It is a hybrid approach that combines the best elements of 3rd Normal Form (3NF) and Star Schema to provide a flexible and scalable data modeling solution. Hubs, Links, Satellites A Data Vault consists of three main components: Hubs, Links, and Satellites. Hubs are the backbone of the Data Vault architecture and represent the entities within the data model. They are the core data elements and contain the primary key information.

  • Data Vault
  • Python
  • DBT
  • ETL
  • Data Warehouse Architecture
Sunday, February 26, 2023 Read