Ghost in the data
  • Home
  • About
  • Posts
  • Topics
  • Resources
  • RSS
  • Categories
  • AI Development
  • Analytics Engineering
  • Artificial Intelligence
  • AWS
  • Banking
  • Best Practices
  • Big Data
  • Business Technology
  • Career Development
  • Career Growth
  • Cloud Architecture
  • Cloud Computing
  • Cloud Infrastructure
  • Communication
  • Conflict Resolution
  • Data Architecture
  • Data Culture
  • Data Engineering
  • Data Ethics
  • Data Governance
  • Data Modeling
  • Data Modelling
  • Data Pipelines
  • Data Privacy
  • Data Quality
  • Data Storage
  • Data Warehousing
  • Database Design
  • Dbt
  • Delta-Lake
  • Development
  • Development Tools
  • DevOps
  • Employee Engagement
  • Gaming Servers
  • Google Cloud Platform
  • Hiring
  • Industry Analysis
  • Interviews
  • IT Management
  • Leadership
  • Life Hacks
  • Mindfulness
  • Minecraft
  • Personal
  • Personal Development
  • Personal Finance
  • Pipeline
  • Pipeline Design
  • Productivity
  • Professional Development
  • Professional Growth
  • Promotion
  • Psychology
  • Python
  • Python Tools
  • Setup Guide
  • SQL
  • Stakeholder Management
  • Team Building
  • Team Culture
  • Team Management
  • Technical Architecture
  • Technology Trends
  • Tutorial
  • User Experience
  • Version Control
  • Workplace Dynamics
Hero Image
Your First 90 Days as a Data Engineer: A Strategic Guide

Introduction Landing your first data engineering role—or starting at a new company—is both exhilarating and daunting. After navigating multiple interviews and accepting an offer, you’ve finally arrived at your desk with a new laptop and company swag (if your lucky). Even now, after solving countless problems ranging from minor bugs to enterprise-scale data challenges, I still occasionally feel that flutter of uncertainty in my stomach, when starting a new role. What if I don’t know what I’m doing? What if I make a mistake?

  • Onboarding
  • Professional Growth
  • Team Collaboration
  • Career Advice
  • Data Culture
Sunday, February 23, 2025 Read
Hero Image
Mastering Data Interviews: A Comprehensive Guide

Introduction After nearly two decades in the data engineering field, I’ve sat on both sides of the interview table countless times. Whether you’re a seasoned professional looking to change roles or a newcomer trying to break into the field, the interview process for data engineering positions can be both challenging and mysterious. There’s often uncertainty about what questions you’ll face, what skills you need to demonstrate, and what interviewers are really looking for beneath the surface.

  • Interviews
  • Technical Assessment
  • Career Growth
  • SQL
  • Data Modeling
  • Problem Solving
Saturday, February 22, 2025 Read
Hero Image
Maximizing Data Impact: A Guide to Effective Data Engineering

Introduction Creating impact goes far beyond writing efficient code or building robust pipelines. It’s about understanding how your work translates into tangible value for stakeholders across the organization. Types of Impact Our work forms the backbone of data-driven decision making in organizations. However, measuring and communicating this impact isn’t always straightforward. If you feel your work isn’t making a meaningful difference, it might be time to pivot your focus or approach. Understanding the various ways we create value helps guide these decisions and ensures we’re contributing in ways that matter.

  • Data Impact
  • Visualization
  • Stakeholder Management
  • Team Enablement
  • Data Quality
Saturday, February 15, 2025 Read
Hero Image
Data Modeling Showdown: Kimball vs One Big Table vs Relational

Introduction When architecting a data warehouse, one of the most crucial decisions is choosing the right data modeling approach. Like selecting the right tool for a job, each modeling methodology has its strengths and ideal use cases. Today, we’ll explore three popular approaches: Kimball’s dimensional modeling (star schema), the one big table approach, and traditional relational modeling. The Dataset: Understanding Our Example To illustrate these approaches, let’s consider a retail sales system with these core components:

  • Data Warehouse
  • SQL
  • Star Schema
  • Database Design
  • Performance Optimization
Saturday, January 25, 2025 Read
Hero Image
Data Industry Trends: What to Expect in 2025

Introduction The data industry has kicked off 2025 with transformative developments that are fundamentally reshaping our approach to data management and analytics. The landscape is witnessing seismic shifts - from Databricks’ historic funding round to Boomi’s strategic acquisition of Rivery, and the industry-shaking Iceberg buyout. Yet amid this technological evolution, a critical question emerges: how will these advancements translate into tangible value for organizations? As we navigate through this dynamic environment, the focus extends beyond identifying dominant technologies to understanding their practical impact on business outcomes. Let’s explore the key trends that are defining the data world in 2025, and more importantly, how they’re reshaping the way organizations leverage their data assets.

  • Industry Trends
  • Apache Iceberg
  • AI
  • Data Solutions
  • SQL
  • Data Governance
Saturday, January 18, 2025 Read
Hero Image
UV: A Game-Changer for Data Engineering Scripts

Introduction While pip install has been the go-to package installer for Python developers, UV brings game-changing performance improvements to dependency management. UV achieves significantly faster installation speeds through several clever optimizations: Parallel Downloads: Unlike pip’s sequential approach, UV downloads multiple packages simultaneously, dramatically reducing wait times for large dependency sets. Wheel-First Strategy: UV prioritizes pre-built wheels over source distributions, avoiding time-consuming compilation steps when possible. Rust-Based Implementation: Built with Rust’s memory safety and concurrent processing capabilities, UV handles package resolution more efficiently than pip’s Python-based implementation. In real-world testing, UV often installs packages 5-10x faster than pip, particularly in environments with many dependencies. For data professionals working with complex libraries like pandas, numpy, scikit-learn, or pyspark, this speed difference isn’t just convenient – it’s transformative for workflow efficiency.

  • UV
  • Python
  • Data Testing
  • Data Transformation
  • Development Tools
Saturday, January 11, 2025 Read
Hero Image
Mastering Data Engineering: Insights and Best Practices

Introduction I have been working with Data for a bit over 17 years now, I have seen it evolve from its nascent stages to a cornerstone of the tech industry. The journey has been nothing short of revolutionary, impacting businesses and society at large. The evolution and the role of a data engineer have expanded, requiring not just technical skills, but a deep understanding of business, security, and the human element within technology.

  • Culture
  • Continuous Learning
  • Data Quality
  • Professional Growth
  • Data Pipeline
  • Data System Resilience
  • Team Collaboration
Saturday, March 30, 2024 Read
Hero Image
How to Find and Attract Top Data Engineers

Introduction In my journey of filling open positions, I tend to get inundated with a multitude of resumes. Sifting through applications, your reaction varies from “this might work,” to a straightforward “no”. Rarely do I encounter a resume that makes me exclaim, “This person is exceptional! We need them on our team.” Despite reviewing thousands of job applications, the quest to find a standout Data Engineer often feels challenging. I believe there’s a reason for this rarity. The truth is, that the most talented Data Engineers, along with top professionals in any field, are seldom actively seeking employment.

  • Culture
  • Employee Engagement
  • Hiring Strategies
  • Talent Acquisition
  • Recruitment
Thursday, March 14, 2024 Read
Hero Image
Data Vault Data Modeling with Python and dbt

Introduction Data Vault is a data modeling technique that is specifically designed for use in Data Warehouses. It is a hybrid approach that combines the best elements of 3rd Normal Form (3NF) and Star Schema to provide a flexible and scalable data modeling solution. Hubs, Links, Satellites A Data Vault consists of three main components: Hubs, Links, and Satellites. Hubs are the backbone of the Data Vault architecture and represent the entities within the data model. They are the core data elements and contain the primary key information.

  • Data Vault
  • Python
  • DBT
  • ETL
  • Data Warehouse Architecture
Sunday, February 26, 2023 Read
Hero Image
Choosing the Right File Format for Big Data: A Comparison of Parquet, JSON, ORC, Avro, and CSV

Introduction How you store your data is a critical component of data engineering, as they determine the speed, efficiency, and compatibility of data storage and retrieval. Lets have a look at some of the popular file formats: Parquet, JSON, ORC, Avro, and CSV. We’ll compare their pros and cons, performance differences between reading and writing, and the importance of predicate pushdown and projection pushdown. What is Predicate pushdown and Projection pushdown? Predicate pushdown and projection pushdown are two performance optimization techniques used in big data processing. They allow query engines to reduce the amount of data that needs to be processed by pushing down filter conditions and column projections to the storage layer.

  • File Formats
  • ORC
  • AVRO
  • CSV
  • JSON
  • Parquet
  • Schema Evolution
Sunday, February 12, 2023 Read
  • ««
  • «
  • 1
  • 2
  • 3
  • »
  • »»