Ghost in the data
  • Home
  • About
  • Posts
  • Topics
  • Resources
  • RSS
  • Posts
  • 2026
    • Talk
    • Brainstorming
    • Guerrilla Interview Guide
    • 2026 Strategy
    • Dimensional Modeling AWS
    • Duct Tape Data Engineer
    • AI Peer Reviewer
    • NBA Coach Lessons for Data Leaders
    • For Sooty
    • Healing Tables SCD2
    • WAP Iceberg Snowflake
    • The CSV Test Suite Nobody Writes
    • 12 Steps to Better Data Engineering
    • Your Data Model Isn't Broken Pt I
    • Your Friends Will Be There
    • Fix Your Data Without Permission
    • Your Data Model Isn't Broken Pt II
    • Stop Building Salesforce Integrations
    • Why Your Pipeline Finishes Later Every Month
  • 2025
    • UV Tools
    • Zsh Virtual Environments
    • Piracy Service Problem
    • 2025 Data Trends
    • Data Modeling Approaches
    • MacOS Dev Setup
    • Windows Dev Setup
    • Business Context Guide
    • Data Impact
    • Data Engineering Interviews
    • First 90 Days as Data Engineer
    • Senior to Staff Engineer
    • LLMs for Business Part 1
    • LLMs for Business Part 2
    • Mastering 1:1 Meetings
    • Data Quality Test
    • AI Prompting Secret
    • Conceptual Data Modeling
    • WAP Pattern for Data Pipelines
    • AI Simplified
    • dbt Fusion: The Engine Upgrade
    • Continuous Integration for Data Teams
    • Claude Code AI Agents
    • Clear Communication Superpower
    • Compliance vs Commitment
    • D&D Leadership
    • Reflective Best Self
    • Financial Independence
    • Dimensional Modeling Lives
    • Balancing Data Accessibility & Privacy
    • Data Quality Crisis
    • Data Quality Framework
    • AWS Data Pipeline
    • Invisible PR
    • AI's Twin Crises
  • 2024
    • Delta-lake
    • Data Normalisation
    • Data Profiling
    • Defensive Engineering
    • CI/CD
    • Setup Docker and Airflow
    • Find and Attract Data Engineers
    • 17 Years of Insights
    • Relationship Building
    • Individual Contributor
  • 2023
    • GitBash with SSH
    • Journalling
    • Minecraft Server in GCP
    • Onboarding a data team
    • File Format for Big Data
    • Incident Management
    • Data Vault
    • Books that are worth you time?
Hero Image
Embracing Defensive Engineering: A Proactive Approach to Data Pipeline Integrity

Introduction Have you ever had a data pipeline fall apart due to unexpected errors? In the ever-evolving landscape of data, surprises lurk around every corner. Defensive engineering, a methodology focused on preempting and mitigating data anomalies in data pipelines, plays a crucial role in building reliable data pipelines. It’s not just about fixing problems as they arise; it’s about anticipating potential issues and addressing them before they wreak havoc. Below I’ll explore the various facets of defensive engineering, from the basics of handling nulls and type mismatches to the more complex challenges of ensuring data integrity and handling late-arriving data. Whether you’re a seasoned data engineer or just starting out, understanding these principles is key to creating data pipelines that are not just functional, but also robust and secure in the face of unpredictable data challenges.

  • Data Modelling
Sunday, February 11, 2024 Read
Hero Image
Navigating the Data Labyrinth: The Art of Data Profiling

Introduction Imagine navigating a sprawling network of interconnected threads, each strand holding a vital clue. That’s the world of data for us, and profiling is our key to unlocking its secrets. It’s like deciphering a cryptic message, each character a piece of information waiting to be understood. But why is this so important? Ever encountered an error in your analysis, or a misleading conclusion based on faulty data? Data profiling helps us avoid these pitfalls by ensuring the data we work with is accurate, consistent, and ready to yield valuable insights. It’s like building a sturdy foundation before constructing a skyscraper.

  • Data Modelling
Sunday, January 28, 2024 Read
Hero Image
Taming the Chaos: Your Guide to Data Normalisation

Introduction Have you ever felt like you were drowning in a sea of data, where every byte seemed to play a game of hide and seek? In the digital world, where data reigns supreme, it’s not uncommon to find oneself navigating through a labyrinth of disorganised, redundant, and inconsistent information. But fear not, brave data navigators! There exists a beacon of order in this chaos: data normalisation. Data normalisation isn’t just a set of rules to follow; it’s the art of bringing structure and clarity to your data universe. It’s about transforming a jumbled jigsaw puzzle into a masterpiece of organisation, where every piece fits perfectly. Let’s embark on a journey to demystify this hero of the database world and discover how it can turn your data nightmares into a dream of efficiency and accuracy.

  • Data Modelling
Sunday, January 21, 2024 Read
Hero Image
Delta-lake - Z-Ordering, Z-Cube, Liquid Clustering and Partitions

Introduction Ever feel like your data lake is more of a data swamp, swallowing queries whole and spitting out eternity? You’re not alone. Managing massive datasets can be a Herculean task, especially when it comes to squeezing out those precious milliseconds of query performance. But fear not, data warriors, for Delta Lake has hidden treasures waiting to be unearthed: Z-ordering, Z-cube, and liquid clustering. Partition Pruning: The OG Hero Before we dive into these exotic beasts, let’s pay homage to the OG hero of data organization: partition pruning. Imagine your data lake as a meticulously organized library, with each book (partition) shelved by a specific topic (partition column). When a query saunters in, it doesn’t have to wander through every aisle. It simply heads straight for the relevant section, drastically reducing the time it takes to find what it needs. That’s the magic of partition pruning!

  • Delta-lake
Sunday, January 14, 2024 Read
Hero Image
2023 - Books that are worth you time?

Introduction As a Data Engineer, it’s crucial to constantly improve your skills and knowledge to stay ahead of the curve. Whether it’s working with large data sets, building efficient data pipelines, or collaborating with a team, there are many different aspects to consider. To help you succeed, I’ve put together a list of books that cover a range of topics, from culture and team building to Python and SQL. Each of the books I’ve selected offers valuable insights and practical advice to help you become a better Data Engineer. Whether you’re looking to strengthen your coding skills, learn how to effectively communicate with your team, or improve your organization’s data processes, there’s something here for everyone. So, without further ado, let’s dive into the books that can help you take your skills to the next level.

  • Development
Sunday, March 5, 2023 Read
Hero Image
Data Vault Data Modeling with Python and dbt

Introduction Data Vault is a data modeling technique that is specifically designed for use in Data Warehouses. It is a hybrid approach that combines the best elements of 3rd Normal Form (3NF) and Star Schema to provide a flexible and scalable data modeling solution. Hubs, Links, Satellites A Data Vault consists of three main components: Hubs, Links, and Satellites. Hubs are the backbone of the Data Vault architecture and represent the entities within the data model. They are the core data elements and contain the primary key information.

  • Data Vault
  • Python
  • DBT
  • ETL
  • Data Warehouse Architecture
Sunday, February 26, 2023 Read
Hero Image
Navigating Incident Response Management with DevOps

Introduction Incident response management (IRM) is a critical aspect of any organization’s overall security and risk management strategy. In today’s fast-paced, technology-driven world, IT incidents can occur at any time, and it’s important to have a plan in place to effectively manage these incidents and minimize the impact they have on your organization. The IRM lifecycle is a structured approach to managing incidents, from identification to resolution, and it involves a range of activities, including communication, coordination, and control. In this post, I’ll explore the IRM lifecycle in detail, and discuss the roles and responsibilities of different individuals during each stage. I’ll also compare traditional incident management with devops incident management, and discuss the advantages of adopting a devops approach.

  • Incident Response
  • Risk Management
Sunday, February 19, 2023 Read
Hero Image
Choosing the Right File Format for Big Data: A Comparison of Parquet, JSON, ORC, Avro, and CSV

Introduction How you store your data is a critical component of data engineering, as they determine the speed, efficiency, and compatibility of data storage and retrieval. Lets have a look at some of the popular file formats: Parquet, JSON, ORC, Avro, and CSV. We’ll compare their pros and cons, performance differences between reading and writing, and the importance of predicate pushdown and projection pushdown. What is Predicate pushdown and Projection pushdown? Predicate pushdown and projection pushdown are two performance optimization techniques used in big data processing. They allow query engines to reduce the amount of data that needs to be processed by pushing down filter conditions and column projections to the storage layer.

  • File Formats
  • ORC
  • AVRO
  • CSV
  • JSON
  • Parquet
  • Schema Evolution
Sunday, February 12, 2023 Read
Hero Image
Onboarding a data team

Introduction Onboarding is so important to give a great impression, but also setting up the scaffolding of what a new employee would expect the culture to be like at a company. Needless to say, it’s not normally a great experience, it normally starts with spending the first two weeks getting access to systems and tools. Then once you have access, especially in the remote working environment - sometimes you get only a brief introduction with your manager, and get a small glimpse as to what you are to be working on. This common scenario, makes people feel less connected to the team, and no excitement or passion for the work - they feel undervalued, which isn’t a great start from day one or two.

  • Development
  • Onboarding
  • Employee Engagement
  • Culture
  • Tools and Access
  • Remote Work
Sunday, January 29, 2023 Read
Hero Image
Create a Minecraft (PE)Bedrock Server in GCP

Introduction My eldest daughter, plays with her friends a lot with the Minecraft Pocket Edition (PE). The PE Edition, really limits it to WiiU, xbox or ipad. This is the most connivent gaming version for them as they all have iPads. Occasionally the world they are all playing on gets corrupted and she looses all the work they have done. There is ways to recover it, but its rather time consuming.

  • GCP
  • Minecraft
  • Cloud Gaming
  • Server Setup
  • Family Gaming
  • Bedrock Edition
Sunday, January 15, 2023 Read
Hero Image
Journalling

Introduction Journalling is a habit that helps us reflect on the year, but also allows us to coach ourselves to be a better you. There is a element of getting it down in writing that allows us to “slow” down and focus on what we are writing. There is something about Journalling in physical form, as when you reflect back on past Journal there is some sort of bond you have with your hand writing, and you can also go back to that moment in time and remember when you wrote it.

  • Development
  • Journal
  • Journaling Techniques
  • Success Habits
  • Self-Reflection
  • Mindfulness Practices
  • Self-Reflection
  • Inspirational Quote
  • Self-Reflection
  • Personal Growth
Sunday, January 8, 2023 Read
Hero Image
Setting up GitBash with SSH on Windows

Introduction Git is a everyday tool for me. I remember reading about Distributed Version Control from Joel Spolsky years ago, when we decided to switch from SVN to GIT. Although I use it daily, most of the time it is - set and forget, when it comes to SSH keys to repositories. So whenever I get a new laptop or need to re-initialize repo, I have to re-teach myself the steps to get it working. So this is a bit of a guide, to help me navigate and get back on track when im lost next time.

  • Github
  • SSH
  • Git
  • GitBash
  • SSH Keys
Sunday, January 1, 2023 Read
  • ««
  • «
  • 2
  • 3
  • 4
  • 5
  • 6
  • »
  • »»