Introduction


How you store your data is a critical component of data engineering, as they determine the speed, efficiency, and compatibility of data storage and retrieval.

Lets have a look at some of the popular file formats: Parquet, JSON, ORC, Avro, and CSV. We’ll compare their pros and cons, performance differences between reading and writing, and the importance of predicate pushdown and projection pushdown.

What is Predicate pushdown and Projection pushdown?


Predicate pushdown and projection pushdown are two performance optimization techniques used in big data processing. They allow query engines to reduce the amount of data that needs to be processed by pushing down filter conditions and column projections to the storage layer.

Predicate pushdown refers to the process of applying filter conditions to the data before it is read into memory. By filtering the data at the storage layer, predicate pushdown reduces the amount of data that needs to be read and processed, which can lead to significant performance improvements for analytical queries.

Example:

select * 
from 
  table 
where 
  condition

The database layer sends a query to the storage layer, including a filter condition (e.g., WHERE condition). The storage layer applies the filter condition to the data before it is read into memory, reducing the amount of data that needs to be processed by the database layer. The result set returned to the database layer is the filtered data that satisfies the condition, leading to improved performance and reduced storage requirements.

Projection pushdown refers to the process of reading only the required columns from the data, instead of reading the entire row. By reducing the amount of data that needs to be read, projection pushdown can also lead to improved performance and reduced storage requirements.

Example:

select 
  column1, 
  column2 
from 
  table

The database layer sends a query to the storage layer, including a list of required columns (e.g., SELECT column1, column2). The storage layer reads only the required columns from the data, instead of reading the entire row, reducing the amount of data that needs to be processed by the database layer. The result set returned to the database layer is the data containing only the required columns, leading to improved performance and reduced storage requirements.

Both predicate pushdown and projection pushdown are supported by several big data file formats, including Parquet and ORC, which are optimized for analytical processing. By leveraging these techniques, query engines can perform faster and more efficient big data processing.

What are the common file formats used in Big Data?


Parquet

Parquet is a columnar file format that is optimized for fast, efficient, and reliable storage and retrieval of large amounts of data. It was designed for use in the Apache Hadoop ecosystem and is used in many big data and data warehousing applications.

Pros:

Columnar storage: Parquet stores data in columns rather than rows, which makes it highly efficient for analytical queries that only access a small subset of columns.

Compression: Parquet supports various compression algorithms, including Snappy, Gzip, and LZO, to reduce the storage footprint of data.

Predicate pushdown: Parquet supports predicate pushdown, which allows query engines to filter data before it is read from disk. This can greatly improve query performance.

Cons:

Complexity: Parquet has a more complex data model and file format than other file formats, which can make it challenging to use for some applications.

Limited compatibility: Parquet is not compatible with all data processing tools and systems.

Performance:
  • Read: Parquet’s columnar storage format makes it highly efficient for reading data, especially for analytical queries that only access a subset of columns.
  • Write: Writing data to a Parquet file can be slower than other file formats, especially for small amounts of data.

JSON

JSON is a text-based data format that is commonly used to store and exchange data on the web. It is easy to use and is supported by a wide range of programming languages and data processing tools.

Pros:

Human-readable: JSON data is easy to read and understand, which makes it a good choice for applications that need to be human-readable.

Wide compatibility: JSON is supported by many programming languages and data processing tools, which makes it a versatile choice for many applications.

Cons:

Inefficient storage: JSON data is stored in a row-based format, which is less efficient than columnar storage for analytical queries.

No predicate pushdown: JSON does not support predicate pushdown, which means that query engines must read the entire dataset before filtering data.

Performance:
  • Read: Reading JSON data can be slower than other file formats, especially for large amounts of data.
  • Write: Writing JSON data is relatively fast, especially for small amounts of data.

ORC


ORC (Optimized Row Columnar) is a file format that is optimized for fast, efficient, and reliable storage and retrieval of large amounts of data. It is similar to Parquet in many ways but is optimized for use with Apache Hive and other systems in the Hadoop ecosystem.

Pros:

Columnar storage: ORC stores data in columns, which makes it highly efficient for analytical queries that only access a subset of columns.

Compression: ORC supports various compression algorithms, including Snappy, Zlib, and LZO, to reduce the storage footprint of data.

Predicate pushdown: ORC supports predicate pushdown, which can greatly improve query performance by filtering data before it is read from disk.

Cons:

Limited compatibility: ORC is not compatible with all data processing tools and systems outside of the Hadoop ecosystem.

Performance:
  • Read: ORC’s columnar storage format makes it highly efficient for reading data, especially for analytical queries that only access a subset of columns.
  • Write: Writing data to an ORC file can be slower than other file formats, especially for small amounts of data.

Avro


Avro is a data serialization system that is designed for high performance and ease of use. It is widely used in the Hadoop ecosystem and supports rich data structures and schema evolution.

Pros:

Schema evolution: Avro supports schema evolution, which allows you to change the schema of data without having to recreate the entire dataset.

Compression: Avro supports various compression algorithms, including Snappy, Deflate, and Bzip2, to reduce the storage footprint of data.

Cons:

Limited compatibility: Avro is not compatible with all data processing tools and systems.

Performance:
  • Read: Reading Avro data is relatively fast, especially for large amounts of data.
  • Write: Writing Avro data can be slower than other file formats, especially for small amounts of data.

CSV


CSV (Comma-Separated Values) is a simple and widely used file format for storing data. It is easy to use and is supported by a wide range of programming languages and data processing tools.

Pros:

Simple: CSV is a simple file format that is easy to use and understand.

Wide compatibility: CSV is supported by many programming languages and data processing tools, which makes it a versatile choice for many applications.

Cons:

Inefficient storage: CSV data is stored in a row-based format, which is less efficient than columnar storage for analytical queries.

No predicate pushdown: CSV does not support predicate pushdown, which means that query engines must read the entire dataset before filtering data.

Performance:
  • Read: Reading CSV data can be slower than other file formats, especially for large amounts of data.
  • Write: Writing CSV data is relatively fast, especially for small amounts of data.

Considerations


Type 2 data, also known as slowly changing dimension data, is data that changes over time but maintains a history of previous changes. When choosing a file format for type 2 data, you should consider factors such as storage size, performance, and ease of use.

Based on these factors, the best file format for type 2 data would be Parquet or ORC. Both of these file formats are optimized for big data storage and use columnar storage, which allows them to store data more efficiently and reduce storage size compared to other file formats. They also provide excellent performance for analytical processing, making them well-suited for type 2 data.

In addition, Parquet and ORC both support predicate pushdown, which allows query engines to reduce the amount of data that needs to be processed by pushing down filter conditions to the storage layer. This can be particularly useful for type 2 data, where you may need to filter data based on the version or time period of the changes.

Finally, both Parquet and ORC have good compatibility with popular big data processing tools, such as Apache Spark and Hive, making them easy to use for big data processing.

However Avro is a good choice for projects that require data interoperability, data evolution, big data processing, and schema management. However, it’s worth noting that Avro’s storage size benefits are not as significant as those provided by Parquet and ORC, and its performance for analytical processing may not be as good as those formats.

Python Tutorial


In this section, we will show you how to save a Pandas dataframe in each of the five file formats and how to read from each file format.

First, let’s create a simple Pandas dataframe:

import pandas as pd

data = {'name': ['John', 'Jane', 'Jim', 'Joan'],
        'age': [30, 32, 35, 29],
        'city': ['New York', 'London', 'Paris', 'Berlin']}
df = pd.DataFrame(data)

Saving the data frames

Parquet

df.to_parquet('data.parquet')

JSON

df.to_json('data.json')

ORC

import pyarrow as pa

table = pa.Table.from_pandas(df)
pa.parquet.write_table(table, 'data.orc')

AVRO

import fastavro

schema = fastavro.schema.parse_schema(df.to_dict(orient='list'))
with open('data.avro', 'wb') as out:
    fastavro.writer(out, schema, df.to_dict(orient='records'))

CSV

df.to_csv('data.csv', index=False)

Reading the data frame

Parquet

df = pd.read_parquet('data.parquet')

JSON

df = pd.read_json('data.json')

ORC

table = pa.parquet.read_table('data.orc')
df = table.to_pandas()

AVRO

with open('data.avro', 'rb') as inp:
    df = pd.DataFrame(list(fastavro.reader(inp)))

CSV

pythonCopy code
df = pd.read_csv('data.csv')

In conclusion, each of these file formats has its own pros and cons, and the best choice depends on the specific use case and requirements. Parquet, ORC, and Avro are optimized for analytical queries and big data processing, while JSON and CSV are simpler and more versatile. To get the best performance, it is important to choose the right file format and to take advantage of features like predicate pushdown and projection pushdown when available.