Modern AI models rely heavily on large volumes of data for accurate predictions and performance. However, loading and preprocessing large datasets can be time-consuming and frustrating.
For instance, during my previous job, training a machine learning model with tens of millions of records took an entire day just to load and preprocess the data. The company had a well-implemented digital transformation, where all equipment data was automatically uploaded to the cloud. Despite this, there was a significant drawback: the data could only be downloaded from the cloud system in CSV format. All collaborating departments directly handled data in CSV format.
While CSV format is easy to open, process, and visualize, it is not suitable for handling big data in machine learning tasks. To address this issue, understanding how to efficiently handle data using various formats and parallel processing techniques is crucial.
In this post, we will explain different data formats and demonstrate a Python code to compare the time taken to load and preprocess large datasets. Additionally, we will explore ways to leverage parallel processing to speed up data handling.
Various Data Formats for Big Data Processing
Different data formats are used for big data processing, each with its advantages and disadvantages. Here, we will compare common data formats and their suitability for big data tasks.
CSV (Comma-Separated Values)
- Description: A plain text format where each value is separated by a comma.
- Advantages: Human-readable, widely supported by analysis tools.
- Disadvantages: Slow read and write speeds for large datasets.
Parquet
- Description: A columnar storage format that provides efficient data compression and encoding schemes.
- Advantages: Fast read speed, efficient storage, ideal for big data.
- Disadvantages: Requires specialized libraries for handling.
HDF5 (Hierarchical Data Format)
- Description: A binary format designed for large scientific datasets.
- Advantages: Supports complex data hierarchies, efficient access to large data.
- Disadvantages: Can be slower to read compared to Parquet.
JSON (JavaScript Object Notation)
- Description: A lightweight data interchange format.
- Advantages: Human-readable, easily parsed by most programming languages.
- Disadvantages: Inefficient for large datasets due to verbose structure.
Python Code to Compare Data Loading and Processing
In this evaluation, we generate random datasets of 10 million, 50 million, and 100 million records and save them in CSV, Parquet, HDF5, and JSON formats. We then measure the time taken to load and preprocess the data, along with the storage size.
column1 | column2 | column3 |
15 | 0.032499886126697186 | C |
54 | 0.3494937008068031 | C |
23 | 0.531620365 | A |
84 | 0.5575162159224915 | C |
48 | 0.003971126 | B |
84 | 0.20341102566009028 | C |
22 | 0.8039966596242509 | B |
32 | 0.20436228384007749 | C |
9 | 0.062749613 | D |
import pandas as pd import numpy as np import time import os import matplotlib.pyplot as plt # Generate large dataset def generate_data(num_rows): data = { 'column1': np.random.randint(0, 100, num_rows), 'column2': np.random.random(num_rows), 'column3': np.random.choice(['A', 'B', 'C', 'D'], num_rows) } return pd.DataFrame(data) # Save data in different formats def save_data_formats(df, base_filename): df.to_csv(f"{base_filename}.csv", index=False) df.to_parquet(f"{base_filename}.parquet", index=False) df.to_hdf(f"{base_filename}.h5", key='df', mode='w', format='table') df.to_json(f"{base_filename}.json", orient='records', lines=True) # Measure load time and file size def measure_load_time_and_size(base_filename): times = {} sizes = {} # CSV start_time = time.time() df_csv = pd.read_csv(f"{base_filename}.csv") times['CSV'] = time.time() - start_time sizes['CSV'] = os.path.getsize(f"{base_filename}.csv") # Parquet start_time = time.time() df_parquet = pd.read_parquet(f"{base_filename}.parquet") times['Parquet'] = time.time() - start_time sizes['Parquet'] = os.path.getsize(f"{base_filename}.parquet") # HDF5 start_time = time.time() df_hdf5 = pd.read_hdf(f"{base_filename}.h5") times['HDF5'] = time.time() - start_time sizes['HDF5'] = os.path.getsize(f"{base_filename}.h5") # JSON start_time = time.time() df_json = pd.read_json(f"{base_filename}.json", orient='records', lines=True) times['JSON'] = time.time() - start_time sizes['JSON'] = os.path.getsize(f"{base_filename}.json") return times, sizes # Set data sizes data_sizes = [10_000_000, 50_000_000, 100_000_000] # Store results results = {} size_results = {} for size in data_sizes: print(f"Processing data size: {size}") df = generate_data(size) base_filename = f"data_{size}" save_data_formats(df, base_filename) times, sizes = measure_load_time_and_size(base_filename) os.remove(f"{base_filename}.csv") os.remove(f"{base_filename}.parquet") os.remove(f"{base_filename}.h5") os.remove(f"{base_filename}.json") results[size] = times size_results[size] = sizes # Print results for size, times in results.items(): print(f"\nData size: {size}") for format_name, time_taken in times.items(): print(f"{format_name}: {time_taken:.2f} seconds") for size, sizes in size_results.items(): print(f"\nData size: {size}") for format_name, size_taken in sizes.items(): print(f"{format_name}: {size_taken / (1024 * 1024):.2f} MB")
Data Load Evaluation Results
Performance Comparison by Data Format
- CSV: Takes the longest time to read across all data sizes. This format is inefficient for large data due to its plain text nature.
- Parquet: Shows the fastest load times across all data sizes. Its columnar storage format offers efficient compression and scanning, making it ideal for big data.
- HDF5: Similar performance to CSV but slightly faster overall. It is well-suited for storing large scientific data but may not perform as well in read operations compared to Parquet.
- JSON: Consistently has the longest load times. While it is easy to read and parse, its verbose structure makes it inefficient for handling large datasets.
Performance Changes by Data Size
- CSV and HDF5: Load times increase linearly with data size. These formats do not leverage structural advantages and scale proportionally with data volume.
- Parquet: Shows a relatively low increase in load time with increasing data size. Its columnar format and efficient compression contribute to minimal overhead as data grows.
- JSON: Exhibits a steep increase in load time with data size, highlighting its inefficiency for big data processing.
Data Storage Space Evaluation Results
- CSV: Simple and widely used but inefficient for large data storage.
- Parquet: Offers the highest compression rates and is the most efficient for large data storage.
- HDF5: Suitable for big data storage but less efficient than Parquet.
- JSON: Has the largest file size and is the least efficient for storing big data.
Conclusion and Recommendations
- JSON: High readability and versatility but unsuitable for big data processing.
- CSV: Simple and widely compatible but inefficient in terms of performance.
- HDF5: Strong in data storage but falls short in load performance, making it less ideal for big data processing.
- Parquet: Most efficient for handling big data, both in load times and storage space. It is recommended to use Parquet for big data tasks due to its superior performance and efficiency.
In conclusion, choosing the right data format is crucial for efficiently handling large datasets in AI and machine learning. By leveraging formats like Parquet, significant improvements in processing time and storage efficiency.