Big Data AI Training: Efficient Methods for Loading and Processing Large Datasets

Modern AI models rely heavily on large volumes of data for accurate predictions and performance. However, loading and preprocessing large datasets can be time-consuming and frustrating.

For instance, during my previous job, training a machine learning model with tens of millions of records took an entire day just to load and preprocess the data. The company had a well-implemented digital transformation, where all equipment data was automatically uploaded to the cloud. Despite this, there was a significant drawback: the data could only be downloaded from the cloud system in CSV format. All collaborating departments directly handled data in CSV format.

While CSV format is easy to open, process, and visualize, it is not suitable for handling big data in machine learning tasks. To address this issue, understanding how to efficiently handle data using various formats and parallel processing techniques is crucial.

In this post, we will explain different data formats and demonstrate a Python code to compare the time taken to load and preprocess large datasets. Additionally, we will explore ways to leverage parallel processing to speed up data handling.

Table of Contents　

Various Data Formats for Big Data Processing

Different data formats are used for big data processing, each with its advantages and disadvantages. Here, we will compare common data formats and their suitability for big data tasks.

CSV (Comma-Separated Values)

Description: A plain text format where each value is separated by a comma.
Advantages: Human-readable, widely supported by analysis tools.
Disadvantages: Slow read and write speeds for large datasets.

Parquet

Description: A columnar storage format that provides efficient data compression and encoding schemes.
Advantages: Fast read speed, efficient storage, ideal for big data.
Disadvantages: Requires specialized libraries for handling.

HDF5 (Hierarchical Data Format)

Description: A binary format designed for large scientific datasets.
Advantages: Supports complex data hierarchies, efficient access to large data.
Disadvantages: Can be slower to read compared to Parquet.

JSON (JavaScript Object Notation)

Description: A lightweight data interchange format.
Advantages: Human-readable, easily parsed by most programming languages.
Disadvantages: Inefficient for large datasets due to verbose structure.

Python Code to Compare Data Loading and Processing

In this evaluation, we generate random datasets of 10 million, 50 million, and 100 million records and save them in CSV, Parquet, HDF5, and JSON formats. We then measure the time taken to load and preprocess the data, along with the storage size.

column1	column2	column3
15	0.032499886126697186	C
54	0.3494937008068031	C
23	0.531620365	A
84	0.5575162159224915	C
48	0.003971126	B
84	0.20341102566009028	C
22	0.8039966596242509	B
32	0.20436228384007749	C
9	0.062749613	D

Data Sample

import pandas as pd
import numpy as np
import time
import os
import matplotlib.pyplot as plt

# Generate large dataset
def generate_data(num_rows):
    data = {
        'column1': np.random.randint(0, 100, num_rows),
        'column2': np.random.random(num_rows),
        'column3': np.random.choice(['A', 'B', 'C', 'D'], num_rows)
    }
    return pd.DataFrame(data)

# Save data in different formats
def save_data_formats(df, base_filename):
    df.to_csv(f"{base_filename}.csv", index=False)
    df.to_parquet(f"{base_filename}.parquet", index=False)
    df.to_hdf(f"{base_filename}.h5", key='df', mode='w', format='table')
    df.to_json(f"{base_filename}.json", orient='records', lines=True)

# Measure load time and file size
def measure_load_time_and_size(base_filename):
    times = {}
    sizes = {}
    
    # CSV
    start_time = time.time()
    df_csv = pd.read_csv(f"{base_filename}.csv")
    times['CSV'] = time.time() - start_time
    sizes['CSV'] = os.path.getsize(f"{base_filename}.csv")
    
    # Parquet
    start_time = time.time()
    df_parquet = pd.read_parquet(f"{base_filename}.parquet")
    times['Parquet'] = time.time() - start_time
    sizes['Parquet'] = os.path.getsize(f"{base_filename}.parquet")
    
    # HDF5
    start_time = time.time()
    df_hdf5 = pd.read_hdf(f"{base_filename}.h5")
    times['HDF5'] = time.time() - start_time
    sizes['HDF5'] = os.path.getsize(f"{base_filename}.h5")
    
    # JSON
    start_time = time.time()
    df_json = pd.read_json(f"{base_filename}.json", orient='records', lines=True)
    times['JSON'] = time.time() - start_time
    sizes['JSON'] = os.path.getsize(f"{base_filename}.json")
    
    return times, sizes

# Set data sizes
data_sizes = [10_000_000, 50_000_000, 100_000_000]

# Store results
results = {}
size_results = {}

for size in data_sizes:
    print(f"Processing data size: {size}")
    df = generate_data(size)
    base_filename = f"data_{size}"
    save_data_formats(df, base_filename)
    times, sizes = measure_load_time_and_size(base_filename)
    os.remove(f"{base_filename}.csv")
    os.remove(f"{base_filename}.parquet")
    os.remove(f"{base_filename}.h5")
    os.remove(f"{base_filename}.json")
    
    results[size] = times
    size_results[size] = sizes

# Print results
for size, times in results.items():
    print(f"\nData size: {size}")
    for format_name, time_taken in times.items():
        print(f"{format_name}: {time_taken:.2f} seconds")

for size, sizes in size_results.items():
    print(f"\nData size: {size}")
    for format_name, size_taken in sizes.items():
        print(f"{format_name}: {size_taken / (1024 * 1024):.2f} MB")

Data Load Evaluation Results

Performance Comparison by Data Format

CSV: Takes the longest time to read across all data sizes. This format is inefficient for large data due to its plain text nature.
Parquet: Shows the fastest load times across all data sizes. Its columnar storage format offers efficient compression and scanning, making it ideal for big data.
HDF5: Similar performance to CSV but slightly faster overall. It is well-suited for storing large scientific data but may not perform as well in read operations compared to Parquet.
JSON: Consistently has the longest load times. While it is easy to read and parse, its verbose structure makes it inefficient for handling large datasets.

Performance Changes by Data Size

CSV and HDF5: Load times increase linearly with data size. These formats do not leverage structural advantages and scale proportionally with data volume.
Parquet: Shows a relatively low increase in load time with increasing data size. Its columnar format and efficient compression contribute to minimal overhead as data grows.
JSON: Exhibits a steep increase in load time with data size, highlighting its inefficiency for big data processing.

Data Storage Space Evaluation Results

CSV: Simple and widely used but inefficient for large data storage.
Parquet: Offers the highest compression rates and is the most efficient for large data storage.
HDF5: Suitable for big data storage but less efficient than Parquet.
JSON: Has the largest file size and is the least efficient for storing big data.

Conclusion and Recommendations

JSON: High readability and versatility but unsuitable for big data processing.
CSV: Simple and widely compatible but inefficient in terms of performance.
HDF5: Strong in data storage but falls short in load performance, making it less ideal for big data processing.
Parquet: Most efficient for handling big data, both in load times and storage space. It is recommended to use Parquet for big data tasks due to its superior performance and efficiency.

In conclusion, choosing the right data format is crucial for efficiently handling large datasets in AI and machine learning. By leveraging formats like Parquet, significant improvements in processing time and storage efficiency.