How to split your dataset? train_test_split, KFold, StratifiedKFold

Posted by

Splitting the dataset is one of the crucial steps in training machine learning models. It helps in preventing overfitting to the training data and ensures that the model generalizes well to unseen data. In this post, we will discuss various techniques for dataset splitting, including train_test_split, k-fold cross-validation, and stratified k-fold cross-validation.

train_test_split

train_test_split is one of the simplest methods to randomly divide data into two groups. This method splits the dataset into a training set (train set) and a testing set (test set), with typically 70% of the data used for training and 30% for testing.

Advantages

  • Quick training and evaluation when the dataset is large.
  • Random splitting ensures all data is used for training and evaluation.

Disadvantages

  • Random splitting might not reflect the characteristics of the dataset.
  • With smaller datasets, the train or test set might be too small, leading to poor generalization.

Usage

from sklearn.model_selection import train_test_split 

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train and evaluate the model
model.fit(X_train, y_train)
score = model.score(X_test, y_test)

Here, X is the feature data, and y is the target data. test_size determines the proportion of the dataset used for testing, and random_state sets the seed for randomness.

k-fold cross-validation

k-fold cross-validation splits the data into k groups (folds). Each group is used as a test set once, while the remaining groups are used for training. This ensures that all data is used for both training and testing.

Advantages

  • Maximizes the use of data for training and evaluation.
  • Provides a more reliable assessment of model performance.

Disadvantages

  • Computationally expensive as the model is trained k times.
  • Higher overhead for large datasets.

Usage

from sklearn.model_selection import KFold

# Load the dataset
X, y = load_data()

# Create KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Run KFold cross-validation
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Train and evaluate the model
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)

Stratified k-fold cross-validation

Stratified k-fold cross-validation is a variant of k-fold cross-validation that maintains the class distribution within each fold. This is particularly useful for imbalanced datasets.

Advantages

  • Sensitive to data imbalance.
  • Helps in improving generalization by ensuring the model sees a variety of samples from different classes.

Disadvantages

  • More computationally intensive than standard k-fold cross-validation.
  • May not provide additional benefits if the dataset is already balanced.

Usage

from sklearn.model_selection import StratifiedKFold

# Load the dataset
X, y = load_data()

# Create StratifiedKFold
skf = StratifiedKFold(n_splits=5)

# Run StratifiedKFold cross-validation
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Train and evaluate the model
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)

Here, n_splits specifies the number of splits, and split() returns indices for training and testing data.

Image Source: https://amueller.github.io/

Leave a Reply

이메일 주소는 공개되지 않습니다. 필수 필드는 *로 표시됩니다