Splitting the dataset is one of the crucial steps in training machine learning models. It helps in preventing overfitting to the training data and ensures that the model generalizes well to unseen data. In this post, we will discuss various techniques for dataset splitting, including train_test_split, k-fold cross-validation, and stratified k-fold cross-validation.
train_test_split
train_test_split
is one of the simplest methods to randomly divide data into two groups. This method splits the dataset into a training set (train set) and a testing set (test set), with typically 70% of the data used for training and 30% for testing.
Advantages
- Quick training and evaluation when the dataset is large.
- Random splitting ensures all data is used for training and evaluation.
Disadvantages
- Random splitting might not reflect the characteristics of the dataset.
- With smaller datasets, the train or test set might be too small, leading to poor generalization.
Usage
from sklearn.model_selection import train_test_split # Split the dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Train and evaluate the model model.fit(X_train, y_train) score = model.score(X_test, y_test)
Here, X
is the feature data, and y
is the target data. test_size
determines the proportion of the dataset used for testing, and random_state
sets the seed for randomness.
k-fold cross-validation
k-fold cross-validation
splits the data into k
groups (folds). Each group is used as a test set once, while the remaining groups are used for training. This ensures that all data is used for both training and testing.
Advantages
- Maximizes the use of data for training and evaluation.
- Provides a more reliable assessment of model performance.
Disadvantages
- Computationally expensive as the model is trained
k
times. - Higher overhead for large datasets.
Usage
from sklearn.model_selection import KFold # Load the dataset X, y = load_data() # Create KFold kf = KFold(n_splits=5, shuffle=True, random_state=42) # Run KFold cross-validation for train_index, test_index in kf.split(X): X_train, X_test = X.iloc[train_index, :], X.iloc[test_index, :] y_train, y_test = y.iloc[train_index], y.iloc[test_index] # Train and evaluate the model model.fit(X_train, y_train) score = model.score(X_test, y_test)
Stratified k-fold cross-validation
Stratified k-fold cross-validation
is a variant of k-fold cross-validation that maintains the class distribution within each fold. This is particularly useful for imbalanced datasets.
Advantages
- Sensitive to data imbalance.
- Helps in improving generalization by ensuring the model sees a variety of samples from different classes.
Disadvantages
- More computationally intensive than standard k-fold cross-validation.
- May not provide additional benefits if the dataset is already balanced.
Usage
from sklearn.model_selection import StratifiedKFold # Load the dataset X, y = load_data() # Create StratifiedKFold skf = StratifiedKFold(n_splits=5) # Run StratifiedKFold cross-validation for train_index, test_index in skf.split(X, y): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] # Train and evaluate the model model.fit(X_train, y_train) score = model.score(X_test, y_test)
Here, n_splits
specifies the number of splits, and split()
returns indices for training and testing data.
Image Source: https://amueller.github.io/