Testing and Training Datasets

Data Science with Julia

Julia Workshop


Testing and Training Datasets

Several Julia packages offer functionality for splitting datasets into training and testing sets. Here are the most commonly used and recommended options:

  1. MLDataUtils.jl: This package is specifically designed for machine learning data utilities and provides the splitobs function, which is a very convenient and flexible way to split data. It’s generally the preferred approach.

  2. DataFrames.jl (with some manual work): If your data is in a DataFrame, you can use it in conjunction with random number generation to create indices for your training and testing sets. This requires a bit more manual work than splitobs but is still a viable option.

  3. ScikitLearn.jl: This package provides Julia bindings to the popular Python scikit-learn library. It includes functions like train_test_split, which you might be familiar with if you’ve used scikit-learn in Python. However, since the other options are native Julia packages, they are generally preferred for Julia projects.

1. Using MLDataUtils.jl (Recommended):

using MLDataUtils

# Example data (replace with your actual data)
X = rand(100, 5)  # 100 samples, 5 features
y = rand(Bool, 100) # 100 labels (binary classification)

# Split into 80% training and 20% testing, shuffle the data
X_train, X_test, y_train, y_test = splitobs(X, y, 0.8, shuffle=true)

# You can also specify the number of observations directly:
n_train = 80
X_train, X_test, y_train, y_test = splitobs(X, y, n_train)

# To get indices instead of the actual data:
train_indices, test_indices = splitobs(1:100, 0.8) # Split indices
X_train = X[train_indices, :]
X_test = X[test_indices, :]
y_train = y[train_indices]
y_test = y[test_indices]

# For stratified splitting (important if classes are imbalanced):
X_train, X_test, y_train, y_test = stratifiedobs(X, y, 0.8)

2. Using DataFrames.jl (Manual approach):

using DataFrames, Random

# Example DataFrame (replace with your DataFrame)
df = DataFrame(A = rand(100), B = rand(100), C = rand(Bool, 100))

# Set a seed for reproducibility
Random.seed!(123)

# Create a random permutation of row indices
n_rows = nrow(df)
perm = randperm(n_rows)

# Split into training and testing indices
train_size = Int(floor(0.8 * n_rows))
train_indices = perm[1:train_size]
test_indices = perm[train_size+1:end]

# Create training and testing DataFrames
df_train = df[train_indices, :]
df_test = df[test_indices, :]

# Or, to get the data as matrices:
X = Matrix(df[:, Not(:C)])
y = df.C
X_train = X[train_indices, :]
X_test = X[test_indices, :]
y_train = y[train_indices]
y_test = y[test_indices]

3. Using ScikitLearn.jl (Less common in pure Julia workflows):

using ScikitLearn

# Example data
X = rand(100, 5)
y = rand(Bool, 100)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

Recommendation:

The MLDataUtils.jl package with the splitobs() function (and stratifiedobs() for stratified splitting) is the most straightforward and recommended method for splitting datasets in Julia. It’s designed specifically for this purpose and integrates well with other Julia machine learning packages. It also handles different data types (arrays, matrices, etc.) very efficiently. Use DataFrames.jl only if you are already working heavily with DataFrames and prefer to manipulate indices directly. Avoid ScikitLearn.jl for this task in pure Julia projects, as it introduces a dependency on Python.