Several Julia packages offer functionality for splitting datasets into training and testing sets. Here are the most commonly used and recommended options:
MLDataUtils.jl
: This package is
specifically designed for machine learning data utilities and provides
the splitobs
function, which is a very convenient and
flexible way to split data. It’s generally the preferred
approach.
DataFrames.jl
(with some manual
work): If your data is in a DataFrame
, you can use
it in conjunction with random number generation to create indices for
your training and testing sets. This requires a bit more manual work
than splitobs
but is still a viable option.
ScikitLearn.jl
: This package
provides Julia bindings to the popular Python scikit-learn library. It
includes functions like train_test_split
, which you might
be familiar with if you’ve used scikit-learn in Python. However, since
the other options are native Julia packages, they are generally
preferred for Julia projects.
1. Using MLDataUtils.jl
(Recommended):
using MLDataUtils
# Example data (replace with your actual data)
X = rand(100, 5) # 100 samples, 5 features
y = rand(Bool, 100) # 100 labels (binary classification)
# Split into 80% training and 20% testing, shuffle the data
X_train, X_test, y_train, y_test = splitobs(X, y, 0.8, shuffle=true)
# You can also specify the number of observations directly:
n_train = 80
X_train, X_test, y_train, y_test = splitobs(X, y, n_train)
# To get indices instead of the actual data:
train_indices, test_indices = splitobs(1:100, 0.8) # Split indices
X_train = X[train_indices, :]
X_test = X[test_indices, :]
y_train = y[train_indices]
y_test = y[test_indices]
# For stratified splitting (important if classes are imbalanced):
X_train, X_test, y_train, y_test = stratifiedobs(X, y, 0.8)
2. Using DataFrames.jl
(Manual
approach):
using DataFrames, Random
# Example DataFrame (replace with your DataFrame)
df = DataFrame(A = rand(100), B = rand(100), C = rand(Bool, 100))
# Set a seed for reproducibility
Random.seed!(123)
# Create a random permutation of row indices
n_rows = nrow(df)
perm = randperm(n_rows)
# Split into training and testing indices
train_size = Int(floor(0.8 * n_rows))
train_indices = perm[1:train_size]
test_indices = perm[train_size+1:end]
# Create training and testing DataFrames
df_train = df[train_indices, :]
df_test = df[test_indices, :]
# Or, to get the data as matrices:
X = Matrix(df[:, Not(:C)])
y = df.C
X_train = X[train_indices, :]
X_test = X[test_indices, :]
y_train = y[train_indices]
y_test = y[test_indices]
3. Using ScikitLearn.jl
(Less common in pure
Julia workflows):
using ScikitLearn
# Example data
X = rand(100, 5)
y = rand(Bool, 100)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
Recommendation:
The MLDataUtils.jl
package with the
splitobs()
function (and stratifiedobs()
for
stratified splitting) is the most straightforward and recommended method
for splitting datasets in Julia. It’s designed specifically for this
purpose and integrates well with other Julia machine learning packages.
It also handles different data types (arrays, matrices, etc.) very
efficiently. Use DataFrames.jl
only if you are already
working heavily with DataFrames and prefer to manipulate indices
directly. Avoid ScikitLearn.jl
for this task in pure Julia
projects, as it introduces a dependency on Python.