Machine learning with Caret in R

FZ2022 Algorithms and Data Analytics

Author

Affiliation

Sergio Castellanos-Gamboa, PhD

Tecnológico de Monterrey

Published

October 8, 2024

1 Introduction to Machine Learning

Machine learning is a branch of artificial intelligence (AI) that allows computers to learn from data without being explicitly programmed for every task. In machine learning, we build models that identify patterns in data and use these patterns to make predictions on new, unseen data.

1.1 Key Concepts in Machine Learning

Let’s break down some of the basic concepts:

Model: A model is like a mathematical equation that represents the relationship between the input data (features) and the output (what we want to predict). For example, in supervised learning (which we are doing here), a model takes input data (e.g., measurements of flowers) and predicts the target output (e.g., the type of flower).

A simple formula for a linear model might look like this:

y = w_1x_1 + w_2x_2 + \dots + w_nx_n + b

Where:

y is the output (what we are trying to predict).
x_1, x_2, \ldots, x_n are the input features (the data we have).
w_1, w_2, \ldots, w_n are the weights (parameters that the model learns).
b is the bias term (another parameter the model learns).
Training: Training is the process of “teaching” the model. The model looks at the data we give it and learns the best possible relationship between the input and output.
Test Data: After we train the model, we need to check how well it performs. We do this by using a separate dataset called the test data. This is data the model hasn’t seen before. The goal is to measure how well the model makes predictions on new data.
Features: These are the input variables used to make predictions. For example, in the famous iris dataset, the features might include measurements like petal length or sepal width.
Target (or Label): This is the output we are trying to predict. In our example, the target is the species of the flower (setosa, versicolor, or virginica).

1.2 Types of Machine Learning

Supervised Learning: The model is trained on labeled data. This means the dataset contains both input data and the correct output (the labels). The model learns from this data to make predictions. We will be doing supervised learning in this tutorial.
Unsupervised Learning: The model is trained on data without labels, meaning it only has the input features and no correct output. The model tries to find patterns on its own.

1.3 The Role of Caret in R

Caret stands for Classification And Regression Training. It is a popular R package that simplifies many tasks in machine learning, including:

Data Preprocessing: Preparing the data for modeling, such as splitting the data into training and test sets, normalizing, or scaling features.
Model Training: It provides easy access to many different machine learning algorithms.
Model Tuning: It helps you find the best settings (called hyperparameters) for your models.
Model Evaluation: Caret offers tools to measure how well your model performs on test data, using metrics like accuracy and error.

1.4 Why is Splitting Data Important?

To evaluate how well the model performs on new, unseen data, we divide the dataset into two parts:

Training Data: Used to train the model.
Test Data: Used to evaluate how well the model generalizes to new data. This simulates real-world scenarios, where the model encounters unseen data.

In this tutorial, we will:

Load the data: We’ll use a dataset called iris.
Preprocess the data: Split the data into training and test sets.
Train the model: Use the k-Nearest Neighbors (k-NN) algorithm.
Evaluate the model: Measure its performance on test data using accuracy and other metrics.

2 Loading libraries

# Load required libraries
# Install and load caret
if (!require(caret)) {
  install.packages("caret")
  library(caret)
}

# Install and load ggplot2
if (!require(ggplot2)) {
  install.packages("ggplot2")
  library(ggplot2)
}

# Install and load dplyr
if (!require(dplyr)) {
  install.packages("dplyr")
  library(dplyr)
}

For this tutorial, we’ll use the built-in iris dataset. It’s a famous dataset in machine learning with information about different types of flowers. You can replace this with your own dataset if necessary.

# Load dataset
data(iris)
# Preview the dataset
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

3 Data Preprocessing

Before applying machine learning algorithms, it is essential to preprocess the data. We’ll perform steps such as splitting the data into training and test sets, scaling features, and handling missing values (if any).

Training data: The model learns from this data.
Test data: We check how well the model performs on this data after it has been trained.

# Split the data into training and test sets (80/20 split)
set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = .8, 
                                  list = FALSE, 
                                  times = 1)

trainData <- iris[ trainIndex,]
testData  <- iris[-trainIndex,]

# Check the dimensions
dim(trainData)

[1] 120   5

dim(testData)

[1] 30  5

set.seed(123): Ensures that the random splitting is reproducible (same result every time you run it).
createDataPartition: Splits the data into 80% for training and 20% for testing.
- iris$Species: This is the column we are trying to predict (the type of flower).
trainData and testData: The data is split into two sets—one for training the model and one for testing it.

4 Model Training

In this section, we’ll train a model using caret. Let’s use a k-Nearest Neighbors (k-NN) algorithm for classification. This algorithm works by classifying a new data point based on its nearest “neighbors” in the training data.

# Train a k-NN model
set.seed(123)
model_knn <- train(Species ~ ., data = trainData, method = "knn", 
                   tuneLength = 5,
                   trControl = trainControl(method = "cv", number = 10))

# Print model details
print(model_knn)

k-Nearest Neighbors 

120 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa 
   5  0.9750000  0.9625
   7  0.9750000  0.9625
   9  0.9833333  0.9750
  11  0.9750000  0.9625
  13  0.9666667  0.9500

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.

train: This function trains a machine learning model. We specify:
- Species ~ .: We’re predicting the Species column using all other columns (. means all columns).
- data = trainData: The training data.
- method = "knn": We use the k-Nearest Neighbors method.
- tuneLength = 5: Tests different “k” values to find the best one (how many neighbors to consider).
- trainControl(method = "cv", number = 10): Uses cross-validation (splits the data multiple times) to make the training more reliable.

After training, print(model_knn) shows the details of the model and the best parameters it found.

5 Model Evaluation

We will now evaluate the model’s performance on the test data.

# Make predictions
predictions <- predict(model_knn, testData)

# Confusion matrix
confusionMatrix(predictions, testData$Species)

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         1
  virginica       0          0         9

Overall Statistics
                                          
               Accuracy : 0.9667          
                 95% CI : (0.8278, 0.9992)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 2.963e-13       
                                          
                  Kappa : 0.95            
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           0.9000
Specificity                 1.0000            0.9500           1.0000
Pos Pred Value              1.0000            0.9091           1.0000
Neg Pred Value              1.0000            1.0000           0.9524
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.3000
Detection Prevalence        0.3333            0.3667           0.3000
Balanced Accuracy           1.0000            0.9750           0.9500

predict: This function takes the trained model and test data, and it predicts the flower species for the test data.
confusionMatrix: This shows how well the predictions match the actual species in the test data. The confusion matrix tells us:
- True Positives (correct classifications).
- False Positives and False Negatives (incorrect classifications).
- Metrics like accuracy (how often the model predicts correctly).

6 Accuracy and Error

Let’s calculate some metrics to measure the accuracy and error of the model.

# Accuracy
accuracy <- postResample(pred = predictions, obs = testData$Species)
print(accuracy)

 Accuracy     Kappa 
0.9666667 0.9500000

postResample: This function compares the predicted species (pred) with the actual species (obs).
accuracy: Shows how well the model did, with key metrics like:
- Accuracy: The percentage of correct predictions.
- Kappa: A metric that accounts for agreement by chance.

7 Conclusion

In this tutorial, we have successfully trained and evaluated a machine learning model using the caret package in R. By splitting the data, tuning the model, and evaluating its performance, we can build effective machine learning models for a wide range of applications.