R Programming for Data Science introduced classification through functions in package class
. Here you will get introduced to a powerful package: caret
.
The caret
package (short for C_lassification A_nd RE_gression T_raining) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for
as well as other functionality.
To learn more about package caret
visit the link in the References section.
To get started, load packages tidyverse
, caret
, and gifski
. Install any packages with code install.packages("package_name")
.
library(tidyverse)
library(caret)
library(gifski)
We will examine data that are the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
# data url
wineurl <- paste("https://archive.ics.uci.edu/ml/",
"machine-learning-databases/wine/wine.data", sep = "")
# read in data and set as tibble
wine <- as_tibble(read.csv(file = wineurl, header = FALSE))
# assign variable names
colnames(wine) <- c("origin", "alcohol", "acid",
"ash", "alcalinity", "magnesium",
"phenols", "flavanoids", "nonflavanoid",
"proanthocyanins","color.int", "hue",
"od", "proline")
# change origin to factor
wine <- wine %>%
mutate(origin = factor(origin))
wine
Our goal is to apply the K-NN algorithm and evaluate its prediction accuracy in using the 13 wine features to predict the wine’s origin.
Let’s examine how the features are related to origin
.
flavanoids
and phenols
. Have the point color and shape reflect the wine origin.Before we train our classifier it is important to examine the features of our data. The results of this analysis may require further preprocessing and give us insight into what distance metric to use for K-NN.
How many wine origins exist for each wine? Hint: dplyr::count()
Are there any NA
values in the data?
What types of variables are the 13 features? Are they all numeric continuous?
Next, we split the 178 rows of wine
into a training dataset and a testing dataset. We train our classifier with the training data set and evaluate the prediction accuracy with the testing dataset.
The function createDataPartition()
from package caret
can be used to create balanced splits of the data. Thus, the ratio of outcomes will be preserved in the training and testing datasets. This is important.
Function createDataPartition(y, p = 5, list = TRUE)
has main arguments:
y
: a vector of your outcomes, origin
from wine
p
: proportion of overall data that goes to trainning, typically use between 0.70 and 0.80
list
: should the results be in a list? Generally, we will set this to FALSE
.
The result of function createDataPartition()
is a list or matrix of row indices that correspond to the training data.
Use function createDataPartition()
to get the row indices that will correspond to the trainning dataset. Set list = FALSE
, and choose a split percentage of 70%. Save the result as an object train.index
.
Use train.index
to subset the rows of wine
to create your training dataset. Save the result as an object named wine.train
. Hint: dplyr::slice()
Use the other indices to subset the rows of wine
to create your testing dataset. Save the result as an object named wine.test
. Hint: dplyr::slice()
, use a minus sign
Max Kuhn details other data splitting functionality at https://topepo.github.io/caret/.
K-NN operates on the basis of a distance metric. Therefore, it is helpful if all our features are standardized. Function preProcess()
from package caret
will help do this correctly and efficiently. The function performs preprocessing transformations (centering, scaling etc.) that can be estimated from the training data and applied to any dataset with the same variables.
Function preProcess(x, method = c("center", "scale"))
has main arguments:
x
: a matrix or data frame, in this case wine
method
: type of processing to be done, to standardize we want to center and scale each feature
The result of function preProcess()
is a list. Type ?preProcess
in your console to look at the “Value” section of the help. Function preProcess()
doesn’t actually preprocess the data, we need to use function predict()
with the result of function preProcess()
.
Input wine.train
into function preProcess()
, and choose methods center and scale. Save the result as an object named train.proc
.
Display train.proc$mean
and train.proc$std
to see the mean and standard deviation of each feature.
To actually obtain standardized features use function predict()
. We will save the standardized objects as train.transformed
and test.transformed
for the training and testing datasets, respectively.
When and how you standardize and when you split the data is one of the most common mistakes. See item 3 in References for a nice explanation on the correct procedure.
The caret
train()
function lets us train different algorithms using similar syntax. We can use function predict()
to obtain predictions.
Function train(form, method, data)
has the main arguments we need:
form
: a formula relating outcomes to features, in this case we have origin ~ .
, dot means all features
method
: type of method to use, in this case “knn”
data
: transformed trainning data frame, train.transformed
The result of function train()
is a list. Type ?train
in your console to look at the “Value” section of the help.
Use train()
to train the K-NN classifier. Save the result as an object named train.knn
.
Now, we use predict()
and function confusionMatrix()
to evaluate the prediction accuracy.
What is the prediction accuracy?
When an algorithm includes a tuning parameter, function train()
automatically uses cross validation to decide among a few default values. Recall that K-NN has one tuning parameter, \(k\).
To visualize the results of cross validation we can use our object train.knn
in conjunction with ggplot()
.
By default, the cross validation is performed by taking 25 bootstrap samples comprised of 25% of the observations. Default is to try \(k = 5, 7, 9\). We can change these \(k\) values using the tuneGrid
parameter in function train()
. The grid of values must be supplied by a data frame with the parameter names as columns.
For example,
What parameter maximized the accuracy for train.knn
? How about train.knn.big.grid
? Hint: train.knn$bestTune
Function predict()
will always use the \(k\) with the highest accuracy. Use predict()
and confusionMatrix()
to evaluate the accuracy for train.knn.big.grid
.