In supervised machine learning, the learning algorithm is presented with labelled example inputs and outputs. This can be classification where the output is categorical or regression where the output it numerical. As an example, we will explore a data set below which contains a number of variables (features), including whether an email is spam or not. This is a classification problem. Later, we will look at a wage data set, where wage is numerical, so this is regression.
In unsupervised machine learning, no labels are provided and the learning algorithm focuses on detecting structure in the unlabelled data. Common techniques are clustering (e.g., K-means Clustering) and dimensionality reduction (e.g. Principal Component Analysis). We will save these topics for another day and focus on supervised ML in this workshop.
The caret package stands for Classification and Regression Training. Its power is in being a unified framework for applying all kinds of machine learning algorithms from different developers. This means the commands are the same no matter what method you use. In summary,
preProcess()
createDataPartition()
, createResample()
, createTimeSlices()
train()
, predict()
confusionMatrix()
caret
package
caret
provides uniform framework to build/predict using different models
caret
package allows algorithms to be run the same way through predict()
functionInstall the caret package if you haven’t already, and load it.
# install.packages("caret") # install the package once
library(caret) # load the package
The caret package has a data set called spam. It has 58 variables, one of which is ‘type’ and classifies the email as either spam or nonspam. So this is a binary classification problem. The goal will be to use features (some combination of the first 57 variables) to predict the 58th variable– whether the email is spam or not.
We will always split the data set before proceeding.
We will hold back some of the data and not use it to build our model. Then we can test our model on this reserved set to get an idea of the accuracy of our model. A good rule of thumb is to use 60-80% of the data to train the model and to use the remaining data to validate the model. The caret package has a function createDataPartition() that handles this easily. The code below assigns 75% of the data into the variable training and puts 25% of the data into the variable testing
Note– for large sample sizes, it is common to partition as 60% training, 20% test (held back) and 20% validation (to refine model). For medium sample sizes, 60-80% training and the remaining percentage held back as a test set.
Note– many Kaggle data sets are already split for you!
# load packages and data
library(caret); library(kernlab); data(spam)
# Partition data: 75% of the data to train the model
inTrain<- createDataPartition(y=spam$type, p=0.75, list=FALSE)
training<-spam[inTrain,]
testing<-spam[-inTrain,] # note the minus sign
dim(training)
GLM is a generalization of linear regression (it allows response variables that have error distribtion models other than gaussian). Don’t worry about the details if this is new to you– the idea is if you know the values of the features, you can predict the response.
Because spam email has a lot of words like “credit”" and “free,”" we will predict spam messages using the frequency of these two words.
set.seed(32343)
modelFit<-train(type~credit+free, data=training, method='glm') # note we choose our method to be glm here
modelFit
Let’s inspect the model. You can see the coefficients and intercept (very “line-like”).
modelFit$finalModel
Now let’s use our model for prediction on the testing set. Remember, we KNOW the correct answer with the test set, so we can see how well our model guesses if an email is spam or nonspam based on words like credit and free.
predictions<- predict(modelFit, newdata=testing)
head(predictions)
The variable predictions contains our guesses. But we’d like a succinct way to compare this to the real classification of the emails. We will use a confusion matrix. This is a table that describes the performance of a classification model.
confusionMatrix(predictions, testing$type, positive = "spam") # without the postive =, it will take the first level of the factor for what we are predicting. In this case, nonspam was the first level, so I changed it.
The output of the code above shows a 2 x 2 matrix with Prediction on the left and Reference on the top. The top left cell says that our model correctly predicted nonspam when it was really nonspam 661 times (true negative). The bottom right cell says our model correctly predicted spam when it was really spam 207 times (true positive). The other two cells show errors. These are either false positives/Type I error (we predicted spam but it wasn’t) or false negatives/Type II error (we predicted nonspam but it was).
Accuracy is a measurement of how often the classifier is correct. \(Accuracy = \frac{(TP + TN)}{total}.\)
Cohne’s Kappa is a measure of how well the classifier performed as compared to its performance simply by chance.
Let’s revisit the code from above and unpack it a bit… y = what output we want to split on, which is this case are the two types of messages (SPAM and non Spam). p specifies the proportion of data that will exist in each chunk after splitting the data, in this case we split into two chunks of 75% and 25%. We then subset the data using the output from the createDataPartition function.
This is the basic plan for any data splitting.
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The parameter k refers to the number of groups that we will split the data into.
For k-fold cross validation, the original sample is randomly partitioned into k equal sized subsamples. A single subsample is retained as a test set, and the remaining k-1 subsamples are used for training.
The general procedure is:
The parameter k is often chosen to be 10 in experiments.
set.seed(32323)
folds <-createFolds(y=spam$type, k=10,
list=TRUE, returnTrain=TRUE) # or return test
sapply(folds, length)
# resample
set.seed(32323)
folds<-createResample(y=spam$type, times=10, list=TRUE)
There are built-in functions in the scikit-learn package like KFold() that can handle this if you want to avoid doing it manually.
Extremes of this are: \(k = 1\) (this is just the train/test split we started with) or \(k = n\) meaning each observation is held out of the data set. This is called leave-one-out cross validation (LOOCV).
It’s a good idea to do some exploratory plotting to get a sense of the basic structure of your data and any quirks. For these examples, we will use a data set in the ISLR (Introduction to Statistical Learning in R) package. The dataset Wage has 3000 observations with 11 variables (year, age, marriage status, race, education, region,…, wage). Note that the wage variable is numerical, so this is a regression problem.
library(ggplot2); library(ISLR); library(caret)
data(Wage)
summary(Wage)
Let’s split into training and test sets. We will do a 70-30% split.
inTrain<- createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
train<- Wage[inTrain,]
test<- Wage[-inTrain,]
dim(train); dim(test); # look at size of each
Some plots comparing each feature to wage (what we want to predict).
featurePlot(x=train[,c('age', 'education', 'jobclass')],
y=train$wage,
plot='pairs')
This is pretty messy. Let’s look at a simpler plot.
qplot(age, wage, data=train)
We can see that there is some type of outlier group which does not conform to the overall trend of the data (at the top). Ideally, we want to try to understand why this could be the case. We could begin to do this simply by coloring the data values. We can now see that a major explanation for this could be the job type of the individuals. This will likely be important in a model and knowing this ahead of time is very useful.
qplot(age, wage, data=train, color=jobclass)
Or using our ggplot2 skills…
ggplot(data=train, aes(age, wage, color=education))+geom_smooth()
We can also look at density plots
qplot(wage, color=education, data=train, geom='density')
As an example, let’s model how wage varies with age. We use a linear regression model (lm).
lm_wage <- lm(wage ~ age, data = train)
lm_wage
Predictions from this model can be compared to the test set. We are not convinced that a linear model is right for this data because of its structure, so we might consider other types of models, discussed below.
There are other methods besides linear models. For example, prediction with trees is common and often performs better than regression if data is nonlinear. Let’s do a classification tree with the iris data set (method = “rpart” sets this method):
# load iris data set
data(iris)
# create test/train data sets
inTrain <- createDataPartition(y=iris$Species,p=0.7, list=FALSE)
training <- iris[inTrain,]
testing <- iris[-inTrain,]
# fit classification tree as a model
modFit <- train(Species ~ .,method="rpart",data=training)
# print the classification tree
print(modFit$finalModel)
# plot the classification tree
rattle::fancyRpartPlot(modFit$finalModel)
# predict on test values
predict(modFit,newdata=testing)
Bagging is also an option. (The name comes from Bootstrap aggregating.) The idea is instead of building a single smoother from a complete data set, some number of bootstrap samples (random, with replacement) are drawn. Each sample is different from the origianl data set, but resembles it in distribution and variability. For each bootstrap a smoother is fit. These can then be averaged and the mean hugs the data nicely. There is a great example from a 1986 paper on temperature and ozone (Rousseeuq and Leroy) that illustrates this. (Make sure to install the package first.) In the graph, you can see the data points in black, the sample smoothers in gray, and the mean in red.
# load data
library(ElemStatLearn); data(ozone,package="ElemStatLearn")
# reorder rows based on ozone variable
ozone <- ozone[order(ozone$ozone),]
# create empty matrix
ll <- matrix(NA,nrow=10,ncol=155)
# iterate 10 times
for(i in 1:10){
# create sample from data with replacement
ss <- sample(1:dim(ozone)[1],replace=T)
# draw sample from the dataa and reorder rows based on ozone
ozone0 <- ozone[ss,]; ozone0 <- ozone0[order(ozone0$ozone),]
# fit loess function through data (similar to spline)
loess0 <- loess(temperature ~ ozone,data=ozone0,span=0.2)
# prediction from loess curve for the same values each time
ll[i,] <- predict(loess0,newdata=data.frame(ozone=1:155))
}
# plot the data points
plot(ozone$ozone,ozone$temperature,pch=19,cex=0.5)
# plot each prediction model
for(i in 1:10){lines(1:155,ll[i,],col="grey",lwd=2)}
# plot the average in red
lines(1:155,apply(ll,2,mean),col="red",lwd=2)
In caret, bagEarth, treebag, bagFDA are built in options for this. You can also write a custom bag function where you control the arguments, like below.
# load relevant package and data
install.packages("party")
library(party); data(ozone,package="ElemStatLearn")
# reorder rows based on ozone variable
ozone <- ozone[order(ozone$ozone),]
# extract predictors
predictors <- data.frame(ozone=ozone$ozone)
# extract outcome
temperature <- ozone$temperature
# run bagging algorithm
treebag <- bag(predictors, temperature, B = 10,
# custom bagging function
bagControl = bagControl(fit = ctreeBag$fit,
predict = ctreeBag$pred,
aggregate = ctreeBag$aggregate))
# plot data points
plot(ozone$ozone,temperature,col='lightgrey',pch=19)
# plot the first fit
points(ozone$ozone,predict(treebag$fits[[1]]$fit,predictors),pch=19,col="red")
# plot the aggregated predictions
points(ozone$ozone,predict(treebag,predictors),pch=19,col="blue")
Boosting is another ML meta-algorithm, primarily used for reducing bias and variance in supervised learning. The idea came from the question “can a weak set of learners be converted to strong ones?” Here, weak means it can label examples better than random guessing, and strong means it is arbitrarily well-correlated with true classification. This was answered affirmatively in 1990 (Kearns and Valiant) and has had a big impact on the field of ML. For us, we don’t need to follow the details, because caret has a command that will implement this! The most common is gbm (for gradient boosting method). Here’s an example with the Wage data set. This will be a better model than our first (linear) attempt.
library(ISLR)
# load data
data(Wage)
# remove log wage variable (we are trying to predict wage)
Wage <- subset(Wage,select=-c(logwage))
# create train/test data sets
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]; testing <- Wage[-inTrain,]
# run the gbm model
modFit <- train(wage ~ ., method="gbm",data=training,verbose=FALSE)
# print model summary
print(modFit)
For this exercise, we will use a modified version of the Orange Juice Data from the ISLR packages. Our goal will be to predict which of the two orange juice brands (Citrus Hill or Minute Maid) the customer purchases based on the other information. We will load it from the website in the code below.
# install.packages(c('caret', 'skimr', 'RANN', 'randomForest', 'fastAdaboost', 'gbm', 'xgboost', 'caretEnsemble', 'C50', 'earth'))
# Load the caret package
library(caret)
# Import dataset
orange <- read.csv('https://raw.githubusercontent.com/selva86/datasets/master/orange_juice_withmissing.csv')
# Structure of the dataframe
str(orange)
# See top 6 rows and 10 columns
head(orange[, 1:10])
We can see from inspecting the data that there are 1070 observations of 18 variables (variables include StoreID, PriceCH, Price MM, etc.)
We will hold back some of the data and not use it to build our model. Then we can test our model on this reserved set to get an idea of the accuracy of our model. A good rule of thumb is to use 80% of the data to train the model and 20% of the data to validate the model. The caret package has a function createDataPartition() that handles this easily. The code below assigns 80% of the data into the variable trainData and puts 20% of the data into the variable testData.
Note– for large sample sizes, it is common to partition as 60% training, 20% test (held back) and 20% validation (to refine model). For medium sample sizes, 60-80% training and the remaining percentage held back as a test set.
Reminder– many Kaggle data sets are already split for you!
# Create the training and test datasets
set.seed(100)
# Step 1: Get row numbers for the training data
trainRowNumbers <- createDataPartition(orange$Purchase, p=0.8, list=FALSE) # this function takes as input the variable we are predicting and the percentage p to go into training. It returns the row numbers that should form the training data set.
# Step 2: Create the training dataset
trainData <- orange[trainRowNumbers,]
# Step 3: Create the test dataset
testData <- orange[-trainRowNumbers,]
# Store X and Y for later use.
x = trainData[, 2:18] # every column except Purchase is a predictor
y = trainData$Purchase # Purchase is what we are trying to predict
In our first workshop, we discussed ways to handle missing values, and kept things simple. Now we are ready for a more sophisticated approach.
The data set is missing values (“NA”). If the feature is a continuous variable, it is common practice to replace these values with the mean of the column and if it is a categorical variable with the mode. However, this is quite rudimentary, and loses important informaton. The caret package lets us instead predict the missing values by considering the k-nearest neighbors.
Briefly, this algorithm uses a metric (i.e., Euclidean distance or Hamming distance) and assigns the class to the point in question by going with the majority in the prescribed neighborhood.
# Create the knn imputation model on the training data
preProcess_missingdata_model <- preProcess(trainData, method='knnImpute')
preProcess_missingdata_model
# Use the imputation model to predict the values of missing data points
library(RANN) # required for knnInpute
trainData <- predict(preProcess_missingdata_model, newdata = trainData)
anyNA(trainData) # False means there are no NA left
All missing values have been imputed.
How do we know if a feature is an important predictor? We can compute variable importance
varimp_mars <- varImp(model_mars)
plot(varimp_mars, main="Variable Importance with MARS")
From this plot, we see that LoyalCH is the most imprtant variable, followed by PriceDiff and StoreID.
Let’s predict on the test data now, after imputing missing values. (Whatever cleaning or normalizing is done to the training set should be done in the test set.)
# Step 1: Impute missing values
testData2 <- predict(preProcess_missingdata_model, testData)
# Predict on testData2 which has the missing values imputed
predicted <- predict(model_mars, testData2)
head(predicted)
Let’s compute the confusion matrix:
confusionMatrix(reference = testData$Purchase, data = predicted, mode='everything', positive='MM')
As we’ve seen, there are many models to choose from. Fortunately, caret lets us use multiple models and compare them, and let’s us combine the predictions of multiple models to forma final prediction. This won’t be covered today, but hopefully in the future!
If you are interested in the theory of ML, check out The Elements of Statistical Learning (by Hastie, Tibshirani and Friedman at Stanford). They posted the ebook for free here: https://web.stanford.edu/~hastie/Papers/ESLII.pdf. However, what is nice about caret is you can jump right into ML without the theoretical background (some would argue that’s dangerous, but I’m a fan of project-based learning!).
A practical next step would be choosing a Kaggle competition to try some of these models. A gentle introduction would be the Titanic dataset.
A big thanks to the Johns Hopkins University class in Practical Machine Learning, which introduced me to caret.