Statistical models can be built to serve different research purposes, such as explanation or prediction.
You could also be interested in achieving a mix of these two goals.
Machine learning is the process of building predictive models. A “training data set” is used to build a model and optimize the model’s parameters, and then the predictive ability of the model is tested using a “testing data set.”
One of the main innovations of machine learning is that it requires you to build your statistical model using a separate set of data than the one used to test the accuracy of your model.
The original data is split into a training set that is used to build and optimize the statistical model, and a testing set that is used to test the resulting model’s predictive accuracy (i.e., ability to accurately predict new, not-yet-seen data).
Since data sets can be hard to come by, data splitting is used to split a single dat a set into a training set and a testing set. Methods for splitting data up into training and testing sets are known as sampling techniques. Sampling techniques can vary from choosing cases in the original data set randomly to using random sampling techniques to decide which cases should go in the training set and which should go in the testing set.
You typically want more data in the training set since this data is being used to build your model than in the testing set. Some proportions that are used typically are 70% training, 30% testing, but you can adjust these.
Recall that the statistical model is built using only the data in the training set. To avoid overfitting, we do not want the model we build to fit too specifically to the unique error in the training set.
Cross-validation is a technique for splitting the training set multiple times into mini training-testing sets. The model is fit using each of the mini training sets and then its predictive accuracy is measured in each of the mini testing sets. The results are averaged across each of these models.
Here are a couple of cross-validation approaches:
leave-one-out cross-validation: Using the training set, fit the statistical model to n-1 observations. Then, test the predictive accuracy of the model on the single observation that’s been left out. Repeat this n times.
k-fold cross-validation: Randomly split the training set into k chunks (aka, folds) of roughly equal size. One of the chunks is treated as the testing set. The remaining k-1 chunks are treated as the training set for building the model, and the predictive accuracy of the model is examined examined using the testing set. This process is repeated k times, with each time, a different chunk being treated as the testing set. Typically, people use values of k = 5 or k = 10 (but it can be any value up to n).
In machine learning, there are two types of parameters: model parameters and hyperparameters. What’s the difference?
Model parameters are estimated from the data (e.g., regression coefficients)
Hyperparameters are values specific to some algorithms that can be “tuned” by the researcher. Different algorithms have different hyperparameters. You won’t know the best values of hyperparameters prior to training a model. You have to rely on rules of thumb or try to find the best values through trial and error. This is the tuning process.
Algorithms: a set of steps that are passed into a model for processing
Models: a complex object that takes an input parameter and gives an output
Features: a variable in machine learning lingo
Linear regression, lm()
Logistic regression, LogitBoost()
Support vector machines, svm() or svmLinear()
Random forests, randomForest()
And there are hundreds more… for a full list: names(getModelInfo())
A very important consideration in the building of predictive models is overfitting. The data in the training set is going to reflect true, underlying relationships among variables, but there will also be an amount of error that is unique to the training set. Overfitting is the problem of fitting a model too well to the training set. This could result in the model having poor predictive power when applied to a new dataset. During model training, you want to meet a balance between fitting a model with good accuracy (i.e., one that reduces error), without trying to account for so much error in the training model that you overfit the model.
Regression is used to perform machine learning with a continuous outcome variable. The data we will use for this example is called Prestige from the car package. This data contains various features across a variety of occupations, such as education, percentage of incumbents who are women, and the perceived prestige of the occupation.
## education income women prestige census type
## gov.administrators 13.11 12351 11.16 68.8 1113 prof
## general.managers 12.26 25879 4.02 69.1 1130 prof
## accountants 12.77 9271 15.70 63.4 1171 prof
## purchasing.officers 11.42 8865 9.11 56.8 1175 prof
## chemists 14.62 8403 11.68 73.5 2111 prof
## physicists 15.64 11030 5.13 77.6 2113 prof
## 'data.frame': 102 obs. of 6 variables:
## $ education: num 13.1 12.3 12.8 11.4 14.6 ...
## $ income : int 12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
## $ women : num 11.16 4.02 15.7 9.11 11.68 ...
## $ prestige : num 68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
## $ census : int 1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
## $ type : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
Prestige$income <- as.numeric(Prestige$income)
Prestige$census <- as.numeric(Prestige$census)
Prestige$type <- as.numeric(Prestige$type)
Prestige <- subset(Prestige, select = c(education, income, women, prestige, census))Before you perform model training, you should partition the original dataset into a training set and a testing set. Model training is performed only on the training set.
createDataPartition is used to split the original data
into a training set and a testing set. The inputs into
createDataPartition include y, times, p, and list:
y = the outcome variable
times = the number of times you want to split the data
p = the percentage of the data that goes into the training set
list = FALSE gives the results in a matrix with the row numbers of the partition that you can pass back into the training data to effectively split it
# Randomly sample from the original dataset
set.seed(50) # set.seed is a random number generator; the value in parentheses is arbitrary
# Split the original dataset into a training set and a testing set
partition_data <- createDataPartition(Prestige$income, times = 1, p = .7, list = FALSE)
training.set <- Prestige[partition_data, ] # Training set
testing.set <- Prestige[-partition_data, ] # Testing setNow that we have split the data into a training set and testing set, we can now move on to training a model using the training set. We will go through just a few of the different algorithms and cross-validation techniques you can use for model training.
caret is an R package that consolidates all of the many
various machine learning algorithms into an easy-to-use interface.
The train function is used for model training. It uses
the following inputs:
Let’s build the model using an algorithm we’re familiar with: linear regression model
set.seed(21)
ml_linear_model <- train(income ~. ,
data = training.set,
method = "lm",
trControl = trainControl(method="cv", number = 5), # k-folds CV with k=5
#tuneGrid = , # leaving empty because there's no hyperparameters to tune in this case
preProc = c("center"))
ml_linear_model## Linear Regression
##
## 74 samples
## 4 predictor
##
## Pre-processing: centered (4)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 58, 60, 60, 58, 60
## Resampling results:
##
## RMSE Rsquared MAE
## 2831.885 0.6700662 1799.577
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 2831.885 0.6700662 1799.577 1265.345 0.2030662 503.4603
How accurately does this model predict incomes in the testing set?
linear_predicted <- predict(ml_linear_model, testing.set)
# Overall accuracy assessment
postResample(linear_predicted, testing.set$income)## RMSE Rsquared MAE
## 1609.4258122 0.7938279 1009.9321849
The model we built accounts for 79% of the variability in incomes in the testing set.
One of the advantages of this approach is you can easily run and compare models fit using different types of algorithms. Another commonly used machine learning model is called a Support Vector Machine (SVM).
Unlike linear regression, SVMs do have a
hyperparameter that can be tuned. You can see what it’s called by
passing the model algorithm, called svmLinear, to the
getModelInfo function.
## parameter class label
## 1 C numeric Cost
Then, we can choose “tuning” values to try out for this
hyperparameter expand.grid is the function that allows you
to specify tuning values.
Now, let’s build the model using the training set.
set.seed(42)
ml_svm_model <- train(income ~. ,
data = training.set,
method = "svmLinear",
trControl = trainControl(method="cv", number = 5),
tuneGrid = tuning_values,
preProc = c("center"))
ml_svm_model## Support Vector Machines with Linear Kernel
##
## 74 samples
## 4 predictor
##
## Pre-processing: centered (4)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 59, 59, 58, 60, 60
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 1e-03 4225.181 0.6856207 2720.510
## 1e-02 3173.461 0.6887618 1767.654
## 1e-01 2804.109 0.6916329 1555.539
## 1e+00 2769.176 0.6850269 1564.259
## 1e+01 2772.279 0.6836705 1568.732
## 1e+02 2771.494 0.6836159 1567.516
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was C = 1.
## C RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1e-03 4225.181 0.6856207 2720.510 1463.573 0.1733353 635.9427
## 2 1e-02 3173.461 0.6887618 1767.654 1508.062 0.1808151 591.4909
## 3 1e-01 2804.109 0.6916329 1555.539 1445.872 0.2021806 511.3837
## 4 1e+00 2769.176 0.6850269 1564.259 1418.832 0.2106898 477.9205
## 5 1e+01 2772.279 0.6836705 1568.732 1421.506 0.2146243 474.2085
## 6 1e+02 2771.494 0.6836159 1567.516 1422.368 0.2146574 475.7424
How accurately does this model predict incomes in the testing set?
svm_predicted <- predict(ml_svm_model, testing.set)
# Overall accuracy assessment
postResample(svm_predicted, testing.set$income)## RMSE Rsquared MAE
## 1595.482580 0.805849 993.348648
The model we built accounts for 81% of the variability in incomes in the testing set.
Let’s try one more algorithm called Random Forest.
Find the hyperparameters that can be tuned.
## parameter class label
## 1 mtry numeric #Randomly Selected Predictors
Build the model using the training set.
set.seed(47)
ml_rf_model <- train(income ~. ,
data = training.set,
method = "rf",
trControl = trainControl(method="cv", number = 5),
tuneGrid = tuning_values,
preProc = c("center"))## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Random Forest
##
## 74 samples
## 4 predictor
##
## Pre-processing: centered (4)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 59, 59, 59, 59, 60
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 2959.116 0.5808807 1834.863
## 3 2983.974 0.5831581 1812.909
## 4 2992.354 0.5915796 1787.337
## 5 3014.966 0.5822055 1811.346
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
## mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 2 2959.116 0.5808807 1834.863 983.5608 0.1273506 459.1188
## 2 3 2983.974 0.5831581 1812.909 886.5779 0.1232333 384.7540
## 3 4 2992.354 0.5915796 1787.337 861.0418 0.1376776 360.9088
## 4 5 3014.966 0.5822055 1811.346 879.7178 0.1218599 386.8569
How accurately does this model predict incomes in the testing set?
# Testing predictive ability of model in testing set
rf_predicted <- predict(ml_rf_model, testing.set)
postResample(rf_predicted, testing.set$income) ## RMSE Rsquared MAE
## 1858.0446113 0.7959387 1261.5159226
The model we built accounts for 80% of the variability in incomes in the testing set.
RMSE_Testing <- c(postResample(linear_predicted, testing.set$income)[1], postResample(svm_predicted, testing.set$income)[1], postResample(rf_predicted, testing.set$income)[1])
RSq_Testing <- c(postResample(linear_predicted, testing.set$income)[2], postResample(svm_predicted, testing.set$income)[2], postResample(rf_predicted, testing.set$income)[2])
Algorithm <- c("Linear Regression", "Support Vector Machine", "Random Forest")
compare_models <- data.frame(cbind(Algorithm, RMSE_Testing, RSq_Testing))
compare_models## Algorithm RMSE_Testing RSq_Testing
## RMSE Linear Regression 1609.4258121626 0.793827907063687
## RMSE.1 Support Vector Machine 1595.48258030158 0.805848953359384
## RMSE.2 Random Forest 1858.04461133748 0.795938730765608
It looks like the model fit using the Support Vector Machine algorithm is performing the best across the board on our measures of model fit.