Background

Statistical models can be built to serve different research purposes, such as explanation or prediction.

  • When one’s purpose is explanation, the researcher writes a statistical model that tests a specific literature-informed theory (the model may have even been pre-registered).
    • Typically, the researcher wants the model to have interpretable parameter estimates that provide an understanding of the nature of the relationship between each predictor and the outcome variable
  • When one’s purpose is prediction, the researcher wants to build a statistical model that does the best at accurately predicting new data.
    • The model does not necessarily need to produce interpretable parameter estimates as long as, when applied to new data, the statistical model does a good job at predicting those values.

You could also be interested in achieving a mix of these two goals.

Machine Learning

Machine learning is the process of building predictive models. A “training data set” is used to build a model and optimize the model’s parameters, and then the predictive ability of the model is tested using a “testing data set.”

Steps of Machine Learning

  • Data Splitting
  • Cross-validation
  • Tuning
  • Model Testing

Training Set vs Testing Set

One of the main innovations of machine learning is that it requires you to build your statistical model using a separate set of data than the one used to test the accuracy of your model.

The original data is split into a training set that is used to build and optimize the statistical model, and a testing set that is used to test the resulting model’s predictive accuracy (i.e., ability to accurately predict new, not-yet-seen data).

Data Splitting

Since data sets can be hard to come by, data splitting is used to split a single dat a set into a training set and a testing set. Methods for splitting data up into training and testing sets are known as sampling techniques. Sampling techniques can vary from choosing cases in the original data set randomly to using random sampling techniques to decide which cases should go in the training set and which should go in the testing set.

You typically want more data in the training set since this data is being used to build your model than in the testing set. Some proportions that are used typically are 70% training, 30% testing, but you can adjust these.

Cross-Validation

Recall that the statistical model is built using only the data in the training set. To avoid overfitting, we do not want the model we build to fit too specifically to the unique error in the training set.

Cross-validation is a technique for splitting the training set multiple times into mini training-testing sets. The model is fit using each of the mini training sets and then its predictive accuracy is measured in each of the mini testing sets. The results are averaged across each of these models.

Here are a couple of cross-validation approaches:

  • leave-one-out cross-validation: Using the training set, fit the statistical model to n-1 observations. Then, test the predictive accuracy of the model on the single observation that’s been left out. Repeat this n times.

  • k-fold cross-validation: Randomly split the training set into k chunks (aka, folds) of roughly equal size. One of the chunks is treated as the testing set. The remaining k-1 chunks are treated as the training set for building the model, and the predictive accuracy of the model is examined examined using the testing set. This process is repeated k times, with each time, a different chunk being treated as the testing set. Typically, people use values of k = 5 or k = 10 (but it can be any value up to n).

Tuning

In machine learning, there are two types of parameters: model parameters and hyperparameters. What’s the difference?

  • Model parameters are estimated from the data (e.g., regression coefficients)

  • Hyperparameters are values specific to some algorithms that can be “tuned” by the researcher. Different algorithms have different hyperparameters. You won’t know the best values of hyperparameters prior to training a model. You have to rely on rules of thumb or try to find the best values through trial and error. This is the tuning process.

Terminology Break: Algorithms vs Models

  • Algorithms: a set of steps that are passed into a model for processing

  • Models: a complex object that takes an input parameter and gives an output

  • Features: a variable in machine learning lingo

Common Machine Learning Algorithms

  • Linear regression, lm()

  • Logistic regression, LogitBoost()

  • Support vector machines, svm() or svmLinear()

  • Random forests, randomForest()

And there are hundreds more… for a full list: names(getModelInfo())

Overfitting

A very important consideration in the building of predictive models is overfitting. The data in the training set is going to reflect true, underlying relationships among variables, but there will also be an amount of error that is unique to the training set. Overfitting is the problem of fitting a model too well to the training set. This could result in the model having poor predictive power when applied to a new dataset. During model training, you want to meet a balance between fitting a model with good accuracy (i.e., one that reduces error), without trying to account for so much error in the training model that you overfit the model.

A Regression Example

Regression is used to perform machine learning with a continuous outcome variable. The data we will use for this example is called Prestige from the car package. This data contains various features across a variety of occupations, such as education, percentage of incumbents who are women, and the perceived prestige of the occupation.

head(Prestige) 
##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof
str(Prestige)
## 'data.frame':    102 obs. of  6 variables:
##  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
##  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
Prestige$income <- as.numeric(Prestige$income)
Prestige$census <- as.numeric(Prestige$census)
Prestige$type <- as.numeric(Prestige$type)

Prestige <- subset(Prestige, select = c(education, income, women, prestige, census))

Data Splitting

Before you perform model training, you should partition the original dataset into a training set and a testing set. Model training is performed only on the training set.

createDataPartition is used to split the original data into a training set and a testing set. The inputs into createDataPartition include y, times, p, and list:

  • y = the outcome variable

  • times = the number of times you want to split the data

  • p = the percentage of the data that goes into the training set

  • list = FALSE gives the results in a matrix with the row numbers of the partition that you can pass back into the training data to effectively split it

# Randomly sample from the original dataset
set.seed(50) # set.seed is a random number generator; the value in parentheses is arbitrary

# Split the original dataset into a training set and a testing set
partition_data <- createDataPartition(Prestige$income, times = 1, p = .7, list = FALSE)

training.set <- Prestige[partition_data, ] # Training set
testing.set <- Prestige[-partition_data, ] # Testing set

Now that we have split the data into a training set and testing set, we can now move on to training a model using the training set. We will go through just a few of the different algorithms and cross-validation techniques you can use for model training.

Model Training in caret

caret is an R package that consolidates all of the many various machine learning algorithms into an easy-to-use interface.

The train function is used for model training. It uses the following inputs:

  • y = the outcome variable; y ~. means predict y from all other variables in the dataset
  • method = the machine learning algorithm
  • trControl = the cross-validation method
  • tuneGrid = a data frame of the hyperparameters that you want to be evaluated during model training
  • preProc = any pre-processing adjustments that you want done on the predictor data

Let’s build the model using an algorithm we’re familiar with: linear regression model

set.seed(21)

ml_linear_model <- train(income ~. , 
                         data = training.set, 
                         method = "lm", 
                         trControl = trainControl(method="cv", number = 5), # k-folds CV with k=5 
                        #tuneGrid = , # leaving empty because there's no hyperparameters to tune in this case
                          preProc = c("center")) 


ml_linear_model
## Linear Regression 
## 
## 74 samples
##  4 predictor
## 
## Pre-processing: centered (4) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 58, 60, 60, 58, 60 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2831.885  0.6700662  1799.577
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
ml_linear_model$results
##   intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      TRUE 2831.885 0.6700662 1799.577 1265.345  0.2030662 503.4603

How accurately does this model predict incomes in the testing set?

linear_predicted <- predict(ml_linear_model, testing.set) 

# Overall accuracy assessment
postResample(linear_predicted, testing.set$income)
##         RMSE     Rsquared          MAE 
## 1609.4258122    0.7938279 1009.9321849

The model we built accounts for 79% of the variability in incomes in the testing set.

Comparing Models

One of the advantages of this approach is you can easily run and compare models fit using different types of algorithms. Another commonly used machine learning model is called a Support Vector Machine (SVM).

Support Vector Machine

Unlike linear regression, SVMs do have a hyperparameter that can be tuned. You can see what it’s called by passing the model algorithm, called svmLinear, to the getModelInfo function.

getModelInfo("svmLinear")$svmLinear$parameters 
##   parameter   class label
## 1         C numeric  Cost

Then, we can choose “tuning” values to try out for this hyperparameter expand.grid is the function that allows you to specify tuning values.

tuning_values <- expand.grid(C = c(0.001, 0.01, 0.1, 1, 10, 100)) 

Now, let’s build the model using the training set.

set.seed(42)

ml_svm_model <- train(income ~. , 
                      data = training.set, 
                      method = "svmLinear", 
                      trControl = trainControl(method="cv", number = 5), 
                      tuneGrid = tuning_values, 
                      preProc = c("center"))

ml_svm_model
## Support Vector Machines with Linear Kernel 
## 
## 74 samples
##  4 predictor
## 
## Pre-processing: centered (4) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 59, 59, 58, 60, 60 
## Resampling results across tuning parameters:
## 
##   C      RMSE      Rsquared   MAE     
##   1e-03  4225.181  0.6856207  2720.510
##   1e-02  3173.461  0.6887618  1767.654
##   1e-01  2804.109  0.6916329  1555.539
##   1e+00  2769.176  0.6850269  1564.259
##   1e+01  2772.279  0.6836705  1568.732
##   1e+02  2771.494  0.6836159  1567.516
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was C = 1.
ml_svm_model$results
##       C     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1 1e-03 4225.181 0.6856207 2720.510 1463.573  0.1733353 635.9427
## 2 1e-02 3173.461 0.6887618 1767.654 1508.062  0.1808151 591.4909
## 3 1e-01 2804.109 0.6916329 1555.539 1445.872  0.2021806 511.3837
## 4 1e+00 2769.176 0.6850269 1564.259 1418.832  0.2106898 477.9205
## 5 1e+01 2772.279 0.6836705 1568.732 1421.506  0.2146243 474.2085
## 6 1e+02 2771.494 0.6836159 1567.516 1422.368  0.2146574 475.7424

How accurately does this model predict incomes in the testing set?

svm_predicted <- predict(ml_svm_model, testing.set) 

# Overall accuracy assessment
postResample(svm_predicted, testing.set$income)
##        RMSE    Rsquared         MAE 
## 1595.482580    0.805849  993.348648

The model we built accounts for 81% of the variability in incomes in the testing set.

Random Forest

Let’s try one more algorithm called Random Forest.

Find the hyperparameters that can be tuned.

getModelInfo("rf")$rf$parameters
##   parameter   class                         label
## 1      mtry numeric #Randomly Selected Predictors
tuning_values <- expand.grid(mtry = c(2, 3, 4, 5))

Build the model using the training set.

set.seed(47)

ml_rf_model <- train(income ~. , 
                  data = training.set, 
                  method = "rf", 
                  trControl = trainControl(method="cv", number = 5), 
                  tuneGrid = tuning_values,
                  preProc = c("center"))
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
ml_rf_model
## Random Forest 
## 
## 74 samples
##  4 predictor
## 
## Pre-processing: centered (4) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 59, 59, 59, 59, 60 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     2959.116  0.5808807  1834.863
##   3     2983.974  0.5831581  1812.909
##   4     2992.354  0.5915796  1787.337
##   5     3014.966  0.5822055  1811.346
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
ml_rf_model$results
##   mtry     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1    2 2959.116 0.5808807 1834.863 983.5608  0.1273506 459.1188
## 2    3 2983.974 0.5831581 1812.909 886.5779  0.1232333 384.7540
## 3    4 2992.354 0.5915796 1787.337 861.0418  0.1376776 360.9088
## 4    5 3014.966 0.5822055 1811.346 879.7178  0.1218599 386.8569

How accurately does this model predict incomes in the testing set?

# Testing predictive ability of model in testing set
rf_predicted <- predict(ml_rf_model, testing.set)

postResample(rf_predicted, testing.set$income) 
##         RMSE     Rsquared          MAE 
## 1858.0446113    0.7959387 1261.5159226

The model we built accounts for 80% of the variability in incomes in the testing set.

Which model did the best?

RMSE_Testing <- c(postResample(linear_predicted, testing.set$income)[1], postResample(svm_predicted, testing.set$income)[1], postResample(rf_predicted, testing.set$income)[1])

RSq_Testing <- c(postResample(linear_predicted, testing.set$income)[2], postResample(svm_predicted, testing.set$income)[2], postResample(rf_predicted, testing.set$income)[2])

Algorithm <- c("Linear Regression", "Support Vector Machine", "Random Forest")

compare_models <- data.frame(cbind(Algorithm, RMSE_Testing, RSq_Testing))

compare_models
##                     Algorithm     RMSE_Testing       RSq_Testing
## RMSE        Linear Regression  1609.4258121626 0.793827907063687
## RMSE.1 Support Vector Machine 1595.48258030158 0.805848953359384
## RMSE.2          Random Forest 1858.04461133748 0.795938730765608

It looks like the model fit using the Support Vector Machine algorithm is performing the best across the board on our measures of model fit.