library(dplyr)
library(kableExtra)
library(caret)
library(doParallel)

We tune a total of 6 models from the following categories of models:

Data Preparation

# Read in data
train <- readxl::read_excel('Data//StudentData.xlsx')
eval <- readxl::read_excel('Data//StudentEvaluation.xlsx')

The variable Brand Code is a categorical variable, having 4 classes (A, B, C, and D). We opt to use the “one-hot” encoding scheme for this variable, creating 5 new variables for the data: BrandCodeA, BrandCodeB, BrandCodeC, BrandCodeD, and BrandCodeNA.

# One-hot encoding the categorical variable `Brand Code`
train$`Brand Code` <- addNA(train$`Brand Code`)
eval$`Brand Code` <- addNA(eval$`Brand Code`)
brandCodeTrain <- predict(dummyVars(~`Brand Code`, data=train), train)
brandCodeEval <- predict(dummyVars(~`Brand Code`, data=eval), eval)
head(brandCodeTrain, 10)
##    `Brand Code`A `Brand Code`B `Brand Code`C `Brand Code`D `Brand Code`NA
## 1              0             1             0             0              0
## 2              1             0             0             0              0
## 3              0             1             0             0              0
## 4              1             0             0             0              0
## 5              1             0             0             0              0
## 6              1             0             0             0              0
## 7              1             0             0             0              0
## 8              0             1             0             0              0
## 9              0             1             0             0              0
## 10             0             1             0             0              0
head(train$`Brand Code`, 10)
##  [1] B A B A A A A B B B
## Levels: A B C D <NA>
head(brandCodeEval, 10)
##    `Brand Code`A `Brand Code`B `Brand Code`C `Brand Code`D `Brand Code`NA
## 1              0             0             0             1              0
## 2              1             0             0             0              0
## 3              0             1             0             0              0
## 4              0             1             0             0              0
## 5              0             1             0             0              0
## 6              0             1             0             0              0
## 7              1             0             0             0              0
## 8              0             1             0             0              0
## 9              1             0             0             0              0
## 10             0             0             0             1              0
head(eval$`Brand Code`, 10)
##  [1] D A B B B B A B A D
## Levels: A B C D <NA>
train <- cbind(brandCodeTrain, subset(train, select=-c(`Brand Code`)))
eval <- cbind(brandCodeEval, subset(eval, select=-c(`Brand Code`)))

White spaces and special characters in the column names are removed so they does not cause issues in some of the R packages.

# Remove special symbols (white space and `) in names
names(train) <- gsub(patter=c(' |`'), replacement='', names(train))
names(eval) <- gsub(patter=c(' |`'), replacement='', names(eval))

There are a few rows with target variable (PH) missing. These rows are removed, since they cannot be used for training.

# Remove rows in training set with missing target variables
train <- train[complete.cases(train$PH),]

There is one near-zero-variance variable in the data:

# Check near-zero-variance variables
nearZeroVar(train, names=T)
## [1] "BrandCodeNA"  "HydPressure1"

Below, we remove the near-zero-variance predictor, and separate the predictors and target:

# Separate the predictors and target, and remove nzv variable
xTrain <- subset(train, select=-c(PH,`HydPressure1`)) %>% as.data.frame()
xEval <- subset(eval, select=-c(PH,`HydPressure1`)) %>% as.data.frame()
yTrain <- train$PH

The train function from the caret package is used to tune the models. The 5-fold cross validation scheme is used to estimate the model performance based on their RMSE. Below, we create the folds and set up the train control:

set.seed(1)
cvFolds <- createFolds(yTrain, k=5)
trControl <- trainControl(verboseIter=T,
                          method='cv', 
                          number=5,
                          index=cvFolds)

For the missing values, we experiment with three different imputation algorithms provided in the preProcess function:

As will be seen in the “Linear Models” section below, the choice of imputation method does not seem to affect the prediction performance much. We opt to use the knnImpute method due to its high efficiency.

For the linear and non-linear models, the pre-processing step also include centering and scaling (standardizing), so that the variables all have a mean of 0 and standard deviation of 1. For the tree-based models, this step is omitted, since tree models work fine without this step.

The caret package supports parallel processing (multi-core training). This capability significantly lowers the training time:

# Set up and start multi-core processing
cl <- makePSOCKcluster(5)
registerDoParallel(cl)

Model Building and Tuning

Linear Models

In this section, we tune 6 models for the following purpose in mind:

  • Understand the effect of different imputation methods on the model performance
  • Find the optimal hyper-parameters for the respective linear models

For the partial least squares, the models are tuned over the number of components used in the model. 3 models are created, each uses a different imputation method:

# PLS
plsFit1 <- train(x=xTrain,
                 y=yTrain, 
                 method='pls',
                 tuneLength=20,
                 trControl=trControl,
                 preProc=c('knnImpute', 'center', 'scale'))
## Aggregating results
## Selecting tuning parameters
## Fitting ncomp = 14 on full training set
plsFit2 <- train(x=xTrain,
                 y=yTrain, 
                 method='pls',
                 tuneLength=20,
                 trControl=trControl,
                 preProc=c('bagImpute', 'center', 'scale'))
## Aggregating results
## Selecting tuning parameters
## Fitting ncomp = 7 on full training set
plsFit3 <- train(x=xTrain,
                 y=yTrain, 
                 method='pls',
                 tuneLength=20,
                 trControl=trControl,
                 preProc=c('medianImpute', 'center', 'scale'))
## Aggregating results
## Selecting tuning parameters
## Fitting ncomp = 14 on full training set

For the elastic nets, the models are tuned over the two regularization parameters. Likewise, 3 models are tuned, each with different imputation method:

# Elastic Net
enetFit1 <- train(x=xTrain,
                  y=yTrain,
                  method='enet',
                  tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                       .lambda = seq(0, 1, by=0.1)),
                  trControl=trControl,
                  preProc=c('knnImpute', 'center', 'scale'))
## Aggregating results
## Selecting tuning parameters
## Fitting fraction = 0.5, lambda = 0 on full training set
enetFit2 <- train(x=xTrain,
                  y=yTrain,
                  method='enet',
                  tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                       .lambda = seq(0, 1, by=0.1)),
                  trControl=trControl,
                  preProc=c('bagImpute', 'center', 'scale'))
## Aggregating results
## Selecting tuning parameters
## Fitting fraction = 0.4, lambda = 0 on full training set
enetFit3 <- train(x=xTrain,
                  y=yTrain,
                  method='enet',
                  tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                       .lambda = seq(0, 1, by=0.1)),
                  trControl=trControl,
                  preProc=c('medianImpute', 'center', 'scale'))
## Aggregating results
## Selecting tuning parameters
## Fitting fraction = 0.5, lambda = 0 on full training set

The performance of these models are be compared using the resamples function:

resamples(list(PLS1=plsFit1, PLS2=plsFit2, PLS3=plsFit3,
               enet1=enetFit1, enet2=enetFit2, enet3=enetFit3)) %>% summary()
## 
## Call:
## summary.resamples(object = .)
## 
## Models: PLS1, PLS2, PLS3, enet1, enet2, enet3 
## Number of resamples: 5 
## 
## MAE 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS1  0.1040102 0.1063136 0.1071064 0.1067182 0.1079895 0.1081715    0
## PLS2  0.1046314 0.1060063 0.1075259 0.1069578 0.1078970 0.1087286    0
## PLS3  0.1048040 0.1066420 0.1071693 0.1069891 0.1079980 0.1083322    0
## enet1 0.1032539 0.1056163 0.1070119 0.1061140 0.1071730 0.1075147    0
## enet2 0.1042026 0.1058568 0.1070027 0.1063896 0.1070570 0.1078287    0
## enet3 0.1042282 0.1060427 0.1072512 0.1066468 0.1078112 0.1079009    0
## 
## RMSE 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS1  0.1356567 0.1376909 0.1376948 0.1374621 0.1377316 0.1385362    0
## PLS2  0.1358973 0.1368498 0.1371483 0.1374332 0.1384082 0.1388624    0
## PLS3  0.1363889 0.1377426 0.1379041 0.1378178 0.1384736 0.1385797    0
## enet1 0.1339212 0.1361784 0.1363822 0.1361661 0.1366161 0.1377323    0
## enet2 0.1347075 0.1360830 0.1362463 0.1365080 0.1374337 0.1380693    0
## enet3 0.1348319 0.1363997 0.1368995 0.1368871 0.1369487 0.1393558    0
## 
## Rsquared 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS1  0.3625243 0.3667430 0.3699577 0.3699562 0.3739147 0.3766411    0
## PLS2  0.3648928 0.3649674 0.3664653 0.3685208 0.3728756 0.3734029    0
## PLS3  0.3556132 0.3644181 0.3702737 0.3668091 0.3706663 0.3730739    0
## enet1 0.3700989 0.3727973 0.3781011 0.3778583 0.3811035 0.3871907    0
## enet2 0.3703520 0.3733133 0.3738265 0.3752370 0.3791456 0.3795478    0
## enet3 0.3626465 0.3649488 0.3761665 0.3722877 0.3785394 0.3791373    0

As you can see, the performance differences are very small among the different imputation methods. We opt to use the knnImpute method from this point on, due to its efficiency.

plsFit <- plsFit1
enetFit <- enetFit1

plot(plsFit)

plot(enetFit)

The final linear models are:

plsFit$finalModel
## Partial least squares regression , fitted with the orthogonal scores algorithm.
## Call:
## plsr(formula = .outcome ~ ., ncomp = param$ncomp, data = dat,     method = "oscorespls")
enetFit$finalModel
## 
## Call:
## elasticnet::enet(x = as.matrix(x), y = y, lambda = param$lambda)
## Cp statistics of the Lasso fit 
## Cp: 1828.781 1300.154 1029.285  836.820  795.467  791.002  790.239  538.139  458.584  439.565  426.175  352.972  305.674  251.236  243.047  208.556  189.786  190.667  171.832  154.297  154.234  153.671  146.067  138.922  107.104  104.545   88.112   84.251   67.606   55.387   56.761   58.066   54.200   48.890   49.116   32.453   34.000   34.000 
## DF:  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 21 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 35 
## Sequence of  moves:
##      MnfFlow BrandCodeC BowlSetpoint Usagecont PressureSetpoint BrandCodeD
## Var       14          3           30        23               31          4
## Step       1          2            3         4                5          6
##      Temperature BrandCodeA CarbPressure1 BrandCodeNA FillOunces PSC
## Var           22          1            15           5          7  11
## Step           7          8             9          10         11  12
##      HydPressure3 CarbFlow PSCCO2 OxygenFiller PSCFill AlchRel CarbTemp
## Var            18       24     13           29      12      33       10
## Step           13       14     15           16      17      18       19
##      PCVolume MFR BrandCodeD FillPressure Density AirPressurer CarbRel
## Var         8  26         -4           16      25           32      34
## Step       20  21         22           23      24           25      26
##      CarbVolume BallingLvl BrandCodeB PressureVacuum HydPressure2
## Var           6         35          2             28           17
## Step         27         28         29             30           31
##      FillerLevel FillerSpeed Balling HydPressure4 CarbPressure BrandCodeD
## Var           20          21      27           19            9         -4
## Step          32          33      34           35           36         37
##        
## Var  38
## Step 38

Non-linear Models

For the KNN method, the model is tuned over the number of k-nearest neighbors used to make prediction:

# KNN
knnFit <- train(x=xTrain,
                y=yTrain,
                method='knn',
                tuneLength=20,
                trControl=trControl,
                preProc=c('knnImpute', 'center', 'scale'))
plot(knnFit)

For the support vector machine, we choose the radial basis kernel function. The hyper-parameters being tuned is the cost value. The scale parameter sigma is fixed and determined by the function analytically.

# Support Vector Machine
svmFit <- train(x=xTrain,
                y=yTrain,
                method="svmRadial",
                tuneLength=20,
                trControl=trControl,
                preProc=c('knnImpute', 'center', 'scale'))
plot(svmFit)

The final non-linear models are:

knnFit$finalModel
## 9-nearest neighbor regression model
svmFit$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 2 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0201032562509376 
## 
## Number of Support Vectors : 2175 
## 
## Objective Function Value : -1643.345 
## Training error : 0.271298

Tree-based Models

For the random forest model, the mtry parameter, which is the number of randomly selected predictors in each tree, is tuned to obtain the optimal model. The rf implementation in R does not permit missing values, therefore knnImpute is used in the pre-processing step.

# Random Forest
rf <- train(x=xTrain, 
            y=yTrain, 
            method='rf',
            tuneLength=10,
            trControl=trControl,
            preProc=c('knnImpute'), 
            importance=T)
plot(rf)

For the XGBoost model, below is a list of the parameters being tuned:

  • nrounds : boosting iterations (trees)
  • max_depth : max tree depth
  • eta : learning rate
  • gamma : minimum loss reduction
  • colsample_bytree : subsample ratio of columns
  • min_child_weight : minimum sum of instance weight
  • subsample : subsample ratio of rows

In addition, the XGBoost allows missing value in the data. Here, we experiment with both imputing the missing values (with knnImpute) and not imputing the missing values.

# XGBoost
xgbGrid <- expand.grid(.nrounds=c(100, 500, 1000, 1500), # boosting iterations (trees)
                       .max_depth=c(4, 6, 8, 10), # max tree depth
                       .eta=c(0.001, 0.01, 0.1, 0.5), # learning rate
                       .gamma=c(0),# minimum loss reduction
                       .colsample_bytree=c(0.4, 0.6, 0.8, 1), # subsample ratio of columns
                       .min_child_weight=c(1, 5, 15), # minimum sum of instance weight
                       .subsample=c(0.5, 0.75, 1))  # subsample ratio of rows

xgb1 <- train(x = xTrain,
              y = yTrain,
              method = 'xgbTree',
              tuneGrid = xgbGrid,
              trControl = trControl)

xgb2 <- train(x = xTrain,
              y = yTrain,
              method = 'xgbTree',
              tuneGrid = xgbGrid,
              trControl = trControl,
              preProce = c('knnImpute'))

# End multi-core processing
stopCluster(cl)
registerDoSEQ()
resamples(list(XGB1=xgb1, XGB2=xgb2)) %>% summary()
## 
## Call:
## summary.resamples(object = .)
## 
## Models: XGB1, XGB2 
## Number of resamples: 5 
## 
## MAE 
##            Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
## XGB1 0.08727048 0.08757008 0.08768184 0.08783904 0.08788029 0.08879249
## XGB2 0.08697277 0.08718546 0.08743389 0.08767905 0.08782876 0.08897435
##      NA's
## XGB1    0
## XGB2    0
## 
## RMSE 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## XGB1 0.1154879 0.1161752 0.1165290 0.1171056 0.1181344 0.1192014    0
## XGB2 0.1156593 0.1157656 0.1164978 0.1170502 0.1180158 0.1193124    0
## 
## Rsquared 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## XGB1 0.5285509 0.5346482 0.5357232 0.5414750 0.5503459 0.5581067    0
## XGB2 0.5281014 0.5351413 0.5416740 0.5423755 0.5419176 0.5650431    0

It appears that the performance difference between imputing and not imputing are negligible. We opt to use the imputed model since it is a slight improvement and knnImpute does not take that much time to perform.

The final tree-based models are:

rf
## Random Forest 
## 
## 2567 samples
##   35 predictor
## 
## Pre-processing: nearest neighbor imputation (35), centered (35),
##  scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 514, 514, 513, 513, 513 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    2    0.1277131  0.4988421  0.09870782
##    5    0.1221249  0.5299636  0.09319066
##    9    0.1197881  0.5399403  0.09071346
##   13    0.1187889  0.5436157  0.08960861
##   16    0.1185293  0.5426875  0.08926884
##   20    0.1177503  0.5458798  0.08871182
##   24    0.1179296  0.5439091  0.08855143
##   27    0.1179859  0.5399440  0.08836967
##   31    0.1181714  0.5364060  0.08827614
##   35    0.1185114  0.5311831  0.08879924
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 20.
xgb <- xgb2
xgb$finalModel
## ##### xgb.Booster
## Handle is invalid! Suggest using xgb.Booster.complete
## raw: 6.6 Mb 
## call:
##   xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth, 
##     gamma = param$gamma, colsample_bytree = param$colsample_bytree, 
##     min_child_weight = param$min_child_weight, subsample = param$subsample), 
##     data = x, nrounds = param$nrounds, objective = "reg:linear")
## params (as set within xgb.train):
##   eta = "0.01", max_depth = "10", gamma = "0", colsample_bytree = "0.6", min_child_weight = "5", subsample = "1", objective = "reg:linear", silent = "1"
## callbacks:
##   cb.print.evaluation(period = print_every_n)
## # of features: 35 
## niter: 1500
## nfeatures : 35 
## xNames : BrandCodeA BrandCodeB BrandCodeC BrandCodeD BrandCodeNA CarbVolume FillOunces PCVolume CarbPressure CarbTemp PSC PSCFill PSCCO2 MnfFlow CarbPressure1 FillPressure HydPressure2 HydPressure3 HydPressure4 FillerLevel FillerSpeed Temperature Usagecont CarbFlow Density MFR Balling PressureVacuum OxygenFiller BowlSetpoint PressureSetpoint AirPressurer AlchRel CarbRel BallingLvl 
## problemType : Regression 
## tuneValue :
##       nrounds max_depth  eta gamma colsample_bytree min_child_weight
## 1068    1500        10 0.01     0              0.6                5
##      subsample
## 1068         1
## obsLevels : NA 
## param :
##  list()

Model Evaluation and Comparison

Variable Importance

Following models have their model-specific variable importance:

  • partial least square
  • random forest
  • xgboost

For other models, the default action in caret is to evaluate the variable importance based on loess smoother fit between the target and the predictors (see https://topepo.github.io/caret/variable-importance.html).

Blow, the ranking of variables are calculated and tabulate below. As can be seen, the variable importance calculated for elastic net, KNN, and SVM are the same, since they do not have model-specific method, and are all calculated based on loess R-squares.

getRank <- function(trainObjects){
  temp <- c()
  methods <- c()
  for(object in trainObjects){
    methods <- c(methods, object$method)
    varimp <- varImp(object)[[1]]
    varimp$variables <- row.names(varimp)
    rank <- varimp[order(varimp$Overall, decreasing = T),] %>% row.names()
    temp <- cbind(temp, rank)
    
  }
  temp <- as.data.frame(temp)
  names(temp) <- methods
  temp$Rank <- c(1:dim(temp)[1])
  temp <- select(temp, Rank, everything())
  return(temp)
}

kable(getRank(list(plsFit, rf, xgb, enetFit, knnFit, svmFit)))
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
Rank pls rf xgbTree enet knn svmRadial
1 MnfFlow MnfFlow MnfFlow MnfFlow MnfFlow MnfFlow
2 BrandCodeC BrandCodeC Usagecont MFR MFR MFR
3 BowlSetpoint PressureVacuum BrandCodeC BowlSetpoint BowlSetpoint BowlSetpoint
4 Usagecont OxygenFiller OxygenFiller FillerLevel FillerLevel FillerLevel
5 FillerLevel BallingLvl AlchRel PressureSetpoint PressureSetpoint PressureSetpoint
6 HydPressure3 AirPressurer Temperature Usagecont Usagecont Usagecont
7 PressureSetpoint AlchRel PressureVacuum CarbFlow CarbFlow CarbFlow
8 FillPressure CarbRel FillerSpeed BrandCodeC BrandCodeC BrandCodeC
9 BrandCodeB Usagecont CarbPressure1 HydPressure3 HydPressure3 HydPressure3
10 Temperature Temperature CarbRel FillPressure FillPressure FillPressure
11 HydPressure2 CarbFlow AirPressurer PressureVacuum PressureVacuum PressureVacuum
12 PressureVacuum FillerSpeed BowlSetpoint HydPressure2 HydPressure2 HydPressure2
13 CarbPressure1 HydPressure3 CarbFlow CarbRel CarbRel CarbRel
14 OxygenFiller CarbPressure1 Balling HydPressure4 HydPressure4 HydPressure4
15 CarbFlow Density BallingLvl Temperature Temperature Temperature
16 BrandCodeD BowlSetpoint FillerLevel BrandCodeD BrandCodeD BrandCodeD
17 AlchRel HydPressure2 Density OxygenFiller OxygenFiller OxygenFiller
18 CarbRel Balling PCVolume FillerSpeed FillerSpeed FillerSpeed
19 BrandCodeA PCVolume MFR AlchRel AlchRel AlchRel
20 BallingLvl FillPressure HydPressure2 CarbPressure1 CarbPressure1 CarbPressure1
21 Density FillerLevel CarbVolume PSCCO2 PSCCO2 PSCCO2
22 HydPressure4 CarbVolume HydPressure3 PSC PSC PSC
23 PSC MFR FillOunces FillOunces FillOunces FillOunces
24 FillOunces HydPressure4 FillPressure PCVolume PCVolume PCVolume
25 CarbVolume BrandCodeA HydPressure4 BrandCodeB BrandCodeB BrandCodeB
26 Balling BrandCodeNA PSC BallingLvl BallingLvl BallingLvl
27 BrandCodeNA BrandCodeB CarbPressure CarbTemp CarbTemp CarbTemp
28 PSCCO2 BrandCodeD CarbTemp BrandCodeA BrandCodeA BrandCodeA
29 CarbPressure PressureSetpoint PSCFill CarbPressure CarbPressure CarbPressure
30 PSCFill CarbPressure BrandCodeB PSCFill PSCFill PSCFill
31 MFR FillOunces PressureSetpoint CarbVolume CarbVolume CarbVolume
32 PCVolume PSCFill PSCCO2 Density Density Density
33 FillerSpeed CarbTemp BrandCodeA BrandCodeNA BrandCodeNA BrandCodeNA
34 CarbTemp PSCCO2 BrandCodeD Balling Balling Balling
35 AirPressurer PSC BrandCodeNA AirPressurer AirPressurer AirPressurer

The variable importance calculated based on PLS, RF, XGB, and loess R-squares are plotted below:

plot(varImp(plsFit), main='Variable Importance based on PLS')

plot(varImp(rf), main='Variable Importance based on Random Forest')

plot(varImp(xgb), main='Variable Importance based on XGBoost')

plot(varImp(svmFit), main='Variable Importance based on Loess R-Squares')

Model Performance

The models’ performance are listed below:

resamples(list(PLS=plsFit, ENet=enetFit, KNN=knnFit, SVM=svmFit, RF=rf, XGB=xgb)) %>% summary()
## 
## Call:
## summary.resamples(object = .)
## 
## Models: PLS, ENet, KNN, SVM, RF, XGB 
## Number of resamples: 4 
## 
## MAE 
##            Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
## PLS  0.10401017 0.10573777 0.10715156 0.10662121 0.10803500 0.10817153
## ENet 0.10325386 0.10502571 0.10631413 0.10584920 0.10713763 0.10751469
## KNN  0.10357093 0.10386052 0.10401850 0.10396019 0.10411817 0.10423283
## SVM  0.09480799 0.09557060 0.09584557 0.09591785 0.09619282 0.09717227
## RF   0.08744219 0.08787527 0.08848266 0.08871182 0.08931921 0.09043978
## XGB  0.08697277 0.08713229 0.08730968 0.08735522 0.08753261 0.08782876
##      NA's
## PLS     0
## ENet    0
## KNN     0
## SVM     0
## RF      0
## XGB     0
## 
## RMSE 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS  0.1356567 0.1371824 0.1377113 0.1374039 0.1379328 0.1385362    0
## ENet 0.1339212 0.1356141 0.1362803 0.1357745 0.1364407 0.1366161    0
## KNN  0.1344314 0.1345692 0.1349825 0.1357495 0.1361629 0.1386017    0
## SVM  0.1259120 0.1265990 0.1273402 0.1278324 0.1285736 0.1307371    0
## RF   0.1165398 0.1172607 0.1177933 0.1177503 0.1182830 0.1188748    0
## XGB  0.1156593 0.1157390 0.1161317 0.1164846 0.1168773 0.1180158    0
## 
## Rsquared 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS  0.3625243 0.3656883 0.3683503 0.3689665 0.3716286 0.3766411    0
## ENet 0.3700989 0.3761005 0.3796023 0.3791235 0.3826253 0.3871907    0
## KNN  0.3790417 0.3836021 0.3902105 0.3906420 0.3972504 0.4031051    0
## SVM  0.4293764 0.4446522 0.4546945 0.4522908 0.4623331 0.4703980    0
## RF   0.5379883 0.5399803 0.5430877 0.5458798 0.5489872 0.5593555    0
## XGB  0.5351413 0.5400408 0.5417958 0.5459440 0.5476990 0.5650431    0

It can be seen that the XGBoost model achieves better performance. It has the lowest average RMSE. Based on this, we opt to choose the XGBoost model as our final models.

Below are the boxplots of the RMSE in the CV folds for the models. It can be seen that the linear models have very tight spread in their RMSE, while the non-linear and the tree-based models have higher spread. This means that linear models have lower variance than the non-linear and tree-based models. It makes sense since non-linear models and tree-based models are in general more powerful than the linear models, and therefore prompt to overfit (high variance, low bias).

par(mfrow=c(2,3))
boxplot(plsFit$resample$RMSE, main='CV RMSE for PLS')
boxplot(enetFit$resample$RMSE, main='CV RMSE for ENet')
boxplot(knnFit$resample$RMSE, main='CV RMSE for KNN')
boxplot(svmFit$resample$RMSE, main='CV RMSE for SVM')
boxplot(rf$resample$RMSE, main='CV RMSE for RF')
boxplot(xgb$resample$RMSE, main='CV RMSE for XGB')

Eval Set Prediction

Below, we make the prediction using the XGB model, and save the result:

(pred <- predict(xgb, newdata=xEval))
##   [1] 8.489114 8.386810 8.499494 8.676891 8.422321 8.508721 8.479796
##   [8] 8.509658 8.524522 8.733335 8.540319 8.524879 8.476252 8.604997
##  [15] 8.263277 8.584721 8.598623 8.507232 8.503490 8.716899 8.680784
##  [22] 8.667629 8.563601 8.511458 8.669805 8.493483 8.405451 8.685325
##  [29] 8.666525 8.699488 8.642038 8.702867 8.616266 8.672192 8.685741
##  [36] 8.555187 8.504755 8.530453 8.691133 8.752850 8.788443 8.525010
##  [43] 8.294233 8.275476 8.710000 8.754517 8.757333 8.724513 8.702127
##  [50] 8.766453 8.628973 8.692094 8.721963 8.720330 8.863279 8.751065
##  [57] 8.442047 8.509911 8.702737 8.782986 8.788027 8.719201 8.736992
##  [64] 8.802861 8.810700 8.704443 8.547973 8.540565 8.610917 8.506020
##  [71] 8.456654 8.384642 8.520053 8.508084 8.515755 8.496242 8.584035
##  [78] 8.662597 8.725308 8.639506 8.773076 8.661763 8.719378 8.628849
##  [85] 8.708687 8.742179 8.739834 8.647678 8.537263 8.726831 8.616635
##  [92] 8.618535 8.537017 8.510901 8.567130 8.626575 8.634910 8.625755
##  [99] 8.660060 8.676085 8.668866 8.717737 8.719065 8.701555 8.777940
## [106] 8.807128 8.490880 8.430260 8.464031 8.511809 8.695340 8.700084
## [113] 8.648747 8.743848 8.640017 8.716689 8.749076 8.740728 8.737554
## [120] 8.699940 8.704624 8.733253 8.755580 8.676638 8.505178 8.452633
## [127] 8.498458 8.298923 8.468449 8.465969 8.553341 8.530916 8.548730
## [134] 8.479448 8.460509 8.530807 8.554157 8.455324 8.437445 8.425599
## [141] 8.464765 8.524144 8.487891 8.522289 8.422502 8.503467 8.469502
## [148] 8.573105 8.473152 8.603023 8.623924 8.594813 8.442616 8.638641
## [155] 8.460707 8.483208 8.416842 8.495853 8.579787 8.516457 8.601458
## [162] 8.627097 8.637206 8.643469 8.654936 8.605709 8.470303 8.739980
## [169] 8.669416 8.767879 8.478296 8.540945 8.331718 8.439440 8.477420
## [176] 8.577559 8.421168 8.489767 8.400572 8.470026 8.429167 8.548959
## [183] 8.507633 8.489511 8.522857 8.308163 8.326262 8.485365 8.404167
## [190] 8.372846 8.354460 8.522525 8.549019 8.459602 8.474385 8.332462
## [197] 8.310847 8.224078 8.243311 8.423572 8.403149 8.416455 8.460292
## [204] 8.375723 8.351536 8.495371 8.497482 8.375500 8.437349 8.289114
## [211] 8.331231 8.385859 8.427534 8.519608 8.426883 8.424350 8.373230
## [218] 8.345346 8.504831 8.502677 8.483864 8.471354 8.445770 8.391487
## [225] 8.611402 8.492319 8.472500 8.467644 8.468987 8.470590 8.470867
## [232] 8.513150 8.493874 8.576015 8.582872 8.501921 8.623503 8.667575
## [239] 8.486345 8.516557 8.488672 8.562017 8.518543 8.518978 8.411890
## [246] 8.417725 8.422590 8.532801 8.582995 8.478495 8.437906 8.460243
## [253] 8.514612 8.561201 8.558359 8.552038 8.488605 8.635681 8.513274
## [260] 8.525599 8.616925 8.639329 8.508380 8.615664 8.255293 8.350628
## [267] 8.123829
eval$PH <- pred
write.csv(eval, "StudentEvaluation_PH_PREDICTED.csv", row.names=FALSE)