Understanding The PH Process

Below we provide a technical write up on the production process of our mineral water brand “Natural Water”. This is a supplementary guide to the accompanying business write up.

Due to recent fears over the high metallic nature of bottled mineral waters, the government has asked us to document our production process and provide a predictive model for ph levels in our product. Below we will document our entire process with code in three parts. Part one will contain the Exploratory Data Analysis (EDA). Part two will go over data preparation for modeling, and part three will cover the entire modeling process. There will be some documentaion accompanying our code, however,the material and code is dense. We have provided all our code as we expect the regulatory agencies may attempt to reproduce our results. Please see the business write up here for a brief summary of our process and results

Exploratory Data Analysis

The first tab “Notes from EDA”, contains most of the relevant information gained from EDA. Most notes point towards a tab in the EDA analysis. Our readers are expected to be able to understand statistical analysis and graphs, as such notes are somewhat limited. The plots and visualizations have been broken down into tabs for presentation purpose. The code for each tab has been added as an appendix at the bottom of each tab. We note multicollinearity issues that are present in the data, however, as we eventually chose a XGboost model, we decided against dropping certain predictors.

Notes from EDA

  • Removing the 4 records with null PH value( see section Describe Dataset and PH)
  • Some clear multicollinearity issues(see corrplot section)
    • last 3 columns seems to share identical correlations with other predictors
      • Bailing.LVL, Carb.Rel, Alch.Rel
      • Carb rel seems to have highest correlation, might be one of last 3 columns to keep
    • some of Hydrolic pressures have high correlation(see corrplots section)
    • Bailing Density, have high correlations with each other and last 3 variables
    • Carb temp and Carb pressure share high correlation with each other
    • Our variables don’t have very strong correlation scores with ph
    • MNF flow has highest correlation with PH(negative)
  • Brand ‘B’ has the most frequency followed by Brand ‘D’.(see Brand differences section )
  • Differences exist among brands by Density, Balling, and Balling Lvl(see Brand differences section )
  • From the Histogram we can see Hyd.Pressure1, Hyd.Press2,Hyd.Pressure3, Carb.Flow and Balling have some really low counts
    • Continuous
  • possible transformation needed for linear models
    • Some of the variables like Usage.count are skewed and have outliers.
    • From the Density plots we can see the distribution is not uniform for some of the variables like Hyd.Pressure1, Hyd.Press2, and Hyd.Pressure3.
  • Rounded PSC.CO2 and Pressure.setpoint( see Extra Analysis)

Package imports

rm(list=ls())
library (MASS)
library(tidyverse)
library(psych)
library(kableExtra)
library(knitr)
library(corrplot)
library(caret)
library(xlsx)
library(mice)
library(gridExtra)
library(grid)
library(cowplot)
library(mice)
library(VIM)
library(reshape2)
library(doParallel)
df <- read.xlsx("train.xlsx",1)
df_w_na <- copy(df)
df <-  subset(df, !is.na(PH))
df$Pressure.Setpoint = round(df$Pressure.Setpoint, 0)
df$PSC.CO2 = round(df$PSC.CO2, 2)
test <- read.xlsx("test.xlsx",1)

Density Plots

par(mfrow = c(3,5), cex = .5)
for (i in colnames(df)) {
 smoothScatter(df[,i], main = names(df[i]), ylab = "", xlab = "", colramp = colorRampPalette(c("white", "blue")))
 }

Describe Dataset and PH

  • PH is normally distributed
  • mean(8.55) and median (8.54)
  • slight leftward tail
  • Display rows where PH is null( 4 rows total)
vars n mean sd median trimmed mad min max range skew kurtosis se
Brand.Code* 1 2447 2.51 1.00 2.00 2.51 0.00 1.00 4.00 3.00 0.38 -1.06 0.02
Carb.Volume 2 2557 5.37 0.11 5.35 5.37 0.11 5.04 5.70 0.66 0.39 -0.47 0.00
Fill.Ounces 3 2529 23.97 0.09 23.97 23.98 0.08 23.63 24.32 0.69 -0.02 0.87 0.00
PC.Volume 4 2528 0.28 0.06 0.27 0.27 0.05 0.08 0.48 0.40 0.35 0.67 0.00
Carb.Pressure 5 2540 68.19 3.54 68.20 68.12 3.56 57.00 79.40 22.40 0.18 -0.01 0.07
Carb.Temp 6 2541 141.09 4.03 140.80 140.99 3.85 128.60 154.00 25.40 0.24 0.23 0.08
PSC 7 2534 0.08 0.05 0.08 0.08 0.05 0.00 0.27 0.27 0.85 0.65 0.00
PSC.Fill 8 2544 0.20 0.12 0.18 0.18 0.12 0.00 0.62 0.62 0.94 0.77 0.00
PSC.CO2 9 2528 0.06 0.04 0.04 0.05 0.03 0.00 0.24 0.24 1.73 3.72 0.00
Mnf.Flow 10 2567 24.63 119.50 70.20 21.14 161.60 -100.20 229.40 329.60 0.00 -1.87 2.36
Carb.Pressure1 11 2535 122.57 4.73 123.20 122.53 4.45 105.60 140.20 34.60 0.04 0.13 0.09
Fill.Pressure 12 2549 47.92 3.18 46.40 47.71 2.37 34.60 60.40 25.80 0.55 1.41 0.06
Hyd.Pressure1 13 2556 12.46 12.43 11.40 10.86 16.90 -0.80 58.00 58.80 0.78 -0.14 0.25
Hyd.Pressure2 14 2552 20.99 16.38 28.60 21.09 13.34 0.00 59.40 59.40 -0.31 -1.56 0.32
Hyd.Pressure3 15 2552 20.48 15.97 27.60 20.53 13.79 -1.20 50.00 51.20 -0.32 -1.57 0.32
Hyd.Pressure4 16 2539 96.31 13.10 96.00 95.46 11.86 62.00 142.00 80.00 0.56 0.61 0.26
Filler.Level 17 2551 109.25 15.70 118.40 111.04 9.19 55.80 161.20 105.40 -0.85 0.05 0.31
Filler.Speed 18 2513 3688.11 769.63 3982.00 3920.25 47.44 998.00 4030.00 3032.00 -2.88 6.75 15.35
Temperature 19 2555 65.96 1.38 65.60 65.80 0.89 63.60 76.20 12.60 2.39 10.25 0.03
Usage.cont 20 2562 20.99 2.98 21.79 21.25 3.19 12.08 25.90 13.82 -0.54 -1.02 0.06
Carb.Flow 21 2565 2472.05 1070.43 3030.00 2604.15 323.21 26.00 5104.00 5078.00 -0.99 -0.57 21.14
Density 22 2567 1.17 0.38 0.98 1.15 0.15 0.24 1.92 1.68 0.53 -1.21 0.01
MFR 23 2359 704.05 73.90 724.00 718.16 15.42 31.40 868.60 837.20 -5.09 30.46 1.52
Balling 24 2567 2.20 0.93 1.65 2.13 0.37 0.16 4.01 3.85 0.60 -1.40 0.02
Pressure.Vacuum 25 2567 -5.22 0.57 -5.40 -5.25 0.59 -6.60 -3.60 3.00 0.53 -0.03 0.01
PH 26 2567 8.55 0.17 8.54 8.55 0.18 7.88 9.36 1.48 -0.29 0.06 0.00
Oxygen.Filler 27 2556 0.05 0.05 0.03 0.04 0.02 0.00 0.40 0.40 2.41 8.84 0.00
Bowl.Setpoint 28 2565 109.35 15.29 120.00 111.36 0.00 70.00 140.00 70.00 -0.97 -0.05 0.30
Pressure.Setpoint 29 2555 47.61 2.04 46.00 47.60 0.00 44.00 52.00 8.00 0.20 -1.60 0.04
Air.Pressurer 30 2567 142.83 1.21 142.60 142.58 0.59 140.80 148.20 7.40 2.25 4.73 0.02
Alch.Rel 31 2560 6.90 0.51 6.56 6.84 0.06 5.28 8.62 3.34 0.88 -0.85 0.01
Carb.Rel 32 2559 5.44 0.13 5.40 5.43 0.12 4.96 6.06 1.10 0.50 -0.30 0.00
Balling.Lvl 33 2566 2.05 0.87 1.48 1.98 0.21 0.00 3.66 3.66 0.59 -1.50 0.02
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Display PH is null training data observations

Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont Carb.Flow Density MFR Balling Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
738 B 5.34 23.95 0.11 65.2 138.0 0.01 0.18 0.04 -100.0 138.4 NA 0.0 0.0 0.0 90 NA NA NA 14.76 38 0.24 NA -0.17 -5.2 NA NA 120 46 143.2 NA NA 0.00
1535 B 5.22 23.78 0.30 73.2 151.8 0.11 0.10 0.04 NA 128.2 NA 0.2 0.6 31.8 52 NA NA 71.2 18.34 272 0.62 NA 0.75 -5.2 NA 0.4 90 50 142.6 6.24 5.46 2.72
1719 B 5.31 23.85 0.16 65.4 138.6 0.02 0.34 0.02 NA 133.2 NA 0.0 0.0 1.6 NA NA NA NA 23.94 32 NA NA NA -4.8 NA 0.4 70 50 143.4 NA NA 0.00
2214 B 5.27 23.90 0.23 67.2 142.8 0.02 0.34 0.06 0.2 131.0 NA -0.6 0.2 -1.2 NA NA 1406 67.6 23.72 44 0.60 NA 0.70 -5.4 NA 0.4 110 50 142.2 6.54 5.38 1.34
## General Visualizations
kable(describe(df),digits =2,'markdown',booktabs =T)
box_ph <-  ggplot(df, aes(x="PH", y=PH))+geom_boxplot()
hist_ph <- ggplot(df, aes(x=PH))+geom_histogram(stat='count')
plot_grid(box_ph,hist_ph, labels = c('A', 'B'))


## Grab observations where PH is NA
kable(df_w_na[(which(is.na(df_w_na$PH))),],digits =2,'markdown',booktabs =T)

Brand Differences

  • We explore our brand column, which consists of 4 distinct brands(A,B,C,D)
Brand_A Brand_B Brand_C Brand_D Brand_Na
Carb.Volume 5.43 5.31 5.30 5.51 5.29
Fill.Ounces 23.98 23.98 23.98 23.96 24.00
PC.Volume 0.27 0.28 0.29 0.26 0.29
Carb.Pressure 69.28 67.20 66.92 70.70 66.26
Carb.Temp 141.23 141.11 141.05 141.13 140.52
PSC 0.08 0.09 0.09 0.08 0.09
PSC.Fill 0.20 0.19 0.21 0.19 0.19
PSC.CO2 0.06 0.06 0.06 0.05 0.06
Mnf.Flow 39.71 20.58 23.28 25.75 27.15
Carb.Pressure1 122.82 122.44 122.19 122.72 123.46
Fill.Pressure 48.22 48.17 48.23 46.96 48.78
Hyd.Pressure1 12.90 12.57 12.33 12.40 10.89
Hyd.Pressure2 21.16 21.17 19.16 21.97 18.44
Hyd.Pressure3 20.97 20.23 19.05 21.59 19.79
Hyd.Pressure4 101.28 100.11 102.54 83.45 95.98
Filler.Level 108.63 107.96 111.60 110.42 112.15
Filler.Speed 3582.08 3732.54 3673.66 3688.98 3517.83
Temperature 66.06 65.89 66.72 65.46 67.23
Usage.cont 21.14 20.97 21.04 21.01 20.73
Carb.Flow 2387.35 2540.59 2311.54 2437.18 2557.43
Density 1.57 0.91 0.92 1.68 0.97
MFR 704.98 705.64 704.03 703.98 685.10
Balling 3.20 1.51 1.63 3.49 1.67
Pressure.Vacuum -5.24 -5.15 -5.31 -5.26 -5.33
PH 8.50 8.57 8.41 8.60 8.49
Oxygen.Filler 0.04 0.05 0.05 0.04 0.05
Bowl.Setpoint 109.11 107.80 111.62 110.89 112.17
Pressure.Setpoint 47.76 47.94 48.03 46.65 47.80
Air.Pressurer 142.74 142.96 142.77 142.69 142.67
Alch.Rel 7.13 6.55 6.56 7.69 6.64
Carb.Rel 5.52 5.36 5.35 5.61 5.34
Balling.Lvl 3.08 1.41 1.52 3.23 1.47
## Warning: Ignoring unknown parameters: binwidth, bins, pad

#Lets see brand codes
A <- df[df$Brand.Code == "A",]
Brand_A <- colMeans(A[,2:ncol(A)], na.rm = TRUE)
B <- df[df$Brand.Code == "B",]
Brand_B <- colMeans(B[,2:ncol(B)], na.rm = TRUE)
C <- df[df$Brand.Code == "C",]
Brand_C <- colMeans(C[,2:ncol(C)], na.rm = TRUE)
D <- df[df$Brand.Code == "D",]
Brand_D <- colMeans(D[,2:ncol(D)], na.rm = TRUE)
Na <- df[!df$Brand.Code %in% c("A", "B", "C", "D"),]
Brand_Na <- colMeans(Na[,2:ncol(Na)], na.rm = TRUE)
Brands <- cbind(Brand_A, Brand_B, Brand_C, Brand_D, Brand_Na)
kable(Brands,digits =2,'markdown',booktabs =T)


## Visualizations for Brand code factor column
boxplot_PH <- df %>% 
  select_if(is.factor) %>% 
  gather %>% 
  ggplot(aes(x = value)) + 
  geom_histogram(stat='count') + 
  facet_wrap(~key)

Hist_PH <- df %>% 
  select_if(is.factor) %>% 
  gather %>% 
  ggplot(aes(x = value)) + 
  geom_boxplot(aes(x = df$Brand.Code,y = df$PH)) + 
  facet_wrap(~key)

plot_grid(BOXPLOT_PH,Hist_PH)

Corrplots

  • Below corrplots follow this trend
    • first matrix explores all variables
    • Second matrix explores Bailing.LVL, Carb.Rel, Alch.Rel
    • Third matrix Carb.Pressure, Carb.Temp
    • Fourth matrix Hydrolic pressures
    • Fifth and sixth matrix are our PH versus predictor correlations

Entire DF Corrplot

Very High Correlations Corrplot

Moderate Correlations corrplot

** PH Correlations with Predictors**

## Entire corrplot
corrplot(cor(df[,2:33],use = "complete.obs", method = "pearson"))
## very high cvorrelations 
corrplot(cor(df[,2:33],use = "complete.obs", method = "pearson")[30:32,30:32, drop=FALSE], cl.pos='n', method = "number")
corrplot(cor(df[,2:33],use = "complete.obs", method = "pearson")[4:5,4:5, drop=FALSE], cl.pos='n', method = "number")
## somewhat high correlations
corrplot(cor(df[,2:33],use = "complete.obs", method = "pearson")[9:14,9:14, drop=FALSE], cl.pos='n', method = "number")
## Get PH Correlations with predictors 
corrplot(cor(df[,2:33],use = "complete.obs", method = "pearson")[1:16,25, drop=FALSE], cl.pos='n', method = "number")
corrplot(cor(df[,2:33],use = "complete.obs", method = "pearson")[17:32,25, drop=FALSE], cl.pos='n', method = "number")

Scatterplots PH Versus Predictors/ Histograms of Predictors

## Warning: Removed 285 rows containing missing values (geom_point).

## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 285 rows containing non-finite values (stat_count).

## Warning: Removed 366 rows containing missing values (geom_point).

## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 366 rows containing non-finite values (stat_count).

## Warning: Removed 41 rows containing missing values (geom_point).

## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 41 rows containing non-finite values (stat_count).

### melt df for plots predictors 2-12
df.m <- melt(df[,c(2:12,26)], "PH")
ggplot(df.m, aes(value, PH)) + 
  geom_point() + 
  facet_wrap(~variable, scales = "free")
ggplot(df.m, aes(value)) + 
  geom_histogram(stat = "count") + 
  facet_wrap(~variable, scales = "free")
### melt df for plots predictors 13-25
df.m <- melt(df[,c(13:25,26)], "PH")
ggplot(df.m, aes(value, PH)) + 
  geom_point() + 
  facet_wrap(~variable, scales = "free")
ggplot(df.m, aes(value)) + 
  geom_histogram(stat = "count") + 
  facet_wrap(~variable, scales = "free")
### melt df for plots predictors 26-33
df.m <- melt(df[,c(26:33)], "PH")
ggplot(df.m, aes(value, PH)) + 
  geom_point() + 
  facet_wrap(~variable, scales = "free")
ggplot(df.m, aes(value)) + 
  geom_histogram(stat = "count") + 
  facet_wrap(~variable, scales = "free")

Deeper EDA

Variables that Seem Discrete

  • Bowl.Setpoint
    • seems to be measured in intervals of 10, and then the measurements become discrete
  • Pressure.Setpoint
    • Three Random non integer measurements
  • PSC.CO2
    • rounding errors visible in data
Bowl.Setpoint Freq
70 99
80 96
90 434
100 112
110 437
120 1307
122 1
126 10
130 51
134 2
140 20
Pressure.Setpoint Freq
44 96
46 1322
46.4 1
46.6 1
46.8 1
48 125
50 1002
52 11
PSC.CO2 Freq
0 108
0.0199999999999996 325
0.0200000000000005 288
0.0399999999999991 37
0.04 624
0.0599999999999996 283
0.0600000000000005 219
0.0799999999999992 23
0.08 234
0.0999999999999996 79
0.100000000000001 65
0.119999999999999 10
0.12 72
0.14 30
0.140000000000001 19
0.159999999999999 3
0.16 36
0.18 12
0.180000000000001 8
0.199999999999999 2
0.2 16
0.22 22
0.24 17

Rounding

  • I rounded PSC.CO2 and Pressure.setpoint. I don’t think these changes will make any difference in modeling results(all seem distributed evenly around the mean of PH), but I think we should still include it in EDA as it shows how “deeply” we looked into the data.
PSC.CO2 Freq
0 108
0.02 612
0.04 659
0.06 501
0.08 257
0.1 144
0.12 82
0.14 49
0.16 39
0.18 20
0.2 18
0.22 22
0.24 17
Pressure.Setpoint Freq
44 96
46 1322
47 2
48 125
50 999
52 11

PSC.CO2 and Pressure.setpoint

  • Scatterplots with new rounded off data
## Warning: Removed 12 rows containing missing values (geom_point).

## Warning: Removed 39 rows containing missing values (geom_point).

Bowl.Setpoint predictor

  • Setpoint seems to sequence by 10 integers, and then has random integers towards the end. I filtered for those random observations to see any trend. I then visualize the variable. I don’t believe we need to make any changes to it
##      PH
## 1  8.74
## 2  8.74
## 3  8.74
## 4  8.76
## 5  8.62
## 6  8.62
## 7  8.60
## 8  8.62
## 9  8.56
## 10 8.58
## 11 8.44
## Warning: Removed 2 rows containing missing values (geom_point).

kable(table(df$Bowl.Setpoint,dnn = "Bowl Setpoint"), digits = 2,'markdown', booktabs =T)
kable(table(df$Pressure.Setpoint,dnn = "Pressure Setpoint"), digits = 2, 'markdown', booktabs =T,caption = "PRESSURE SETPOINT")
kable(table(df$PSC.CO2,dnn = "PSC.CO2"), digits = 2, 'markdown',booktabs =T,caption = "PSC CO2")

## After rounding visualize Pressure.Setpoint
ggplot(df, aes(Pressure.Setpoint, PH)) + 
  geom_point()
## After rounding visualize PSC.CO2
ggplot(df, aes(PSC.CO2, PH)) + 
  geom_point()

df %>% 
    filter(Bowl.Setpoint> 120 & Bowl.Setpoint <130) %>% 
    dplyr::select(PH)
ggplot(df, aes(Bowl.Setpoint, PH)) + 
  geom_point()

df$PSC.CO2 <- round(df$PSC.CO2,2)
## rounded new table for PSC CO2
table(df$PSC.CO2)
## rounded to integer new table for Pressure setpoint
df$Pressure.Setpoint <- round(df$Pressure.Setpoint,0)
table(df$Pressure.Setpoint)

Imputation and Visualizing Missing Data

  • Imputation methods will eventually be handled by the Caret package, but we have visualized our missing data below
##        Brand.Code       Carb.Volume       Fill.Ounces         PC.Volume 
##      0.0467471757      0.0038955980      0.0148032723      0.0151928321 
##     Carb.Pressure         Carb.Temp               PSC          PSC.Fill 
##      0.0105181145      0.0101285547      0.0128554733      0.0089598753 
##           PSC.CO2          Mnf.Flow    Carb.Pressure1     Fill.Pressure 
##      0.0151928321      0.0000000000      0.0124659135      0.0070120764 
##     Hyd.Pressure1     Hyd.Pressure2     Hyd.Pressure3     Hyd.Pressure4 
##      0.0042851578      0.0058433970      0.0058433970      0.0109076743 
##      Filler.Level      Filler.Speed       Temperature        Usage.cont 
##      0.0062329568      0.0210362291      0.0046747176      0.0019477990 
##         Carb.Flow           Density               MFR           Balling 
##      0.0007791196      0.0000000000      0.0810284379      0.0000000000 
##   Pressure.Vacuum                PH     Oxygen.Filler     Bowl.Setpoint 
##      0.0000000000      0.0000000000      0.0042851578      0.0007791196 
## Pressure.Setpoint     Air.Pressurer          Alch.Rel          Carb.Rel 
##      0.0046747176      0.0000000000      0.0027269186      0.0031164784 
##       Balling.Lvl 
##      0.0003895598

## 
##  Variables sorted by number of missings: 
##           Variable        Count
##                MFR 0.0810284379
##         Brand.Code 0.0467471757
##       Filler.Speed 0.0210362291
##          PC.Volume 0.0151928321
##            PSC.CO2 0.0151928321
##        Fill.Ounces 0.0148032723
##                PSC 0.0128554733
##     Carb.Pressure1 0.0124659135
##      Hyd.Pressure4 0.0109076743
##      Carb.Pressure 0.0105181145
##          Carb.Temp 0.0101285547
##           PSC.Fill 0.0089598753
##      Fill.Pressure 0.0070120764
##       Filler.Level 0.0062329568
##      Hyd.Pressure2 0.0058433970
##      Hyd.Pressure3 0.0058433970
##        Temperature 0.0046747176
##  Pressure.Setpoint 0.0046747176
##      Hyd.Pressure1 0.0042851578
##      Oxygen.Filler 0.0042851578
##        Carb.Volume 0.0038955980
##           Carb.Rel 0.0031164784
##           Alch.Rel 0.0027269186
##         Usage.cont 0.0019477990
##          Carb.Flow 0.0007791196
##      Bowl.Setpoint 0.0007791196
##        Balling.Lvl 0.0003895598
##           Mnf.Flow 0.0000000000
##            Density 0.0000000000
##            Balling 0.0000000000
##    Pressure.Vacuum 0.0000000000
##                 PH 0.0000000000
##      Air.Pressurer 0.0000000000
## missing values
pMiss <- function(x){sum(is.na(x))/length(x)}
apply(df,2,pMiss)
aggr_plot <- aggr(df, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
#tempData <- mice(df,m=5,maxit=50,meth='pmm',seed=500)
#summary(tempData)
#completedData <- complete(tempData,1)
#write.csv(completedData, file = "imputeddf.csv")
#completedData <- read.csv("imputeddf.csv")
#completedData <- completedData[,-1]

Data Preparation

One Hot encode

The variable Brand Code is a categorical variable, having 4 classes (A, B, C, and D). We opt to use the “one-hot” encoding scheme for this variable, creating 5 new variables for the data: BrandCodeA, BrandCodeB, BrandCodeC, BrandCodeD, and BrandCodeNA.

Brand CodeA Brand CodeB Brand CodeC Brand CodeD Brand CodeNA
0 1 0 0 0
1 0 0 0 0
0 1 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
0 1 0 0 0
0 1 0 0 0
0 1 0 0 0
Brand CodeA Brand CodeB Brand CodeC Brand CodeD Brand CodeNA
0 0 0 1 0
1 0 0 0 0
0 1 0 0 0
0 1 0 0 0
0 1 0 0 0
0 1 0 0 0
1 0 0 0 0
0 1 0 0 0
1 0 0 0 0
0 0 0 1 0
# One-hot encoding the categorical variable `Brand Code`
train$`Brand Code` <- addNA(train$`Brand Code`)
eval$`Brand Code` <- addNA(eval$`Brand Code`)
brandCodeTrain <- predict(dummyVars(~`Brand Code`, data=train), train)
brandCodeEval <- predict(dummyVars(~`Brand Code`, data=eval), eval)

kable(brandCodeTrain[1:10,],digits = 2,'markdown', booktabs =T)
#head(train$`Brand Code`, 10)
kable(brandCodeEval[1:10,],digits = 2,'markdown', booktabs =T)
#head(eval$`Brand Code`, 10)
train <- cbind(brandCodeTrain, subset(train, select=-c(`Brand Code`)))
eval <- cbind(brandCodeEval, subset(eval, select=-c(`Brand Code`)))

Clean Dataset

  • White spaces and special characters in the column names are removed so they do not cause issues in some of the R packages.
  • There are a few rows with target variable (PH) missing. These rows are removed, since they cannot be used for training.
  • Below, we remove the near-zero-variance predictor(Hydpressure1), and separate the predictors and target:
# Remove special symbols (white space and `) in names
names(train) <- gsub(patter=c(' |`'), replacement='', names(train))
names(eval) <- gsub(patter=c(' |`'), replacement='', names(eval))

# Remove rows in training set with missing target variables
train <- train[complete.cases(train$PH),]

# Check near-zero-variance variables
nearZeroVar(train, names=T)
## [1] "BrandCodeNA"  "HydPressure1"
# Separate the predictors and target, and remove nzv variable
xTrain <- subset(train, select=-c(PH,`HydPressure1`)) %>% as.data.frame()
xEval <- subset(eval, select=-c(PH,`HydPressure1`)) %>% as.data.frame()
yTrain <- train$PH

Modeling

Setting Train Parameters

For the missing values, we experiment with three different imputation algorithms provided in the preProcess function:

  • KNN imputation
  • Bagged trees imputation
  • Median imputation

As will be seen in the “Linear Models” section below, the choice of imputation method does not seem to affect the prediction performance much. We opt to use the knnImpute method due to its high efficiency.

For the linear and non-linear models, the pre-processing step also include centering and scaling (standardizing), so that the variables all have a mean of 0 and standard deviation of 1. For the tree-based models, this step is omitted, since tree models work fine without this step.

The caret package supports parallel processing (multi-core training). This capability significantly lowers the training time:

Code for Train paramaters

set.seed(1)
cvFolds <- createFolds(yTrain, k=5)
trControl <- trainControl(verboseIter=T,
                          method='cv', 
                          number=5,
                          index=cvFolds)

# Set up and start multi-core processing
cl <- makePSOCKcluster(5)
registerDoParallel(cl)

Linear Models

In this section, we tune 6 models for the following purpose in mind:

  • Understand the effect of different imputation methods on the model performance

  • Find the optimal hyper-parameters for the respective linear models

Setting Tuning Parameters

  • For the partial least squares, the models are tuned over the number of components used in the model. 3 models are created, each uses a different imputation method:

  • For the elastic nets, the models are tuned over the two regularization parameters. Likewise, 3 models are tuned, each with different imputation method:

  • The performance of these models are be compared using the resamples function:

  • As you can see, the performance differences are very small among the different imputation methods. We opt to use the knnImpute method from this point on, due to its efficiency.

## Aggregating results
## Selecting tuning parameters
## Fitting ncomp = 14 on full training set
## Aggregating results
## Selecting tuning parameters
## Fitting ncomp = 14 on full training set
## Aggregating results
## Selecting tuning parameters
## Fitting fraction = 0.5, lambda = 0 on full training set
## Aggregating results
## Selecting tuning parameters
## Fitting fraction = 0.5, lambda = 0 on full training set

Summary Table

## 
## Call:
## summary.resamples(object = .)
## 
## Models: PLS1, PLS2, PLS3, enet1, enet2, enet3 
## Number of resamples: 5 
## 
## MAE 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS1  0.1040102 0.1063136 0.1071064 0.1067182 0.1079895 0.1081715    0
## PLS2  0.1043480 0.1060543 0.1073844 0.1068265 0.1075589 0.1087867    0
## PLS3  0.1048040 0.1066420 0.1071693 0.1069891 0.1079980 0.1083322    0
## enet1 0.1032539 0.1056163 0.1070119 0.1061140 0.1071730 0.1075147    0
## enet2 0.1039317 0.1055278 0.1069091 0.1062466 0.1071942 0.1076704    0
## enet3 0.1042282 0.1060427 0.1072512 0.1066468 0.1078112 0.1079009    0
## 
## RMSE 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS1  0.1356567 0.1376909 0.1376948 0.1374621 0.1377316 0.1385362    0
## PLS2  0.1355903 0.1367379 0.1369284 0.1373042 0.1382949 0.1389695    0
## PLS3  0.1363889 0.1377426 0.1379041 0.1378178 0.1384736 0.1385797    0
## enet1 0.1339212 0.1361784 0.1363822 0.1361661 0.1366161 0.1377323    0
## enet2 0.1359433 0.1360144 0.1366364 0.1367142 0.1368978 0.1380791    0
## enet3 0.1348319 0.1363997 0.1368995 0.1368871 0.1369487 0.1393558    0
## 
## Rsquared 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS1  0.3625243 0.3667430 0.3699577 0.3699562 0.3739147 0.3766411    0
## PLS2  0.3642204 0.3674205 0.3686586 0.3696541 0.3721007 0.3758705    0
## PLS3  0.3556132 0.3644181 0.3702737 0.3668091 0.3706663 0.3730739    0
## enet1 0.3700989 0.3727973 0.3781011 0.3778583 0.3811035 0.3871907    0
## enet2 0.3678809 0.3705872 0.3740222 0.3741709 0.3785737 0.3797904    0
## enet3 0.3626465 0.3649488 0.3761665 0.3722877 0.3785394 0.3791373    0

Plot of PLS and Elastic Net

## Partial least squares regression , fitted with the orthogonal scores algorithm.
## Call:
## plsr(formula = .outcome ~ ., ncomp = param$ncomp, data = dat,     method = "oscorespls")
## 
## Call:
## elasticnet::enet(x = as.matrix(x), y = y, lambda = param$lambda)
## Cp statistics of the Lasso fit 
## Cp: 1828.781 1300.154 1029.285  836.820  795.467  791.002  790.239  538.139  458.584  439.565  426.175  352.972  305.674  251.236  243.047  208.556  189.786  190.667  171.832  154.297  154.234  153.671  146.067  138.922  107.104  104.545   88.112   84.251   67.606   55.387   56.761   58.066   54.200   48.890   49.116   32.453   34.000   34.000 
## DF:  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 21 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 35 
## Sequence of  moves:
##      MnfFlow BrandCodeC BowlSetpoint Usagecont PressureSetpoint BrandCodeD
## Var       14          3           30        23               31          4
## Step       1          2            3         4                5          6
##      Temperature BrandCodeA CarbPressure1 BrandCodeNA FillOunces PSC
## Var           22          1            15           5          7  11
## Step           7          8             9          10         11  12
##      HydPressure3 CarbFlow PSCCO2 OxygenFiller PSCFill AlchRel CarbTemp
## Var            18       24     13           29      12      33       10
## Step           13       14     15           16      17      18       19
##      PCVolume MFR BrandCodeD FillPressure Density AirPressurer CarbRel
## Var         8  26         -4           16      25           32      34
## Step       20  21         22           23      24           25      26
##      CarbVolume BallingLvl BrandCodeB PressureVacuum HydPressure2
## Var           6         35          2             28           17
## Step         27         28         29             30           31
##      FillerLevel FillerSpeed Balling HydPressure4 CarbPressure BrandCodeD
## Var           20          21      27           19            9         -4
## Step          32          33      34           35           36         37
##        
## Var  38
## Step 38
# PLS
plsFit1 <- train(x=xTrain,
                 y=yTrain, 
                 method='pls',
                 tuneLength=20,
                 trControl=trControl,
                 preProc=c('knnImpute', 'center', 'scale'))
plsFit2 <- train(x=xTrain,
                 y=yTrain,
                 method='pls',
                 tuneLength=20,
                 trControl=trControl,
                 preProc=c('bagImpute', 'center', 'scale'))
## option to load in plsfit2 as pls impute takes time
plsFit2 <- readRDS("plsFit2.rds")

plsFit3 <- train(x=xTrain,
                 y=yTrain, 
                 method='pls',
                 tuneLength=20,
                 trControl=trControl,
                 preProc=c('medianImpute', 'center', 'scale'))

# Elastic Net
enetFit1 <- train(x=xTrain,
                  y=yTrain,
                  method='enet',
                  tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                       .lambda = seq(0, 1, by=0.1)),
                  trControl=trControl,
                  preProc=c('knnImpute', 'center', 'scale'))
enetFit2 <- train(x=xTrain,
                  y=yTrain,
                  method='enet',
                  tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                       .lambda = seq(0, 1, by=0.1)),
                  trControl=trControl,
                  preProc=c('bagImpute', 'center', 'scale'))
## option to load in plsfit2 as enetfit2 impute takes time
enetFit2 <- readRDS("enetFit2.rds")

enetFit3 <- train(x=xTrain,
                  y=yTrain,
                  method='enet',
                  tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                       .lambda = seq(0, 1, by=0.1)),
                  trControl=trControl,
                  preProc=c('medianImpute', 'center', 'scale'))

## test results of impute methods
resamples(list(PLS1=plsFit1, PLS2=plsFit2, PLS3=plsFit3,
               enet1=enetFit1, enet2=enetFit2, enet3=enetFit3)) %>% summary()


## plot and knnimpute models
plsFit <- plsFit1
enetFit <- enetFit1
plot(plsFit)
plot(enetFit)

## final models
plsFit$finalModel
enetFit$finalModel

Non-linear Models

  • For the KNN method, the model is tuned over the number of k-nearest neighbors used to make prediction:
  • For the support vector machine, we choose the radial basis kernel function. The hyper-parameters being tuned is the cost value. The scale parameter sigma is fixed and determined by the function analytically.

Plot KNN Model

Plot SVM Model

The final non-linear models are:

## 9-nearest neighbor regression model
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 2 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0201032562509376 
## 
## Number of Support Vectors : 2175 
## 
## Objective Function Value : -1643.345 
## Training error : 0.271298
##load in model
knnFit <- readRDS("knn.rds")
## or create model
knnFit <- train(x=xTrain,
                y=yTrain,
                method='knn',
                tuneLength=20,
                trControl=trControl,
                preProc=c('knnImpute', 'center', 'scale'))
## plot knn model
plot(knnFit)


## load in svm model
svmFit <- readRDS("Models//svm.rds")

# or create svm model
svmFit <- train(x=xTrain,
                y=yTrain,
                method="svmRadial",
                tuneLength=20,
                trControl=trControl,
                preProc=c('knnImpute', 'center', 'scale'))
## plot svm model
plot(svmFit)

## final models 
knnFit$finalModel
svmFit$finalModel

Tree-based Models

Random Forest:

  • For the Random Forest model, the mtry parameter, which is the number of randomly selected predictors in each tree, is tuned to obtain the optimal model.
  • The rf implementation in R does not permit missing values, therefore knnImpute is used in the pre-processing step.

XGBoost model:

For the XGBoost model, below is a list of the parameters being tuned:

  • nrounds : boosting iterations (trees)
  • max_depth : max tree depth
  • eta : learning rate
  • gamma : minimum loss reduction
  • colsample_bytree : subsample ratio of columns
  • min_child_weight : minimum sum of instance weight
  • subsample : subsample ratio of rows

  • In addition, the XGBoost allows missing value in the data. Here, we experiment with both imputing the missing values (with knnImpute) and not imputing the missing values.
    • It appears that the performance difference between imputing and not imputing are negligible. We opt to use the imputed model(XGB2) since it is a slight improvement and knnImpute does not take that much time to perform.

XG Boost imputed(XGB2) versus non imputed(XBG1):

## 
## Call:
## summary.resamples(object = .)
## 
## Models: XGB1, XGB2 
## Number of resamples: 5 
## 
## MAE 
##            Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
## XGB1 0.08727048 0.08757008 0.08768184 0.08783904 0.08788029 0.08879249
## XGB2 0.08697277 0.08718546 0.08743389 0.08767905 0.08782876 0.08897435
##      NA's
## XGB1    0
## XGB2    0
## 
## RMSE 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## XGB1 0.1154879 0.1161752 0.1165290 0.1171056 0.1181344 0.1192014    0
## XGB2 0.1156593 0.1157656 0.1164978 0.1170502 0.1180158 0.1193124    0
## 
## Rsquared 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## XGB1 0.5285509 0.5346482 0.5357232 0.5414750 0.5503459 0.5581067    0
## XGB2 0.5281014 0.5351413 0.5416740 0.5423755 0.5419176 0.5650431    0

The final tree-based models are:

## Random Forest 
## 
## 2567 samples
##   35 predictor
## 
## Pre-processing: nearest neighbor imputation (35), centered (35),
##  scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 514, 514, 513, 513, 513 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    2    0.1277131  0.4988421  0.09870782
##    5    0.1221249  0.5299636  0.09319066
##    9    0.1197881  0.5399403  0.09071346
##   13    0.1187889  0.5436157  0.08960861
##   16    0.1185293  0.5426875  0.08926884
##   20    0.1177503  0.5458798  0.08871182
##   24    0.1179296  0.5439091  0.08855143
##   27    0.1179859  0.5399440  0.08836967
##   31    0.1181714  0.5364060  0.08827614
##   35    0.1185114  0.5311831  0.08879924
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 20.
## ##### xgb.Booster
## Handle is invalid! Suggest using xgb.Booster.complete
## raw: 6.6 Mb 
## call:
##   xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth, 
##     gamma = param$gamma, colsample_bytree = param$colsample_bytree, 
##     min_child_weight = param$min_child_weight, subsample = param$subsample), 
##     data = x, nrounds = param$nrounds, objective = "reg:linear")
## params (as set within xgb.train):
##   eta = "0.01", max_depth = "10", gamma = "0", colsample_bytree = "0.6", min_child_weight = "5", subsample = "1", objective = "reg:linear", silent = "1"
## callbacks:
##   cb.print.evaluation(period = print_every_n)
## # of features: 35 
## niter: 1500
## nfeatures : 35 
## xNames : BrandCodeA BrandCodeB BrandCodeC BrandCodeD BrandCodeNA CarbVolume FillOunces PCVolume CarbPressure CarbTemp PSC PSCFill PSCCO2 MnfFlow CarbPressure1 FillPressure HydPressure2 HydPressure3 HydPressure4 FillerLevel FillerSpeed Temperature Usagecont CarbFlow Density MFR Balling PressureVacuum OxygenFiller BowlSetpoint PressureSetpoint AirPressurer AlchRel CarbRel BallingLvl 
## problemType : Regression 
## tuneValue :
##       nrounds max_depth  eta gamma colsample_bytree min_child_weight
## 1068    1500        10 0.01     0              0.6                5
##      subsample
## 1068         1
## obsLevels : NA 
## param :
##  list()
## Load rf model
rf <- readRDS("Models//rf.rds")

# Or create Random Forest
rf <- train(x=xTrain, 
            y=yTrain, 
            method='rf',
            tuneLength=10,
            trControl=trControl,
            preProc=c('knnImpute'), 
            importance=T)
# plot RF model
plot(rf)

##load models 
xgb1 <- readRDS("Models//xgb1.rds")
xgb2 <- readRDS("Models//xgb2.rds")

# XGBoost
xgbGrid <- expand.grid(.nrounds=c(100, 500, 1000, 1500), # boosting iterations (trees)
                       .max_depth=c(4, 6, 8, 10), # max tree depth
                       .eta=c(0.001, 0.01, 0.1, 0.5), # learning rate
                       .gamma=c(0),# minimum loss reduction
                       .colsample_bytree=c(0.4, 0.6, 0.8, 1), # subsample ratio of columns
                       .min_child_weight=c(1, 5, 15), # minimum sum of instance weight
                       .subsample=c(0.5, 0.75, 1))  # subsample ratio of rows
xgb1 <- train(x = xTrain,
              y = yTrain,
              method = 'xgbTree',
              tuneGrid = xgbGrid,
              trControl = trControl)
xgb2 <- train(x = xTrain,
              y = yTrain,
              method = 'xgbTree',
              tuneGrid = xgbGrid,
              trControl = trControl,
              preProce = c('knnImpute'))
# End multi-core processing
stopCluster(cl)
registerDoSEQ()
# resample
resamples(list(XGB1=xgb1, XGB2=xgb2)) %>% summary()

## final models 
rf
xgb <- xgb2
xgb$finalModel

Model Evaluation and Comparison

Variable Importance:

Following models have their model-specific variable importance:

  • Partial Least Square
  • Random Forest
  • Xgboost

For other models, the default action in caret is to evaluate the variable importance based on loess smoother fit between the target and the predictors (see https://topepo.github.io/caret/variable-importance.html).

Blow, the ranking of variables are calculated and tabulate below. As can be seen, the variable importance calculated for elastic net, KNN, and SVM are the same, since they do not have model-specific method, and are all calculated based on loess R-squares.

Variable Importance df:

## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:corrplot':
## 
##     corrplot
## The following object is masked from 'package:stats':
## 
##     loadings
Rank pls rf xgbTree enet knn svmRadial
1 MnfFlow MnfFlow MnfFlow MnfFlow MnfFlow MnfFlow
2 BrandCodeC BrandCodeC Usagecont MFR MFR MFR
3 BowlSetpoint PressureVacuum BrandCodeC BowlSetpoint BowlSetpoint BowlSetpoint
4 Usagecont OxygenFiller OxygenFiller FillerLevel FillerLevel FillerLevel
5 FillerLevel BallingLvl AlchRel PressureSetpoint PressureSetpoint PressureSetpoint
6 HydPressure3 AirPressurer Temperature Usagecont Usagecont Usagecont
7 PressureSetpoint AlchRel PressureVacuum CarbFlow CarbFlow CarbFlow
8 FillPressure CarbRel FillerSpeed BrandCodeC BrandCodeC BrandCodeC
9 BrandCodeB Usagecont CarbPressure1 HydPressure3 HydPressure3 HydPressure3
10 Temperature Temperature CarbRel FillPressure FillPressure FillPressure
11 HydPressure2 CarbFlow AirPressurer PressureVacuum PressureVacuum PressureVacuum
12 PressureVacuum FillerSpeed BowlSetpoint HydPressure2 HydPressure2 HydPressure2
13 CarbPressure1 HydPressure3 CarbFlow CarbRel CarbRel CarbRel
14 OxygenFiller CarbPressure1 Balling HydPressure4 HydPressure4 HydPressure4
15 CarbFlow Density BallingLvl Temperature Temperature Temperature
16 BrandCodeD BowlSetpoint FillerLevel BrandCodeD BrandCodeD BrandCodeD
17 AlchRel HydPressure2 Density OxygenFiller OxygenFiller OxygenFiller
18 CarbRel Balling PCVolume FillerSpeed FillerSpeed FillerSpeed
19 BrandCodeA PCVolume MFR AlchRel AlchRel AlchRel
20 BallingLvl FillPressure HydPressure2 CarbPressure1 CarbPressure1 CarbPressure1
21 Density FillerLevel CarbVolume PSCCO2 PSCCO2 PSCCO2
22 HydPressure4 CarbVolume HydPressure3 PSC PSC PSC
23 PSC MFR FillOunces FillOunces FillOunces FillOunces
24 FillOunces HydPressure4 FillPressure PCVolume PCVolume PCVolume
25 CarbVolume BrandCodeA HydPressure4 BrandCodeB BrandCodeB BrandCodeB
26 Balling BrandCodeNA PSC BallingLvl BallingLvl BallingLvl
27 BrandCodeNA BrandCodeB CarbPressure CarbTemp CarbTemp CarbTemp
28 PSCCO2 BrandCodeD CarbTemp BrandCodeA BrandCodeA BrandCodeA
29 CarbPressure PressureSetpoint PSCFill CarbPressure CarbPressure CarbPressure
30 PSCFill CarbPressure BrandCodeB PSCFill PSCFill PSCFill
31 MFR FillOunces PressureSetpoint CarbVolume CarbVolume CarbVolume
32 PCVolume PSCFill PSCCO2 Density Density Density
33 FillerSpeed CarbTemp BrandCodeA BrandCodeNA BrandCodeNA BrandCodeNA
34 CarbTemp PSCCO2 BrandCodeD Balling Balling Balling
35 AirPressurer PSC BrandCodeNA AirPressurer AirPressurer AirPressurer

Variable Importance plots:

Model Performance:

The models’ performance are listed below:

  • It can be seen that the XGBoost model achieves better performance.
    • It has the lowest average RMSE. Based on this, we opt to choose the XGBoost model as our final models.
## 
## Call:
## summary.resamples(object = .)
## 
## Models: PLS, ENet, KNN, SVM, RF, XGB 
## Number of resamples: 4 
## 
## MAE 
##            Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
## PLS  0.10401017 0.10573777 0.10715156 0.10662121 0.10803500 0.10817153
## ENet 0.10325386 0.10502571 0.10631413 0.10584920 0.10713763 0.10751469
## KNN  0.10357093 0.10386052 0.10401850 0.10396019 0.10411817 0.10423283
## SVM  0.09480799 0.09557060 0.09584557 0.09591785 0.09619282 0.09717227
## RF   0.08744219 0.08787527 0.08848266 0.08871182 0.08931921 0.09043978
## XGB  0.08697277 0.08713229 0.08730968 0.08735522 0.08753261 0.08782876
##      NA's
## PLS     0
## ENet    0
## KNN     0
## SVM     0
## RF      0
## XGB     0
## 
## RMSE 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS  0.1356567 0.1371824 0.1377113 0.1374039 0.1379328 0.1385362    0
## ENet 0.1339212 0.1356141 0.1362803 0.1357745 0.1364407 0.1366161    0
## KNN  0.1344314 0.1345692 0.1349825 0.1357495 0.1361629 0.1386017    0
## SVM  0.1259120 0.1265990 0.1273402 0.1278324 0.1285736 0.1307371    0
## RF   0.1165398 0.1172607 0.1177933 0.1177503 0.1182830 0.1188748    0
## XGB  0.1156593 0.1157390 0.1161317 0.1164846 0.1168773 0.1180158    0
## 
## Rsquared 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS  0.3625243 0.3656883 0.3683503 0.3689665 0.3716286 0.3766411    0
## ENet 0.3700989 0.3761005 0.3796023 0.3791235 0.3826253 0.3871907    0
## KNN  0.3790417 0.3836021 0.3902105 0.3906420 0.3972504 0.4031051    0
## SVM  0.4293764 0.4446522 0.4546945 0.4522908 0.4623331 0.4703980    0
## RF   0.5379883 0.5399803 0.5430877 0.5458798 0.5489872 0.5593555    0
## XGB  0.5351413 0.5400408 0.5417958 0.5459440 0.5476990 0.5650431    0

Boxplots for CV performance:

  • Below are the boxplots of the RMSE in the CV folds for the models.
  • It can be seen that the linear models have very tight spread in their RMSE, while the non-linear and the tree-based models have higher spread.
    • This means that linear models have lower variance than the non-linear and tree-based models.
    • It makes sense since non-linear models and tree-based models are in general more powerful than the linear models, and therefore prompt to overfit (high variance, low bias).

## custom function to rank variable importance
getRank <- function(trainObjects){
  temp <- c()
  methods <- c()
  for(object in trainObjects){
    methods <- c(methods, object$method)
    varimp <- varImp(object)[[1]]
    varimp$variables <- row.names(varimp)
    rank <- varimp[order(varimp$Overall, decreasing = T),] %>% row.names()
    temp <- cbind(temp, rank)
    
  }
  temp <- as.data.frame(temp)
  names(temp) <- methods
  temp$Rank <- c(1:dim(temp)[1])
  temp <- select(temp, Rank, everything())
  return(temp)
}

## print out ofvarIMp from custom function
kable(getRank(list(plsFit, rf, xgb, enetFit, knnFit, svmFit)),digits = 2,'markdown', booktabs =T)

## plot varImp for all models 
plot(varImp(plsFit), main='Variable Importance based on PLS')
plot(varImp(rf), main='Variable Importance based on Random Forest')
plot(varImp(xgb), main='Variable Importance based on XGBoost')
plot(varImp(svmFit), main='Variable Importance based on Loess R-Squares')

## summary of model scores 
resamples(list(PLS=plsFit, ENet=enetFit, KNN=knnFit, SVM=svmFit, RF=rf, XGB=xgb)) %>% summary()

## boxplots for cv scores
par(mfrow=c(2,3))
boxplot(plsFit$resample$RMSE, main='CV RMSE for PLS')
boxplot(enetFit$resample$RMSE, main='CV RMSE for ENet')
boxplot(knnFit$resample$RMSE, main='CV RMSE for KNN')
boxplot(svmFit$resample$RMSE, main='CV RMSE for SVM')
boxplot(rf$resample$RMSE, main='CV RMSE for RF')
boxplot(xgb$resample$RMSE, main='CV RMSE for XGB')

## Make predictions
(pred <- predict(xgb, newdata=xEval))
eval$PH <- pred
write.csv(eval, "StudentEvaluation_PH_PREDICTED.csv", row.names=FALSE)

Eval Set Prediction

Below, we make the prediction using the XGB model, and save the result:

(pred <- predict(xgb, newdata=xEval))
##   [1] 8.489114 8.386810 8.499494 8.676891 8.422321 8.508721 8.479796
##   [8] 8.509658 8.524522 8.733335 8.540319 8.524879 8.476252 8.604997
##  [15] 8.263277 8.584721 8.598623 8.507232 8.503490 8.716899 8.680784
##  [22] 8.667629 8.563601 8.511458 8.669805 8.493483 8.405451 8.685325
##  [29] 8.666525 8.699488 8.642038 8.702867 8.616266 8.672192 8.685741
##  [36] 8.555187 8.504755 8.530453 8.691133 8.752850 8.788443 8.525010
##  [43] 8.294233 8.275476 8.710000 8.754517 8.757333 8.724513 8.702127
##  [50] 8.766453 8.628973 8.692094 8.721963 8.720330 8.863279 8.751065
##  [57] 8.442047 8.509911 8.702737 8.782986 8.788027 8.719201 8.736992
##  [64] 8.802861 8.810700 8.704443 8.547973 8.540565 8.610917 8.506020
##  [71] 8.456654 8.384642 8.520053 8.508084 8.515755 8.496242 8.584035
##  [78] 8.662597 8.725308 8.639506 8.773076 8.661763 8.719378 8.628849
##  [85] 8.708687 8.742179 8.739834 8.647678 8.537263 8.726831 8.616635
##  [92] 8.618535 8.537017 8.510901 8.567130 8.626575 8.634910 8.625755
##  [99] 8.660060 8.676085 8.668866 8.717737 8.719065 8.701555 8.777940
## [106] 8.807128 8.490880 8.430260 8.464031 8.511809 8.695340 8.700084
## [113] 8.648747 8.743848 8.640017 8.716689 8.749076 8.740728 8.737554
## [120] 8.699940 8.704624 8.733253 8.755580 8.676638 8.505178 8.452633
## [127] 8.498458 8.298923 8.468449 8.465969 8.553341 8.530916 8.548730
## [134] 8.479448 8.460509 8.530807 8.554157 8.455324 8.437445 8.425599
## [141] 8.464765 8.524144 8.487891 8.522289 8.422502 8.503467 8.469502
## [148] 8.573105 8.473152 8.603023 8.623924 8.594813 8.442616 8.638641
## [155] 8.460707 8.483208 8.416842 8.495853 8.579787 8.516457 8.601458
## [162] 8.627097 8.637206 8.643469 8.654936 8.605709 8.470303 8.739980
## [169] 8.669416 8.767879 8.478296 8.540945 8.331718 8.439440 8.477420
## [176] 8.577559 8.421168 8.489767 8.400572 8.470026 8.429167 8.548959
## [183] 8.507633 8.489511 8.522857 8.308163 8.326262 8.485365 8.404167
## [190] 8.372846 8.354460 8.522525 8.549019 8.459602 8.474385 8.332462
## [197] 8.310847 8.224078 8.243311 8.423572 8.403149 8.416455 8.460292
## [204] 8.375723 8.351536 8.495371 8.497482 8.375500 8.437349 8.289114
## [211] 8.331231 8.385859 8.427534 8.519608 8.426883 8.424350 8.373230
## [218] 8.345346 8.504831 8.502677 8.483864 8.471354 8.445770 8.391487
## [225] 8.611402 8.492319 8.472500 8.467644 8.468987 8.470590 8.470867
## [232] 8.513150 8.493874 8.576015 8.582872 8.501921 8.623503 8.667575
## [239] 8.486345 8.516557 8.488672 8.562017 8.518543 8.518978 8.411890
## [246] 8.417725 8.422590 8.532801 8.582995 8.478495 8.437906 8.460243
## [253] 8.514612 8.561201 8.558359 8.552038 8.488605 8.635681 8.513274
## [260] 8.525599 8.616925 8.639329 8.508380 8.615664 8.255293 8.350628
## [267] 8.123829
eval$PH <- pred
write.csv(eval, "StudentEvaluation_PH_PREDICTED.csv", row.names=FALSE)