Below we provide a technical write up on the production process of our mineral water brand “ABC Beverage”. This is a supplementary guide to the accompanying business write up.
Due to recent fears over the high metallic nature of bottled mineral waters, the government has asked us to document our production process and provide a predictive model for ph levels in our product. Below we will document our entire process with code in three parts. Part one will contain the Exploratory Data Analysis (EDA). Part two will go over data preparation for modeling, and part three will cover the entire modeling process. There will be some documentaion accompanying our code, however,the material and code is dense. We have provided all our code as we expect the regulatory agencies may attempt to reproduce our results. Please see the business write up click here for a brief summary of our process and results
The first tab “Notes from EDA”, contains most of the relevant information gained from EDA. Most notes point towards a tab in the EDA analysis. Our readers are expected to be able to understand statistical analysis and graphs, as such notes are somewhat limited. The plots and visualizations have been broken down into tabs for presentation purpose. The code for each tab has been added as an appendix at the bottom of each tab. We note multicollinearity issues that are present in the data, however, as we eventually chose a XGboost model, we decided against dropping certain predictors.
Package imports
rm(list=ls())
library (MASS)
library(tidyverse)
library(psych)
library(kableExtra)
library(knitr)
library(corrplot)
library(caret)
library(xlsx)
library(mice)
library(gridExtra)
library(grid)
library(cowplot)
library(mice)
library(VIM)
library(reshape2)
library(doParallel)df <- read.xlsx("train.xlsx",1)
df_w_na <- copy(df)
df <- subset(df, !is.na(PH))
df$Pressure.Setpoint = round(df$Pressure.Setpoint, 0)
df$PSC.CO2 = round(df$PSC.CO2, 2)
test <- read.xlsx("test.xlsx",1)par(mfrow = c(3,5), cex = .5)
for (i in colnames(df)) {
smoothScatter(df[,i], main = names(df[i]), ylab = "", xlab = "", colramp = colorRampPalette(c("white", "blue")))
}| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Brand.Code* | 1 | 2447 | 2.51 | 1.00 | 2.00 | 2.51 | 0.00 | 1.00 | 4.00 | 3.00 | 0.38 | -1.06 | 0.02 |
| Carb.Volume | 2 | 2557 | 5.37 | 0.11 | 5.35 | 5.37 | 0.11 | 5.04 | 5.70 | 0.66 | 0.39 | -0.47 | 0.00 |
| Fill.Ounces | 3 | 2529 | 23.97 | 0.09 | 23.97 | 23.98 | 0.08 | 23.63 | 24.32 | 0.69 | -0.02 | 0.87 | 0.00 |
| PC.Volume | 4 | 2528 | 0.28 | 0.06 | 0.27 | 0.27 | 0.05 | 0.08 | 0.48 | 0.40 | 0.35 | 0.67 | 0.00 |
| Carb.Pressure | 5 | 2540 | 68.19 | 3.54 | 68.20 | 68.12 | 3.56 | 57.00 | 79.40 | 22.40 | 0.18 | -0.01 | 0.07 |
| Carb.Temp | 6 | 2541 | 141.09 | 4.03 | 140.80 | 140.99 | 3.85 | 128.60 | 154.00 | 25.40 | 0.24 | 0.23 | 0.08 |
| PSC | 7 | 2534 | 0.08 | 0.05 | 0.08 | 0.08 | 0.05 | 0.00 | 0.27 | 0.27 | 0.85 | 0.65 | 0.00 |
| PSC.Fill | 8 | 2544 | 0.20 | 0.12 | 0.18 | 0.18 | 0.12 | 0.00 | 0.62 | 0.62 | 0.94 | 0.77 | 0.00 |
| PSC.CO2 | 9 | 2528 | 0.06 | 0.04 | 0.04 | 0.05 | 0.03 | 0.00 | 0.24 | 0.24 | 1.73 | 3.72 | 0.00 |
| Mnf.Flow | 10 | 2567 | 24.63 | 119.50 | 70.20 | 21.14 | 161.60 | -100.20 | 229.40 | 329.60 | 0.00 | -1.87 | 2.36 |
| Carb.Pressure1 | 11 | 2535 | 122.57 | 4.73 | 123.20 | 122.53 | 4.45 | 105.60 | 140.20 | 34.60 | 0.04 | 0.13 | 0.09 |
| Fill.Pressure | 12 | 2549 | 47.92 | 3.18 | 46.40 | 47.71 | 2.37 | 34.60 | 60.40 | 25.80 | 0.55 | 1.41 | 0.06 |
| Hyd.Pressure1 | 13 | 2556 | 12.46 | 12.43 | 11.40 | 10.86 | 16.90 | -0.80 | 58.00 | 58.80 | 0.78 | -0.14 | 0.25 |
| Hyd.Pressure2 | 14 | 2552 | 20.99 | 16.38 | 28.60 | 21.09 | 13.34 | 0.00 | 59.40 | 59.40 | -0.31 | -1.56 | 0.32 |
| Hyd.Pressure3 | 15 | 2552 | 20.48 | 15.97 | 27.60 | 20.53 | 13.79 | -1.20 | 50.00 | 51.20 | -0.32 | -1.57 | 0.32 |
| Hyd.Pressure4 | 16 | 2539 | 96.31 | 13.10 | 96.00 | 95.46 | 11.86 | 62.00 | 142.00 | 80.00 | 0.56 | 0.61 | 0.26 |
| Filler.Level | 17 | 2551 | 109.25 | 15.70 | 118.40 | 111.04 | 9.19 | 55.80 | 161.20 | 105.40 | -0.85 | 0.05 | 0.31 |
| Filler.Speed | 18 | 2513 | 3688.11 | 769.63 | 3982.00 | 3920.25 | 47.44 | 998.00 | 4030.00 | 3032.00 | -2.88 | 6.75 | 15.35 |
| Temperature | 19 | 2555 | 65.96 | 1.38 | 65.60 | 65.80 | 0.89 | 63.60 | 76.20 | 12.60 | 2.39 | 10.25 | 0.03 |
| Usage.cont | 20 | 2562 | 20.99 | 2.98 | 21.79 | 21.25 | 3.19 | 12.08 | 25.90 | 13.82 | -0.54 | -1.02 | 0.06 |
| Carb.Flow | 21 | 2565 | 2472.05 | 1070.43 | 3030.00 | 2604.15 | 323.21 | 26.00 | 5104.00 | 5078.00 | -0.99 | -0.57 | 21.14 |
| Density | 22 | 2567 | 1.17 | 0.38 | 0.98 | 1.15 | 0.15 | 0.24 | 1.92 | 1.68 | 0.53 | -1.21 | 0.01 |
| MFR | 23 | 2359 | 704.05 | 73.90 | 724.00 | 718.16 | 15.42 | 31.40 | 868.60 | 837.20 | -5.09 | 30.46 | 1.52 |
| Balling | 24 | 2567 | 2.20 | 0.93 | 1.65 | 2.13 | 0.37 | 0.16 | 4.01 | 3.85 | 0.60 | -1.40 | 0.02 |
| Pressure.Vacuum | 25 | 2567 | -5.22 | 0.57 | -5.40 | -5.25 | 0.59 | -6.60 | -3.60 | 3.00 | 0.53 | -0.03 | 0.01 |
| PH | 26 | 2567 | 8.55 | 0.17 | 8.54 | 8.55 | 0.18 | 7.88 | 9.36 | 1.48 | -0.29 | 0.06 | 0.00 |
| Oxygen.Filler | 27 | 2556 | 0.05 | 0.05 | 0.03 | 0.04 | 0.02 | 0.00 | 0.40 | 0.40 | 2.41 | 8.84 | 0.00 |
| Bowl.Setpoint | 28 | 2565 | 109.35 | 15.29 | 120.00 | 111.36 | 0.00 | 70.00 | 140.00 | 70.00 | -0.97 | -0.05 | 0.30 |
| Pressure.Setpoint | 29 | 2555 | 47.61 | 2.04 | 46.00 | 47.60 | 0.00 | 44.00 | 52.00 | 8.00 | 0.20 | -1.60 | 0.04 |
| Air.Pressurer | 30 | 2567 | 142.83 | 1.21 | 142.60 | 142.58 | 0.59 | 140.80 | 148.20 | 7.40 | 2.25 | 4.73 | 0.02 |
| Alch.Rel | 31 | 2560 | 6.90 | 0.51 | 6.56 | 6.84 | 0.06 | 5.28 | 8.62 | 3.34 | 0.88 | -0.85 | 0.01 |
| Carb.Rel | 32 | 2559 | 5.44 | 0.13 | 5.40 | 5.43 | 0.12 | 4.96 | 6.06 | 1.10 | 0.50 | -0.30 | 0.00 |
| Balling.Lvl | 33 | 2566 | 2.05 | 0.87 | 1.48 | 1.98 | 0.21 | 0.00 | 3.66 | 3.66 | 0.59 | -1.50 | 0.02 |
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Display PH is null training data observations
| Brand.Code | Carb.Volume | Fill.Ounces | PC.Volume | Carb.Pressure | Carb.Temp | PSC | PSC.Fill | PSC.CO2 | Mnf.Flow | Carb.Pressure1 | Fill.Pressure | Hyd.Pressure1 | Hyd.Pressure2 | Hyd.Pressure3 | Hyd.Pressure4 | Filler.Level | Filler.Speed | Temperature | Usage.cont | Carb.Flow | Density | MFR | Balling | Pressure.Vacuum | PH | Oxygen.Filler | Bowl.Setpoint | Pressure.Setpoint | Air.Pressurer | Alch.Rel | Carb.Rel | Balling.Lvl | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 738 | B | 5.34 | 23.95 | 0.11 | 65.2 | 138.0 | 0.01 | 0.18 | 0.04 | -100.0 | 138.4 | NA | 0.0 | 0.0 | 0.0 | 90 | NA | NA | NA | 14.76 | 38 | 0.24 | NA | -0.17 | -5.2 | NA | NA | 120 | 46 | 143.2 | NA | NA | 0.00 |
| 1535 | B | 5.22 | 23.78 | 0.30 | 73.2 | 151.8 | 0.11 | 0.10 | 0.04 | NA | 128.2 | NA | 0.2 | 0.6 | 31.8 | 52 | NA | NA | 71.2 | 18.34 | 272 | 0.62 | NA | 0.75 | -5.2 | NA | 0.4 | 90 | 50 | 142.6 | 6.24 | 5.46 | 2.72 |
| 1719 | B | 5.31 | 23.85 | 0.16 | 65.4 | 138.6 | 0.02 | 0.34 | 0.02 | NA | 133.2 | NA | 0.0 | 0.0 | 1.6 | NA | NA | NA | NA | 23.94 | 32 | NA | NA | NA | -4.8 | NA | 0.4 | 70 | 50 | 143.4 | NA | NA | 0.00 |
| 2214 | B | 5.27 | 23.90 | 0.23 | 67.2 | 142.8 | 0.02 | 0.34 | 0.06 | 0.2 | 131.0 | NA | -0.6 | 0.2 | -1.2 | NA | NA | 1406 | 67.6 | 23.72 | 44 | 0.60 | NA | 0.70 | -5.4 | NA | 0.4 | 110 | 50 | 142.2 | 6.54 | 5.38 | 1.34 |
## General Visualizations
kable(describe(df),digits =2,'markdown',booktabs =T)
box_ph <- ggplot(df, aes(x="PH", y=PH))+geom_boxplot()
hist_ph <- ggplot(df, aes(x=PH))+geom_histogram(stat='count')
plot_grid(box_ph,hist_ph, labels = c('A', 'B'))
## Grab observations where PH is NA
kable(df_w_na[(which(is.na(df_w_na$PH))),],digits =2,'markdown',booktabs =T)| Brand_A | Brand_B | Brand_C | Brand_D | Brand_Na | |
|---|---|---|---|---|---|
| Carb.Volume | 5.43 | 5.31 | 5.30 | 5.51 | 5.29 |
| Fill.Ounces | 23.98 | 23.98 | 23.98 | 23.96 | 24.00 |
| PC.Volume | 0.27 | 0.28 | 0.29 | 0.26 | 0.29 |
| Carb.Pressure | 69.28 | 67.20 | 66.92 | 70.70 | 66.26 |
| Carb.Temp | 141.23 | 141.11 | 141.05 | 141.13 | 140.52 |
| PSC | 0.08 | 0.09 | 0.09 | 0.08 | 0.09 |
| PSC.Fill | 0.20 | 0.19 | 0.21 | 0.19 | 0.19 |
| PSC.CO2 | 0.06 | 0.06 | 0.06 | 0.05 | 0.06 |
| Mnf.Flow | 39.71 | 20.58 | 23.28 | 25.75 | 27.15 |
| Carb.Pressure1 | 122.82 | 122.44 | 122.19 | 122.72 | 123.46 |
| Fill.Pressure | 48.22 | 48.17 | 48.23 | 46.96 | 48.78 |
| Hyd.Pressure1 | 12.90 | 12.57 | 12.33 | 12.40 | 10.89 |
| Hyd.Pressure2 | 21.16 | 21.17 | 19.16 | 21.97 | 18.44 |
| Hyd.Pressure3 | 20.97 | 20.23 | 19.05 | 21.59 | 19.79 |
| Hyd.Pressure4 | 101.28 | 100.11 | 102.54 | 83.45 | 95.98 |
| Filler.Level | 108.63 | 107.96 | 111.60 | 110.42 | 112.15 |
| Filler.Speed | 3582.08 | 3732.54 | 3673.66 | 3688.98 | 3517.83 |
| Temperature | 66.06 | 65.89 | 66.72 | 65.46 | 67.23 |
| Usage.cont | 21.14 | 20.97 | 21.04 | 21.01 | 20.73 |
| Carb.Flow | 2387.35 | 2540.59 | 2311.54 | 2437.18 | 2557.43 |
| Density | 1.57 | 0.91 | 0.92 | 1.68 | 0.97 |
| MFR | 704.98 | 705.64 | 704.03 | 703.98 | 685.10 |
| Balling | 3.20 | 1.51 | 1.63 | 3.49 | 1.67 |
| Pressure.Vacuum | -5.24 | -5.15 | -5.31 | -5.26 | -5.33 |
| PH | 8.50 | 8.57 | 8.41 | 8.60 | 8.49 |
| Oxygen.Filler | 0.04 | 0.05 | 0.05 | 0.04 | 0.05 |
| Bowl.Setpoint | 109.11 | 107.80 | 111.62 | 110.89 | 112.17 |
| Pressure.Setpoint | 47.76 | 47.94 | 48.03 | 46.65 | 47.80 |
| Air.Pressurer | 142.74 | 142.96 | 142.77 | 142.69 | 142.67 |
| Alch.Rel | 7.13 | 6.55 | 6.56 | 7.69 | 6.64 |
| Carb.Rel | 5.52 | 5.36 | 5.35 | 5.61 | 5.34 |
| Balling.Lvl | 3.08 | 1.41 | 1.52 | 3.23 | 1.47 |
## Warning: Ignoring unknown parameters: binwidth, bins, pad
#Lets see brand codes
A <- df[df$Brand.Code == "A",]
Brand_A <- colMeans(A[,2:ncol(A)], na.rm = TRUE)
B <- df[df$Brand.Code == "B",]
Brand_B <- colMeans(B[,2:ncol(B)], na.rm = TRUE)
C <- df[df$Brand.Code == "C",]
Brand_C <- colMeans(C[,2:ncol(C)], na.rm = TRUE)
D <- df[df$Brand.Code == "D",]
Brand_D <- colMeans(D[,2:ncol(D)], na.rm = TRUE)
Na <- df[!df$Brand.Code %in% c("A", "B", "C", "D"),]
Brand_Na <- colMeans(Na[,2:ncol(Na)], na.rm = TRUE)
Brands <- cbind(Brand_A, Brand_B, Brand_C, Brand_D, Brand_Na)
kable(Brands,digits =2,'markdown',booktabs =T)
## Visualizations for Brand code factor column
boxplot_PH <- df %>%
select_if(is.factor) %>%
gather %>%
ggplot(aes(x = value)) +
geom_histogram(stat='count') +
facet_wrap(~key)
Hist_PH <- df %>%
select_if(is.factor) %>%
gather %>%
ggplot(aes(x = value)) +
geom_boxplot(aes(x = df$Brand.Code,y = df$PH)) +
facet_wrap(~key)
plot_grid(BOXPLOT_PH,Hist_PH)Entire DF Corrplot
Very High Correlations Corrplot
Moderate Correlations corrplot
** PH Correlations with Predictors**
## Entire corrplot
corrplot(cor(df[,2:33],use = "complete.obs", method = "pearson"))
## very high cvorrelations
corrplot(cor(df[,2:33],use = "complete.obs", method = "pearson")[30:32,30:32, drop=FALSE], cl.pos='n', method = "number")
corrplot(cor(df[,2:33],use = "complete.obs", method = "pearson")[4:5,4:5, drop=FALSE], cl.pos='n', method = "number")
## somewhat high correlations
corrplot(cor(df[,2:33],use = "complete.obs", method = "pearson")[9:14,9:14, drop=FALSE], cl.pos='n', method = "number")
## Get PH Correlations with predictors
corrplot(cor(df[,2:33],use = "complete.obs", method = "pearson")[1:16,25, drop=FALSE], cl.pos='n', method = "number")
corrplot(cor(df[,2:33],use = "complete.obs", method = "pearson")[17:32,25, drop=FALSE], cl.pos='n', method = "number")## Warning: Removed 285 rows containing missing values (geom_point).
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 285 rows containing non-finite values (stat_count).
## Warning: Removed 366 rows containing missing values (geom_point).
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 366 rows containing non-finite values (stat_count).
## Warning: Removed 41 rows containing missing values (geom_point).
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 41 rows containing non-finite values (stat_count).
### melt df for plots predictors 2-12
df.m <- melt(df[,c(2:12,26)], "PH")
ggplot(df.m, aes(value, PH)) +
geom_point() +
facet_wrap(~variable, scales = "free")
ggplot(df.m, aes(value)) +
geom_histogram(stat = "count") +
facet_wrap(~variable, scales = "free")
### melt df for plots predictors 13-25
df.m <- melt(df[,c(13:25,26)], "PH")
ggplot(df.m, aes(value, PH)) +
geom_point() +
facet_wrap(~variable, scales = "free")
ggplot(df.m, aes(value)) +
geom_histogram(stat = "count") +
facet_wrap(~variable, scales = "free")
### melt df for plots predictors 26-33
df.m <- melt(df[,c(26:33)], "PH")
ggplot(df.m, aes(value, PH)) +
geom_point() +
facet_wrap(~variable, scales = "free")
ggplot(df.m, aes(value)) +
geom_histogram(stat = "count") +
facet_wrap(~variable, scales = "free")Variables that Seem Discrete
| Bowl.Setpoint | Freq |
|---|---|
| 70 | 99 |
| 80 | 96 |
| 90 | 434 |
| 100 | 112 |
| 110 | 437 |
| 120 | 1307 |
| 122 | 1 |
| 126 | 10 |
| 130 | 51 |
| 134 | 2 |
| 140 | 20 |
| Pressure.Setpoint | Freq |
|---|---|
| 44 | 96 |
| 46 | 1322 |
| 46.4 | 1 |
| 46.6 | 1 |
| 46.8 | 1 |
| 48 | 125 |
| 50 | 1002 |
| 52 | 11 |
| PSC.CO2 | Freq |
|---|---|
| 0 | 108 |
| 0.0199999999999996 | 325 |
| 0.0200000000000005 | 288 |
| 0.0399999999999991 | 37 |
| 0.04 | 624 |
| 0.0599999999999996 | 283 |
| 0.0600000000000005 | 219 |
| 0.0799999999999992 | 23 |
| 0.08 | 234 |
| 0.0999999999999996 | 79 |
| 0.100000000000001 | 65 |
| 0.119999999999999 | 10 |
| 0.12 | 72 |
| 0.14 | 30 |
| 0.140000000000001 | 19 |
| 0.159999999999999 | 3 |
| 0.16 | 36 |
| 0.18 | 12 |
| 0.180000000000001 | 8 |
| 0.199999999999999 | 2 |
| 0.2 | 16 |
| 0.22 | 22 |
| 0.24 | 17 |
Rounding
| PSC.CO2 | Freq |
|---|---|
| 0 | 108 |
| 0.02 | 612 |
| 0.04 | 659 |
| 0.06 | 501 |
| 0.08 | 257 |
| 0.1 | 144 |
| 0.12 | 82 |
| 0.14 | 49 |
| 0.16 | 39 |
| 0.18 | 20 |
| 0.2 | 18 |
| 0.22 | 22 |
| 0.24 | 17 |
| Pressure.Setpoint | Freq |
|---|---|
| 44 | 96 |
| 46 | 1322 |
| 47 | 2 |
| 48 | 125 |
| 50 | 999 |
| 52 | 11 |
PSC.CO2 and Pressure.setpoint
## Warning: Removed 12 rows containing missing values (geom_point).
## Warning: Removed 39 rows containing missing values (geom_point).
Bowl.Setpoint predictor
## PH
## 1 8.74
## 2 8.74
## 3 8.74
## 4 8.76
## 5 8.62
## 6 8.62
## 7 8.60
## 8 8.62
## 9 8.56
## 10 8.58
## 11 8.44
## Warning: Removed 2 rows containing missing values (geom_point).
kable(table(df$Bowl.Setpoint,dnn = "Bowl Setpoint"), digits = 2,'markdown', booktabs =T)
kable(table(df$Pressure.Setpoint,dnn = "Pressure Setpoint"), digits = 2, 'markdown', booktabs =T,caption = "PRESSURE SETPOINT")
kable(table(df$PSC.CO2,dnn = "PSC.CO2"), digits = 2, 'markdown',booktabs =T,caption = "PSC CO2")
## After rounding visualize Pressure.Setpoint
ggplot(df, aes(Pressure.Setpoint, PH)) +
geom_point()
## After rounding visualize PSC.CO2
ggplot(df, aes(PSC.CO2, PH)) +
geom_point()
df %>%
filter(Bowl.Setpoint> 120 & Bowl.Setpoint <130) %>%
dplyr::select(PH)
ggplot(df, aes(Bowl.Setpoint, PH)) +
geom_point()
df$PSC.CO2 <- round(df$PSC.CO2,2)
## rounded new table for PSC CO2
table(df$PSC.CO2)
## rounded to integer new table for Pressure setpoint
df$Pressure.Setpoint <- round(df$Pressure.Setpoint,0)
table(df$Pressure.Setpoint)## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## 0.0467471757 0.0038955980 0.0148032723 0.0151928321
## Carb.Pressure Carb.Temp PSC PSC.Fill
## 0.0105181145 0.0101285547 0.0128554733 0.0089598753
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## 0.0151928321 0.0000000000 0.0124659135 0.0070120764
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## 0.0042851578 0.0058433970 0.0058433970 0.0109076743
## Filler.Level Filler.Speed Temperature Usage.cont
## 0.0062329568 0.0210362291 0.0046747176 0.0019477990
## Carb.Flow Density MFR Balling
## 0.0007791196 0.0000000000 0.0810284379 0.0000000000
## Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint
## 0.0000000000 0.0000000000 0.0042851578 0.0007791196
## Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 0.0046747176 0.0000000000 0.0027269186 0.0031164784
## Balling.Lvl
## 0.0003895598
##
## Variables sorted by number of missings:
## Variable Count
## MFR 0.0810284379
## Brand.Code 0.0467471757
## Filler.Speed 0.0210362291
## PC.Volume 0.0151928321
## PSC.CO2 0.0151928321
## Fill.Ounces 0.0148032723
## PSC 0.0128554733
## Carb.Pressure1 0.0124659135
## Hyd.Pressure4 0.0109076743
## Carb.Pressure 0.0105181145
## Carb.Temp 0.0101285547
## PSC.Fill 0.0089598753
## Fill.Pressure 0.0070120764
## Filler.Level 0.0062329568
## Hyd.Pressure2 0.0058433970
## Hyd.Pressure3 0.0058433970
## Temperature 0.0046747176
## Pressure.Setpoint 0.0046747176
## Hyd.Pressure1 0.0042851578
## Oxygen.Filler 0.0042851578
## Carb.Volume 0.0038955980
## Carb.Rel 0.0031164784
## Alch.Rel 0.0027269186
## Usage.cont 0.0019477990
## Carb.Flow 0.0007791196
## Bowl.Setpoint 0.0007791196
## Balling.Lvl 0.0003895598
## Mnf.Flow 0.0000000000
## Density 0.0000000000
## Balling 0.0000000000
## Pressure.Vacuum 0.0000000000
## PH 0.0000000000
## Air.Pressurer 0.0000000000
## missing values
pMiss <- function(x){sum(is.na(x))/length(x)}
apply(df,2,pMiss)
aggr_plot <- aggr(df, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
#tempData <- mice(df,m=5,maxit=50,meth='pmm',seed=500)
#summary(tempData)
#completedData <- complete(tempData,1)
#write.csv(completedData, file = "imputeddf.csv")
#completedData <- read.csv("imputeddf.csv")
#completedData <- completedData[,-1]The variable Brand Code is a categorical variable, having 4 classes (A, B, C, and D). We opt to use the “one-hot” encoding scheme for this variable, creating 5 new variables for the data: BrandCodeA, BrandCodeB, BrandCodeC, BrandCodeD, and BrandCodeNA.
Brand CodeA |
Brand CodeB |
Brand CodeC |
Brand CodeD |
Brand CodeNA |
|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 |
Brand CodeA |
Brand CodeB |
Brand CodeC |
Brand CodeD |
Brand CodeNA |
|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 |
| 1 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 1 | 0 |
# One-hot encoding the categorical variable `Brand Code`
train$`Brand Code` <- addNA(train$`Brand Code`)
eval$`Brand Code` <- addNA(eval$`Brand Code`)
brandCodeTrain <- predict(dummyVars(~`Brand Code`, data=train), train)
brandCodeEval <- predict(dummyVars(~`Brand Code`, data=eval), eval)
kable(brandCodeTrain[1:10,],digits = 2,'markdown', booktabs =T)
#head(train$`Brand Code`, 10)
kable(brandCodeEval[1:10,],digits = 2,'markdown', booktabs =T)
#head(eval$`Brand Code`, 10)
train <- cbind(brandCodeTrain, subset(train, select=-c(`Brand Code`)))
eval <- cbind(brandCodeEval, subset(eval, select=-c(`Brand Code`)))# Remove special symbols (white space and `) in names
names(train) <- gsub(patter=c(' |`'), replacement='', names(train))
names(eval) <- gsub(patter=c(' |`'), replacement='', names(eval))
# Remove rows in training set with missing target variables
train <- train[complete.cases(train$PH),]
# Check near-zero-variance variables
nearZeroVar(train, names=T)## [1] "BrandCodeNA" "HydPressure1"
# Separate the predictors and target, and remove nzv variable
xTrain <- subset(train, select=-c(PH,`HydPressure1`)) %>% as.data.frame()
xEval <- subset(eval, select=-c(PH,`HydPressure1`)) %>% as.data.frame()
yTrain <- train$PHSetting Train Parameters
For the missing values, we experiment with three different imputation algorithms provided in the preProcess function:
As will be seen in the “Linear Models” section below, the choice of imputation method does not seem to affect the prediction performance much. We opt to use the knnImpute method due to its high efficiency.
For the linear and non-linear models, the pre-processing step also include centering and scaling (standardizing), so that the variables all have a mean of 0 and standard deviation of 1. For the tree-based models, this step is omitted, since tree models work fine without this step.
The caret package supports parallel processing (multi-core training). This capability significantly lowers the training time:
set.seed(1)
cvFolds <- createFolds(yTrain, k=5)
trControl <- trainControl(verboseIter=T,
method='cv',
number=5,
index=cvFolds)
# Set up and start multi-core processing
cl <- makePSOCKcluster(5)
registerDoParallel(cl)In this section, we tune 6 models for the following purpose in mind:
Understand the effect of different imputation methods on the model performance
Find the optimal hyper-parameters for the respective linear models
Setting Tuning Parameters
For the partial least squares, the models are tuned over the number of components used in the model. 3 models are created, each uses a different imputation method:
For the elastic nets, the models are tuned over the two regularization parameters. Likewise, 3 models are tuned, each with different imputation method:
The performance of these models are be compared using the resamples function:
As you can see, the performance differences are very small among the different imputation methods. We opt to use the knnImpute method from this point on, due to its efficiency.
## Aggregating results
## Selecting tuning parameters
## Fitting ncomp = 14 on full training set
## Aggregating results
## Selecting tuning parameters
## Fitting ncomp = 14 on full training set
## Aggregating results
## Selecting tuning parameters
## Fitting fraction = 0.5, lambda = 0 on full training set
## Aggregating results
## Selecting tuning parameters
## Fitting fraction = 0.5, lambda = 0 on full training set
Summary Table
##
## Call:
## summary.resamples(object = .)
##
## Models: PLS1, PLS2, PLS3, enet1, enet2, enet3
## Number of resamples: 5
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS1 0.1040102 0.1063136 0.1071064 0.1067182 0.1079895 0.1081715 0
## PLS2 0.1043480 0.1060543 0.1073844 0.1068265 0.1075589 0.1087867 0
## PLS3 0.1048040 0.1066420 0.1071693 0.1069891 0.1079980 0.1083322 0
## enet1 0.1032539 0.1056163 0.1070119 0.1061140 0.1071730 0.1075147 0
## enet2 0.1039317 0.1055278 0.1069091 0.1062466 0.1071942 0.1076704 0
## enet3 0.1042282 0.1060427 0.1072512 0.1066468 0.1078112 0.1079009 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS1 0.1356567 0.1376909 0.1376948 0.1374621 0.1377316 0.1385362 0
## PLS2 0.1355903 0.1367379 0.1369284 0.1373042 0.1382949 0.1389695 0
## PLS3 0.1363889 0.1377426 0.1379041 0.1378178 0.1384736 0.1385797 0
## enet1 0.1339212 0.1361784 0.1363822 0.1361661 0.1366161 0.1377323 0
## enet2 0.1359433 0.1360144 0.1366364 0.1367142 0.1368978 0.1380791 0
## enet3 0.1348319 0.1363997 0.1368995 0.1368871 0.1369487 0.1393558 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS1 0.3625243 0.3667430 0.3699577 0.3699562 0.3739147 0.3766411 0
## PLS2 0.3642204 0.3674205 0.3686586 0.3696541 0.3721007 0.3758705 0
## PLS3 0.3556132 0.3644181 0.3702737 0.3668091 0.3706663 0.3730739 0
## enet1 0.3700989 0.3727973 0.3781011 0.3778583 0.3811035 0.3871907 0
## enet2 0.3678809 0.3705872 0.3740222 0.3741709 0.3785737 0.3797904 0
## enet3 0.3626465 0.3649488 0.3761665 0.3722877 0.3785394 0.3791373 0
Plot of PLS and Elastic Net
## Partial least squares regression , fitted with the orthogonal scores algorithm.
## Call:
## plsr(formula = .outcome ~ ., ncomp = param$ncomp, data = dat, method = "oscorespls")
##
## Call:
## elasticnet::enet(x = as.matrix(x), y = y, lambda = param$lambda)
## Cp statistics of the Lasso fit
## Cp: 1828.781 1300.154 1029.285 836.820 795.467 791.002 790.239 538.139 458.584 439.565 426.175 352.972 305.674 251.236 243.047 208.556 189.786 190.667 171.832 154.297 154.234 153.671 146.067 138.922 107.104 104.545 88.112 84.251 67.606 55.387 56.761 58.066 54.200 48.890 49.116 32.453 34.000 34.000
## DF: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 21 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 35
## Sequence of moves:
## MnfFlow BrandCodeC BowlSetpoint Usagecont PressureSetpoint BrandCodeD
## Var 14 3 30 23 31 4
## Step 1 2 3 4 5 6
## Temperature BrandCodeA CarbPressure1 BrandCodeNA FillOunces PSC
## Var 22 1 15 5 7 11
## Step 7 8 9 10 11 12
## HydPressure3 CarbFlow PSCCO2 OxygenFiller PSCFill AlchRel CarbTemp
## Var 18 24 13 29 12 33 10
## Step 13 14 15 16 17 18 19
## PCVolume MFR BrandCodeD FillPressure Density AirPressurer CarbRel
## Var 8 26 -4 16 25 32 34
## Step 20 21 22 23 24 25 26
## CarbVolume BallingLvl BrandCodeB PressureVacuum HydPressure2
## Var 6 35 2 28 17
## Step 27 28 29 30 31
## FillerLevel FillerSpeed Balling HydPressure4 CarbPressure BrandCodeD
## Var 20 21 27 19 9 -4
## Step 32 33 34 35 36 37
##
## Var 38
## Step 38
# PLS
plsFit1 <- train(x=xTrain,
y=yTrain,
method='pls',
tuneLength=20,
trControl=trControl,
preProc=c('knnImpute', 'center', 'scale'))
plsFit2 <- train(x=xTrain,
y=yTrain,
method='pls',
tuneLength=20,
trControl=trControl,
preProc=c('bagImpute', 'center', 'scale'))
## option to load in plsfit2 as pls impute takes time
plsFit2 <- readRDS("plsFit2.rds")
plsFit3 <- train(x=xTrain,
y=yTrain,
method='pls',
tuneLength=20,
trControl=trControl,
preProc=c('medianImpute', 'center', 'scale'))
# Elastic Net
enetFit1 <- train(x=xTrain,
y=yTrain,
method='enet',
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1),
.lambda = seq(0, 1, by=0.1)),
trControl=trControl,
preProc=c('knnImpute', 'center', 'scale'))
enetFit2 <- train(x=xTrain,
y=yTrain,
method='enet',
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1),
.lambda = seq(0, 1, by=0.1)),
trControl=trControl,
preProc=c('bagImpute', 'center', 'scale'))
## option to load in plsfit2 as enetfit2 impute takes time
enetFit2 <- readRDS("enetFit2.rds")
enetFit3 <- train(x=xTrain,
y=yTrain,
method='enet',
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1),
.lambda = seq(0, 1, by=0.1)),
trControl=trControl,
preProc=c('medianImpute', 'center', 'scale'))
## test results of impute methods
resamples(list(PLS1=plsFit1, PLS2=plsFit2, PLS3=plsFit3,
enet1=enetFit1, enet2=enetFit2, enet3=enetFit3)) %>% summary()
## plot and knnimpute models
plsFit <- plsFit1
enetFit <- enetFit1
plot(plsFit)
plot(enetFit)
## final models
plsFit$finalModel
enetFit$finalModelPlot KNN Model
Plot SVM Model
The final non-linear models are:
## 9-nearest neighbor regression model
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 2
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0201032562509376
##
## Number of Support Vectors : 2175
##
## Objective Function Value : -1643.345
## Training error : 0.271298
##load in model
knnFit <- readRDS("knn.rds")
## or create model
knnFit <- train(x=xTrain,
y=yTrain,
method='knn',
tuneLength=20,
trControl=trControl,
preProc=c('knnImpute', 'center', 'scale'))
## plot knn model
plot(knnFit)
## load in svm model
svmFit <- readRDS("Models//svm.rds")
# or create svm model
svmFit <- train(x=xTrain,
y=yTrain,
method="svmRadial",
tuneLength=20,
trControl=trControl,
preProc=c('knnImpute', 'center', 'scale'))
## plot svm model
plot(svmFit)
## final models
knnFit$finalModel
svmFit$finalModelRandom Forest:
mtry parameter, which is the number of randomly selected predictors in each tree, is tuned to obtain the optimal model.rf implementation in R does not permit missing values, therefore knnImpute is used in the pre-processing step.XGBoost model:
For the XGBoost model, below is a list of the parameters being tuned:
nrounds : boosting iterations (trees)max_depth : max tree deptheta : learning rategamma : minimum loss reductioncolsample_bytree : subsample ratio of columnsmin_child_weight : minimum sum of instance weightsubsample : subsample ratio of rows
knnImpute) and not imputing the missing values.
knnImpute does not take that much time to perform.XG Boost imputed(XGB2) versus non imputed(XBG1):
##
## Call:
## summary.resamples(object = .)
##
## Models: XGB1, XGB2
## Number of resamples: 5
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## XGB1 0.08727048 0.08757008 0.08768184 0.08783904 0.08788029 0.08879249
## XGB2 0.08697277 0.08718546 0.08743389 0.08767905 0.08782876 0.08897435
## NA's
## XGB1 0
## XGB2 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## XGB1 0.1154879 0.1161752 0.1165290 0.1171056 0.1181344 0.1192014 0
## XGB2 0.1156593 0.1157656 0.1164978 0.1170502 0.1180158 0.1193124 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## XGB1 0.5285509 0.5346482 0.5357232 0.5414750 0.5503459 0.5581067 0
## XGB2 0.5281014 0.5351413 0.5416740 0.5423755 0.5419176 0.5650431 0
The final tree-based models are:
## Random Forest
##
## 2567 samples
## 35 predictor
##
## Pre-processing: nearest neighbor imputation (35), centered (35),
## scaled (35)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 514, 514, 513, 513, 513
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.1277131 0.4988421 0.09870782
## 5 0.1221249 0.5299636 0.09319066
## 9 0.1197881 0.5399403 0.09071346
## 13 0.1187889 0.5436157 0.08960861
## 16 0.1185293 0.5426875 0.08926884
## 20 0.1177503 0.5458798 0.08871182
## 24 0.1179296 0.5439091 0.08855143
## 27 0.1179859 0.5399440 0.08836967
## 31 0.1181714 0.5364060 0.08827614
## 35 0.1185114 0.5311831 0.08879924
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 20.
## ##### xgb.Booster
## Handle is invalid! Suggest using xgb.Booster.complete
## raw: 6.6 Mb
## call:
## xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth,
## gamma = param$gamma, colsample_bytree = param$colsample_bytree,
## min_child_weight = param$min_child_weight, subsample = param$subsample),
## data = x, nrounds = param$nrounds, objective = "reg:linear")
## params (as set within xgb.train):
## eta = "0.01", max_depth = "10", gamma = "0", colsample_bytree = "0.6", min_child_weight = "5", subsample = "1", objective = "reg:linear", silent = "1"
## callbacks:
## cb.print.evaluation(period = print_every_n)
## # of features: 35
## niter: 1500
## nfeatures : 35
## xNames : BrandCodeA BrandCodeB BrandCodeC BrandCodeD BrandCodeNA CarbVolume FillOunces PCVolume CarbPressure CarbTemp PSC PSCFill PSCCO2 MnfFlow CarbPressure1 FillPressure HydPressure2 HydPressure3 HydPressure4 FillerLevel FillerSpeed Temperature Usagecont CarbFlow Density MFR Balling PressureVacuum OxygenFiller BowlSetpoint PressureSetpoint AirPressurer AlchRel CarbRel BallingLvl
## problemType : Regression
## tuneValue :
## nrounds max_depth eta gamma colsample_bytree min_child_weight
## 1068 1500 10 0.01 0 0.6 5
## subsample
## 1068 1
## obsLevels : NA
## param :
## list()
## Load rf model
rf <- readRDS("Models//rf.rds")
# Or create Random Forest
rf <- train(x=xTrain,
y=yTrain,
method='rf',
tuneLength=10,
trControl=trControl,
preProc=c('knnImpute'),
importance=T)
# plot RF model
plot(rf)
##load models
xgb1 <- readRDS("Models//xgb1.rds")
xgb2 <- readRDS("Models//xgb2.rds")
# XGBoost
xgbGrid <- expand.grid(.nrounds=c(100, 500, 1000, 1500), # boosting iterations (trees)
.max_depth=c(4, 6, 8, 10), # max tree depth
.eta=c(0.001, 0.01, 0.1, 0.5), # learning rate
.gamma=c(0),# minimum loss reduction
.colsample_bytree=c(0.4, 0.6, 0.8, 1), # subsample ratio of columns
.min_child_weight=c(1, 5, 15), # minimum sum of instance weight
.subsample=c(0.5, 0.75, 1)) # subsample ratio of rows
xgb1 <- train(x = xTrain,
y = yTrain,
method = 'xgbTree',
tuneGrid = xgbGrid,
trControl = trControl)
xgb2 <- train(x = xTrain,
y = yTrain,
method = 'xgbTree',
tuneGrid = xgbGrid,
trControl = trControl,
preProce = c('knnImpute'))
# End multi-core processing
stopCluster(cl)
registerDoSEQ()
# resample
resamples(list(XGB1=xgb1, XGB2=xgb2)) %>% summary()
## final models
rf
xgb <- xgb2
xgb$finalModel
plot(xgb$finalModel)Variable Importance:
Following models have their model-specific variable importance:
For other models, the default action in caret is to evaluate the variable importance based on loess smoother fit between the target and the predictors (see https://topepo.github.io/caret/variable-importance.html).
Blow, the ranking of variables are calculated and tabulate below. As can be seen, the variable importance calculated for elastic net, KNN, and SVM are the same, since they do not have model-specific method, and are all calculated based on loess R-squares.
Variable Importance df:
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:corrplot':
##
## corrplot
## The following object is masked from 'package:stats':
##
## loadings
| Rank | pls | rf | xgbTree | enet | knn | svmRadial |
|---|---|---|---|---|---|---|
| 1 | MnfFlow | MnfFlow | MnfFlow | MnfFlow | MnfFlow | MnfFlow |
| 2 | BrandCodeC | BrandCodeC | Usagecont | MFR | MFR | MFR |
| 3 | BowlSetpoint | PressureVacuum | BrandCodeC | BowlSetpoint | BowlSetpoint | BowlSetpoint |
| 4 | Usagecont | OxygenFiller | OxygenFiller | FillerLevel | FillerLevel | FillerLevel |
| 5 | FillerLevel | BallingLvl | AlchRel | PressureSetpoint | PressureSetpoint | PressureSetpoint |
| 6 | HydPressure3 | AirPressurer | Temperature | Usagecont | Usagecont | Usagecont |
| 7 | PressureSetpoint | AlchRel | PressureVacuum | CarbFlow | CarbFlow | CarbFlow |
| 8 | FillPressure | CarbRel | FillerSpeed | BrandCodeC | BrandCodeC | BrandCodeC |
| 9 | BrandCodeB | Usagecont | CarbPressure1 | HydPressure3 | HydPressure3 | HydPressure3 |
| 10 | Temperature | Temperature | CarbRel | FillPressure | FillPressure | FillPressure |
| 11 | HydPressure2 | CarbFlow | AirPressurer | PressureVacuum | PressureVacuum | PressureVacuum |
| 12 | PressureVacuum | FillerSpeed | BowlSetpoint | HydPressure2 | HydPressure2 | HydPressure2 |
| 13 | CarbPressure1 | HydPressure3 | CarbFlow | CarbRel | CarbRel | CarbRel |
| 14 | OxygenFiller | CarbPressure1 | Balling | HydPressure4 | HydPressure4 | HydPressure4 |
| 15 | CarbFlow | Density | BallingLvl | Temperature | Temperature | Temperature |
| 16 | BrandCodeD | BowlSetpoint | FillerLevel | BrandCodeD | BrandCodeD | BrandCodeD |
| 17 | AlchRel | HydPressure2 | Density | OxygenFiller | OxygenFiller | OxygenFiller |
| 18 | CarbRel | Balling | PCVolume | FillerSpeed | FillerSpeed | FillerSpeed |
| 19 | BrandCodeA | PCVolume | MFR | AlchRel | AlchRel | AlchRel |
| 20 | BallingLvl | FillPressure | HydPressure2 | CarbPressure1 | CarbPressure1 | CarbPressure1 |
| 21 | Density | FillerLevel | CarbVolume | PSCCO2 | PSCCO2 | PSCCO2 |
| 22 | HydPressure4 | CarbVolume | HydPressure3 | PSC | PSC | PSC |
| 23 | PSC | MFR | FillOunces | FillOunces | FillOunces | FillOunces |
| 24 | FillOunces | HydPressure4 | FillPressure | PCVolume | PCVolume | PCVolume |
| 25 | CarbVolume | BrandCodeA | HydPressure4 | BrandCodeB | BrandCodeB | BrandCodeB |
| 26 | Balling | BrandCodeNA | PSC | BallingLvl | BallingLvl | BallingLvl |
| 27 | BrandCodeNA | BrandCodeB | CarbPressure | CarbTemp | CarbTemp | CarbTemp |
| 28 | PSCCO2 | BrandCodeD | CarbTemp | BrandCodeA | BrandCodeA | BrandCodeA |
| 29 | CarbPressure | PressureSetpoint | PSCFill | CarbPressure | CarbPressure | CarbPressure |
| 30 | PSCFill | CarbPressure | BrandCodeB | PSCFill | PSCFill | PSCFill |
| 31 | MFR | FillOunces | PressureSetpoint | CarbVolume | CarbVolume | CarbVolume |
| 32 | PCVolume | PSCFill | PSCCO2 | Density | Density | Density |
| 33 | FillerSpeed | CarbTemp | BrandCodeA | BrandCodeNA | BrandCodeNA | BrandCodeNA |
| 34 | CarbTemp | PSCCO2 | BrandCodeD | Balling | Balling | Balling |
| 35 | AirPressurer | PSC | BrandCodeNA | AirPressurer | AirPressurer | AirPressurer |
Variable Importance plots:
Model Performance:
The models’ performance are listed below:
##
## Call:
## summary.resamples(object = .)
##
## Models: PLS, ENet, KNN, SVM, RF, XGB
## Number of resamples: 4
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## PLS 0.10401017 0.10573777 0.10715156 0.10662121 0.10803500 0.10817153
## ENet 0.10325386 0.10502571 0.10631413 0.10584920 0.10713763 0.10751469
## KNN 0.10357093 0.10386052 0.10401850 0.10396019 0.10411817 0.10423283
## SVM 0.09480799 0.09557060 0.09584557 0.09591785 0.09619282 0.09717227
## RF 0.08744219 0.08787527 0.08848266 0.08871182 0.08931921 0.09043978
## XGB 0.08697277 0.08713229 0.08730968 0.08735522 0.08753261 0.08782876
## NA's
## PLS 0
## ENet 0
## KNN 0
## SVM 0
## RF 0
## XGB 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 0.1356567 0.1371824 0.1377113 0.1374039 0.1379328 0.1385362 0
## ENet 0.1339212 0.1356141 0.1362803 0.1357745 0.1364407 0.1366161 0
## KNN 0.1344314 0.1345692 0.1349825 0.1357495 0.1361629 0.1386017 0
## SVM 0.1259120 0.1265990 0.1273402 0.1278324 0.1285736 0.1307371 0
## RF 0.1165398 0.1172607 0.1177933 0.1177503 0.1182830 0.1188748 0
## XGB 0.1156593 0.1157390 0.1161317 0.1164846 0.1168773 0.1180158 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 0.3625243 0.3656883 0.3683503 0.3689665 0.3716286 0.3766411 0
## ENet 0.3700989 0.3761005 0.3796023 0.3791235 0.3826253 0.3871907 0
## KNN 0.3790417 0.3836021 0.3902105 0.3906420 0.3972504 0.4031051 0
## SVM 0.4293764 0.4446522 0.4546945 0.4522908 0.4623331 0.4703980 0
## RF 0.5379883 0.5399803 0.5430877 0.5458798 0.5489872 0.5593555 0
## XGB 0.5351413 0.5400408 0.5417958 0.5459440 0.5476990 0.5650431 0
Boxplots for CV performance:
## custom function to rank variable importance
getRank <- function(trainObjects){
temp <- c()
methods <- c()
for(object in trainObjects){
methods <- c(methods, object$method)
varimp <- varImp(object)[[1]]
varimp$variables <- row.names(varimp)
rank <- varimp[order(varimp$Overall, decreasing = T),] %>% row.names()
temp <- cbind(temp, rank)
}
temp <- as.data.frame(temp)
names(temp) <- methods
temp$Rank <- c(1:dim(temp)[1])
temp <- select(temp, Rank, everything())
return(temp)
}
## print out ofvarIMp from custom function
kable(getRank(list(plsFit, rf, xgb, enetFit, knnFit, svmFit)),digits = 2,'markdown', booktabs =T)
## plot varImp for all models
plot(varImp(plsFit), main='Variable Importance based on PLS')
plot(varImp(rf), main='Variable Importance based on Random Forest')
plot(varImp(xgb), main='Variable Importance based on XGBoost')
plot(varImp(svmFit), main='Variable Importance based on Loess R-Squares')
## summary of model scores
resamples(list(PLS=plsFit, ENet=enetFit, KNN=knnFit, SVM=svmFit, RF=rf, XGB=xgb)) %>% summary()
## boxplots for cv scores
par(mfrow=c(2,3))
boxplot(plsFit$resample$RMSE, main='CV RMSE for PLS')
boxplot(enetFit$resample$RMSE, main='CV RMSE for ENet')
boxplot(knnFit$resample$RMSE, main='CV RMSE for KNN')
boxplot(svmFit$resample$RMSE, main='CV RMSE for SVM')
boxplot(rf$resample$RMSE, main='CV RMSE for RF')
boxplot(xgb$resample$RMSE, main='CV RMSE for XGB')
## Make predictions
(pred <- predict(xgb, newdata=xEval))
eval$PH <- pred
write.csv(eval, "StudentEvaluation_PH_PREDICTED.csv", row.names=FALSE)Below, we make the prediction using the XGB model, and save the result:
(pred <- predict(xgb, newdata=xEval))## [1] 8.489114 8.386810 8.499494 8.676891 8.422321 8.508721 8.479796
## [8] 8.509658 8.524522 8.733335 8.540319 8.524879 8.476252 8.604997
## [15] 8.263277 8.584721 8.598623 8.507232 8.503490 8.716899 8.680784
## [22] 8.667629 8.563601 8.511458 8.669805 8.493483 8.405451 8.685325
## [29] 8.666525 8.699488 8.642038 8.702867 8.616266 8.672192 8.685741
## [36] 8.555187 8.504755 8.530453 8.691133 8.752850 8.788443 8.525010
## [43] 8.294233 8.275476 8.710000 8.754517 8.757333 8.724513 8.702127
## [50] 8.766453 8.628973 8.692094 8.721963 8.720330 8.863279 8.751065
## [57] 8.442047 8.509911 8.702737 8.782986 8.788027 8.719201 8.736992
## [64] 8.802861 8.810700 8.704443 8.547973 8.540565 8.610917 8.506020
## [71] 8.456654 8.384642 8.520053 8.508084 8.515755 8.496242 8.584035
## [78] 8.662597 8.725308 8.639506 8.773076 8.661763 8.719378 8.628849
## [85] 8.708687 8.742179 8.739834 8.647678 8.537263 8.726831 8.616635
## [92] 8.618535 8.537017 8.510901 8.567130 8.626575 8.634910 8.625755
## [99] 8.660060 8.676085 8.668866 8.717737 8.719065 8.701555 8.777940
## [106] 8.807128 8.490880 8.430260 8.464031 8.511809 8.695340 8.700084
## [113] 8.648747 8.743848 8.640017 8.716689 8.749076 8.740728 8.737554
## [120] 8.699940 8.704624 8.733253 8.755580 8.676638 8.505178 8.452633
## [127] 8.498458 8.298923 8.468449 8.465969 8.553341 8.530916 8.548730
## [134] 8.479448 8.460509 8.530807 8.554157 8.455324 8.437445 8.425599
## [141] 8.464765 8.524144 8.487891 8.522289 8.422502 8.503467 8.469502
## [148] 8.573105 8.473152 8.603023 8.623924 8.594813 8.442616 8.638641
## [155] 8.460707 8.483208 8.416842 8.495853 8.579787 8.516457 8.601458
## [162] 8.627097 8.637206 8.643469 8.654936 8.605709 8.470303 8.739980
## [169] 8.669416 8.767879 8.478296 8.540945 8.331718 8.439440 8.477420
## [176] 8.577559 8.421168 8.489767 8.400572 8.470026 8.429167 8.548959
## [183] 8.507633 8.489511 8.522857 8.308163 8.326262 8.485365 8.404167
## [190] 8.372846 8.354460 8.522525 8.549019 8.459602 8.474385 8.332462
## [197] 8.310847 8.224078 8.243311 8.423572 8.403149 8.416455 8.460292
## [204] 8.375723 8.351536 8.495371 8.497482 8.375500 8.437349 8.289114
## [211] 8.331231 8.385859 8.427534 8.519608 8.426883 8.424350 8.373230
## [218] 8.345346 8.504831 8.502677 8.483864 8.471354 8.445770 8.391487
## [225] 8.611402 8.492319 8.472500 8.467644 8.468987 8.470590 8.470867
## [232] 8.513150 8.493874 8.576015 8.582872 8.501921 8.623503 8.667575
## [239] 8.486345 8.516557 8.488672 8.562017 8.518543 8.518978 8.411890
## [246] 8.417725 8.422590 8.532801 8.582995 8.478495 8.437906 8.460243
## [253] 8.514612 8.561201 8.558359 8.552038 8.488605 8.635681 8.513274
## [260] 8.525599 8.616925 8.639329 8.508380 8.615664 8.255293 8.350628
## [267] 8.123829
eval$PH <- pred
write.csv(eval, "StudentEvaluation_PH_PREDICTED.csv", row.names=FALSE)