This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
library(Amelia)
library(AppliedPredictiveModeling)
library(car)
library(caret)
#library(corrgram)
library(corrplot)
library(data.table)
#library(dlookr)
library(dplyr)
library(DT)
library(e1071)
library(forecast)
library(fpp2)
#library(ggcorrplot)
library(ggplot2)
library(glmnet)
#library(glmulti)
library(gridExtra)
#library(Hmisc)
library(kableExtra)
library(knitr)
library(lubridate)
library(MASS)
library(mice)
#library(plotly)
library(pROC)
#library(pscl)
library(psych)
library(RANN)
#library(RColorBrewer)
library(readxl)
library(reshape2)
library(stringr)
library(tidyverse)
library(tseries)
library(urca)
#library(xlsx)
setwd("/Users/elinaazrilyan/Downloads/")
Ins_train_data <- read.csv("./StudentData.csv", stringsAsFactors = FALSE)
Ins_eval_data <- read.csv("./StudentEvaluation.csv", stringsAsFactors = FALSE)
Initially, we’ll do a cursory exploration of the data. After that, we’ll iteratively prepare and explore the data, wherever required.
dim1 <- dim(Ins_train_data)
print(paste0('Dimension of training set: ', 'Number of rows: ', dim1[1], ', ', 'Number of cols: ', dim1[2]))
## [1] "Dimension of training set: Number of rows: 2571, Number of cols: 33"
print('Head of training data set:')
## [1] "Head of training data set:"
head(Ins_train_data)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp
## 1 B 5.340000 23.96667 0.2633333 68.2 141.2
## 2 A 5.426667 24.00667 0.2386667 68.4 139.6
## 3 B 5.286667 24.06000 0.2633333 70.8 144.8
## 4 A 5.440000 24.00667 0.2933333 63.0 132.6
## 5 A 5.486667 24.31333 0.1113333 67.2 136.8
## 6 A 5.380000 23.92667 0.2693333 66.6 138.4
## PSC PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## 1 0.104 0.26 0.04 -100 118.8 46.0
## 2 0.124 0.22 0.04 -100 121.6 46.0
## 3 0.090 0.34 0.16 -100 120.2 46.0
## 4 NA 0.42 0.04 -100 115.2 46.4
## 5 0.026 0.16 0.12 -100 118.4 45.8
## 6 0.090 0.24 0.04 -100 119.6 45.6
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level
## 1 0 NA NA 118 121.2
## 2 0 NA NA 106 118.6
## 3 0 NA NA 82 120.0
## 4 0 0 0 92 117.8
## 5 0 0 0 92 118.6
## 6 0 0 0 116 120.2
## Filler.Speed Temperature Usage.cont Carb.Flow Density MFR Balling
## 1 4002 66.0 16.18 2932 0.88 725.0 1.398
## 2 3986 67.6 19.90 3144 0.92 726.8 1.498
## 3 4020 67.0 17.76 2914 1.58 735.0 3.142
## 4 4012 65.6 17.42 3062 1.54 730.6 3.042
## 5 4010 65.6 17.68 3054 1.54 722.8 3.042
## 6 4014 66.2 23.82 2948 1.52 738.8 2.992
## Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint
## 1 -4.0 8.36 0.022 120 46.4
## 2 -4.0 8.26 0.026 120 46.8
## 3 -3.8 8.94 0.024 120 46.6
## 4 -4.4 8.24 0.030 120 46.0
## 5 -4.4 8.26 0.030 120 46.0
## 6 -4.4 8.32 0.024 120 46.0
## Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
## 1 142.6 6.58 5.32 1.48
## 2 143.0 6.56 5.30 1.56
## 3 142.0 7.66 5.84 3.28
## 4 146.2 7.14 5.42 3.04
## 5 146.2 7.14 5.44 3.04
## 6 146.6 7.16 5.44 3.02
print('Structure of training data set:')
## [1] "Structure of training data set:"
str(Ins_train_data)
## 'data.frame': 2571 obs. of 33 variables:
## $ Brand.Code : chr "B" "A" "B" "A" ...
## $ Carb.Volume : num 5.34 5.43 5.29 5.44 5.49 ...
## $ Fill.Ounces : num 24 24 24.1 24 24.3 ...
## $ PC.Volume : num 0.263 0.239 0.263 0.293 0.111 ...
## $ Carb.Pressure : num 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
## $ Carb.Temp : num 141 140 145 133 137 ...
## $ PSC : num 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
## $ PSC.Fill : num 0.26 0.22 0.34 0.42 0.16 0.24 0.4 0.34 0.12 0.24 ...
## $ PSC.CO2 : num 0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
## $ Mnf.Flow : num -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ Carb.Pressure1 : num 119 122 120 115 118 ...
## $ Fill.Pressure : num 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
## $ Hyd.Pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure2 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure3 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure4 : int 118 106 82 92 92 116 124 132 90 108 ...
## $ Filler.Level : num 121 119 120 118 119 ...
## $ Filler.Speed : int 4002 3986 4020 4012 4010 4014 NA 1004 4014 4028 ...
## $ Temperature : num 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
## $ Usage.cont : num 16.2 19.9 17.8 17.4 17.7 ...
## $ Carb.Flow : int 2932 3144 2914 3062 3054 2948 30 684 2902 3038 ...
## $ Density : num 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
## $ MFR : num 725 727 735 731 723 ...
## $ Balling : num 1.4 1.5 3.14 3.04 3.04 ...
## $ Pressure.Vacuum : num -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
## $ PH : num 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Oxygen.Filler : num 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
## $ Bowl.Setpoint : int 120 120 120 120 120 120 120 120 120 120 ...
## $ Pressure.Setpoint: num 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
## $ Air.Pressurer : num 143 143 142 146 146 ...
## $ Alch.Rel : num 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
## $ Carb.Rel : num 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
## $ Balling.Lvl : num 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
There are few fields, which have missing values, which we’ll investigate in greater details later.
At this stage, we’ll explore and prepare iteratively. Now, we’ll check for NA. After that if required, we’ll impute them. After that we’ll show some boxplots of the numeric fields.
Checking for NA.
any(is.na(Ins_train_data))
## [1] TRUE
NA does exist. So, we’ll impute with mice().
Ins_train_imputed <- mice(Ins_train_data, m = 1, method = "pmm", print = F) %>% mice::complete()
Rechecking for NA after imputation.
any(is.na(Ins_train_imputed))
## [1] FALSE
We observe that NA were removed. In the following, we’ll visualize with missmap().
Ins_train_imputed %>% missmap(main = "Missing / Observed")
Both is.na() and missmap() confirm that NA were eliminated.
Here, we’ll explore the data a little further. First, we’ll take a quick look at min, 1st quartile, median, mean, 2nd quartile, max etc.
summary(Ins_train_imputed) # %>% kable
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23933
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.371 Mean :23.98 Mean :0.27739
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07800 Median :0.1800
## Mean :68.21 Mean :141.1 Mean :0.08487 Mean :0.1956
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:118.8 1st Qu.:46.00
## Median :0.04000 Median : 64.80 Median :123.2 Median :46.40
## Mean :0.05631 Mean : 24.47 Mean :122.5 Mean :47.93
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.47 Mean :20.96 Mean :20.43 Mean : 96.47
## 3rd Qu.:20.40 3rd Qu.:34.60 3rd Qu.:33.20 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## Filler.Level Filler.Speed Temperature Usage.cont
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08
## 1st Qu.: 97.4 1st Qu.:3815 1st Qu.:65.20 1st Qu.:18.36
## Median :118.4 Median :3980 Median :65.60 Median :21.80
## Mean :109.2 Mean :3637 Mean :65.97 Mean :20.99
## 3rd Qu.:120.0 3rd Qu.:3996 3rd Qu.:66.40 3rd Qu.:23.75
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90
## Carb.Flow Density MFR Balling
## Min. : 26 Min. :0.240 Min. : 31.4 Min. :-0.170
## 1st Qu.:1142 1st Qu.:0.900 1st Qu.:695.0 1st Qu.: 1.496
## Median :3028 Median :0.980 Median :721.4 Median : 1.648
## Mean :2468 Mean :1.174 Mean :670.9 Mean : 2.197
## 3rd Qu.:3187 3rd Qu.:1.620 3rd Qu.:730.4 3rd Qu.: 3.292
## Max. :5104 Max. :1.920 Max. :868.6 Max. : 4.012
## Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint
## Min. :-6.600 Min. :7.880 Min. :0.00240 Min. : 70.0
## 1st Qu.:-5.600 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0
## Median :-5.400 Median :8.540 Median :0.03340 Median :120.0
## Mean :-5.216 Mean :8.546 Mean :0.04711 Mean :109.3
## 3rd Qu.:-5.000 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0
## Max. :-3.600 Max. :9.360 Max. :0.40000 Max. :140.0
## Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## Min. :44.00 Min. :140.8 Min. :5.280 Min. :4.960
## 1st Qu.:46.00 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340
## Median :46.00 Median :142.6 Median :6.560 Median :5.400
## Mean :47.61 Mean :142.8 Mean :6.896 Mean :5.436
## 3rd Qu.:50.00 3rd Qu.:143.0 3rd Qu.:7.220 3rd Qu.:5.540
## Max. :52.00 Max. :148.2 Max. :8.620 Max. :6.060
## Balling.Lvl
## Min. :0.00
## 1st Qu.:1.38
## Median :1.48
## Mean :2.05
## 3rd Qu.:3.14
## Max. :3.66
First look at the boxplots.
par(mfrow = c(3, 3))
for(i in 2:33) {
if (is.numeric(Ins_train_imputed[,i])) {
boxplot(Ins_train_imputed[,i], main = names(Ins_train_imputed[i]), col = 4, horizontal = TRUE)
}
}
The boxplots show that some of the variables have outliers in them. So, we’ll cap them.
Ins_train_cap <- Ins_train_imputed
for (i in 2:33) {
qntl <- quantile(Ins_train_cap[,i], probs = c(0.25, 0.75), na.rm = T)
cap_amt <- quantile(Ins_train_cap[,i], probs = c(0.05, 0.95), na.rm = T)
High <- 1.5 * IQR(Ins_train_cap[,i], na.rm = T)
Ins_train_cap[,i][Ins_train_cap[,i] < (qntl[1] - High)] <- cap_amt[1]
Ins_train_cap[,i][Ins_train_cap[,i] > (qntl[2] + High)] <- cap_amt[2]
}
par(mfrow = c(3, 3))
for(i in 2:33) {
if (is.numeric(Ins_train_cap[,i])) {
boxplot(Ins_train_cap[,i], main = names(Ins_train_cap[i]), col = 4, horizontal = TRUE)
}
}
The outliers were caped and now we see that several fields PSC.FILL, PSC.C02, Mnf.Flow, Hyd.Pressure1, Hyd.Pressure2, Hyd.Pressure3, Usage.cont, Carb.Flow, Density, Balling, Oxygen.Filler, Bowl.Setpoint, Pressure.Setpoint, Alch.Rel, Balling.Lvl have high variance.
Histograms tell us how the data is distributed in the dataset (numeric fields).
for(i in seq(from = 2, to = length(Ins_train_cap), by = 9)) {
if(i <= 27) {
multi.hist(Ins_train_cap[i:(i + 8)])
} else {
multi.hist(Ins_train_cap[i:(i + 4)])
}
}
Observing the above histograms, I decided the critical skewness, needing BoxCox transformation, to be 0.75 or higher. Based on this critical value, I am creating a vector transform_cols, which’ll contain the column names of skewed columns.
The columns, whose skewness exceed the critical value of 0.75, are printed below.
transform_cols <- c()
for(i in seq(from = 2, to = length(Ins_train_cap), by = 1)) {
if(abs(skewness(Ins_train_cap[, i])) >= 1) {
transform_cols <- append(transform_cols, names(Ins_train_cap[i]))
print(paste0(names(Ins_train_cap[i]), ": ", skewness(Ins_train_cap[, i])))
}
}
## [1] "Filler.Speed: -2.20990958899228"
## [1] "MFR: -2.18082694677126"
## [1] "Oxygen.Filler: 1.20318652532815"
## [1] "Air.Pressurer: 2.07962689734424"
Many of these histograms are skewed. So, following the recommendations of “Applied Statistical Learning” (page 105, 2nd para), I’ll apply Box-Cox transformation to remove the skewness.
lambda <- NULL
data_imputed_2 <- Ins_train_cap
for (i in 1:length(transform_cols)) {
lambda[transform_cols[i]] <- BoxCox.lambda(abs(Ins_train_cap[, transform_cols[i]]))
data_imputed_2[c(transform_cols[i])] <- BoxCox(Ins_train_cap[transform_cols[i]], lambda[transform_cols[i]])
}
Now, we don’t need to observe the histograms all over again. It will suffice to see the skewness.
We observe that skewness of most or all of the columns reduced and some even reduced to less than 1.
for(i in seq(from = 2, to = length(data_imputed_2), by = 1)) {
if(abs(skewness(data_imputed_2[, i])) >= 1) {
print(paste0(names(data_imputed_2[i]), ": ", skewness(data_imputed_2[, i])))
}
}
## [1] "Filler.Speed: -2.1540156029322"
## [1] "MFR: -2.10082207443164"
## [1] "Air.Pressurer: 2.04308315860091"
Now, we’ll explore the Categorical variables.
cat('Brand.Code:')
## Brand.Code:
table(data_imputed_2$Brand.Code)
##
## A B C D
## 120 293 1239 304 615
Observation: In Brand.Code column, 120 rows are empty. So, we’ll impute them with “X”.
Ins_train_cap_imputed <- data_imputed_2 %>% mutate(Brand.Code = ifelse((Brand.Code == ""), "X", Brand.Code))
cat("Brand.Code:")
## Brand.Code:
table(Ins_train_cap_imputed$Brand.Code)
##
## A B C D X
## 293 1239 304 615 120
At this point the data is prepared. So, we’ll explore the top correlated variables.
For the purpose of correlation, we’ll remove the only non-numeric field Brand.Code, out of the correlation.
Now, we’ll look at the correlation matrix of the variables.
Ins_train_corr <- Ins_train_cap_imputed[-1]
cor_mx = cor(Ins_train_corr, use = "pairwise.complete.obs", method = "pearson")
corrplot(cor_mx, method = "color", type = "upper", order = "original", number.cex = .7, addCoef.col = "black", #Add coefficient of correlation
tl.srt = 90, # Text label color and rotation
diag = TRUE, # hide correlation coefficient on the principal diagonal
tl.cex = 0.5)
At this point exploration, preparation and pair-wise correlations of StudentData.csv are done. So, I’ll begin the same exericse for StudentEvaluation.csv.
Initially, we’ll do a cursory exploration of the data. After that, we’ll iteratively prepare and explore the data, wherever required.
dim2 <- dim(Ins_eval_data)
print(paste0('Dimension of training set: ', 'Number of rows: ', dim2[1], ', ', 'Number of cols: ', dim2[2]))
## [1] "Dimension of training set: Number of rows: 267, Number of cols: 33"
print('Head of training data set:')
## [1] "Head of training data set:"
head(Ins_eval_data)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp
## 1 D 5.480000 24.03333 0.2700000 65.4 134.6
## 2 A 5.393333 23.95333 0.2266667 63.2 135.0
## 3 B 5.293333 23.92000 0.3033333 66.4 140.4
## 4 B 5.266667 23.94000 0.1860000 64.8 139.0
## 5 B 5.406667 24.20000 0.1600000 69.4 142.2
## 6 B 5.286667 24.10667 0.2120000 73.4 147.2
## PSC PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## 1 0.236 0.40 0.04 -100 116.6 46.0
## 2 0.042 0.22 0.08 -100 118.8 46.2
## 3 0.068 0.10 0.02 -100 120.2 45.8
## 4 0.004 0.20 0.02 -100 124.8 40.0
## 5 0.040 0.30 0.06 -100 115.0 51.4
## 6 0.078 0.22 NA -100 118.6 46.4
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level
## 1 0 NA NA 96 129.4
## 2 0 0 0 112 120.0
## 3 0 0 0 98 119.4
## 4 0 0 0 132 120.2
## 5 0 0 0 94 116.0
## 6 0 0 0 94 120.4
## Filler.Speed Temperature Usage.cont Carb.Flow Density MFR Balling
## 1 3986 66.0 21.66 2950 0.88 727.6 1.398
## 2 4012 65.6 17.60 2916 1.50 735.8 2.942
## 3 4010 65.6 24.18 3056 0.90 734.8 1.448
## 4 NA 74.4 18.12 28 0.74 NA 1.056
## 5 4018 66.4 21.32 3214 0.88 752.0 1.398
## 6 4010 66.6 18.00 3064 0.84 732.0 1.298
## Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint
## 1 -3.8 NA 0.022 130 45.2
## 2 -4.4 NA 0.030 120 46.0
## 3 -4.2 NA 0.046 120 46.0
## 4 -4.0 NA NA 120 46.0
## 5 -4.0 NA 0.082 120 50.0
## 6 -3.8 NA 0.064 120 46.0
## Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
## 1 142.6 6.56 5.34 1.48
## 2 147.2 7.14 5.58 3.04
## 3 146.6 6.52 5.34 1.46
## 4 146.4 6.48 5.50 1.48
## 5 145.8 6.50 5.38 1.46
## 6 146.0 6.50 5.42 1.44
print('Structure of training data set:')
## [1] "Structure of training data set:"
str(Ins_eval_data)
## 'data.frame': 267 obs. of 33 variables:
## $ Brand.Code : chr "D" "A" "B" "B" ...
## $ Carb.Volume : num 5.48 5.39 5.29 5.27 5.41 ...
## $ Fill.Ounces : num 24 24 23.9 23.9 24.2 ...
## $ PC.Volume : num 0.27 0.227 0.303 0.186 0.16 ...
## $ Carb.Pressure : num 65.4 63.2 66.4 64.8 69.4 73.4 65.2 67.4 66.8 72.6 ...
## $ Carb.Temp : num 135 135 140 139 142 ...
## $ PSC : num 0.236 0.042 0.068 0.004 0.04 0.078 0.088 0.076 0.246 0.146 ...
## $ PSC.Fill : num 0.4 0.22 0.1 0.2 0.3 0.22 0.14 0.1 0.48 0.1 ...
## $ PSC.CO2 : num 0.04 0.08 0.02 0.02 0.06 NA 0 0.04 0.04 0.02 ...
## $ Mnf.Flow : num -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ Carb.Pressure1 : num 117 119 120 125 115 ...
## $ Fill.Pressure : num 46 46.2 45.8 40 51.4 46.4 46.2 40 43.8 40.8 ...
## $ Hyd.Pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure2 : num NA 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure3 : num NA 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure4 : int 96 112 98 132 94 94 108 108 110 106 ...
## $ Filler.Level : num 129 120 119 120 116 ...
## $ Filler.Speed : int 3986 4012 4010 NA 4018 4010 4010 NA 4010 1006 ...
## $ Temperature : num 66 65.6 65.6 74.4 66.4 66.6 66.8 NA 65.8 66 ...
## $ Usage.cont : num 21.7 17.6 24.2 18.1 21.3 ...
## $ Carb.Flow : int 2950 2916 3056 28 3214 3064 3042 1972 2502 28 ...
## $ Density : num 0.88 1.5 0.9 0.74 0.88 0.84 1.48 1.6 1.52 1.48 ...
## $ MFR : num 728 736 735 NA 752 ...
## $ Balling : num 1.4 2.94 1.45 1.06 1.4 ...
## $ Pressure.Vacuum : num -3.8 -4.4 -4.2 -4 -4 -3.8 -4.2 -4.4 -4.4 -4.2 ...
## $ PH : logi NA NA NA NA NA NA ...
## $ Oxygen.Filler : num 0.022 0.03 0.046 NA 0.082 0.064 0.042 0.096 0.046 0.096 ...
## $ Bowl.Setpoint : int 130 120 120 120 120 120 120 120 120 120 ...
## $ Pressure.Setpoint: num 45.2 46 46 46 50 46 46 46 46 46 ...
## $ Air.Pressurer : num 143 147 147 146 146 ...
## $ Alch.Rel : num 6.56 7.14 6.52 6.48 6.5 6.5 7.18 7.16 7.14 7.78 ...
## $ Carb.Rel : num 5.34 5.58 5.34 5.5 5.38 5.42 5.46 5.42 5.44 5.52 ...
## $ Balling.Lvl : num 1.48 3.04 1.46 1.48 1.46 1.44 3.02 3 3.1 3.12 ...
There are few fields, which have missing values, which we’ll investigate in greater details later.
At this stage, we’ll explore and prepare iteratively. Now, we’ll check for NA. After that if required, we’ll impute them.
After that we’ll show some boxplots of the numeric fields.
Checking for NA.
any(is.na(Ins_eval_data))
## [1] TRUE
NA does exist. So, we’ll impute with mice().
Ins_eval_imputed <- mice(Ins_eval_data, m = 1, method = "pmm", print = F) %>% mice::complete()
Rechecking for NA after imputation.
any(is.na(subset(Ins_eval_imputed, select = -c(PH))))
## [1] FALSE
We observe that NA were removed in all columns except TARGET_FLAG and TARGET_AMT, which is what we want. In the following, we’ll visualize with missmap().
Ins_eval_imputed %>% missmap(main = "Missing / Observed")
Both is.na() and missmap() confirm that NA were eliminated.
Now, we’ll explore the data a little further. First, we’ll take a quick look at min, 1st quartile, median, mean, 2nd quartile, max etc.
summary(Ins_eval_imputed) # %>% kable
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## Length:267 Min. :5.147 Min. :23.75 Min. :0.09867
## Class :character 1st Qu.:5.283 1st Qu.:23.92 1st Qu.:0.23333
## Mode :character Median :5.340 Median :23.97 Median :0.27600
## Mean :5.368 Mean :23.97 Mean :0.27819
## 3rd Qu.:5.463 3rd Qu.:24.01 3rd Qu.:0.32233
## Max. :5.667 Max. :24.20 Max. :0.46400
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :60.20 Min. :130.0 Min. :0.00400 Min. :0.0200
## 1st Qu.:65.30 1st Qu.:138.4 1st Qu.:0.04500 1st Qu.:0.1100
## Median :68.00 Median :140.8 Median :0.07800 Median :0.1800
## Mean :68.25 Mean :141.3 Mean :0.08604 Mean :0.1912
## 3rd Qu.:70.60 3rd Qu.:143.9 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :77.60 Max. :154.0 Max. :0.24600 Max. :0.6200
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :-100.20 Min. :113.0 Min. :37.80
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:120.1 1st Qu.:46.00
## Median :0.04000 Median : 0.20 Median :123.4 Median :47.80
## Mean :0.05116 Mean : 21.03 Mean :123.0 Mean :48.13
## 3rd Qu.:0.06000 3rd Qu.: 141.30 3rd Qu.:125.4 3rd Qu.:50.20
## Max. :0.24000 Max. : 220.40 Max. :136.0 Max. :60.20
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## Min. :-50.00 Min. :-50.00 Min. :-50.00 Min. : 68.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 90.00
## Median : 10.40 Median : 26.80 Median : 27.60 Median : 98.00
## Mean : 12.01 Mean : 20.04 Mean : 19.54 Mean : 98.26
## 3rd Qu.: 20.40 3rd Qu.: 34.80 3rd Qu.: 33.00 3rd Qu.:104.00
## Max. : 50.00 Max. : 61.40 Max. : 49.20 Max. :140.00
## Filler.Level Filler.Speed Temperature Usage.cont
## Min. : 69.2 Min. :1006 Min. :63.80 Min. :12.90
## 1st Qu.:100.6 1st Qu.:3802 1st Qu.:65.40 1st Qu.:18.10
## Median :118.6 Median :3942 Median :65.80 Median :21.40
## Mean :110.3 Mean :3520 Mean :66.23 Mean :20.88
## 3rd Qu.:120.2 3rd Qu.:3996 3rd Qu.:66.60 3rd Qu.:23.74
## Max. :153.2 Max. :4020 Max. :75.40 Max. :24.60
## Carb.Flow Density MFR Balling
## Min. : 0 Min. :0.060 Min. : 15.6 Min. :0.902
## 1st Qu.:1083 1st Qu.:0.910 1st Qu.:688.6 1st Qu.:1.497
## Median :3038 Median :0.980 Median :720.4 Median :1.648
## Mean :2409 Mean :1.172 Mean :654.3 Mean :2.199
## 3rd Qu.:3215 3rd Qu.:1.600 3rd Qu.:730.7 3rd Qu.:3.242
## Max. :3858 Max. :1.840 Max. :784.8 Max. :3.788
## Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint
## Min. :-6.400 Mode:logical Min. :0.00240 Min. : 70.0
## 1st Qu.:-5.600 NA's:267 1st Qu.:0.02070 1st Qu.:100.0
## Median :-5.200 Median :0.03380 Median :120.0
## Mean :-5.176 Mean :0.04791 Mean :109.6
## 3rd Qu.:-4.800 3rd Qu.:0.05630 3rd Qu.:120.0
## Max. :-3.600 Max. :0.39800 Max. :130.0
## Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## Min. :44.00 Min. :141.2 Min. :6.400 Min. :5.180
## 1st Qu.:46.00 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340
## Median :46.00 Median :142.6 Median :6.580 Median :5.400
## Mean :47.73 Mean :142.8 Mean :6.903 Mean :5.439
## 3rd Qu.:50.00 3rd Qu.:142.8 3rd Qu.:7.180 3rd Qu.:5.560
## Max. :52.00 Max. :147.2 Max. :7.820 Max. :5.740
## Balling.Lvl
## Min. :0.000
## 1st Qu.:1.380
## Median :1.480
## Mean :2.051
## 3rd Qu.:3.080
## Max. :3.420
Currently, PH has NA. We’ll insert zero into column PH, for convenience of analysis.
Ins_eval_imputed$PH[is.na(Ins_eval_imputed$PH)] <- 0
Let’s take a first look at the boxplots
par(mfrow = c(3, 3))
for(i in 2:33) {
if (is.numeric(Ins_eval_imputed[,i])) {
boxplot(Ins_eval_imputed[,i], main = names(Ins_eval_imputed[i]), col = 4, horizontal = TRUE)
}
}
The boxplots show that some of the variables have outliers in them. So, we’ll cap them.
Ins_eval_cap <- Ins_eval_imputed
for (i in 2:33) {
if(i == 26) next # skipping the PH column, which is a 26th column position.
qntl <- quantile(Ins_eval_cap[,i], probs = c(0.25, 0.75), na.rm = T)
cap_amt <- quantile(Ins_eval_cap[,i], probs = c(0.05, 0.95), na.rm = T)
High <- 1.5 * IQR(Ins_eval_cap[,i], na.rm = T)
Ins_eval_cap[,i][Ins_eval_cap[,i] < (qntl[1] - High)] <- cap_amt[1]
Ins_eval_cap[,i][Ins_eval_cap[,i] > (qntl[2] + High)] <- cap_amt[2]
}
par(mfrow = c(3, 3))
for(i in 2:33) {
if(i == 26) next # skipping the PH column, which is a 26th column position.
if (is.numeric(Ins_eval_cap[,i])) {
boxplot(Ins_eval_cap[,i], main = names(Ins_eval_cap[i]), col = 4, horizontal = TRUE)
}
}
The outliers were caped and now we see that several fields Carb.Volume, PSC.FILL, PSC.C02, Mnf.Flow, Hyd.Pressure1, Hyd.Pressure2, Hyd.Pressure3, Usage.cont, Carb.Flow, Density, Balling, Bowl.Setpoint, Pressure.Setpoint, Alch.Rel, Carb.Rel, Balling.Lvl have high variance.
We’ll do the boxplots differently, with gglplot, to check if there are any differences.
Histograms tell us how the data is distributed in the dataset (numeric fields).
for(i in seq(from = 2, to = length(Ins_eval_cap), by = 9)) {
if(i <= 27) {
multi.hist(Ins_eval_cap[i:(i + 8)])
} else {
multi.hist(Ins_eval_cap[i:(i + 4)])
}
}
We can ignore PH, which is target column, where zeros were forced in.
Observing the above histograms, I decided the critical skewness, needing BoxCox transformation, to be 0.75 or higher. Based on this critical value, I am creating a vector transform_cols2, which’ll contain the column names of skewed columns.
The columns, whose skewness exceed the critical value of 0.75, are printed below.
transform_cols2 <- c()
for(i in seq(from = 2, to = length(Ins_eval_cap), by = 1)) {
if(i == 26) next # skipping the PH column, which is a 26th column position.
if(abs(skewness(Ins_eval_cap[, i])) >= 1) {
transform_cols2 <- append(transform_cols2, names(Ins_eval_cap[i]))
print(paste0(names(Ins_eval_cap[i]), ": ", skewness(Ins_eval_cap[, i])))
}
}
## [1] "Filler.Speed: -1.73500259165131"
## [1] "MFR: -1.88345379589438"
## [1] "Oxygen.Filler: 1.49546550563674"
## [1] "Bowl.Setpoint: -1.1242132435293"
## [1] "Air.Pressurer: 2.13733821348594"
Many of these histograms are skewed. So, following the recommendations of “Applied Statistical Learning” (page 105, 2nd para), I’ll apply Box-Cox transformation to remove the skewness.
lambda <- NULL
data_imputed_3 <- Ins_eval_cap
for (i in 1:length(transform_cols2)) {
lambda[transform_cols2[i]] <- BoxCox.lambda(abs(Ins_eval_cap[, transform_cols2[i]]))
data_imputed_3[c(transform_cols2[i])] <- BoxCox(Ins_eval_cap[transform_cols2[i]], lambda[transform_cols2[i]])
}
Now, we don’t need to observe the histograms all over again. It will suffice to see the skewness.
We observe that skewness of most or all of the columns reduced and some even reduced to less than 1.
for(i in seq(from = 2, to = length(data_imputed_3), by = 1)) {
if(i == 26) next # skipping the PH column, which is a 26th column position.
if(abs(skewness(data_imputed_3[, i])) >= 1) {
print(paste0(names(data_imputed_3[i]), ": ", skewness(data_imputed_3[, i])))
}
}
## [1] "Filler.Speed: -1.700653029204"
## [1] "MFR: -1.80872582316888"
## [1] "Air.Pressurer: 2.11075447000935"
Now, we’ll explore the Categorical variables.
cat('Brand.Code:')
## Brand.Code:
table(data_imputed_3$Brand.Code)
##
## A B C D
## 8 35 129 31 64
Observation: In Brand.Code column, 120 rows are empty. So, we’ll impute them with “X”.
Ins_eval_cap_imputed <- data_imputed_3 %>% mutate(Brand.Code = ifelse((Brand.Code == ""), "X", Brand.Code))
cat("Brand.Code:")
## Brand.Code:
table(Ins_eval_cap_imputed$Brand.Code)
##
## A B C D X
## 35 129 31 64 8
At this point the data is prepared. So, we’ll explore the top correlated variables.
For the purpose of correlation, we’ll remove the only non-numeric field Brand.Code, out of the correlation.
Now, we’ll look at the correlation matrix of the variables.
Ins_cap_corr <- subset(Ins_eval_cap_imputed, select = -c(Brand.Code, PH))
cor_mx = cor(Ins_cap_corr, use = "pairwise.complete.obs", method = "pearson")
corrplot(cor_mx, method = "color", type = "upper", order = "original", number.cex = .7, addCoef.col = "black", #Add coefficient of correlation
tl.srt = 90, # Text label color and rotation
diag = TRUE, # hide correlation coefficient on the principal diagonal
tl.cex = 0.5)
At this point exploration, preparation and pair-wise correlations of StudentEvaluation.csv are done. So, I’ll begin the building process.
We will use 80/20 split to create Test and Train data from our Ins_train_cap_imputed file. Since our dataset is not that large, we want to have as much training data available for modeling as possible.
set.seed(300)
trainingRows <- createDataPartition(Ins_train_cap_imputed$PH, p = 0.8, list = FALSE)
Ins_train <- Ins_train_cap_imputed[trainingRows, ]
Ins_test <- Ins_train_cap_imputed[-trainingRows, ]
Ins_train_Y <- subset( Ins_train, select = PH )
Ins_train_X <- subset( Ins_train, select = -PH )
Ins_test_Y <- subset( Ins_test, select = PH )
Ins_test_X <- subset( Ins_test, select = -PH )
First we are going to try to use linear models to predict the relationship between our predictors and PH values, assuming that the relationship shows a constant rate of change. We do not have very high hopes for these models since there are a lot of limitations associated with Linear Models - in the real world, the data is rarely linearly separable.
First, we will try Generalized Linear model. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
set.seed(300)
lmFit1 = train(PH ~ ., data = Ins_train,
metric = 'RMSE',
method = 'glm',
preProcess = c('center', 'scale'),
trControl = trainControl(method = 'cv', number = 5, savePredictions = TRUE)
)
lmFit1_pred <- predict(lmFit1, Ins_test_X)
lmFit1
## Generalized Linear Model
##
## 2058 samples
## 32 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1647, 1648, 1645, 1645, 1647
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1264873 0.4197807 0.1000042
The GLM R-Squared value is not very high - 0.40, meaning that the model explains 40% of variability in the data. RMSE for GLM is 0.135.
Next, we will try Partial Least Squares model. PSL finds a linear regression model by projecting the predicted variables and the observable variables to a new space. If the correlation among predictors is high, then the partial least squares squares might be a better option. PSL might also be better is the number of predictors may be greater than the number of observations.
set.seed(300)
lmFit2 = train(PH ~ ., data = Ins_train,
metric = 'RMSE',
method = 'pls',
preProcess = c('center', 'scale'),
trControl = trainControl(method = 'cv', number = 5, savePredictions = TRUE)
)
lmFit2_pred <- predict(lmFit2, Ins_test_X)
lmFit2
## Partial Least Squares
##
## 2058 samples
## 32 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1647, 1648, 1645, 1645, 1647
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1436120 0.2514351 0.1151878
## 2 0.1357894 0.3318913 0.1078069
## 3 0.1321815 0.3671982 0.1057006
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
The PLS R-Squared value is not very high - 0.37, meaning that the model explains 37% of variability in the data. RMSE for PLS is 0.132.
Next, we will try some penalized models, we will start with a Ridge model. Ridge regression adds a penalty on the sum of the squared regression parameters.
set.seed(300)
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
ridgeRegFit <- train(x = Ins_train_X[,-1], y = Ins_train_Y$PH,
method = "ridge",
tuneGrid = ridgeGrid,
trControl = trainControl(method = "cv", number = 10),
preProc = c("center", "scale")
)
ridgeRegFit
## Ridge Regression
##
## 2058 samples
## 31 predictor
##
## Pre-processing: centered (31), scaled (31)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1854, 1852, 1852, 1852, 1854, 1851, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 0.1309255 0.3777028 0.1030942
## 0.007142857 0.1308197 0.3782345 0.1031315
## 0.014285714 0.1308549 0.3777019 0.1032667
## 0.021428571 0.1309238 0.3769163 0.1034247
## 0.028571429 0.1310087 0.3760099 0.1035786
## 0.035714286 0.1311027 0.3750363 0.1037308
## 0.042857143 0.1312017 0.3740260 0.1038798
## 0.050000000 0.1313035 0.3729986 0.1040185
## 0.057142857 0.1314064 0.3719669 0.1041472
## 0.064285714 0.1315096 0.3709395 0.1042720
## 0.071428571 0.1316123 0.3699222 0.1043947
## 0.078571429 0.1317141 0.3689187 0.1045152
## 0.085714286 0.1318148 0.3679314 0.1046326
## 0.092857143 0.1319143 0.3669617 0.1047466
## 0.100000000 0.1320124 0.3660104 0.1048575
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.007142857.
ridge_pred <- predict(ridgeRegFit, Ins_test_X)
The Ridge R-Squared value is not very high - 0.376, meaning that the model explains 38% of variability in the data. RMSE for Ridge is 0.132.
Next, we will try ENET model. Elastic net model has both ridge penalties and lasso penalties.
df1_enet <- train(x = as.matrix(Ins_train_X[,-1]),
y = Ins_train_Y$PH,
method='enet',
metric='RMSE',
trControl = trainControl(method = 'cv', number = 5, savePredictions = TRUE))
df1_enet
## Elasticnet
##
## 2058 samples
## 31 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1646, 1646, 1647, 1647, 1646
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0e+00 0.050 0.1532356 0.2081683 0.1240674
## 0e+00 0.525 0.1319731 0.3686170 0.1044801
## 0e+00 1.000 0.1314267 0.3737880 0.1033977
## 1e-04 0.050 0.1532602 0.2081065 0.1240885
## 1e-04 0.525 0.1319860 0.3685108 0.1044957
## 1e-04 1.000 0.1314220 0.3738221 0.1033957
## 1e-01 0.050 0.1574898 0.2078721 0.1280181
## 1e-01 0.525 0.1356316 0.3331510 0.1083463
## 1e-01 1.000 0.1324763 0.3628033 0.1052158
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 1 and lambda = 1e-04.
enet_pred <- predict(df1_enet, Ins_test_X)
The ENet R-Squared value is not very high - 0.319, meaning that the model explains 32% of variability in the data. RMSE for Enet is 0.144.
As expected, it doesn’t look like either of the linear models has a good performance based on their R-squared and RMSE values, but let’s compare those and see which model performs best.
z<- rbind(
postResample(pred = lmFit1_pred, obs = Ins_test_Y$PH),
postResample(pred = lmFit2_pred, obs = Ins_test_Y$PH),
postResample(pred = ridge_pred, obs = Ins_test_Y$PH),
postResample(pred = enet_pred, obs = Ins_test_Y$PH)
)
data.frame(z,row.names = c('GLM', 'PLS', 'RIDGE', 'ENET'))
## RMSE Rsquared MAE
## GLM 0.1373923 0.3940642 0.1059544
## PLS 0.1408794 0.3623035 0.1105987
## RIDGE 0.1466487 0.3118465 0.1100849
## ENET 0.1466470 0.3125253 0.1101068
The best linear model based on the highest R-Squared and lowest RSME value is GLM.
Nexgt we will try several Non-Linear models:multivariate adaptive regression splines (MARS), support vector machines (SVMs), and K-nearest neighbors (KNNs). We expect these models to perform better than Linear Models. We will look at Tree models separately.
We will continue modeling by tuning and evaluating a MARS model. MARS uses surrogate features instead of the original predictors.
set.seed(200)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)
marsTune <- train(x = Ins_train_X,
y = Ins_train_Y$PH,
method = "earth",
preProc=c("center", "scale"),
tuneGrid= marsGrid,
trControl = trainControl(method = "cv"))
## Loading required package: earth
## Warning: package 'earth' was built under R version 3.5.2
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:psych':
##
## rescale
## Loading required package: TeachingDemos
##
## Attaching package: 'plotmo'
## The following object is masked from 'package:urca':
##
## plotres
Evaluating MARS model’s performance:
marsPred = predict(marsTune, newdata = Ins_test_X)
postResample(pred = marsPred, obs = Ins_test_Y$PH)
## RMSE Rsquared MAE
## 0.1285241 0.4694460 0.0957915
The MARS R-Squared value is 0.506, meaning that the model explains 51% of variability in the data. RMSE for MARS is 0.122.
The next model we will tune and evaluate is SVM model - I will use pre-process to center and scale the data and will use tune length of 10. The benefist of SVM are that, since the squared residuals are not used, large outliers have a limited effect on the regression equation. Second, samples that the model fits well have no effect on the regression equation.
set.seed(200)
svmTuned = train(x = Ins_train_X[,-1],
y = Ins_train_Y$PH,
method="svmRadial",
preProc=c("center", "scale"),
tuneLength=10,
trControl = trainControl(method = "repeatedcv"))
svmTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 2058 samples
## 31 predictor
##
## Pre-processing: centered (31), scaled (31)
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 1853, 1851, 1852, 1852, 1852, 1852, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.1228866 0.4600632 0.09332724
## 0.50 0.1197312 0.4842512 0.09003390
## 1.00 0.1172018 0.5041908 0.08752911
## 2.00 0.1152842 0.5193898 0.08591904
## 4.00 0.1138643 0.5321515 0.08486795
## 8.00 0.1139884 0.5351238 0.08508197
## 16.00 0.1160372 0.5256920 0.08645511
## 32.00 0.1203155 0.5045933 0.08969016
## 64.00 0.1263315 0.4749866 0.09432672
## 128.00 0.1340231 0.4422123 0.10035297
##
## Tuning parameter 'sigma' was held constant at a value of 0.0224276
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0224276 and C = 4.
SVMPred = predict(svmTuned, newdata = Ins_test_X[,-1])
postResample(pred = SVMPred, obs = Ins_test_Y$PH)
## RMSE Rsquared MAE
## 0.13548672 0.42300016 0.09431234
The SVM R-Squared value is 0.432, meaning that the model explains 43% of variability in the data. RMSE for SVM is 0.133.
The next Non-Linear model we will tune and evaluate is KNN model. The KNN approach predicts a new sample using the K-closest samples from the training set.
set.seed(333)
knnModel <- train(x = Ins_train_X[,-1],
y = Ins_train_Y$PH,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 2058 samples
## 31 predictor
##
## Pre-processing: centered (31), scaled (31)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2058, 2058, 2058, 2058, 2058, 2058, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.1349896 0.3775088 0.09978210
## 7 0.1310545 0.3975710 0.09804134
## 9 0.1296132 0.4053502 0.09766549
## 11 0.1290303 0.4089141 0.09778222
## 13 0.1287028 0.4115663 0.09791686
## 15 0.1288645 0.4103348 0.09827205
## 17 0.1288008 0.4113970 0.09859539
## 19 0.1289462 0.4106035 0.09908758
## 21 0.1292937 0.4077857 0.09957645
## 23 0.1294880 0.4064450 0.09990046
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 13.
Evaluating the model’s performance:
knnPred <- predict(knnModel, newdata = Ins_test_X[,-1])
postResample(pred = knnPred, obs = Ins_test_Y$PH)
## RMSE Rsquared MAE
## 0.1378597 0.4062405 0.1037003
The SVM R-Squared value is 0.416, meaning that the model explains 42% of variability in the data. RMSE for SVM is 0.135.
It looks like non-linear models are performing better than the linear models based on their R-squared values, but let’s compare those and see which model performs best.
z<- rbind(
postResample(pred = marsPred, obs = Ins_test_Y$PH),
postResample(pred = SVMPred, obs = Ins_test_Y$PH),
postResample(pred = knnPred, obs = Ins_test_Y$PH)
)
data.frame(z,row.names = c('MARS', 'SVM', 'KNN'))
## RMSE Rsquared MAE
## MARS 0.1285241 0.4694460 0.09579150
## SVM 0.1354867 0.4230002 0.09431234
## KNN 0.1378597 0.4062405 0.10370028
The best non-linear model based on the highest R-Squared and lowest RSME value is MARS
We will now try some Tree models. Decision tree analysis involves making a tree-shaped diagram to chart out a course of action or a statistical probability analysis. It is used to break down complex problems or branches. Each branch of the decision tree could be a possible outcome.
First, we will try a Random Forest Model, these model achieves variance reduction by selecting strong, complex learners that exhibit low bias. Because each learner is selected independently of all previous learners, random forests is robust to a noisy response.
suppressWarnings(library(randomForest))
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
##
## outlier
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(333)
RF_model <- randomForest(x = Ins_train_X[,-1],
y = Ins_train_Y$PH,
importance = TRUE,
ntree = 700
)
RFPred <- predict(RF_model, newdata = Ins_test_X[,-1])
postResample(pred = RFPred, obs = Ins_test_Y$PH)
## RMSE Rsquared MAE
## 0.11449670 0.60234244 0.08037439
The Random Forest R-Squared value is 0.641, meaning that the model explains 64% of variability in the data. RMSE for Random Forest is 0.109.
Next, we will try a Boosted Tree Model. The basic principles of gradient boosting are as follows: given a loss function (e.g., squared error for regression) and a weak learner (e.g., regression trees), the algorithm seeks to find an additive model that minimizes the loss function.
suppressWarnings(library(gbm))
## Loaded gbm 2.1.8
set.seed(333)
gbmGrid <- expand.grid(.interaction.depth = seq(1, 5, by = 2),
.n.trees = seq(300, 1000, by = 100),
.shrinkage = c(0.05, 0.1),
.n.minobsinnode = 5)
gbmTune <- suppressWarnings(train(Ins_train_X[,-1], Ins_train_Y$PH,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
)
GBM_Pred <- predict(gbmTune, newdata = Ins_test_X[,-1])
postResample(pred = GBM_Pred, obs = Ins_test_Y$PH)
## RMSE Rsquared MAE
## 0.11744268 0.55769696 0.08526757
The Boosted Tree R-Squared value is 0.578, meaning that the model explains 58% of variability in the data. RMSE for Boosted Tree is 0.114.
Next, we will try a Single Tree Model. Basic regression trees partition the data into smaller groups that are more homogenous with respect to the response.
set.seed(333)
rpartTune <- train(Ins_train_X, Ins_train_Y$PH,
method = "rpart2",
tuneLength = 10,
trControl = trainControl(method = "cv"))
ST_Pred <- predict(rpartTune, newdata = Ins_test_X)
postResample(pred = ST_Pred, obs = Ins_test_Y$PH)
## RMSE Rsquared MAE
## 0.1352398 0.4127301 0.1047643
The Basic Regression Tree R-Squared value is 0.459, meaning that the model explains 46% of variability in the data. RMSE for Basic Regression Tree is 0.129.
Next, we will try a Cubist Model. Cubist is a rule–based model. A tree is grown where the terminal leaves contain linear regression models. These models are based on the predictors used in previous splits. Also, there are intermediate linear models at each step of the tree.
suppressWarnings(library(Cubist))
set.seed(333)
cubistMod <- cubist(Ins_train_X,
Ins_train_Y$PH,
committees = 6
)
cubistModPred <- predict(cubistMod, newdata = Ins_test_X)
postResample(pred = cubistModPred, obs = Ins_test_Y$PH)
## RMSE Rsquared MAE
## 0.11062085 0.61097487 0.08041779
The Cubist R-Squared value is 0.671, meaning that the model explains 67% of variability in the data. RMSE for Cubist is 0.101.
Finally, we will try Bagged Trees Model. Bagging effectively reduces the variance of a prediction through its aggregation process.
set.seed(333)
suppressWarnings(library(ipred))
baggedTree <- ipredbagg(Ins_train_Y$PH, Ins_train_X)
baggedTreePred <- predict(baggedTree, newdata = Ins_test_X)
postResample(pred = baggedTreePred, obs = Ins_test_Y$PH)
## RMSE Rsquared MAE
## 0.1246965 0.5096903 0.0963989
The Bagged R-Squared value is 0.523, meaning that the model explains 53% of variability in the data. RMSE for Bagged Tree is 0.122.
It looks like Tree Models are performing better than non-linear models and linear models based on their R-squared values, but let’s compare those and see which model performs best.
z<- rbind(
postResample(pred = RFPred, obs = Ins_test_Y$PH),
postResample(pred = GBM_Pred, obs = Ins_test_Y$PH),
postResample(pred = ST_Pred, obs = Ins_test_Y$PH),
postResample(pred = cubistModPred, obs = Ins_test_Y$PH),
postResample(pred = baggedTreePred, obs = Ins_test_Y$PH)
)
data.frame(z,row.names = c('Random Forrest', 'Boosted Trees', 'Single Tree', 'Cubist', 'Bagged Tree'))
## RMSE Rsquared MAE
## Random Forrest 0.1144967 0.6023424 0.08037439
## Boosted Trees 0.1174427 0.5576970 0.08526757
## Single Tree 0.1352398 0.4127301 0.10476434
## Cubist 0.1106208 0.6109749 0.08041779
## Bagged Tree 0.1246965 0.5096903 0.09639890
Based on the combination of R-Squared and RMSE values for all models we tried - the best Model is Cubist - that’s what we will use for our predictions. Random Forest model also has vevry good performance compared to all the other models we tuned and evaluated. Overall, Tree models are performing better that Linear and other Non-Linear Models based on RMSE and R-Squared values.
Here is the list of most relevant variables in this Cubist model:
varImp(cubistMod)
## Overall
## Mnf.Flow 78.0
## Brand.Code 20.5
## Alch.Rel 57.0
## Oxygen.Filler 50.5
## Bowl.Setpoint 34.5
## Pressure.Vacuum 40.0
## Filler.Speed 33.0
## Usage.cont 27.5
## Balling 43.5
## Carb.Rel 34.0
## Balling.Lvl 36.0
## Carb.Flow 29.0
## Hyd.Pressure1 22.5
## Carb.Pressure1 30.5
## Temperature 32.0
## Density 37.5
## Hyd.Pressure3 27.5
## Hyd.Pressure2 26.0
## Air.Pressurer 2.0
## Carb.Temp 18.5
## PC.Volume 14.0
## Carb.Pressure 16.5
## MFR 16.5
## Filler.Level 16.0
## Fill.Pressure 8.5
## Carb.Volume 12.0
## Hyd.Pressure4 11.5
## Pressure.Setpoint 7.5
## PSC 6.5
## PSC.CO2 4.5
## PSC.Fill 3.5
## Fill.Ounces 3.0
Now that we have identified the best model, we can use our evaluation data to make PH predictions and output predictions to an excel readable format. We are adding predicted PH values to our Evaluation data set.
final_predictions <- predict(cubistMod, newdata=Ins_eval_cap_imputed)
Ins_eval_cap_imputed$PH <- final_predictions
final_predictions_df <- data.frame(Ins_eval_cap_imputed)
head(final_predictions_df)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp
## 1 D 5.480000 24.03333 0.2700000 65.4 134.6
## 2 A 5.393333 23.95333 0.2266667 63.2 135.0
## 3 B 5.293333 23.92000 0.3033333 66.4 140.4
## 4 B 5.266667 23.94000 0.1860000 64.8 139.0
## 5 B 5.406667 24.09800 0.1600000 69.4 142.2
## 6 B 5.286667 24.10667 0.2120000 73.4 147.2
## PSC PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## 1 0.1934 0.40 0.04 -100 116.6 46.0
## 2 0.0420 0.22 0.08 -100 118.8 46.2
## 3 0.0680 0.10 0.02 -100 120.2 45.8
## 4 0.0040 0.20 0.02 -100 124.8 40.0
## 5 0.0400 0.30 0.06 -100 115.0 51.4
## 6 0.0780 0.22 0.06 -100 118.6 46.4
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level
## 1 0 0 0 96 129.4
## 2 0 0 0 112 120.0
## 3 0 0 0 98 119.4
## 4 0 0 0 126 120.2
## 5 0 0 0 94 116.0
## 6 0 0 0 94 120.4
## Filler.Speed Temperature Usage.cont Carb.Flow Density MFR Balling
## 1 7939410.7 66.0 21.66 2950 0.88 264578.30 1.398
## 2 8043319.4 65.6 17.60 2916 1.50 270575.24 2.942
## 3 8035302.4 65.6 24.18 3056 0.90 269840.31 1.448
## 4 509801.6 69.0 18.12 28 0.74 26994.26 1.056
## 5 8067394.3 66.4 21.32 3214 0.88 282620.39 1.398
## 6 8035302.4 66.6 18.00 3064 0.84 267787.82 1.298
## Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint
## 1 -3.8 8.772433 -3.823330 8446.705 45.2
## 2 -4.4 8.481736 -3.512143 7197.162 46.0
## 3 -4.2 8.958870 -3.083419 7197.162 46.0
## 4 -4.0 8.572984 -1.888125 7197.162 46.0
## 5 -4.0 8.686432 -2.503876 7197.162 50.0
## 6 -3.8 8.500370 -2.752303 7197.162 46.0
## Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
## 1 0.9930600 6.56 5.34 1.48
## 2 0.9932421 7.14 5.58 3.04
## 3 0.9932421 6.52 5.34 1.46
## 4 0.9932421 6.48 5.50 1.48
## 5 0.9932421 6.50 5.38 1.46
## 6 0.9932421 6.50 5.42 1.44
write_csv(final_predictions_df, "PH_Result.csv")