Project 2 Description

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Libraries

library(Amelia)
library(AppliedPredictiveModeling)
library(car)
library(caret)
#library(corrgram)
library(corrplot)
library(data.table)
#library(dlookr)
library(dplyr)
library(DT)
library(e1071)
library(forecast)
library(fpp2)
#library(ggcorrplot)
library(ggplot2)
library(glmnet)
#library(glmulti)
library(gridExtra)
#library(Hmisc)
library(kableExtra)
library(knitr)
library(lubridate)
library(MASS)
library(mice)
#library(plotly)
library(pROC)
#library(pscl)
library(psych)
library(RANN)
#library(RColorBrewer)
library(readxl)
library(reshape2)
library(stringr)
library(tidyverse)
library(tseries)
library(urca)
#library(xlsx)

Reading datasets.

Ins_train_data <- read.csv("./StudentData.csv", stringsAsFactors = FALSE)
Ins_eval_data <- read.csv("./StudentEvaluation.csv", stringsAsFactors = FALSE)

Data Exploration of StudentData.csv.

Initially, we’ll do a cursory exploration of the data. After that, we’ll iteratively prepare and explore the data, wherever required.

dim1 <- dim(Ins_train_data)
print(paste0('Dimension of training set:   ', 'Number of rows: ', dim1[1], ', ', 'Number of cols: ', dim1[2]))

## [1] "Dimension of training set:   Number of rows: 2571, Number of cols: 33"

print('Head of training data set:')

## [1] "Head of training data set:"

head(Ins_train_data)

##   Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp   PSC
## 1          B    5.340000    23.96667 0.2633333          68.2     141.2 0.104
## 2          A    5.426667    24.00667 0.2386667          68.4     139.6 0.124
## 3          B    5.286667    24.06000 0.2633333          70.8     144.8 0.090
## 4          A    5.440000    24.00667 0.2933333          63.0     132.6    NA
## 5          A    5.486667    24.31333 0.1113333          67.2     136.8 0.026
## 6          A    5.380000    23.92667 0.2693333          66.6     138.4 0.090
##   PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1     0.26    0.04     -100          118.8          46.0             0
## 2     0.22    0.04     -100          121.6          46.0             0
## 3     0.34    0.16     -100          120.2          46.0             0
## 4     0.42    0.04     -100          115.2          46.4             0
## 5     0.16    0.12     -100          118.4          45.8             0
## 6     0.24    0.04     -100          119.6          45.6             0
##   Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1            NA            NA           118        121.2         4002
## 2            NA            NA           106        118.6         3986
## 3            NA            NA            82        120.0         4020
## 4             0             0            92        117.8         4012
## 5             0             0            92        118.6         4010
## 6             0             0           116        120.2         4014
##   Temperature Usage.cont Carb.Flow Density   MFR Balling Pressure.Vacuum   PH
## 1        66.0      16.18      2932    0.88 725.0   1.398            -4.0 8.36
## 2        67.6      19.90      3144    0.92 726.8   1.498            -4.0 8.26
## 3        67.0      17.76      2914    1.58 735.0   3.142            -3.8 8.94
## 4        65.6      17.42      3062    1.54 730.6   3.042            -4.4 8.24
## 5        65.6      17.68      3054    1.54 722.8   3.042            -4.4 8.26
## 6        66.2      23.82      2948    1.52 738.8   2.992            -4.4 8.32
##   Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 1         0.022           120              46.4         142.6     6.58     5.32
## 2         0.026           120              46.8         143.0     6.56     5.30
## 3         0.024           120              46.6         142.0     7.66     5.84
## 4         0.030           120              46.0         146.2     7.14     5.42
## 5         0.030           120              46.0         146.2     7.14     5.44
## 6         0.024           120              46.0         146.6     7.16     5.44
##   Balling.Lvl
## 1        1.48
## 2        1.56
## 3        3.28
## 4        3.04
## 5        3.04
## 6        3.02

print('Structure of training data set:')

## [1] "Structure of training data set:"

str(Ins_train_data)

## 'data.frame':    2571 obs. of  33 variables:
##  $ Brand.Code       : chr  "B" "A" "B" "A" ...
##  $ Carb.Volume      : num  5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill.Ounces      : num  24 24 24.1 24 24.3 ...
##  $ PC.Volume        : num  0.263 0.239 0.263 0.293 0.111 ...
##  $ Carb.Pressure    : num  68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ Carb.Temp        : num  141 140 145 133 137 ...
##  $ PSC              : num  0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC.Fill         : num  0.26 0.22 0.34 0.42 0.16 0.24 0.4 0.34 0.12 0.24 ...
##  $ PSC.CO2          : num  0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
##  $ Mnf.Flow         : num  -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb.Pressure1   : num  119 122 120 115 118 ...
##  $ Fill.Pressure    : num  46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ Hyd.Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure2    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure3    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure4    : int  118 106 82 92 92 116 124 132 90 108 ...
##  $ Filler.Level     : num  121 119 120 118 119 ...
##  $ Filler.Speed     : int  4002 3986 4020 4012 4010 4014 NA 1004 4014 4028 ...
##  $ Temperature      : num  66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage.cont       : num  16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb.Flow        : int  2932 3144 2914 3062 3054 2948 30 684 2902 3038 ...
##  $ Density          : num  0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ MFR              : num  725 727 735 731 723 ...
##  $ Balling          : num  1.4 1.5 3.14 3.04 3.04 ...
##  $ Pressure.Vacuum  : num  -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen.Filler    : num  0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl.Setpoint    : int  120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure.Setpoint: num  46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air.Pressurer    : num  143 143 142 146 146 ...
##  $ Alch.Rel         : num  6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb.Rel         : num  5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling.Lvl      : num  1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...

There are few fields, which have missing values, which we’ll investigate in greater details later.

Data Preparation of StudentData.csv.

At this stage, we’ll explore and prepare iteratively. Now, we’ll check for NA. After that if required, we’ll impute them. After that we’ll show some boxplots of the numeric fields.

Checking for NA.

any(is.na(Ins_train_data))

## [1] TRUE

NA does exist. So, we’ll impute with mice().

Ins_train_imputed <- mice(Ins_train_data, m = 1, method = "pmm", print = F) %>% mice::complete()

Rechecking for NA after imputation.

any(is.na(Ins_train_imputed))

## [1] FALSE

We observe that NA were removed. In the following, we’ll visualize with missmap().

Ins_train_imputed %>% missmap(main = "Missing / Observed")

Both is.na() and missmap() confirm that NA were eliminated.

More Data Exploration of StudentData.csv.

Here, we’ll explore the data a little further. First, we’ll take a quick look at min, 1st quartile, median, mean, 2nd quartile, max etc.

summary(Ins_train_imputed) # %>% kable

##   Brand.Code         Carb.Volume     Fill.Ounces      PC.Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23933  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27200  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27759  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##  Carb.Pressure     Carb.Temp          PSC             PSC.Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04900   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07800   Median :0.1800  
##  Mean   :68.21   Mean   :141.1   Mean   :0.08467   Mean   :0.1958  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##     PSC.CO2          Mnf.Flow       Carb.Pressure1  Fill.Pressure  
##  Min.   :0.0000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.0200   1st Qu.:-100.00   1st Qu.:118.8   1st Qu.:46.00  
##  Median :0.0400   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.0564   Mean   :  24.55   Mean   :122.6   Mean   :47.93  
##  3rd Qu.:0.0800   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.2400   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  Hyd.Pressure1   Hyd.Pressure2   Hyd.Pressure3   Hyd.Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.40   Median : 96.00  
##  Mean   :12.46   Mean   :20.94   Mean   :20.43   Mean   : 96.53  
##  3rd Qu.:20.40   3rd Qu.:34.60   3rd Qu.:33.30   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##   Filler.Level    Filler.Speed   Temperature      Usage.cont      Carb.Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 97.4   1st Qu.:3819   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1151  
##  Median :118.4   Median :3980   Median :65.60   Median :21.78   Median :3028  
##  Mean   :109.2   Mean   :3639   Mean   :65.97   Mean   :20.99   Mean   :2469  
##  3rd Qu.:120.0   3rd Qu.:3996   3rd Qu.:66.40   3rd Qu.:23.76   3rd Qu.:3187  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##     Density           MFR           Balling       Pressure.Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:695.0   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :721.4   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :672.2   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:730.4   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##        PH        Oxygen.Filler     Bowl.Setpoint   Pressure.Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04712   Mean   :109.3   Mean   :47.61    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  Air.Pressurer      Alch.Rel        Carb.Rel      Balling.Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.898   Mean   :5.436   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66

Boxplots

First look at the boxplots.

par(mfrow = c(3, 3))
for(i in 2:33) {
    if (is.numeric(Ins_train_imputed[,i])) {
      boxplot(Ins_train_imputed[,i], main = names(Ins_train_imputed[i]), col = 4, horizontal = TRUE)
   }
}

The boxplots show that some of the variables have outliers in them. So, we’ll cap them.

Ins_train_cap <- Ins_train_imputed

for (i in 2:33) {
  qntl <- quantile(Ins_train_cap[,i], probs = c(0.25, 0.75), na.rm = T)
  cap_amt <- quantile(Ins_train_cap[,i], probs = c(0.05, 0.95), na.rm = T)
  High <- 1.5 * IQR(Ins_train_cap[,i], na.rm = T)
  
  Ins_train_cap[,i][Ins_train_cap[,i] < (qntl[1] - High)] <- cap_amt[1]
  Ins_train_cap[,i][Ins_train_cap[,i] > (qntl[2] + High)] <- cap_amt[2]
}

par(mfrow = c(3, 3))
for(i in 2:33) {
    if (is.numeric(Ins_train_cap[,i])) {
      boxplot(Ins_train_cap[,i], main = names(Ins_train_cap[i]), col = 4, horizontal = TRUE)
   }
}

The outliers were caped and now we see that several fields PSC.FILL, PSC.C02, Mnf.Flow, Hyd.Pressure1, Hyd.Pressure2, Hyd.Pressure3, Usage.cont, Carb.Flow, Density, Balling, Oxygen.Filler, Bowl.Setpoint, Pressure.Setpoint, Alch.Rel, Balling.Lvl have high variance.

Histograms

Histograms tell us how the data is distributed in the dataset (numeric fields).

for(i in seq(from = 2, to = length(Ins_train_cap), by = 9)) {
  if(i <= 27) {
    multi.hist(Ins_train_cap[i:(i + 8)])
  } else {
    multi.hist(Ins_train_cap[i:(i + 4)])
  }
}

Observing the above histograms, I decided the critical skewness, needing BoxCox transformation, to be 0.75 or higher. Based on this critical value, I am creating a vector transform_cols, which’ll contain the column names of skewed columns.

The columns, whose skewness exceed the critical value of 0.75, are printed below.

transform_cols <- c()

for(i in seq(from = 2, to = length(Ins_train_cap), by = 1)) {
  if(abs(skewness(Ins_train_cap[, i])) >= 1) {
    transform_cols <- append(transform_cols, names(Ins_train_cap[i]))
    print(paste0(names(Ins_train_cap[i]), ": ", skewness(Ins_train_cap[, i])))
  }
}

## [1] "Filler.Speed: -2.20526405478563"
## [1] "MFR: -2.1781688233427"
## [1] "Oxygen.Filler: 1.18620089215705"
## [1] "Air.Pressurer: 2.07962689734424"

Many of these histograms are skewed. So, following the recommendations of “Applied Statistical Learning” (page 105, 2nd para), I’ll apply Box-Cox transformation to remove the skewness.

lambda <- NULL
data_imputed_2 <- Ins_train_cap

for (i in 1:length(transform_cols)) {
  lambda[transform_cols[i]] <- BoxCox.lambda(abs(Ins_train_cap[, transform_cols[i]]))
  
  data_imputed_2[c(transform_cols[i])] <- BoxCox(Ins_train_cap[transform_cols[i]], lambda[transform_cols[i]])
}

Now, we don’t need to observe the histograms all over again. It will suffice to see the skewness.

We observe that skewness of most or all of the columns reduced and some even reduced to less than 1.

for(i in seq(from = 2, to = length(data_imputed_2), by = 1)) {
  if(abs(skewness(data_imputed_2[, i])) >= 1) {
    print(paste0(names(data_imputed_2[i]), ": ", skewness(data_imputed_2[, i])))
  }
}

## [1] "Filler.Speed: -2.15015516889914"
## [1] "MFR: -2.0980773732998"
## [1] "Air.Pressurer: 2.04308315860091"

Categorical variables

Now, we’ll explore the Categorical variables.

cat('Brand.Code:')

## Brand.Code:

table(data_imputed_2$Brand.Code)

## 
##         A    B    C    D 
##  120  293 1239  304  615

Observation: In Brand.Code column, 120 rows are empty. So, we’ll impute them with “X”.

Ins_train_cap_imputed <- data_imputed_2 %>% mutate(Brand.Code = ifelse((Brand.Code == ""), "X", Brand.Code))

cat("Brand.Code:")

## Brand.Code:

table(Ins_train_cap_imputed$Brand.Code)

## 
##    A    B    C    D    X 
##  293 1239  304  615  120

Correlations

At this point the data is prepared. So, we’ll explore the top correlated variables.

For the purpose of correlation, we’ll remove the only non-numeric field Brand.Code, out of the correlation.

Now, we’ll look at the correlation matrix of the variables.

Ins_train_corr <- Ins_train_cap_imputed[-1]

cor_mx = cor(Ins_train_corr, use = "pairwise.complete.obs", method = "pearson")

corrplot(cor_mx, method = "color", type = "upper", order = "original", number.cex = .7, addCoef.col = "black",   #Add coefficient of correlation
                            tl.srt = 90,  # Text label color and rotation
                            diag = TRUE,  # hide correlation coefficient on the principal diagonal
                            tl.cex = 0.5)

At this point exploration, preparation and pair-wise correlations of StudentData.csv are done. So, I’ll begin the same exericse for StudentEvaluation.csv.

Data Exploration of StudentEvaluation.csv.

Initially, we’ll do a cursory exploration of the data. After that, we’ll iteratively prepare and explore the data, wherever required.

dim2 <- dim(Ins_eval_data)
print(paste0('Dimension of training set:   ', 'Number of rows: ', dim2[1], ', ', 'Number of cols: ', dim2[2]))

## [1] "Dimension of training set:   Number of rows: 267, Number of cols: 33"

print('Head of training data set:')

## [1] "Head of training data set:"

head(Ins_eval_data)

##   Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp   PSC
## 1          D    5.480000    24.03333 0.2700000          65.4     134.6 0.236
## 2          A    5.393333    23.95333 0.2266667          63.2     135.0 0.042
## 3          B    5.293333    23.92000 0.3033333          66.4     140.4 0.068
## 4          B    5.266667    23.94000 0.1860000          64.8     139.0 0.004
## 5          B    5.406667    24.20000 0.1600000          69.4     142.2 0.040
## 6          B    5.286667    24.10667 0.2120000          73.4     147.2 0.078
##   PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1     0.40    0.04     -100          116.6          46.0             0
## 2     0.22    0.08     -100          118.8          46.2             0
## 3     0.10    0.02     -100          120.2          45.8             0
## 4     0.20    0.02     -100          124.8          40.0             0
## 5     0.30    0.06     -100          115.0          51.4             0
## 6     0.22      NA     -100          118.6          46.4             0
##   Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1            NA            NA            96        129.4         3986
## 2             0             0           112        120.0         4012
## 3             0             0            98        119.4         4010
## 4             0             0           132        120.2           NA
## 5             0             0            94        116.0         4018
## 6             0             0            94        120.4         4010
##   Temperature Usage.cont Carb.Flow Density   MFR Balling Pressure.Vacuum PH
## 1        66.0      21.66      2950    0.88 727.6   1.398            -3.8 NA
## 2        65.6      17.60      2916    1.50 735.8   2.942            -4.4 NA
## 3        65.6      24.18      3056    0.90 734.8   1.448            -4.2 NA
## 4        74.4      18.12        28    0.74    NA   1.056            -4.0 NA
## 5        66.4      21.32      3214    0.88 752.0   1.398            -4.0 NA
## 6        66.6      18.00      3064    0.84 732.0   1.298            -3.8 NA
##   Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 1         0.022           130              45.2         142.6     6.56     5.34
## 2         0.030           120              46.0         147.2     7.14     5.58
## 3         0.046           120              46.0         146.6     6.52     5.34
## 4            NA           120              46.0         146.4     6.48     5.50
## 5         0.082           120              50.0         145.8     6.50     5.38
## 6         0.064           120              46.0         146.0     6.50     5.42
##   Balling.Lvl
## 1        1.48
## 2        3.04
## 3        1.46
## 4        1.48
## 5        1.46
## 6        1.44

print('Structure of training data set:')

## [1] "Structure of training data set:"

str(Ins_eval_data)

## 'data.frame':    267 obs. of  33 variables:
##  $ Brand.Code       : chr  "D" "A" "B" "B" ...
##  $ Carb.Volume      : num  5.48 5.39 5.29 5.27 5.41 ...
##  $ Fill.Ounces      : num  24 24 23.9 23.9 24.2 ...
##  $ PC.Volume        : num  0.27 0.227 0.303 0.186 0.16 ...
##  $ Carb.Pressure    : num  65.4 63.2 66.4 64.8 69.4 73.4 65.2 67.4 66.8 72.6 ...
##  $ Carb.Temp        : num  135 135 140 139 142 ...
##  $ PSC              : num  0.236 0.042 0.068 0.004 0.04 0.078 0.088 0.076 0.246 0.146 ...
##  $ PSC.Fill         : num  0.4 0.22 0.1 0.2 0.3 0.22 0.14 0.1 0.48 0.1 ...
##  $ PSC.CO2          : num  0.04 0.08 0.02 0.02 0.06 NA 0 0.04 0.04 0.02 ...
##  $ Mnf.Flow         : num  -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb.Pressure1   : num  117 119 120 125 115 ...
##  $ Fill.Pressure    : num  46 46.2 45.8 40 51.4 46.4 46.2 40 43.8 40.8 ...
##  $ Hyd.Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure2    : num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure3    : num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure4    : int  96 112 98 132 94 94 108 108 110 106 ...
##  $ Filler.Level     : num  129 120 119 120 116 ...
##  $ Filler.Speed     : int  3986 4012 4010 NA 4018 4010 4010 NA 4010 1006 ...
##  $ Temperature      : num  66 65.6 65.6 74.4 66.4 66.6 66.8 NA 65.8 66 ...
##  $ Usage.cont       : num  21.7 17.6 24.2 18.1 21.3 ...
##  $ Carb.Flow        : int  2950 2916 3056 28 3214 3064 3042 1972 2502 28 ...
##  $ Density          : num  0.88 1.5 0.9 0.74 0.88 0.84 1.48 1.6 1.52 1.48 ...
##  $ MFR              : num  728 736 735 NA 752 ...
##  $ Balling          : num  1.4 2.94 1.45 1.06 1.4 ...
##  $ Pressure.Vacuum  : num  -3.8 -4.4 -4.2 -4 -4 -3.8 -4.2 -4.4 -4.4 -4.2 ...
##  $ PH               : logi  NA NA NA NA NA NA ...
##  $ Oxygen.Filler    : num  0.022 0.03 0.046 NA 0.082 0.064 0.042 0.096 0.046 0.096 ...
##  $ Bowl.Setpoint    : int  130 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure.Setpoint: num  45.2 46 46 46 50 46 46 46 46 46 ...
##  $ Air.Pressurer    : num  143 147 147 146 146 ...
##  $ Alch.Rel         : num  6.56 7.14 6.52 6.48 6.5 6.5 7.18 7.16 7.14 7.78 ...
##  $ Carb.Rel         : num  5.34 5.58 5.34 5.5 5.38 5.42 5.46 5.42 5.44 5.52 ...
##  $ Balling.Lvl      : num  1.48 3.04 1.46 1.48 1.46 1.44 3.02 3 3.1 3.12 ...

There are few fields, which have missing values, which we’ll investigate in greater details later.

Data Preparation of StudentEvaluation.csv.

At this stage, we’ll explore and prepare iteratively. Now, we’ll check for NA. After that if required, we’ll impute them.

After that we’ll show some boxplots of the numeric fields.

Checking for NA.

any(is.na(Ins_eval_data))

## [1] TRUE

NA does exist. So, we’ll impute with mice().

Ins_eval_imputed <- mice(Ins_eval_data, m = 1, method = "pmm", print = F) %>% mice::complete()

Rechecking for NA after imputation.

any(is.na(subset(Ins_eval_imputed, select = -c(PH))))

## [1] FALSE

We observe that NA were removed in all columns except TARGET_FLAG and TARGET_AMT, which is what we want. In the following, we’ll visualize with missmap().

Ins_eval_imputed %>% missmap(main = "Missing / Observed")

Both is.na() and missmap() confirm that NA were eliminated.

More Data exploration of StudentEvaluation.csv.

Now, we’ll explore the data a little further. First, we’ll take a quick look at min, 1st quartile, median, mean, 2nd quartile, max etc.

summary(Ins_eval_imputed) # %>% kable

##   Brand.Code         Carb.Volume     Fill.Ounces      PC.Volume      
##  Length:267         Min.   :5.147   Min.   :23.75   Min.   :0.09867  
##  Class :character   1st Qu.:5.287   1st Qu.:23.92   1st Qu.:0.23333  
##  Mode  :character   Median :5.340   Median :23.97   Median :0.27600  
##                     Mean   :5.368   Mean   :23.97   Mean   :0.27903  
##                     3rd Qu.:5.463   3rd Qu.:24.01   3rd Qu.:0.32267  
##                     Max.   :5.667   Max.   :24.20   Max.   :0.46400  
##  Carb.Pressure     Carb.Temp          PSC            PSC.Fill     
##  Min.   :60.20   Min.   :130.0   Min.   :0.0040   Min.   :0.0200  
##  1st Qu.:65.30   1st Qu.:138.4   1st Qu.:0.0460   1st Qu.:0.1100  
##  Median :68.00   Median :140.8   Median :0.0760   Median :0.1800  
##  Mean   :68.25   Mean   :141.3   Mean   :0.0861   Mean   :0.1917  
##  3rd Qu.:70.60   3rd Qu.:143.9   3rd Qu.:0.1120   3rd Qu.:0.2600  
##  Max.   :77.60   Max.   :154.0   Max.   :0.2460   Max.   :0.6200  
##     PSC.CO2           Mnf.Flow       Carb.Pressure1  Fill.Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :113.0   Min.   :37.80  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:120.1   1st Qu.:46.00  
##  Median :0.04000   Median :   0.20   Median :123.4   Median :47.80  
##  Mean   :0.05146   Mean   :  21.03   Mean   :123.0   Mean   :48.15  
##  3rd Qu.:0.06000   3rd Qu.: 141.30   3rd Qu.:125.4   3rd Qu.:50.20  
##  Max.   :0.24000   Max.   : 220.40   Max.   :136.0   Max.   :60.20  
##  Hyd.Pressure1    Hyd.Pressure2    Hyd.Pressure3    Hyd.Pressure4   
##  Min.   :-50.00   Min.   :-50.00   Min.   :-50.00   Min.   : 68.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.: 90.00  
##  Median : 10.40   Median : 26.80   Median : 27.60   Median : 98.00  
##  Mean   : 12.01   Mean   : 20.04   Mean   : 19.54   Mean   : 98.16  
##  3rd Qu.: 20.40   3rd Qu.: 34.80   3rd Qu.: 33.00   3rd Qu.:104.00  
##  Max.   : 50.00   Max.   : 61.40   Max.   : 49.20   Max.   :140.00  
##   Filler.Level    Filler.Speed   Temperature      Usage.cont      Carb.Flow   
##  Min.   : 69.2   Min.   :1006   Min.   :63.80   Min.   :12.90   Min.   :   0  
##  1st Qu.:100.6   1st Qu.:3798   1st Qu.:65.40   1st Qu.:18.12   1st Qu.:1083  
##  Median :118.6   Median :3918   Median :65.80   Median :21.44   Median :3038  
##  Mean   :110.2   Mean   :3524   Mean   :66.27   Mean   :20.90   Mean   :2409  
##  3rd Qu.:120.2   3rd Qu.:3996   3rd Qu.:66.60   3rd Qu.:23.74   3rd Qu.:3215  
##  Max.   :153.2   Max.   :4020   Max.   :75.40   Max.   :24.60   Max.   :3858  
##     Density           MFR           Balling      Pressure.Vacuum 
##  Min.   :0.060   Min.   : 15.6   Min.   :0.902   Min.   :-6.400  
##  1st Qu.:0.910   1st Qu.:688.6   1st Qu.:1.497   1st Qu.:-5.600  
##  Median :0.980   Median :721.8   Median :1.648   Median :-5.200  
##  Mean   :1.175   Mean   :643.7   Mean   :2.199   Mean   :-5.178  
##  3rd Qu.:1.600   3rd Qu.:730.9   3rd Qu.:3.242   3rd Qu.:-4.800  
##  Max.   :1.840   Max.   :784.8   Max.   :3.788   Max.   :-3.600  
##     PH          Oxygen.Filler     Bowl.Setpoint   Pressure.Setpoint
##  Mode:logical   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  NA's:267       1st Qu.:0.02070   1st Qu.:100.0   1st Qu.:46.00    
##                 Median :0.03380   Median :120.0   Median :46.00    
##                 Mean   :0.04704   Mean   :109.6   Mean   :47.73    
##                 3rd Qu.:0.05630   3rd Qu.:120.0   3rd Qu.:50.00    
##                 Max.   :0.39800   Max.   :130.0   Max.   :52.00    
##  Air.Pressurer      Alch.Rel        Carb.Rel      Balling.Lvl   
##  Min.   :141.2   Min.   :6.400   Min.   :5.180   Min.   :0.000  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.380  
##  Median :142.6   Median :6.580   Median :5.400   Median :1.480  
##  Mean   :142.8   Mean   :6.902   Mean   :5.441   Mean   :2.051  
##  3rd Qu.:142.8   3rd Qu.:7.180   3rd Qu.:5.560   3rd Qu.:3.080  
##  Max.   :147.2   Max.   :7.820   Max.   :5.740   Max.   :3.420

Zeroing PH column

Currently, PH has NA. We’ll insert zero into column PH, for convenience of analysis.

Ins_eval_imputed$PH[is.na(Ins_eval_imputed$PH)] <- 0

Boxplots

Let’s take a first look at the boxplots

par(mfrow = c(3, 3))
for(i in 2:33) {
    if (is.numeric(Ins_eval_imputed[,i])) {
      boxplot(Ins_eval_imputed[,i], main = names(Ins_eval_imputed[i]), col = 4, horizontal = TRUE)
   }
}

The boxplots show that some of the variables have outliers in them. So, we’ll cap them.

Ins_eval_cap <- Ins_eval_imputed

for (i in 2:33) {
  if(i == 26) next # skipping the PH column, which is a 26th column position.

  qntl <- quantile(Ins_eval_cap[,i], probs = c(0.25, 0.75), na.rm = T)
  cap_amt <- quantile(Ins_eval_cap[,i], probs = c(0.05, 0.95), na.rm = T)
  High <- 1.5 * IQR(Ins_eval_cap[,i], na.rm = T)

  Ins_eval_cap[,i][Ins_eval_cap[,i] < (qntl[1] - High)] <- cap_amt[1]
  Ins_eval_cap[,i][Ins_eval_cap[,i] > (qntl[2] + High)] <- cap_amt[2]
}

par(mfrow = c(3, 3))
for(i in 2:33) {
  if(i == 26) next # skipping the PH column, which is a 26th column position.
  
    if (is.numeric(Ins_eval_cap[,i])) {
      boxplot(Ins_eval_cap[,i], main = names(Ins_eval_cap[i]), col = 4, horizontal = TRUE)
   }
}

The outliers were caped and now we see that several fields Carb.Volume, PSC.FILL, PSC.C02, Mnf.Flow, Hyd.Pressure1, Hyd.Pressure2, Hyd.Pressure3, Usage.cont, Carb.Flow, Density, Balling, Bowl.Setpoint, Pressure.Setpoint, Alch.Rel, Carb.Rel, Balling.Lvl have high variance.

We’ll do the boxplots differently, with gglplot, to check if there are any differences.

Histograms

Histograms tell us how the data is distributed in the dataset (numeric fields).

for(i in seq(from = 2, to = length(Ins_eval_cap), by = 9)) {
  if(i <= 27) {
    multi.hist(Ins_eval_cap[i:(i + 8)])
  } else {
    multi.hist(Ins_eval_cap[i:(i + 4)])
  }
}

We can ignore PH, which is target column, where zeros were forced in.

Observing the above histograms, I decided the critical skewness, needing BoxCox transformation, to be 0.75 or higher. Based on this critical value, I am creating a vector transform_cols2, which’ll contain the column names of skewed columns.

The columns, whose skewness exceed the critical value of 0.75, are printed below.

transform_cols2 <- c()

for(i in seq(from = 2, to = length(Ins_eval_cap), by = 1)) {
  if(i == 26) next # skipping the PH column, which is a 26th column position.

  if(abs(skewness(Ins_eval_cap[, i])) >= 1) {
   transform_cols2 <- append(transform_cols2, names(Ins_eval_cap[i]))
   print(paste0(names(Ins_eval_cap[i]), ": ", skewness(Ins_eval_cap[, i])))
  }
}

## [1] "Filler.Speed: -1.80156905343144"
## [1] "MFR: -1.8308992422637"
## [1] "Oxygen.Filler: 1.39484494713326"
## [1] "Bowl.Setpoint: -1.10691561731224"
## [1] "Air.Pressurer: 2.13570472420319"

Many of these histograms are skewed. So, following the recommendations of “Applied Statistical Learning” (page 105, 2nd para), I’ll apply Box-Cox transformation to remove the skewness.

lambda <- NULL
data_imputed_3 <- Ins_eval_cap

for (i in 1:length(transform_cols2)) {
  lambda[transform_cols2[i]] <- BoxCox.lambda(abs(Ins_eval_cap[, transform_cols2[i]]))
  
  data_imputed_3[c(transform_cols2[i])] <- BoxCox(Ins_eval_cap[transform_cols2[i]], lambda[transform_cols2[i]])
}

Now, we don’t need to observe the histograms all over again. It will suffice to see the skewness.

We observe that skewness of most or all of the columns reduced and some even reduced to less than 1.

for(i in seq(from = 2, to = length(data_imputed_3), by = 1)) {
  if(i == 26) next # skipping the PH column, which is a 26th column position.

  if(abs(skewness(data_imputed_3[, i])) >= 1) {
    print(paste0(names(data_imputed_3[i]), ": ", skewness(data_imputed_3[, i])))
  }
}

## [1] "Filler.Speed: -1.7594910735377"
## [1] "MFR: -1.76563075135146"
## [1] "Air.Pressurer: 2.10901213685818"

Categorical variables

Now, we’ll explore the Categorical variables.

cat('Brand.Code:')

## Brand.Code:

table(data_imputed_3$Brand.Code)

## 
##       A   B   C   D 
##   8  35 129  31  64

Observation: In Brand.Code column, 120 rows are empty. So, we’ll impute them with “X”.

Ins_eval_cap_imputed <- data_imputed_3 %>% mutate(Brand.Code = ifelse((Brand.Code == ""), "X", Brand.Code))

cat("Brand.Code:")

## Brand.Code:

table(Ins_eval_cap_imputed$Brand.Code)

## 
##   A   B   C   D   X 
##  35 129  31  64   8

Correlations

At this point the data is prepared. So, we’ll explore the top correlated variables.

For the purpose of correlation, we’ll remove the only non-numeric field Brand.Code, out of the correlation.

Now, we’ll look at the correlation matrix of the variables.

Ins_cap_corr <- subset(Ins_eval_cap_imputed, select = -c(Brand.Code, PH))

cor_mx = cor(Ins_cap_corr, use = "pairwise.complete.obs", method = "pearson")

corrplot(cor_mx, method = "color", type = "upper", order = "original", number.cex = .7, addCoef.col = "black",   #Add coefficient of correlation
                            tl.srt = 90,  # Text label color and rotation
                            diag = TRUE,  # hide correlation coefficient on the principal diagonal
                            tl.cex = 0.5)

At this point exploration, preparation and pair-wise correlations of StudentEvaluation.csv are done. So, I’ll begin the building process.

Models

Splitting Test and Train

We will use 80/20 split to create Test and Train data from our Ins_train_cap_imputed file. Since our dataset is not that large, we want to have as much training data available for modeling as possible.

set.seed(300)
trainingRows <- createDataPartition(Ins_train_cap_imputed$PH, p = 0.8, list = FALSE)
Ins_train <- Ins_train_cap_imputed[trainingRows, ]
Ins_test <- Ins_train_cap_imputed[-trainingRows, ]

Ins_train_Y <- subset( Ins_train, select = PH )
Ins_train_X <- subset( Ins_train, select = -PH )
Ins_test_Y <- subset( Ins_test, select = PH )
Ins_test_X <- subset( Ins_test, select = -PH )

Linear Models

First we are going to try to use linear models to predict the relationship between our predictors and PH values, assuming that the relationship shows a constant rate of change. We do not have very high hopes for these models since there are a lot of limitations associated with Linear Models - in the real world, the data is rarely linearly separable.

GLM Model

First, we will try Generalized Linear model. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

set.seed(300)
lmFit1 = train(PH ~ ., data = Ins_train, 
                      metric = 'RMSE', 
                      method = 'glm', 
                      preProcess = c('center', 'scale'), 
                      trControl = trainControl(method = 'cv', number = 5, savePredictions = TRUE)
)

lmFit1_pred <- predict(lmFit1, Ins_test_X)

lmFit1

## Generalized Linear Model 
## 
## 2058 samples
##   32 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1648, 1647, 1645, 1645, 1647 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   0.128585  0.4148387  0.1010517

The GLM R-Squared value is not very high - 0.40, meaning that the model explains 40% of variability in the data. RMSE for GLM is 0.135.

PLS Model

Next, we will try Partial Least Squares model. PSL finds a linear regression model by projecting the predicted variables and the observable variables to a new space. If the correlation among predictors is high, then the partial least squares squares might be a better option. PSL might also be better is the number of predictors may be greater than the number of observations.

set.seed(300)
lmFit2 = train(PH ~ ., data = Ins_train, 
                      metric = 'RMSE', 
                      method = 'pls', 
                      preProcess = c('center', 'scale'), 
                      trControl = trainControl(method = 'cv', number = 5, savePredictions = TRUE)
)

lmFit2_pred <- predict(lmFit2, Ins_test_X)

lmFit2

## Partial Least Squares 
## 
## 2058 samples
##   32 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1648, 1647, 1645, 1645, 1647 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##   1      0.1447657  0.2567686  0.1151792
##   2      0.1370621  0.3349457  0.1082550
##   3      0.1341415  0.3626661  0.1068270
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

The PLS R-Squared value is not very high - 0.37, meaning that the model explains 37% of variability in the data. RMSE for PLS is 0.132.

Ridge Model

Next, we will try some penalized models, we will start with a Ridge model. Ridge regression adds a penalty on the sum of the squared regression parameters.

set.seed(300)
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
ridgeRegFit <- train(x = Ins_train_X[,-1], y = Ins_train_Y$PH,
                      method = "ridge",
                      tuneGrid = ridgeGrid,
                      trControl = trainControl(method = "cv", number = 10),
                      preProc = c("center", "scale")
                     )
ridgeRegFit

## Ridge Regression 
## 
## 2058 samples
##   31 predictor
## 
## Pre-processing: centered (31), scaled (31) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1852, 1853, 1852, 1853, 1853, 1852, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE      
##   0.000000000  0.1340832  0.3631380  0.1044715
##   0.007142857  0.1341089  0.3627383  0.1047082
##   0.014285714  0.1342104  0.3617683  0.1049311
##   0.021428571  0.1343182  0.3607605  0.1051267
##   0.028571429  0.1344260  0.3597555  0.1053008
##   0.035714286  0.1345324  0.3587607  0.1054492
##   0.042857143  0.1346371  0.3577796  0.1055832
##   0.050000000  0.1347398  0.3568147  0.1057041
##   0.057142857  0.1348404  0.3558677  0.1058185
##   0.064285714  0.1349389  0.3549393  0.1059298
##   0.071428571  0.1350354  0.3540301  0.1060315
##   0.078571429  0.1351299  0.3531400  0.1061267
##   0.085714286  0.1352225  0.3522689  0.1062168
##   0.092857143  0.1353134  0.3514163  0.1063026
##   0.100000000  0.1354026  0.3505817  0.1063866
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.

ridge_pred <- predict(ridgeRegFit, Ins_test_X)

The Ridge R-Squared value is not very high - 0.376, meaning that the model explains 38% of variability in the data. RMSE for Ridge is 0.132.

ENET Model

Next, we will try ENET model. Elastic net model has both ridge penalties and lasso penalties.

df1_enet <-  train(x = as.matrix(Ins_train_X[,-1]), 
                 y = Ins_train_Y$PH,
                 method='enet',
                 metric='RMSE',
                 trControl = trainControl(method = 'cv', number = 5, savePredictions = TRUE))

df1_enet

## Elasticnet 
## 
## 2058 samples
##   31 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1645, 1646, 1646, 1648, 1647 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE       Rsquared   MAE      
##   0e+00   0.050     0.1542340  0.2109936  0.1243444
##   0e+00   0.525     0.1345301  0.3577722  0.1053774
##   0e+00   1.000     0.1339221  0.3632686  0.1042766
##   1e-04   0.050     0.1542638  0.2106756  0.1243713
##   1e-04   0.525     0.1345353  0.3577272  0.1053862
##   1e-04   1.000     0.1339206  0.3632814  0.1042799
##   1e-01   0.050     0.1596233  0.2022396  0.1293414
##   1e-01   0.525     0.1381741  0.3225790  0.1092207
##   1e-01   1.000     0.1352739  0.3501794  0.1062312
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 1 and lambda = 1e-04.

enet_pred <- predict(df1_enet, Ins_test_X)

The ENet R-Squared value is not very high - 0.319, meaning that the model explains 32% of variability in the data. RMSE for Enet is 0.144.

Comparing Linear Models

As expected, it doesn’t look like either of the linear models has a good performance based on their R-squared and RMSE values, but let’s compare those and see which model performs best.

z<- rbind(
  postResample(pred = lmFit1_pred, obs = Ins_test_Y$PH),
  postResample(pred = lmFit2_pred, obs = Ins_test_Y$PH),
  postResample(pred = ridge_pred, obs = Ins_test_Y$PH),
  postResample(pred = enet_pred, obs = Ins_test_Y$PH) 
)

data.frame(z,row.names = c('GLM', 'PLS', 'RIDGE', 'ENET'))

##            RMSE  Rsquared       MAE
## GLM   0.1276592 0.4297306 0.1011268
## PLS   0.1350449 0.3613773 0.1091041
## RIDGE 0.1330382 0.3805764 0.1036082
## ENET  0.1330414 0.3805402 0.1036156

The best linear model based on the highest R-Squared and lowest RSME value is GLM.

Non-Linear Models

Next we will try several Non-Linear models:multivariate adaptive regression splines (MARS), support vector machines (SVMs), and K-nearest neighbors (KNNs). We expect these models to perform better than Linear Models. We will look at Tree models separately.

MARS Model

We will continue modeling by tuning and evaluating a MARS model. MARS uses surrogate features instead of the original predictors.

set.seed(200)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)
marsTune <- train(x = Ins_train_X, 
                  y = Ins_train_Y$PH, 
                  method = "earth",
                  preProc=c("center", "scale"),
                  tuneGrid= marsGrid,
                  trControl = trainControl(method = "cv"))

## Loading required package: earth

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

## 
## Attaching package: 'plotrix'

## The following object is masked from 'package:psych':
## 
##     rescale

## Loading required package: TeachingDemos

## 
## Attaching package: 'plotmo'

## The following object is masked from 'package:urca':
## 
##     plotres

Evaluating MARS model’s performance:

marsPred = predict(marsTune, newdata = Ins_test_X)
postResample(pred = marsPred, obs = Ins_test_Y$PH)

##       RMSE   Rsquared        MAE 
## 0.11557203 0.53300604 0.08976371

The MARS R-Squared value is 0.506, meaning that the model explains 51% of variability in the data. RMSE for MARS is 0.122.

SVM Model

The next model we will tune and evaluate is SVM model - I will use pre-process to center and scale the data and will use tune length of 10. The benefist of SVM are that, since the squared residuals are not used, large outliers have a limited effect on the regression equation. Second, samples that the model fits well have no effect on the regression equation.

set.seed(200)
svmTuned = train(x = Ins_train_X[,-1], 
                 y = Ins_train_Y$PH, 
                 method="svmRadial", 
                 preProc=c("center", "scale"), 
                 tuneLength=10,
                 trControl = trainControl(method = "repeatedcv"))

svmTuned

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 2058 samples
##   31 predictor
## 
## Pre-processing: centered (31), scaled (31) 
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 1852, 1851, 1853, 1852, 1853, 1853, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE       Rsquared   MAE       
##     0.25  0.1261244  0.4422105  0.09466075
##     0.50  0.1232595  0.4642479  0.09163589
##     1.00  0.1209853  0.4824951  0.08942469
##     2.00  0.1196470  0.4942457  0.08829158
##     4.00  0.1188717  0.5023884  0.08781827
##     8.00  0.1191904  0.5038101  0.08816675
##    16.00  0.1215214  0.4933114  0.09019069
##    32.00  0.1255750  0.4741770  0.09388943
##    64.00  0.1295106  0.4587186  0.09785208
##   128.00  0.1342494  0.4399676  0.10125583
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02192774
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02192774 and C = 4.

SVMPred = predict(svmTuned, newdata = Ins_test_X[,-1])
postResample(pred = SVMPred, obs = Ins_test_Y$PH)

##      RMSE  Rsquared       MAE 
## 0.1188546 0.5121645 0.0860854

The SVM R-Squared value is 0.432, meaning that the model explains 43% of variability in the data. RMSE for SVM is 0.133.

KNN Model

The next Non-Linear model we will tune and evaluate is KNN model. The KNN approach predicts a new sample using the K-closest samples from the training set.

set.seed(333)
knnModel <- train(x = Ins_train_X[,-1], 
                 y = Ins_train_Y$PH, 
                 method = "knn",
                 preProc = c("center", "scale"),
                 tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 2058 samples
##   31 predictor
## 
## Pre-processing: centered (31), scaled (31) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2058, 2058, 2058, 2058, 2058, 2058, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE       
##    5  0.1359649  0.3751850  0.10032423
##    7  0.1326486  0.3911729  0.09889153
##    9  0.1305133  0.4043576  0.09795531
##   11  0.1296867  0.4096532  0.09798941
##   13  0.1293712  0.4116864  0.09823886
##   15  0.1295657  0.4098444  0.09867567
##   17  0.1298520  0.4074985  0.09924260
##   19  0.1300758  0.4057334  0.09972038
##   21  0.1304105  0.4028147  0.10017743
##   23  0.1307573  0.3999922  0.10069464
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 13.

Evaluating the model’s performance:

knnPred <- predict(knnModel, newdata = Ins_test_X[,-1])
postResample(pred = knnPred, obs = Ins_test_Y$PH)

##       RMSE   Rsquared        MAE 
## 0.12655367 0.44756505 0.09550007

The SVM R-Squared value is 0.416, meaning that the model explains 42% of variability in the data. RMSE for SVM is 0.135.

Comparing Non-Linear Models

It looks like non-linear models are performing better than the linear models based on their R-squared values, but let’s compare those and see which model performs best.

z<- rbind(
  postResample(pred = marsPred, obs = Ins_test_Y$PH),
  postResample(pred = SVMPred, obs = Ins_test_Y$PH),
  postResample(pred = knnPred, obs = Ins_test_Y$PH)
)

data.frame(z,row.names = c('MARS', 'SVM', 'KNN'))

##           RMSE  Rsquared        MAE
## MARS 0.1155720 0.5330060 0.08976371
## SVM  0.1188546 0.5121645 0.08608540
## KNN  0.1265537 0.4475650 0.09550007

The best non-linear model based on the highest R-Squared and lowest RSME value is MARS

Tree Models

We will now try some Tree models. Decision tree analysis involves making a tree-shaped diagram to chart out a course of action or a statistical probability analysis. It is used to break down complex problems or branches. Each branch of the decision tree could be a possible outcome.

Random Forest

First, we will try a Random Forest Model, these model achieves variance reduction by selecting strong, complex learners that exhibit low bias. Because each learner is selected independently of all previous learners, random forests is robust to a noisy response.

suppressWarnings(library(randomForest))

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:psych':
## 
##     outlier

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

set.seed(333)
RF_model <- randomForest(x = Ins_train_X[,-1], 
                  y = Ins_train_Y$PH, 
                  importance = TRUE,
                  ntree = 700
                  )

RFPred <- predict(RF_model, newdata = Ins_test_X[,-1])
postResample(pred = RFPred, obs = Ins_test_Y$PH)

##       RMSE   Rsquared        MAE 
## 0.09973481 0.67639296 0.07373373

The Random Forest R-Squared value is 0.641, meaning that the model explains 64% of variability in the data. RMSE for Random Forest is 0.109.

Boosted trees

Next, we will try a Boosted Tree Model. The basic principles of gradient boosting are as follows: given a loss function (e.g., squared error for regression) and a weak learner (e.g., regression trees), the algorithm seeks to find an additive model that minimizes the loss function.

suppressWarnings(library(gbm))

## Loaded gbm 2.1.8

set.seed(333)
gbmGrid <- expand.grid(.interaction.depth = seq(1, 5, by = 2), 
                       .n.trees = seq(300, 1000, by = 100),
                       .shrinkage = c(0.05, 0.1),
                       .n.minobsinnode = 5)

gbmTune <- suppressWarnings(train(Ins_train_X[,-1], Ins_train_Y$PH,
                method = "gbm",
                tuneGrid = gbmGrid,
                verbose = FALSE)
                )

GBM_Pred <- predict(gbmTune, newdata = Ins_test_X[,-1])
postResample(pred = GBM_Pred, obs = Ins_test_Y$PH)

##       RMSE   Rsquared        MAE 
## 0.10374171 0.62573343 0.07771994

The Boosted Tree R-Squared value is 0.578, meaning that the model explains 58% of variability in the data. RMSE for Boosted Tree is 0.114.

Single Tree

Next, we will try a Single Tree Model. Basic regression trees partition the data into smaller groups that are more homogenous with respect to the response.

set.seed(333)
rpartTune <- train(Ins_train_X, Ins_train_Y$PH,
                   method = "rpart2",
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))

ST_Pred <- predict(rpartTune, newdata = Ins_test_X)
postResample(pred = ST_Pred, obs = Ins_test_Y$PH)

##       RMSE   Rsquared        MAE 
## 0.12481257 0.45518528 0.09783835

The Basic Regression Tree R-Squared value is 0.459, meaning that the model explains 46% of variability in the data. RMSE for Basic Regression Tree is 0.129.

Cubist

Next, we will try a Cubist Model. Cubist is a rule–based model. A tree is grown where the terminal leaves contain linear regression models. These models are based on the predictors used in previous splits. Also, there are intermediate linear models at each step of the tree.

suppressWarnings(library(Cubist))
set.seed(333)

cubistMod <- cubist(Ins_train_X, 
                    Ins_train_Y$PH, 
                    committees = 6
)

## Warning in cubist.default(Ins_train_X, Ins_train_Y$PH, committees = 6): NAs
## introduced by coercion

cubistModPred <- predict(cubistMod, newdata = Ins_test_X)
postResample(pred = cubistModPred, obs = Ins_test_Y$PH)

##      RMSE  Rsquared       MAE 
## 0.1086993 0.6141540 0.0761772

The Cubist R-Squared value is 0.671, meaning that the model explains 67% of variability in the data. RMSE for Cubist is 0.101.

Bagged Trees

Finally, we will try Bagged Trees Model. Bagging effectively reduces the variance of a prediction through its aggregation process.

set.seed(333)
suppressWarnings(library(ipred))

baggedTree <- ipredbagg(Ins_train_Y$PH, Ins_train_X)

baggedTreePred <- predict(baggedTree, newdata = Ins_test_X)
postResample(pred = baggedTreePred, obs = Ins_test_Y$PH)

##       RMSE   Rsquared        MAE 
## 0.11696275 0.52330618 0.09238415

The Bagged R-Squared value is 0.523, meaning that the model explains 53% of variability in the data. RMSE for Bagged Tree is 0.122.

Comparing Tree Models

It looks like Tree Models are performing better than non-linear models and linear models based on their R-squared values, but let’s compare those and see which model performs best.

z<- rbind(
  postResample(pred = RFPred, obs = Ins_test_Y$PH),
  postResample(pred = GBM_Pred, obs = Ins_test_Y$PH),
  postResample(pred = ST_Pred, obs = Ins_test_Y$PH),
  postResample(pred = cubistModPred, obs = Ins_test_Y$PH),
  postResample(pred = baggedTreePred, obs = Ins_test_Y$PH)
)

data.frame(z,row.names = c('Random Forrest', 'Boosted Trees', 'Single Tree', 'Cubist', 'Bagged Tree'))

##                      RMSE  Rsquared        MAE
## Random Forrest 0.09973481 0.6763930 0.07373373
## Boosted Trees  0.10374171 0.6257334 0.07771994
## Single Tree    0.12481257 0.4551853 0.09783835
## Cubist         0.10869935 0.6141540 0.07617720
## Bagged Tree    0.11696275 0.5233062 0.09238415

Based on the combination of R-Squared and RMSE values for all models we tried - the best Model is Cubist - that’s what we will use for our predictions. Random Forest model also has very good performance compared to all the other models we tuned and evaluated. Overall, Tree models are performing better that Linear and other Non-Linear Models based on RMSE and R-Squared values.

Here is the list of most relevant variables in this Cubist model:

data.frame(varImp(cubistMod)) %>% arrange(desc(Overall))

##                   Overall
## Mnf.Flow             85.5
## Balling              57.0
## Pressure.Vacuum      48.5
## Oxygen.Filler        47.5
## Alch.Rel             46.5
## Balling.Lvl          45.0
## Density              40.0
## Filler.Speed         36.0
## Carb.Rel             34.5
## Carb.Pressure1       32.0
## Bowl.Setpoint        30.0
## Temperature          29.5
## Hyd.Pressure3        27.5
## Usage.cont           25.0
## Carb.Flow            25.0
## MFR                  24.5
## Brand.Code           21.0
## Hyd.Pressure2        21.0
## Hyd.Pressure1        20.0
## Filler.Level         18.0
## Carb.Volume          14.0
## Hyd.Pressure4        14.0
## Carb.Pressure        11.5
## PC.Volume            11.0
## Pressure.Setpoint     8.5
## Carb.Temp             8.0
## Fill.Pressure         5.5
## PSC.Fill              5.0
## PSC                   4.0
## Air.Pressurer         3.5
## PSC.CO2               3.5
## Fill.Ounces           3.5

Predictions

Now that we have identified the best model, we can use our evaluation data to make PH predictions and output predictions to an excel readable format. We are adding predicted PH values to our Evaluation data set.

final_predictions <- predict(cubistMod, newdata=Ins_eval_cap_imputed)
Ins_eval_cap_imputed$PH <- final_predictions
final_predictions_df <- data.frame(Ins_eval_cap_imputed)
head(final_predictions_df)

##   Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp    PSC
## 1          D    5.480000    24.03333 0.2700000          65.4     134.6 0.1934
## 2          A    5.393333    23.95333 0.2266667          63.2     135.0 0.0420
## 3          B    5.293333    23.92000 0.3033333          66.4     140.4 0.0680
## 4          B    5.266667    23.94000 0.1860000          64.8     139.0 0.0040
## 5          B    5.406667    24.09333 0.1600000          69.4     142.2 0.0400
## 6          B    5.286667    24.10667 0.2120000          73.4     147.2 0.0780
##   PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1     0.40    0.04     -100          116.6          46.0             0
## 2     0.22    0.08     -100          118.8          46.2             0
## 3     0.10    0.02     -100          120.2          45.8             0
## 4     0.20    0.02     -100          124.8          40.0             0
## 5     0.30    0.06     -100          115.0          51.4             0
## 6     0.22    0.06     -100          118.6          46.4             0
##   Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1             0             0          96.0        129.4      7939411
## 2             0             0         112.0        120.0      8043319
## 3             0             0          98.0        119.4      8035302
## 4             0             0         125.4        120.2      7569377
## 5             0             0          94.0        116.0      8067394
## 6             0             0          94.0        120.4      8035302
##   Temperature Usage.cont Carb.Flow Density      MFR Balling Pressure.Vacuum
## 1        66.0      21.66      2950    0.88 264578.3   1.398            -3.8
## 2        65.6      17.60      2916    1.50 270575.2   2.942            -4.4
## 3        65.6      24.18      3056    0.90 269840.3   1.448            -4.2
## 4        69.0      18.12        28    0.74 216778.1   1.056            -4.0
## 5        66.4      21.32      3214    0.88 282620.4   1.398            -4.0
## 6        66.6      18.00      3064    0.84 267787.8   1.298            -3.8
##         PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel
## 1 8.579338     -3.655311      8446.705              45.2     0.9930600     6.56
## 2 8.591945     -3.370005      7197.162              46.0     0.9932421     7.14
## 3 8.807056     -2.973485      7197.162              46.0     0.9932421     6.52
## 4 8.535749     -1.898400      7197.162              46.0     0.9932421     6.48
## 5 8.570547     -2.431042      7197.162              50.0     0.9932421     6.50
## 6 8.375602     -2.664477      7197.162              46.0     0.9932421     6.50
##   Carb.Rel Balling.Lvl
## 1     5.34        1.48
## 2     5.58        3.04
## 3     5.34        1.46
## 4     5.50        1.48
## 5     5.38        1.46
## 6     5.42        1.44

write_csv(final_predictions_df, "PH_Result.csv")

DATA24 Project 2

Forhad Akbar, Shovan Biswas, Elina Azrilyan

11/23/2020

Project 2 Description

Libraries

Reading datasets.

Data Exploration of StudentData.csv.

Data Preparation of StudentData.csv.

More Data Exploration of StudentData.csv.

Boxplots

Histograms

Categorical variables

Correlations

Data Exploration of StudentEvaluation.csv.

Data Preparation of StudentEvaluation.csv.

More Data exploration of StudentEvaluation.csv.

Zeroing PH column

Boxplots

Histograms

Categorical variables

Correlations

Models

Splitting Test and Train

Linear Models

GLM Model

PLS Model

Ridge Model

ENET Model

Comparing Linear Models

Non-Linear Models

MARS Model

SVM Model

KNN Model

Comparing Non-Linear Models

Tree Models

Random Forest

Boosted trees

Single Tree

Cubist

Bagged Trees

Comparing Tree Models

Predictions