Data

train <- read.csv("https://raw.githubusercontent.com/mkivenson/Predictive-Analytics/master/Project%202/Datasets/student_train.csv")
test <- read.csv("https://raw.githubusercontent.com/mkivenson/Predictive-Analytics/master/Project%202/Datasets/student_test.csv")

Data Exploration

Summary

First, we take a look at a summary of the data.

  • There are missing values in all of the predictors except for Pressure.Vacuum and Air.Pressurer
  • The target variabale, PH, has four missing values that should be removed from the training data
  • All predictors except for Brand.Code are numeric - Brand.Code will be encoded into dummy variables
summary(train)
##  Brand.Code  Carb.Volume     Fill.Ounces      PC.Volume       Carb.Pressure  
##   : 120     Min.   :5.040   Min.   :23.63   Min.   :0.07933   Min.   :57.00  
##  A: 293     1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917   1st Qu.:65.60  
##  B:1239     Median :5.347   Median :23.97   Median :0.27133   Median :68.20  
##  C: 304     Mean   :5.370   Mean   :23.97   Mean   :0.27712   Mean   :68.19  
##  D: 615     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200   3rd Qu.:70.60  
##             Max.   :5.700   Max.   :24.32   Max.   :0.47800   Max.   :79.40  
##             NA's   :10      NA's   :38      NA's   :39        NA's   :27     
##    Carb.Temp          PSC             PSC.Fill         PSC.CO2       
##  Min.   :128.6   Min.   :0.00200   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000   1st Qu.:0.02000  
##  Median :140.8   Median :0.07600   Median :0.1800   Median :0.04000  
##  Mean   :141.1   Mean   :0.08457   Mean   :0.1954   Mean   :0.05641  
##  3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600   3rd Qu.:0.08000  
##  Max.   :154.0   Max.   :0.27000   Max.   :0.6200   Max.   :0.24000  
##  NA's   :26      NA's   :33        NA's   :23       NA's   :39       
##     Mnf.Flow       Carb.Pressure1  Fill.Pressure   Hyd.Pressure1  
##  Min.   :-100.20   Min.   :105.6   Min.   :34.60   Min.   :-0.80  
##  1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00   1st Qu.: 0.00  
##  Median :  65.20   Median :123.2   Median :46.40   Median :11.40  
##  Mean   :  24.57   Mean   :122.6   Mean   :47.92   Mean   :12.44  
##  3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00   3rd Qu.:20.20  
##  Max.   : 229.40   Max.   :140.2   Max.   :60.40   Max.   :58.00  
##  NA's   :2         NA's   :32      NA's   :22      NA's   :11     
##  Hyd.Pressure2   Hyd.Pressure3   Hyd.Pressure4     Filler.Level  
##  Min.   : 0.00   Min.   :-1.20   Min.   : 52.00   Min.   : 55.8  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00   1st Qu.: 98.3  
##  Median :28.60   Median :27.60   Median : 96.00   Median :118.4  
##  Mean   :20.96   Mean   :20.46   Mean   : 96.29   Mean   :109.3  
##  3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00   3rd Qu.:120.0  
##  Max.   :59.40   Max.   :50.00   Max.   :142.00   Max.   :161.2  
##  NA's   :15      NA's   :15      NA's   :30       NA's   :20     
##   Filler.Speed   Temperature      Usage.cont      Carb.Flow       Density     
##  Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26   Min.   :0.240  
##  1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144   1st Qu.:0.900  
##  Median :3982   Median :65.60   Median :21.79   Median :3028   Median :0.980  
##  Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468   Mean   :1.174  
##  3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186   3rd Qu.:1.620  
##  Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104   Max.   :1.920  
##  NA's   :57     NA's   :14      NA's   :5       NA's   :2      NA's   :1      
##       MFR           Balling       Pressure.Vacuum        PH       
##  Min.   : 31.4   Min.   :-0.170   Min.   :-6.600   Min.   :7.880  
##  1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600   1st Qu.:8.440  
##  Median :724.0   Median : 1.648   Median :-5.400   Median :8.540  
##  Mean   :704.0   Mean   : 2.198   Mean   :-5.216   Mean   :8.546  
##  3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000   3rd Qu.:8.680  
##  Max.   :868.6   Max.   : 4.012   Max.   :-3.600   Max.   :9.360  
##  NA's   :212     NA's   :1                         NA's   :4      
##  Oxygen.Filler     Bowl.Setpoint   Pressure.Setpoint Air.Pressurer  
##  Min.   :0.00240   Min.   : 70.0   Min.   :44.00     Min.   :140.8  
##  1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00     1st Qu.:142.2  
##  Median :0.03340   Median :120.0   Median :46.00     Median :142.6  
##  Mean   :0.04684   Mean   :109.3   Mean   :47.62     Mean   :142.8  
##  3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00     3rd Qu.:143.0  
##  Max.   :0.40000   Max.   :140.0   Max.   :52.00     Max.   :148.2  
##  NA's   :12        NA's   :2       NA's   :12                       
##     Alch.Rel        Carb.Rel      Balling.Lvl  
##  Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :6.560   Median :5.400   Median :1.48  
##  Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :8.620   Max.   :6.060   Max.   :3.66  
##  NA's   :9       NA's   :10      NA's   :1
train <- drop_na(train, "PH") #remove rows without a PH value

Correlation Plot

This correlation plot shows high multicollinearity in the dataset.

corrplot(cor(subset(train, select = -c(Brand.Code)), use = "complete.obs"), method="color", type="lower", tl.col = "black", tl.srt = 5)

More Data Exploration

EDA HERE

More Data Exploration

EDA HERE

Preprocessing

While performing data exploration, the need for data imputation and encoding was revealed. As part of the preprocessing, these steps will be completed.

Encoding

Brand Code is a categorical variable that must be encoded prior to imputation. For each level in the category, a dummy variable is created. Typically, one less predictor than categories is required. However, since there are many missing values in this column, a 0 in each dummy variable corresponds to missing data.

train$Brand.A <- ifelse(train$Brand.Code == 'A', 1, 0)
train$Brand.B <- ifelse(train$Brand.Code == 'B', 1, 0)
train$Brand.C <- ifelse(train$Brand.Code == 'C', 1, 0)
train$Brand.D <- ifelse(train$Brand.Code == 'D', 1, 0)
train <- subset(train, select = -c(Brand.Code))
datatable(train[c("Brand.A", "Brand.B", "Brand.C", "Brand.D")])

Train Test Split

X_train <- subset(train, select = -c(PH))
y_train <- train$PH
X_test <- subset(test, select = -c(PH))
y_test <- test$PH

Visualizing Missing Data

The following plots provide a visualization of missing data. There does not seem to be a significant patten in the mising values, and none of the predictors are sparse (highest missing rate is 8%).

aggr(train[,sapply(train, is.numeric)], col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(train), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
## Warning in plot.aggr(res, ...): not enough vertical space to display frequencies
## (too many combinations)

## 
##  Variables sorted by number of missings: 
##           Variable        Count
##                MFR 0.0810284379
##       Filler.Speed 0.0210362291
##          PC.Volume 0.0151928321
##            PSC.CO2 0.0151928321
##        Fill.Ounces 0.0148032723
##                PSC 0.0128554733
##     Carb.Pressure1 0.0124659135
##      Hyd.Pressure4 0.0109076743
##      Carb.Pressure 0.0105181145
##          Carb.Temp 0.0101285547
##           PSC.Fill 0.0089598753
##      Fill.Pressure 0.0070120764
##       Filler.Level 0.0062329568
##      Hyd.Pressure2 0.0058433970
##      Hyd.Pressure3 0.0058433970
##        Temperature 0.0046747176
##  Pressure.Setpoint 0.0046747176
##      Hyd.Pressure1 0.0042851578
##      Oxygen.Filler 0.0042851578
##        Carb.Volume 0.0038955980
##           Carb.Rel 0.0031164784
##           Alch.Rel 0.0027269186
##         Usage.cont 0.0019477990
##          Carb.Flow 0.0007791196
##      Bowl.Setpoint 0.0007791196
##        Balling.Lvl 0.0003895598
##           Mnf.Flow 0.0000000000
##            Density 0.0000000000
##            Balling 0.0000000000
##    Pressure.Vacuum 0.0000000000
##                 PH 0.0000000000
##      Air.Pressurer 0.0000000000
##            Brand.A 0.0000000000
##            Brand.B 0.0000000000
##            Brand.C 0.0000000000
##            Brand.D 0.0000000000

KNN Imputation

In order to fill missing data, knnImputation will be used. KNN imputation is unsupervised, meaning it does not require a target variable. A train test split was performed earlier so that only predictor data is used for imputation.

result <- preProcess(X_train, method = c("knnImpute"), k = 10)
X_train <- predict(result, X_train)

Linear Models

Linear Regression

lm <- train(X_train, y_train,
                method = "lm",
                tuneLength = 30,
                trControl = trainControl(method = "cv", 10))
lm
## Linear Regression 
## 
## 2567 samples
##   35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2310, 2310, 2311, 2310, 2311, 2310, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1329494  0.4077779  0.1028825
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Principal Component Analysis

rlmPCA <- train(X_train, y_train,
                method = "rlm",
                preProcess = "pca",
                tuneLength = 30,
                trControl = trainControl(method = "cv", 10))
rlmPCA 
## Robust Linear Model 
## 
## 2567 samples
##   35 predictor
## 
## Pre-processing: principal component signal extraction (35), centered
##  (35), scaled (35) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2309, 2310, 2311, 2311, 2309, 2312, ... 
## Resampling results across tuning parameters:
## 
##   intercept  psi           RMSE       Rsquared   MAE      
##   FALSE      psi.huber     8.5468222  0.3549354  8.5456977
##   FALSE      psi.hampel    8.5468222  0.3549354  8.5456977
##   FALSE      psi.bisquare  8.5468217  0.3549455  8.5456972
##    TRUE      psi.huber     0.1389329  0.3540689  0.1089598
##    TRUE      psi.hampel    0.1387010  0.3553241  0.1093217
##    TRUE      psi.bisquare  0.1389408  0.3540800  0.1089015
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi = psi.hampel.

Partial Least Squares

plsTune <- train(X_train, y_train, 
                 method = "pls", 
                 tuneLength = 10,
                 trControl = trainControl(method = "cv"))
plsTune
## Partial Least Squares 
## 
## 2567 samples
##   35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2311, 2310, 2310, 2310, 2312, 2311, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.1496592  0.2491210  0.1188350
##    2     0.1416311  0.3269660  0.1115888
##    3     0.1387773  0.3549088  0.1097609
##    4     0.1370646  0.3704209  0.1080421
##    5     0.1353654  0.3854881  0.1059025
##    6     0.1342459  0.3951727  0.1045765
##    7     0.1336177  0.4009232  0.1039205
##    8     0.1333488  0.4031784  0.1037260
##    9     0.1331430  0.4050793  0.1033611
##   10     0.1329442  0.4068206  0.1031016
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 10.

Ridge Regression

ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 5))
ridgeRegFit <- train(X_train, y_train,
                     method = "ridge", 
                     tuneGrid = ridgeGrid, 
                     trControl = trainControl(method = "cv")
                     #preProc = c("center", "scale")
                     )
ridgeRegFit 
## Ridge Regression 
## 
## 2567 samples
##   35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2311, 2310, 2309, 2311, 2311, 2311, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE       Rsquared   MAE      
##   0.000   0.1329456  0.4078262  0.1029256
##   0.025   0.1333063  0.4045410  0.1035974
##   0.050   0.1337133  0.4010007  0.1040764
##   0.075   0.1340859  0.3977935  0.1044968
##   0.100   0.1344277  0.3949097  0.1048561
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.

Lasso Regression

lassomodel <- train(X_train, y_train,
               method = "lasso", 
               trControl = trainControl(method = "cv")
               )
lassomodel
## The lasso 
## 
## 2567 samples
##   35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2310, 2311, 2311, 2312, 2309, 2310, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE       Rsquared   MAE      
##   0.1       0.1501594  0.2906638  0.1189614
##   0.5       0.1340395  0.3995317  0.1044615
##   0.9       0.1331261  0.4062691  0.1030580
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.9.

Elastic Net Regression

enetGrid <- expand.grid(.lambda = c(0, 0.05, .3), 
                        .fraction = seq(.1, 1, length = 10))

enetTune <- train(X_train, y_train,
                  method = "enet",
                  tuneGrid = enetGrid,
                  trControl = trainControl(method = "cv")
                  #preProc = c("center", "scale")
                  )

enetTune
## Elasticnet 
## 
## 2567 samples
##   35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2311, 2310, 2310, 2311, 2311, 2310, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE       Rsquared   MAE      
##   0.00    0.1       0.1502123  0.2889262  0.1190627
##   0.00    0.2       0.1411710  0.3451036  0.1116336
##   0.00    0.3       0.1373435  0.3732643  0.1081259
##   0.00    0.4       0.1353678  0.3881327  0.1060069
##   0.00    0.5       0.1341995  0.3976907  0.1046679
##   0.00    0.6       0.1336330  0.4022118  0.1039154
##   0.00    0.7       0.1333862  0.4043529  0.1035214
##   0.00    0.8       0.1332050  0.4060155  0.1032622
##   0.00    0.9       0.1331920  0.4062390  0.1031252
##   0.00    1.0       0.1333073  0.4054437  0.1030583
##   0.05    0.1       0.1561784  0.2569834  0.1242210
##   0.05    0.2       0.1470736  0.3106551  0.1164173
##   0.05    0.3       0.1418885  0.3385391  0.1121822
##   0.05    0.4       0.1389627  0.3590661  0.1097055
##   0.05    0.5       0.1370339  0.3740538  0.1078480
##   0.05    0.6       0.1357945  0.3836835  0.1065542
##   0.05    0.7       0.1349849  0.3904633  0.1056027
##   0.05    0.8       0.1343214  0.3961405  0.1048585
##   0.05    0.9       0.1339053  0.3997474  0.1043670
##   0.05    1.0       0.1337652  0.4010906  0.1041229
##   0.30    0.1       0.1588225  0.2385485  0.1264929
##   0.30    0.2       0.1503180  0.2915162  0.1191828
##   0.30    0.3       0.1448361  0.3185452  0.1146231
##   0.30    0.4       0.1416396  0.3336908  0.1120305
##   0.30    0.5       0.1397021  0.3478657  0.1103337
##   0.30    0.6       0.1383471  0.3589361  0.1089971
##   0.30    0.7       0.1377341  0.3652744  0.1082360
##   0.30    0.8       0.1373993  0.3698985  0.1078340
##   0.30    0.9       0.1370237  0.3745216  0.1074446
##   0.30    1.0       0.1367314  0.3785984  0.1071366
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.9 and lambda = 0.

Non-Linear Models

Regression Trees

Evaluation

Conclusion