train <- read.csv("https://raw.githubusercontent.com/mkivenson/Predictive-Analytics/master/Project%202/Datasets/student_train.csv")
test <- read.csv("https://raw.githubusercontent.com/mkivenson/Predictive-Analytics/master/Project%202/Datasets/student_test.csv")
First, we take a look at a summary of the data.
Pressure.Vacuum and Air.PressurerPH, has four missing values that should be removed from the training dataBrand.Code are numeric - Brand.Code will be encoded into dummy variablessummary(train)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure
## : 120 Min. :5.040 Min. :23.63 Min. :0.07933 Min. :57.00
## A: 293 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917 1st Qu.:65.60
## B:1239 Median :5.347 Median :23.97 Median :0.27133 Median :68.20
## C: 304 Mean :5.370 Mean :23.97 Mean :0.27712 Mean :68.19
## D: 615 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200 3rd Qu.:70.60
## Max. :5.700 Max. :24.32 Max. :0.47800 Max. :79.40
## NA's :10 NA's :38 NA's :39 NA's :27
## Carb.Temp PSC PSC.Fill PSC.CO2
## Min. :128.6 Min. :0.00200 Min. :0.0000 Min. :0.00000
## 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000 1st Qu.:0.02000
## Median :140.8 Median :0.07600 Median :0.1800 Median :0.04000
## Mean :141.1 Mean :0.08457 Mean :0.1954 Mean :0.05641
## 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600 3rd Qu.:0.08000
## Max. :154.0 Max. :0.27000 Max. :0.6200 Max. :0.24000
## NA's :26 NA's :33 NA's :23 NA's :39
## Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## Min. :-100.20 Min. :105.6 Min. :34.60 Min. :-0.80
## 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00 1st Qu.: 0.00
## Median : 65.20 Median :123.2 Median :46.40 Median :11.40
## Mean : 24.57 Mean :122.6 Mean :47.92 Mean :12.44
## 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00 3rd Qu.:20.20
## Max. : 229.40 Max. :140.2 Max. :60.40 Max. :58.00
## NA's :2 NA's :32 NA's :22 NA's :11
## Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level
## Min. : 0.00 Min. :-1.20 Min. : 52.00 Min. : 55.8
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00 1st Qu.: 98.3
## Median :28.60 Median :27.60 Median : 96.00 Median :118.4
## Mean :20.96 Mean :20.46 Mean : 96.29 Mean :109.3
## 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00 3rd Qu.:120.0
## Max. :59.40 Max. :50.00 Max. :142.00 Max. :161.2
## NA's :15 NA's :15 NA's :30 NA's :20
## Filler.Speed Temperature Usage.cont Carb.Flow Density
## Min. : 998 Min. :63.60 Min. :12.08 Min. : 26 Min. :0.240
## 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144 1st Qu.:0.900
## Median :3982 Median :65.60 Median :21.79 Median :3028 Median :0.980
## Mean :3687 Mean :65.97 Mean :20.99 Mean :2468 Mean :1.174
## 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186 3rd Qu.:1.620
## Max. :4030 Max. :76.20 Max. :25.90 Max. :5104 Max. :1.920
## NA's :57 NA's :14 NA's :5 NA's :2 NA's :1
## MFR Balling Pressure.Vacuum PH
## Min. : 31.4 Min. :-0.170 Min. :-6.600 Min. :7.880
## 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600 1st Qu.:8.440
## Median :724.0 Median : 1.648 Median :-5.400 Median :8.540
## Mean :704.0 Mean : 2.198 Mean :-5.216 Mean :8.546
## 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000 3rd Qu.:8.680
## Max. :868.6 Max. : 4.012 Max. :-3.600 Max. :9.360
## NA's :212 NA's :1 NA's :4
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer
## Min. :0.00240 Min. : 70.0 Min. :44.00 Min. :140.8
## 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2
## Median :0.03340 Median :120.0 Median :46.00 Median :142.6
## Mean :0.04684 Mean :109.3 Mean :47.62 Mean :142.8
## 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:143.0
## Max. :0.40000 Max. :140.0 Max. :52.00 Max. :148.2
## NA's :12 NA's :2 NA's :12
## Alch.Rel Carb.Rel Balling.Lvl
## Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :6.560 Median :5.400 Median :1.48
## Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
train <- drop_na(train, "PH") #remove rows without a PH value
This correlation plot shows high multicollinearity in the dataset.
corrplot(cor(subset(train, select = -c(Brand.Code)), use = "complete.obs"), method="color", type="lower", tl.col = "black", tl.srt = 5)
EDA HERE
EDA HERE
While performing data exploration, the need for data imputation and encoding was revealed. As part of the preprocessing, these steps will be completed.
Brand Code is a categorical variable that must be encoded prior to imputation. For each level in the category, a dummy variable is created. Typically, one less predictor than categories is required. However, since there are many missing values in this column, a 0 in each dummy variable corresponds to missing data.
train$Brand.A <- ifelse(train$Brand.Code == 'A', 1, 0)
train$Brand.B <- ifelse(train$Brand.Code == 'B', 1, 0)
train$Brand.C <- ifelse(train$Brand.Code == 'C', 1, 0)
train$Brand.D <- ifelse(train$Brand.Code == 'D', 1, 0)
train <- subset(train, select = -c(Brand.Code))
datatable(train[c("Brand.A", "Brand.B", "Brand.C", "Brand.D")])
X_train <- subset(train, select = -c(PH))
y_train <- train$PH
X_test <- subset(test, select = -c(PH))
y_test <- test$PH
The following plots provide a visualization of missing data. There does not seem to be a significant patten in the mising values, and none of the predictors are sparse (highest missing rate is 8%).
aggr(train[,sapply(train, is.numeric)], col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(train), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
## Warning in plot.aggr(res, ...): not enough vertical space to display frequencies
## (too many combinations)
##
## Variables sorted by number of missings:
## Variable Count
## MFR 0.0810284379
## Filler.Speed 0.0210362291
## PC.Volume 0.0151928321
## PSC.CO2 0.0151928321
## Fill.Ounces 0.0148032723
## PSC 0.0128554733
## Carb.Pressure1 0.0124659135
## Hyd.Pressure4 0.0109076743
## Carb.Pressure 0.0105181145
## Carb.Temp 0.0101285547
## PSC.Fill 0.0089598753
## Fill.Pressure 0.0070120764
## Filler.Level 0.0062329568
## Hyd.Pressure2 0.0058433970
## Hyd.Pressure3 0.0058433970
## Temperature 0.0046747176
## Pressure.Setpoint 0.0046747176
## Hyd.Pressure1 0.0042851578
## Oxygen.Filler 0.0042851578
## Carb.Volume 0.0038955980
## Carb.Rel 0.0031164784
## Alch.Rel 0.0027269186
## Usage.cont 0.0019477990
## Carb.Flow 0.0007791196
## Bowl.Setpoint 0.0007791196
## Balling.Lvl 0.0003895598
## Mnf.Flow 0.0000000000
## Density 0.0000000000
## Balling 0.0000000000
## Pressure.Vacuum 0.0000000000
## PH 0.0000000000
## Air.Pressurer 0.0000000000
## Brand.A 0.0000000000
## Brand.B 0.0000000000
## Brand.C 0.0000000000
## Brand.D 0.0000000000
In order to fill missing data, knnImputation will be used. KNN imputation is unsupervised, meaning it does not require a target variable. A train test split was performed earlier so that only predictor data is used for imputation.
result <- preProcess(X_train, method = c("knnImpute"), k = 10)
X_train <- predict(result, X_train)
lm <- train(X_train, y_train,
method = "lm",
tuneLength = 30,
trControl = trainControl(method = "cv", 10))
lm
## Linear Regression
##
## 2567 samples
## 35 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2310, 2310, 2311, 2310, 2311, 2310, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1329494 0.4077779 0.1028825
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
rlmPCA <- train(X_train, y_train,
method = "rlm",
preProcess = "pca",
tuneLength = 30,
trControl = trainControl(method = "cv", 10))
rlmPCA
## Robust Linear Model
##
## 2567 samples
## 35 predictor
##
## Pre-processing: principal component signal extraction (35), centered
## (35), scaled (35)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2309, 2310, 2311, 2311, 2309, 2312, ...
## Resampling results across tuning parameters:
##
## intercept psi RMSE Rsquared MAE
## FALSE psi.huber 8.5468222 0.3549354 8.5456977
## FALSE psi.hampel 8.5468222 0.3549354 8.5456977
## FALSE psi.bisquare 8.5468217 0.3549455 8.5456972
## TRUE psi.huber 0.1389329 0.3540689 0.1089598
## TRUE psi.hampel 0.1387010 0.3553241 0.1093217
## TRUE psi.bisquare 0.1389408 0.3540800 0.1089015
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi = psi.hampel.
plsTune <- train(X_train, y_train,
method = "pls",
tuneLength = 10,
trControl = trainControl(method = "cv"))
plsTune
## Partial Least Squares
##
## 2567 samples
## 35 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2311, 2310, 2310, 2310, 2312, 2311, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1496592 0.2491210 0.1188350
## 2 0.1416311 0.3269660 0.1115888
## 3 0.1387773 0.3549088 0.1097609
## 4 0.1370646 0.3704209 0.1080421
## 5 0.1353654 0.3854881 0.1059025
## 6 0.1342459 0.3951727 0.1045765
## 7 0.1336177 0.4009232 0.1039205
## 8 0.1333488 0.4031784 0.1037260
## 9 0.1331430 0.4050793 0.1033611
## 10 0.1329442 0.4068206 0.1031016
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 10.
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 5))
ridgeRegFit <- train(X_train, y_train,
method = "ridge",
tuneGrid = ridgeGrid,
trControl = trainControl(method = "cv")
#preProc = c("center", "scale")
)
ridgeRegFit
## Ridge Regression
##
## 2567 samples
## 35 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2311, 2310, 2309, 2311, 2311, 2311, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000 0.1329456 0.4078262 0.1029256
## 0.025 0.1333063 0.4045410 0.1035974
## 0.050 0.1337133 0.4010007 0.1040764
## 0.075 0.1340859 0.3977935 0.1044968
## 0.100 0.1344277 0.3949097 0.1048561
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.
lassomodel <- train(X_train, y_train,
method = "lasso",
trControl = trainControl(method = "cv")
)
lassomodel
## The lasso
##
## 2567 samples
## 35 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2310, 2311, 2311, 2312, 2309, 2310, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.1 0.1501594 0.2906638 0.1189614
## 0.5 0.1340395 0.3995317 0.1044615
## 0.9 0.1331261 0.4062691 0.1030580
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.9.
enetGrid <- expand.grid(.lambda = c(0, 0.05, .3),
.fraction = seq(.1, 1, length = 10))
enetTune <- train(X_train, y_train,
method = "enet",
tuneGrid = enetGrid,
trControl = trainControl(method = "cv")
#preProc = c("center", "scale")
)
enetTune
## Elasticnet
##
## 2567 samples
## 35 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2311, 2310, 2310, 2311, 2311, 2310, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.1 0.1502123 0.2889262 0.1190627
## 0.00 0.2 0.1411710 0.3451036 0.1116336
## 0.00 0.3 0.1373435 0.3732643 0.1081259
## 0.00 0.4 0.1353678 0.3881327 0.1060069
## 0.00 0.5 0.1341995 0.3976907 0.1046679
## 0.00 0.6 0.1336330 0.4022118 0.1039154
## 0.00 0.7 0.1333862 0.4043529 0.1035214
## 0.00 0.8 0.1332050 0.4060155 0.1032622
## 0.00 0.9 0.1331920 0.4062390 0.1031252
## 0.00 1.0 0.1333073 0.4054437 0.1030583
## 0.05 0.1 0.1561784 0.2569834 0.1242210
## 0.05 0.2 0.1470736 0.3106551 0.1164173
## 0.05 0.3 0.1418885 0.3385391 0.1121822
## 0.05 0.4 0.1389627 0.3590661 0.1097055
## 0.05 0.5 0.1370339 0.3740538 0.1078480
## 0.05 0.6 0.1357945 0.3836835 0.1065542
## 0.05 0.7 0.1349849 0.3904633 0.1056027
## 0.05 0.8 0.1343214 0.3961405 0.1048585
## 0.05 0.9 0.1339053 0.3997474 0.1043670
## 0.05 1.0 0.1337652 0.4010906 0.1041229
## 0.30 0.1 0.1588225 0.2385485 0.1264929
## 0.30 0.2 0.1503180 0.2915162 0.1191828
## 0.30 0.3 0.1448361 0.3185452 0.1146231
## 0.30 0.4 0.1416396 0.3336908 0.1120305
## 0.30 0.5 0.1397021 0.3478657 0.1103337
## 0.30 0.6 0.1383471 0.3589361 0.1089971
## 0.30 0.7 0.1377341 0.3652744 0.1082360
## 0.30 0.8 0.1373993 0.3698985 0.1078340
## 0.30 0.9 0.1370237 0.3745216 0.1074446
## 0.30 1.0 0.1367314 0.3785984 0.1071366
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.9 and lambda = 0.