Part 1: Load Data and EDA

Load Data

df<-read_excel('StudentData.xlsx')
df_eval<-read_excel('StudentEvaluation.xlsx')
df<-data.frame(df)
df_eval<-data.frame(df_eval)

Data Sample

head(df)
##   Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp   PSC
## 1          B    5.340000    23.96667 0.2633333          68.2     141.2 0.104
## 2          A    5.426667    24.00667 0.2386667          68.4     139.6 0.124
## 3          B    5.286667    24.06000 0.2633333          70.8     144.8 0.090
## 4          A    5.440000    24.00667 0.2933333          63.0     132.6    NA
## 5          A    5.486667    24.31333 0.1113333          67.2     136.8 0.026
## 6          A    5.380000    23.92667 0.2693333          66.6     138.4 0.090
##   PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1     0.26    0.04     -100          118.8          46.0             0
## 2     0.22    0.04     -100          121.6          46.0             0
## 3     0.34    0.16     -100          120.2          46.0             0
## 4     0.42    0.04     -100          115.2          46.4             0
## 5     0.16    0.12     -100          118.4          45.8             0
## 6     0.24    0.04     -100          119.6          45.6             0
##   Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1            NA            NA           118        121.2         4002
## 2            NA            NA           106        118.6         3986
## 3            NA            NA            82        120.0         4020
## 4             0             0            92        117.8         4012
## 5             0             0            92        118.6         4010
## 6             0             0           116        120.2         4014
##   Temperature Usage.cont Carb.Flow Density   MFR Balling Pressure.Vacuum   PH
## 1        66.0      16.18      2932    0.88 725.0   1.398            -4.0 8.36
## 2        67.6      19.90      3144    0.92 726.8   1.498            -4.0 8.26
## 3        67.0      17.76      2914    1.58 735.0   3.142            -3.8 8.94
## 4        65.6      17.42      3062    1.54 730.6   3.042            -4.4 8.24
## 5        65.6      17.68      3054    1.54 722.8   3.042            -4.4 8.26
## 6        66.2      23.82      2948    1.52 738.8   2.992            -4.4 8.32
##   Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 1         0.022           120              46.4         142.6     6.58     5.32
## 2         0.026           120              46.8         143.0     6.56     5.30
## 3         0.024           120              46.6         142.0     7.66     5.84
## 4         0.030           120              46.0         146.2     7.14     5.42
## 5         0.030           120              46.0         146.2     7.14     5.44
## 6         0.024           120              46.0         146.6     7.16     5.44
##   Balling.Lvl
## 1        1.48
## 2        1.56
## 3        3.28
## 4        3.04
## 5        3.04
## 6        3.02

Exploratory data analysis

skim(df)
Data summary
Name df
Number of rows 2571
Number of columns 33
_______________________
Column type frequency:
character 1
numeric 32
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Brand.Code 120 0.95 1 1 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Carb.Volume 10 1.00 5.37 0.11 5.04 5.29 5.35 5.45 5.70 ▁▆▇▅▁
Fill.Ounces 38 0.99 23.97 0.09 23.63 23.92 23.97 24.03 24.32 ▁▂▇▂▁
PC.Volume 39 0.98 0.28 0.06 0.08 0.24 0.27 0.31 0.48 ▁▃▇▂▁
Carb.Pressure 27 0.99 68.19 3.54 57.00 65.60 68.20 70.60 79.40 ▁▅▇▃▁
Carb.Temp 26 0.99 141.09 4.04 128.60 138.40 140.80 143.80 154.00 ▁▅▇▃▁
PSC 33 0.99 0.08 0.05 0.00 0.05 0.08 0.11 0.27 ▆▇▃▁▁
PSC.Fill 23 0.99 0.20 0.12 0.00 0.10 0.18 0.26 0.62 ▆▇▃▁▁
PSC.CO2 39 0.98 0.06 0.04 0.00 0.02 0.04 0.08 0.24 ▇▅▂▁▁
Mnf.Flow 2 1.00 24.57 119.48 -100.20 -100.00 65.20 140.80 229.40 ▇▁▁▇▂
Carb.Pressure1 32 0.99 122.59 4.74 105.60 119.00 123.20 125.40 140.20 ▁▃▇▂▁
Fill.Pressure 22 0.99 47.92 3.18 34.60 46.00 46.40 50.00 60.40 ▁▁▇▂▁
Hyd.Pressure1 11 1.00 12.44 12.43 -0.80 0.00 11.40 20.20 58.00 ▇▅▂▁▁
Hyd.Pressure2 15 0.99 20.96 16.39 0.00 0.00 28.60 34.60 59.40 ▇▂▇▅▁
Hyd.Pressure3 15 0.99 20.46 15.98 -1.20 0.00 27.60 33.40 50.00 ▇▁▃▇▁
Hyd.Pressure4 30 0.99 96.29 13.12 52.00 86.00 96.00 102.00 142.00 ▁▃▇▂▁
Filler.Level 20 0.99 109.25 15.70 55.80 98.30 118.40 120.00 161.20 ▁▃▅▇▁
Filler.Speed 57 0.98 3687.20 770.82 998.00 3888.00 3982.00 3998.00 4030.00 ▁▁▁▁▇
Temperature 14 0.99 65.97 1.38 63.60 65.20 65.60 66.40 76.20 ▇▃▁▁▁
Usage.cont 5 1.00 20.99 2.98 12.08 18.36 21.79 23.75 25.90 ▁▃▅▃▇
Carb.Flow 2 1.00 2468.35 1073.70 26.00 1144.00 3028.00 3186.00 5104.00 ▂▅▆▇▁
Density 1 1.00 1.17 0.38 0.24 0.90 0.98 1.62 1.92 ▁▅▇▂▆
MFR 212 0.92 704.05 73.90 31.40 706.30 724.00 731.00 868.60 ▁▁▁▂▇
Balling 1 1.00 2.20 0.93 -0.17 1.50 1.65 3.29 4.01 ▁▇▇▁▇
Pressure.Vacuum 0 1.00 -5.22 0.57 -6.60 -5.60 -5.40 -5.00 -3.60 ▂▇▆▂▁
PH 4 1.00 8.55 0.17 7.88 8.44 8.54 8.68 9.36 ▁▅▇▂▁
Oxygen.Filler 12 1.00 0.05 0.05 0.00 0.02 0.03 0.06 0.40 ▇▁▁▁▁
Bowl.Setpoint 2 1.00 109.33 15.30 70.00 100.00 120.00 120.00 140.00 ▁▂▃▇▁
Pressure.Setpoint 12 1.00 47.62 2.04 44.00 46.00 46.00 50.00 52.00 ▁▇▁▆▁
Air.Pressurer 0 1.00 142.83 1.21 140.80 142.20 142.60 143.00 148.20 ▅▇▁▁▁
Alch.Rel 9 1.00 6.90 0.51 5.28 6.54 6.56 7.24 8.62 ▁▇▂▃▁
Carb.Rel 10 1.00 5.44 0.13 4.96 5.34 5.40 5.54 6.06 ▁▇▇▂▁
Balling.Lvl 1 1.00 2.05 0.87 0.00 1.38 1.48 3.14 3.66 ▁▇▂▁▆

Data Process

Missing Data:

We can use kNN imputation to help replace many of the missing values we identified above. With the KNN method, a categorical missing value is imputed by looking at other records with similar features. Once k similar records are found, they are used to infer the missing value. For missing numeric values, the k most similar records values are averaged (mean) to “predict” the missing value.

missmap(df)

df <- kNN(df)%>%
      subset(select=Brand.Code:Balling.Lvl)

Next, we double check that this worked correctly by checking for missing values.

sapply(df, function(x) sum(is.na(x)))  
##        Brand.Code       Carb.Volume       Fill.Ounces         PC.Volume 
##                 0                 0                 0                 0 
##     Carb.Pressure         Carb.Temp               PSC          PSC.Fill 
##                 0                 0                 0                 0 
##           PSC.CO2          Mnf.Flow    Carb.Pressure1     Fill.Pressure 
##                 0                 0                 0                 0 
##     Hyd.Pressure1     Hyd.Pressure2     Hyd.Pressure3     Hyd.Pressure4 
##                 0                 0                 0                 0 
##      Filler.Level      Filler.Speed       Temperature        Usage.cont 
##                 0                 0                 0                 0 
##         Carb.Flow           Density               MFR           Balling 
##                 0                 0                 0                 0 
##   Pressure.Vacuum                PH     Oxygen.Filler     Bowl.Setpoint 
##                 0                 0                 0                 0 
## Pressure.Setpoint     Air.Pressurer          Alch.Rel          Carb.Rel 
##                 0                 0                 0                 0 
##       Balling.Lvl 
##                 0

Fortunately, we can see that there are now no missing data points!

Change non-numeric data to factor

df$Brand.Code=as.factor(df$Brand.Code)

We’ll need to dummy code our categorical variables. This process will create new columns for each value and assign a 0 or 1. Note that dummy encoding typically drops one value which becomes the baseline. So if we have a categorical feature with five unique values, we will have four columns. If all columns are 0, that represents the reference value.

df_dummy <- dummyVars(~ 0 + ., drop2nd=TRUE, data = df)
df_dummy <- data.frame(predict(df_dummy, newdata = df))

I will use preprocess() to apply the transformation to a set of predictors. Box-Cox: Reduce the skew and make it more normal. Scale: Calculates the standard deviation for an attribute and divides each value by that standard deviation. Center:Calculates the mean for an attribute and subtracts it from each value.

myTrans <- preProcess(df_dummy, method=c("BoxCox","center", "scale"))

myTrans
## Created from 2571 samples and 36 variables
## 
## Pre-processing:
##   - Box-Cox transformation (23)
##   - centered (36)
##   - ignored (0)
##   - scaled (36)
## 
## Lambda estimates for Box-Cox transformation:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.0000 -1.8000  0.2000  0.1087  2.0000  2.0000
df_dummy2<- predict(myTrans,df_dummy)
head(df_dummy2)
##   Brand.Code.A Brand.Code.B Brand.Code.C Brand.Code.D Carb.Volume Fill.Ounces
## 1   -0.3585688     0.952322    -0.372919   -0.5612192  -0.2589619 -0.09639336
## 2    2.7877805    -1.049657    -0.372919   -0.5612192   0.5567826  0.36210508
## 3   -0.3585688     0.952322    -0.372919   -0.5612192  -0.7810203  0.97462569
## 4    2.7877805    -1.049657    -0.372919   -0.5612192   0.6788334  0.36210508
## 5    2.7877805    -1.049657    -0.372919   -0.5612192   1.0990267  3.90266109
## 6    2.7877805    -1.049657    -0.372919   -0.5612192   0.1224351 -0.55412721
##    PC.Volume Carb.Pressure   Carb.Temp        PSC   PSC.Fill    PSC.CO2
## 1 -0.2061096    0.02105236  0.05878476  0.5188415  0.5506394 -0.3832964
## 2 -0.6219569    0.07742060 -0.34511132  0.8636297  0.2099599 -0.3832964
## 3 -0.2061096    0.74127055  0.92538400  0.2572756  1.2319986  2.4202742
## 4  0.2842233   -1.50564148 -2.26288459 -0.7373813  1.9133578 -0.3832964
## 5 -3.0352438   -0.26329167 -1.08163853 -1.3564659 -0.3010595  1.4857507
## 6 -0.1067558   -0.43593624 -0.65602336  0.2572756  0.3802996 -0.3832964
##    Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1 Hyd.Pressure2
## 1 -1.042813     -0.7905954    -0.5655853     -1.004388     -1.280269
## 2 -1.042813     -0.1934247    -0.5655853     -1.004388     -1.280269
## 3 -1.042813     -0.4911406    -0.5655853     -1.004388     -1.280269
## 4 -1.042813     -1.5688250    -0.4315583     -1.004388     -1.280269
## 5 -1.042813     -0.8764776    -0.6333003     -1.004388     -1.280269
## 6 -1.042813     -0.6192637    -0.7014900     -1.004388     -1.280269
##   Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont
## 1      -1.28226     1.5182190    0.7752774    0.4808833  0.04843332 -1.5572324
## 2      -1.28226     0.7652786    0.5823763    0.4517280  1.22743835 -0.4437290
## 3      -1.28226    -1.1386019    0.6857263    0.5138226  0.79519951 -1.1123441
## 4      -1.28226    -0.2670731    0.5238637    0.4991647 -0.25987895 -1.2115773
## 5      -1.28226    -0.2670731    0.5823763    0.4955047 -0.25987895 -1.1358656
## 6      -1.28226     1.3998208    0.7005896    0.5028264  0.20049791  0.9781031
##   Carb.Flow    Density       MFR    Balling Pressure.Vacuum        PH
## 1 0.4026550 -0.7414937 0.3776862 -0.8580648        2.133539 -1.077645
## 2 0.6338693 -0.6022966 0.4024579 -0.7506974        2.133539 -1.643078
## 3 0.3832488  1.0911946 0.5160841  1.0144229        2.484420  2.336050
## 4 0.5438643  1.0108974 0.4549554  0.9070554        1.431776 -1.755349
## 5 0.5351217  1.0108974 0.3474930  0.9070554        1.431776 -1.643078
## 6 0.4199351  0.9699633 0.5691724  0.8533717        1.431776 -1.304635
##   Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer   Alch.Rel
## 1   -0.37820721     0.7087745        -0.5607126    -0.1862599 -0.6169792
## 2   -0.22426969     0.7087745        -0.3496093     0.1533607 -0.6665236
## 3   -0.29866961     0.7087745        -0.4544814    -0.7010823  1.5094568
## 4   -0.08825242     0.7087745        -0.7773469     2.7707509  0.6057262
## 5   -0.08825242     0.7087745        -0.7773469     2.7707509  0.6057262
## 6   -0.29866961     0.7087745        -0.7773469     3.0859285  0.6441651
##      Carb.Rel Balling.Lvl
## 1 -0.91427031  -0.6553539
## 2 -1.08355183  -0.5634377
## 3  2.89505992   1.4127593
## 4 -0.09578199   1.1370109
## 5  0.06252259   1.1370109
## 6  0.06252259   1.1140319

Target/Feature Plots

Correlation

df_col<-df[sapply(df,is.numeric)]
ggcorr(df_col,label = T,label_round = 1 )

df %>%
  dplyr::select(PH,Mnf.Flow,Usage.cont,Bowl.Setpoint,Density, Temperature,Air.Pressurer)%>%
  ggpairs(aes(color =df$Brand.Code, alpha = 0.9))

tbl <- table(df$Brand.Code, df$PH)
barplot(tbl, main="PH per Brand Code",
       col =c("thistle3","darksalmon", "cornflowerblue","darkolivegreen3"), ylim=range(c(0, 180)), legend=TRUE)

Splitting our data into a Training/Test set

df_rows<-createDataPartition(df_dummy2$PH,p=0.8, list = FALSE)
df_train<-df_dummy2[df_rows,]
df_test<-df_dummy2[-df_rows,]

Part2: Fit the Model

Random Forest

rfModel<-randomForest(PH~.,data=df_train)
rfModel
## 
## Call:
##  randomForest(formula = PH ~ ., data = df_train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 11
## 
##           Mean of squared residuals: 0.3255122
##                     % Var explained: 66.95

Check the importance of the variables

importance(rfModel)
##                   IncNodePurity
## Brand.Code.A           9.165471
## Brand.Code.B          15.590595
## Brand.Code.C          94.312222
## Brand.Code.D          10.842818
## Carb.Volume           35.156523
## Fill.Ounces           30.704189
## PC.Volume             42.981434
## Carb.Pressure         23.177384
## Carb.Temp             21.022808
## PSC                   31.181205
## PSC.Fill              25.448667
## PSC.CO2               16.810734
## Mnf.Flow             253.257329
## Carb.Pressure1        70.962994
## Fill.Pressure         34.948208
## Hyd.Pressure1         27.296341
## Hyd.Pressure2         37.200558
## Hyd.Pressure3         41.787610
## Hyd.Pressure4         27.855715
## Filler.Level          93.146597
## Filler.Speed          51.047576
## Temperature           88.393631
## Usage.cont           160.072133
## Carb.Flow             47.870938
## Density               48.878262
## MFR                   41.031190
## Balling               55.651280
## Pressure.Vacuum       71.155322
## Oxygen.Filler         84.570158
## Bowl.Setpoint         78.935494
## Pressure.Setpoint     30.501402
## Air.Pressurer         68.141795
## Alch.Rel              70.988096
## Carb.Rel              78.246228
## Balling.Lvl           69.279240
varImpPlot(rfModel,type=2)

plot(rfModel)

Predict the test dataset

rfPred<-predict(rfModel,df_test)
rfResult<-postResample(rfPred,df_test$PH)
rfResult
##      RMSE  Rsquared       MAE 
## 0.5702752 0.7181757 0.4110123

Gradient Boosting

set.seed(123)
gbmModel<-gbm(PH~.,data =df_train,distribution = "gaussian",cv.folds = 5)
gbmModel
## gbm(formula = PH ~ ., distribution = "gaussian", data = df_train, 
##     cv.folds = 5)
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## The best cross-validation iteration was 100.
## There were 35 predictors of which 19 had non-zero influence.
# optimum number of trees
gbm.perf(gbmModel, method = "cv")

## [1] 100
gbmPre<-predict(gbmModel,df_test,trees=100)
## Using 100 trees...
gbmResult<-postResample(gbmPre,df_test$PH)
gbmResult
##      RMSE  Rsquared       MAE 
## 0.7867417 0.4371067 0.6073389

Support Vector Machine

set.seed(123)
svmModel <- svm(PH~.,data=df_train,kernel="radial",cost=10, scale = FALSE)
svmModel
## 
## Call:
## svm(formula = PH ~ ., data = df_train, kernel = "radial", cost = 10, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  10 
##       gamma:  0.02857143 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  1764

Predict the test dataset

svmPred<-predict(svmModel,df_test)
svmResult<-postResample(svmPred,df_test$PH)
svmResult
##      RMSE  Rsquared       MAE 
## 0.6577194 0.5926516 0.4752371

Neural Networks

Part 3: Compare the model

result<-rbind(rfResult,gbmResult,svmResult)
result<-data.frame(result)
rownames(result)<-c('Random Forest','Gradient Boosting','SVM')
result
##                        RMSE  Rsquared       MAE
## Random Forest     0.5702752 0.7181757 0.4110123
## Gradient Boosting 0.7867417 0.4371067 0.6073389
## SVM               0.6577194 0.5926516 0.4752371

Part 4:Prediction