Part 1: Load Data and EDA

Load Data

df<-read_excel('StudentData.xlsx')
df_eval<-read_excel('StudentEvaluation.xlsx')
df<-data.frame(df)
df_eval<-data.frame(df_eval)

Data Sample

head(df)

##   Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp   PSC
## 1          B    5.340000    23.96667 0.2633333          68.2     141.2 0.104
## 2          A    5.426667    24.00667 0.2386667          68.4     139.6 0.124
## 3          B    5.286667    24.06000 0.2633333          70.8     144.8 0.090
## 4          A    5.440000    24.00667 0.2933333          63.0     132.6    NA
## 5          A    5.486667    24.31333 0.1113333          67.2     136.8 0.026
## 6          A    5.380000    23.92667 0.2693333          66.6     138.4 0.090
##   PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1     0.26    0.04     -100          118.8          46.0             0
## 2     0.22    0.04     -100          121.6          46.0             0
## 3     0.34    0.16     -100          120.2          46.0             0
## 4     0.42    0.04     -100          115.2          46.4             0
## 5     0.16    0.12     -100          118.4          45.8             0
## 6     0.24    0.04     -100          119.6          45.6             0
##   Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1            NA            NA           118        121.2         4002
## 2            NA            NA           106        118.6         3986
## 3            NA            NA            82        120.0         4020
## 4             0             0            92        117.8         4012
## 5             0             0            92        118.6         4010
## 6             0             0           116        120.2         4014
##   Temperature Usage.cont Carb.Flow Density   MFR Balling Pressure.Vacuum   PH
## 1        66.0      16.18      2932    0.88 725.0   1.398            -4.0 8.36
## 2        67.6      19.90      3144    0.92 726.8   1.498            -4.0 8.26
## 3        67.0      17.76      2914    1.58 735.0   3.142            -3.8 8.94
## 4        65.6      17.42      3062    1.54 730.6   3.042            -4.4 8.24
## 5        65.6      17.68      3054    1.54 722.8   3.042            -4.4 8.26
## 6        66.2      23.82      2948    1.52 738.8   2.992            -4.4 8.32
##   Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 1         0.022           120              46.4         142.6     6.58     5.32
## 2         0.026           120              46.8         143.0     6.56     5.30
## 3         0.024           120              46.6         142.0     7.66     5.84
## 4         0.030           120              46.0         146.2     7.14     5.42
## 5         0.030           120              46.0         146.2     7.14     5.44
## 6         0.024           120              46.0         146.6     7.16     5.44
##   Balling.Lvl
## 1        1.48
## 2        1.56
## 3        3.28
## 4        3.04
## 5        3.04
## 6        3.02

Exploratory data analysis

skim(df)

Data summary
Name	df
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
character	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand.Code	120	0.95	1	1	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb.Volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
Fill.Ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PC.Volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb.Pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
Carb.Temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC.Fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC.CO2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf.Flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	▇▁▁▇▂
Carb.Pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
Fill.Pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Hyd.Pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
Hyd.Pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	▇▂▇▅▁
Hyd.Pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	▇▁▃▇▁
Hyd.Pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	▁▃▇▂▁
Filler.Level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	▁▃▅▇▁
Filler.Speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	▁▁▁▁▇
Temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Usage.cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	▁▃▅▃▇
Carb.Flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	▂▅▆▇▁
Density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
MFR	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	▁▁▁▂▇
Balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
Pressure.Vacuum	0	1.00	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Oxygen.Filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Bowl.Setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
Pressure.Setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air.Pressurer	0	1.00	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
Alch.Rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	▁▇▂▃▁
Carb.Rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁
Balling.Lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

Data Process

Missing Data:

We can use kNN imputation to help replace many of the missing values we identified above. With the KNN method, a categorical missing value is imputed by looking at other records with similar features. Once k similar records are found, they are used to infer the missing value. For missing numeric values, the k most similar records values are averaged (mean) to “predict” the missing value.

missmap(df)

df <- kNN(df)%>%
      subset(select=Brand.Code:Balling.Lvl)

Next, we double check that this worked correctly by checking for missing values.

sapply(df, function(x) sum(is.na(x)))

##        Brand.Code       Carb.Volume       Fill.Ounces         PC.Volume 
##                 0                 0                 0                 0 
##     Carb.Pressure         Carb.Temp               PSC          PSC.Fill 
##                 0                 0                 0                 0 
##           PSC.CO2          Mnf.Flow    Carb.Pressure1     Fill.Pressure 
##                 0                 0                 0                 0 
##     Hyd.Pressure1     Hyd.Pressure2     Hyd.Pressure3     Hyd.Pressure4 
##                 0                 0                 0                 0 
##      Filler.Level      Filler.Speed       Temperature        Usage.cont 
##                 0                 0                 0                 0 
##         Carb.Flow           Density               MFR           Balling 
##                 0                 0                 0                 0 
##   Pressure.Vacuum                PH     Oxygen.Filler     Bowl.Setpoint 
##                 0                 0                 0                 0 
## Pressure.Setpoint     Air.Pressurer          Alch.Rel          Carb.Rel 
##                 0                 0                 0                 0 
##       Balling.Lvl 
##                 0

Fortunately, we can see that there are now no missing data points!

Change non-numeric data to factor

df$Brand.Code=as.factor(df$Brand.Code)

We’ll need to dummy code our categorical variables. This process will create new columns for each value and assign a 0 or 1. Note that dummy encoding typically drops one value which becomes the baseline. So if we have a categorical feature with five unique values, we will have four columns. If all columns are 0, that represents the reference value.

df_dummy <- dummyVars(~ 0 + ., drop2nd=TRUE, data = df)
df_dummy <- data.frame(predict(df_dummy, newdata = df))

I will use preprocess() to apply the transformation to a set of predictors. Box-Cox: Reduce the skew and make it more normal. Scale: Calculates the standard deviation for an attribute and divides each value by that standard deviation. Center:Calculates the mean for an attribute and subtracts it from each value.

myTrans <- preProcess(df_dummy, method=c("BoxCox","center", "scale"))

myTrans

## Created from 2571 samples and 36 variables
## 
## Pre-processing:
##   - Box-Cox transformation (23)
##   - centered (36)
##   - ignored (0)
##   - scaled (36)
## 
## Lambda estimates for Box-Cox transformation:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.0000 -1.8000  0.2000  0.1087  2.0000  2.0000

df_dummy2<- predict(myTrans,df_dummy)
head(df_dummy2)

##   Brand.Code.A Brand.Code.B Brand.Code.C Brand.Code.D Carb.Volume Fill.Ounces
## 1   -0.3585688     0.952322    -0.372919   -0.5612192  -0.2589619 -0.09639336
## 2    2.7877805    -1.049657    -0.372919   -0.5612192   0.5567826  0.36210508
## 3   -0.3585688     0.952322    -0.372919   -0.5612192  -0.7810203  0.97462569
## 4    2.7877805    -1.049657    -0.372919   -0.5612192   0.6788334  0.36210508
## 5    2.7877805    -1.049657    -0.372919   -0.5612192   1.0990267  3.90266109
## 6    2.7877805    -1.049657    -0.372919   -0.5612192   0.1224351 -0.55412721
##    PC.Volume Carb.Pressure   Carb.Temp        PSC   PSC.Fill    PSC.CO2
## 1 -0.2061096    0.02105236  0.05878476  0.5188415  0.5506394 -0.3832964
## 2 -0.6219569    0.07742060 -0.34511132  0.8636297  0.2099599 -0.3832964
## 3 -0.2061096    0.74127055  0.92538400  0.2572756  1.2319986  2.4202742
## 4  0.2842233   -1.50564148 -2.26288459 -0.7373813  1.9133578 -0.3832964
## 5 -3.0352438   -0.26329167 -1.08163853 -1.3564659 -0.3010595  1.4857507
## 6 -0.1067558   -0.43593624 -0.65602336  0.2572756  0.3802996 -0.3832964
##    Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1 Hyd.Pressure2
## 1 -1.042813     -0.7905954    -0.5655853     -1.004388     -1.280269
## 2 -1.042813     -0.1934247    -0.5655853     -1.004388     -1.280269
## 3 -1.042813     -0.4911406    -0.5655853     -1.004388     -1.280269
## 4 -1.042813     -1.5688250    -0.4315583     -1.004388     -1.280269
## 5 -1.042813     -0.8764776    -0.6333003     -1.004388     -1.280269
## 6 -1.042813     -0.6192637    -0.7014900     -1.004388     -1.280269
##   Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont
## 1      -1.28226     1.5182190    0.7752774    0.4808833  0.04843332 -1.5572324
## 2      -1.28226     0.7652786    0.5823763    0.4517280  1.22743835 -0.4437290
## 3      -1.28226    -1.1386019    0.6857263    0.5138226  0.79519951 -1.1123441
## 4      -1.28226    -0.2670731    0.5238637    0.4991647 -0.25987895 -1.2115773
## 5      -1.28226    -0.2670731    0.5823763    0.4955047 -0.25987895 -1.1358656
## 6      -1.28226     1.3998208    0.7005896    0.5028264  0.20049791  0.9781031
##   Carb.Flow    Density       MFR    Balling Pressure.Vacuum        PH
## 1 0.4026550 -0.7414937 0.3776862 -0.8580648        2.133539 -1.077645
## 2 0.6338693 -0.6022966 0.4024579 -0.7506974        2.133539 -1.643078
## 3 0.3832488  1.0911946 0.5160841  1.0144229        2.484420  2.336050
## 4 0.5438643  1.0108974 0.4549554  0.9070554        1.431776 -1.755349
## 5 0.5351217  1.0108974 0.3474930  0.9070554        1.431776 -1.643078
## 6 0.4199351  0.9699633 0.5691724  0.8533717        1.431776 -1.304635
##   Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer   Alch.Rel
## 1   -0.37820721     0.7087745        -0.5607126    -0.1862599 -0.6169792
## 2   -0.22426969     0.7087745        -0.3496093     0.1533607 -0.6665236
## 3   -0.29866961     0.7087745        -0.4544814    -0.7010823  1.5094568
## 4   -0.08825242     0.7087745        -0.7773469     2.7707509  0.6057262
## 5   -0.08825242     0.7087745        -0.7773469     2.7707509  0.6057262
## 6   -0.29866961     0.7087745        -0.7773469     3.0859285  0.6441651
##      Carb.Rel Balling.Lvl
## 1 -0.91427031  -0.6553539
## 2 -1.08355183  -0.5634377
## 3  2.89505992   1.4127593
## 4 -0.09578199   1.1370109
## 5  0.06252259   1.1370109
## 6  0.06252259   1.1140319

Target/Feature Plots

Correlation

df_col<-df[sapply(df,is.numeric)]
ggcorr(df_col,label = T,label_round = 1 )

df %>%
  dplyr::select(PH,Mnf.Flow,Usage.cont,Bowl.Setpoint,Density, Temperature,Air.Pressurer)%>%
  ggpairs(aes(color =df$Brand.Code, alpha = 0.9))

tbl <- table(df$Brand.Code, df$PH)
barplot(tbl, main="PH per Brand Code",
       col =c("thistle3","darksalmon", "cornflowerblue","darkolivegreen3"), ylim=range(c(0, 180)), legend=TRUE)

Splitting our data into a Training/Test set

df_rows<-createDataPartition(df_dummy2$PH,p=0.8, list = FALSE)
df_train<-df_dummy2[df_rows,]
df_test<-df_dummy2[-df_rows,]

Part2: Fit the Model

Random Forest

rfModel<-randomForest(PH~.,data=df_train)
rfModel

## 
## Call:
##  randomForest(formula = PH ~ ., data = df_train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 11
## 
##           Mean of squared residuals: 0.3255122
##                     % Var explained: 66.95

Check the importance of the variables

importance(rfModel)

##                   IncNodePurity
## Brand.Code.A           9.165471
## Brand.Code.B          15.590595
## Brand.Code.C          94.312222
## Brand.Code.D          10.842818
## Carb.Volume           35.156523
## Fill.Ounces           30.704189
## PC.Volume             42.981434
## Carb.Pressure         23.177384
## Carb.Temp             21.022808
## PSC                   31.181205
## PSC.Fill              25.448667
## PSC.CO2               16.810734
## Mnf.Flow             253.257329
## Carb.Pressure1        70.962994
## Fill.Pressure         34.948208
## Hyd.Pressure1         27.296341
## Hyd.Pressure2         37.200558
## Hyd.Pressure3         41.787610
## Hyd.Pressure4         27.855715
## Filler.Level          93.146597
## Filler.Speed          51.047576
## Temperature           88.393631
## Usage.cont           160.072133
## Carb.Flow             47.870938
## Density               48.878262
## MFR                   41.031190
## Balling               55.651280
## Pressure.Vacuum       71.155322
## Oxygen.Filler         84.570158
## Bowl.Setpoint         78.935494
## Pressure.Setpoint     30.501402
## Air.Pressurer         68.141795
## Alch.Rel              70.988096
## Carb.Rel              78.246228
## Balling.Lvl           69.279240

varImpPlot(rfModel,type=2)

plot(rfModel)

Predict the test dataset

rfPred<-predict(rfModel,df_test)
rfResult<-postResample(rfPred,df_test$PH)
rfResult

##      RMSE  Rsquared       MAE 
## 0.5702752 0.7181757 0.4110123

Gradient Boosting

set.seed(123)
gbmModel<-gbm(PH~.,data =df_train,distribution = "gaussian",cv.folds = 5)
gbmModel

## gbm(formula = PH ~ ., distribution = "gaussian", data = df_train, 
##     cv.folds = 5)
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## The best cross-validation iteration was 100.
## There were 35 predictors of which 19 had non-zero influence.

# optimum number of trees
gbm.perf(gbmModel, method = "cv")

## [1] 100

gbmPre<-predict(gbmModel,df_test,trees=100)

## Using 100 trees...

gbmResult<-postResample(gbmPre,df_test$PH)
gbmResult

##      RMSE  Rsquared       MAE 
## 0.7867417 0.4371067 0.6073389

Support Vector Machine

set.seed(123)
svmModel <- svm(PH~.,data=df_train,kernel="radial",cost=10, scale = FALSE)
svmModel

## 
## Call:
## svm(formula = PH ~ ., data = df_train, kernel = "radial", cost = 10, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  10 
##       gamma:  0.02857143 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  1764