## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC
## 1 B 5.340000 23.96667 0.2633333 68.2 141.2 0.104
## 2 A 5.426667 24.00667 0.2386667 68.4 139.6 0.124
## 3 B 5.286667 24.06000 0.2633333 70.8 144.8 0.090
## 4 A 5.440000 24.00667 0.2933333 63.0 132.6 NA
## 5 A 5.486667 24.31333 0.1113333 67.2 136.8 0.026
## 6 A 5.380000 23.92667 0.2693333 66.6 138.4 0.090
## PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1 0.26 0.04 -100 118.8 46.0 0
## 2 0.22 0.04 -100 121.6 46.0 0
## 3 0.34 0.16 -100 120.2 46.0 0
## 4 0.42 0.04 -100 115.2 46.4 0
## 5 0.16 0.12 -100 118.4 45.8 0
## 6 0.24 0.04 -100 119.6 45.6 0
## Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1 NA NA 118 121.2 4002
## 2 NA NA 106 118.6 3986
## 3 NA NA 82 120.0 4020
## 4 0 0 92 117.8 4012
## 5 0 0 92 118.6 4010
## 6 0 0 116 120.2 4014
## Temperature Usage.cont Carb.Flow Density MFR Balling Pressure.Vacuum PH
## 1 66.0 16.18 2932 0.88 725.0 1.398 -4.0 8.36
## 2 67.6 19.90 3144 0.92 726.8 1.498 -4.0 8.26
## 3 67.0 17.76 2914 1.58 735.0 3.142 -3.8 8.94
## 4 65.6 17.42 3062 1.54 730.6 3.042 -4.4 8.24
## 5 65.6 17.68 3054 1.54 722.8 3.042 -4.4 8.26
## 6 66.2 23.82 2948 1.52 738.8 2.992 -4.4 8.32
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 1 0.022 120 46.4 142.6 6.58 5.32
## 2 0.026 120 46.8 143.0 6.56 5.30
## 3 0.024 120 46.6 142.0 7.66 5.84
## 4 0.030 120 46.0 146.2 7.14 5.42
## 5 0.030 120 46.0 146.2 7.14 5.44
## 6 0.024 120 46.0 146.6 7.16 5.44
## Balling.Lvl
## 1 1.48
## 2 1.56
## 3 3.28
## 4 3.04
## 5 3.04
## 6 3.02
| Name | df |
| Number of rows | 2571 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand.Code | 120 | 0.95 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Carb.Volume | 10 | 1.00 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▆▇▅▁ |
| Fill.Ounces | 38 | 0.99 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | ▁▂▇▂▁ |
| PC.Volume | 39 | 0.98 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
| Carb.Pressure | 27 | 0.99 | 68.19 | 3.54 | 57.00 | 65.60 | 68.20 | 70.60 | 79.40 | ▁▅▇▃▁ |
| Carb.Temp | 26 | 0.99 | 141.09 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▅▇▃▁ |
| PSC | 33 | 0.99 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
| PSC.Fill | 23 | 0.99 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
| PSC.CO2 | 39 | 0.98 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | ▇▅▂▁▁ |
| Mnf.Flow | 2 | 1.00 | 24.57 | 119.48 | -100.20 | -100.00 | 65.20 | 140.80 | 229.40 | ▇▁▁▇▂ |
| Carb.Pressure1 | 32 | 0.99 | 122.59 | 4.74 | 105.60 | 119.00 | 123.20 | 125.40 | 140.20 | ▁▃▇▂▁ |
| Fill.Pressure | 22 | 0.99 | 47.92 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | ▁▁▇▂▁ |
| Hyd.Pressure1 | 11 | 1.00 | 12.44 | 12.43 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | ▇▅▂▁▁ |
| Hyd.Pressure2 | 15 | 0.99 | 20.96 | 16.39 | 0.00 | 0.00 | 28.60 | 34.60 | 59.40 | ▇▂▇▅▁ |
| Hyd.Pressure3 | 15 | 0.99 | 20.46 | 15.98 | -1.20 | 0.00 | 27.60 | 33.40 | 50.00 | ▇▁▃▇▁ |
| Hyd.Pressure4 | 30 | 0.99 | 96.29 | 13.12 | 52.00 | 86.00 | 96.00 | 102.00 | 142.00 | ▁▃▇▂▁ |
| Filler.Level | 20 | 0.99 | 109.25 | 15.70 | 55.80 | 98.30 | 118.40 | 120.00 | 161.20 | ▁▃▅▇▁ |
| Filler.Speed | 57 | 0.98 | 3687.20 | 770.82 | 998.00 | 3888.00 | 3982.00 | 3998.00 | 4030.00 | ▁▁▁▁▇ |
| Temperature | 14 | 0.99 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | ▇▃▁▁▁ |
| Usage.cont | 5 | 1.00 | 20.99 | 2.98 | 12.08 | 18.36 | 21.79 | 23.75 | 25.90 | ▁▃▅▃▇ |
| Carb.Flow | 2 | 1.00 | 2468.35 | 1073.70 | 26.00 | 1144.00 | 3028.00 | 3186.00 | 5104.00 | ▂▅▆▇▁ |
| Density | 1 | 1.00 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | ▁▅▇▂▆ |
| MFR | 212 | 0.92 | 704.05 | 73.90 | 31.40 | 706.30 | 724.00 | 731.00 | 868.60 | ▁▁▁▂▇ |
| Balling | 1 | 1.00 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | ▁▇▇▁▇ |
| Pressure.Vacuum | 0 | 1.00 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | ▂▇▆▂▁ |
| PH | 4 | 1.00 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
| Oxygen.Filler | 12 | 1.00 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | ▇▁▁▁▁ |
| Bowl.Setpoint | 2 | 1.00 | 109.33 | 15.30 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | ▁▂▃▇▁ |
| Pressure.Setpoint | 12 | 1.00 | 47.62 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
| Air.Pressurer | 0 | 1.00 | 142.83 | 1.21 | 140.80 | 142.20 | 142.60 | 143.00 | 148.20 | ▅▇▁▁▁ |
| Alch.Rel | 9 | 1.00 | 6.90 | 0.51 | 5.28 | 6.54 | 6.56 | 7.24 | 8.62 | ▁▇▂▃▁ |
| Carb.Rel | 10 | 1.00 | 5.44 | 0.13 | 4.96 | 5.34 | 5.40 | 5.54 | 6.06 | ▁▇▇▂▁ |
| Balling.Lvl | 1 | 1.00 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | ▁▇▂▁▆ |
Missing Data:
We can use kNN imputation to help replace many of the missing values we identified above. With the KNN method, a categorical missing value is imputed by looking at other records with similar features. Once k similar records are found, they are used to infer the missing value. For missing numeric values, the k most similar records values are averaged (mean) to “predict” the missing value.
Next, we double check that this worked correctly by checking for missing values.
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## 0 0 0 0
## Carb.Pressure Carb.Temp PSC PSC.Fill
## 0 0 0 0
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## 0 0 0 0
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## 0 0 0 0
## Filler.Level Filler.Speed Temperature Usage.cont
## 0 0 0 0
## Carb.Flow Density MFR Balling
## 0 0 0 0
## Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint
## 0 0 0 0
## Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 0 0 0 0
## Balling.Lvl
## 0
Fortunately, we can see that there are now no missing data points!
Change non-numeric data to factor
We’ll need to dummy code our categorical variables. This process will create new columns for each value and assign a 0 or 1. Note that dummy encoding typically drops one value which becomes the baseline. So if we have a categorical feature with five unique values, we will have four columns. If all columns are 0, that represents the reference value.
df_dummy <- dummyVars(~ 0 + ., drop2nd=TRUE, data = df)
df_dummy <- data.frame(predict(df_dummy, newdata = df))I will use preprocess() to apply the transformation to a set of predictors. Box-Cox: Reduce the skew and make it more normal. Scale: Calculates the standard deviation for an attribute and divides each value by that standard deviation. Center:Calculates the mean for an attribute and subtracts it from each value.
## Created from 2571 samples and 36 variables
##
## Pre-processing:
## - Box-Cox transformation (23)
## - centered (36)
## - ignored (0)
## - scaled (36)
##
## Lambda estimates for Box-Cox transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.0000 -1.8000 0.2000 0.1087 2.0000 2.0000
## Brand.Code.A Brand.Code.B Brand.Code.C Brand.Code.D Carb.Volume Fill.Ounces
## 1 -0.3585688 0.952322 -0.372919 -0.5612192 -0.2589619 -0.09639336
## 2 2.7877805 -1.049657 -0.372919 -0.5612192 0.5567826 0.36210508
## 3 -0.3585688 0.952322 -0.372919 -0.5612192 -0.7810203 0.97462569
## 4 2.7877805 -1.049657 -0.372919 -0.5612192 0.6788334 0.36210508
## 5 2.7877805 -1.049657 -0.372919 -0.5612192 1.0990267 3.90266109
## 6 2.7877805 -1.049657 -0.372919 -0.5612192 0.1224351 -0.55412721
## PC.Volume Carb.Pressure Carb.Temp PSC PSC.Fill PSC.CO2
## 1 -0.2061096 0.02105236 0.05878476 0.5188415 0.5506394 -0.3832964
## 2 -0.6219569 0.07742060 -0.34511132 0.8636297 0.2099599 -0.3832964
## 3 -0.2061096 0.74127055 0.92538400 0.2572756 1.2319986 2.4202742
## 4 0.2842233 -1.50564148 -2.26288459 -0.7373813 1.9133578 -0.3832964
## 5 -3.0352438 -0.26329167 -1.08163853 -1.3564659 -0.3010595 1.4857507
## 6 -0.1067558 -0.43593624 -0.65602336 0.2572756 0.3802996 -0.3832964
## Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1 Hyd.Pressure2
## 1 -1.042813 -0.7905954 -0.5655853 -1.004388 -1.280269
## 2 -1.042813 -0.1934247 -0.5655853 -1.004388 -1.280269
## 3 -1.042813 -0.4911406 -0.5655853 -1.004388 -1.280269
## 4 -1.042813 -1.5688250 -0.4315583 -1.004388 -1.280269
## 5 -1.042813 -0.8764776 -0.6333003 -1.004388 -1.280269
## 6 -1.042813 -0.6192637 -0.7014900 -1.004388 -1.280269
## Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont
## 1 -1.28226 1.5182190 0.7752774 0.4808833 0.04843332 -1.5572324
## 2 -1.28226 0.7652786 0.5823763 0.4517280 1.22743835 -0.4437290
## 3 -1.28226 -1.1386019 0.6857263 0.5138226 0.79519951 -1.1123441
## 4 -1.28226 -0.2670731 0.5238637 0.4991647 -0.25987895 -1.2115773
## 5 -1.28226 -0.2670731 0.5823763 0.4955047 -0.25987895 -1.1358656
## 6 -1.28226 1.3998208 0.7005896 0.5028264 0.20049791 0.9781031
## Carb.Flow Density MFR Balling Pressure.Vacuum PH
## 1 0.4026550 -0.7414937 0.3776862 -0.8580648 2.133539 -1.077645
## 2 0.6338693 -0.6022966 0.4024579 -0.7506974 2.133539 -1.643078
## 3 0.3832488 1.0911946 0.5160841 1.0144229 2.484420 2.336050
## 4 0.5438643 1.0108974 0.4549554 0.9070554 1.431776 -1.755349
## 5 0.5351217 1.0108974 0.3474930 0.9070554 1.431776 -1.643078
## 6 0.4199351 0.9699633 0.5691724 0.8533717 1.431776 -1.304635
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel
## 1 -0.37820721 0.7087745 -0.5607126 -0.1862599 -0.6169792
## 2 -0.22426969 0.7087745 -0.3496093 0.1533607 -0.6665236
## 3 -0.29866961 0.7087745 -0.4544814 -0.7010823 1.5094568
## 4 -0.08825242 0.7087745 -0.7773469 2.7707509 0.6057262
## 5 -0.08825242 0.7087745 -0.7773469 2.7707509 0.6057262
## 6 -0.29866961 0.7087745 -0.7773469 3.0859285 0.6441651
## Carb.Rel Balling.Lvl
## 1 -0.91427031 -0.6553539
## 2 -1.08355183 -0.5634377
## 3 2.89505992 1.4127593
## 4 -0.09578199 1.1370109
## 5 0.06252259 1.1370109
## 6 0.06252259 1.1140319
Correlation
df %>%
dplyr::select(PH,Mnf.Flow,Usage.cont,Bowl.Setpoint,Density, Temperature,Air.Pressurer)%>%
ggpairs(aes(color =df$Brand.Code, alpha = 0.9))tbl <- table(df$Brand.Code, df$PH)
barplot(tbl, main="PH per Brand Code",
col =c("thistle3","darksalmon", "cornflowerblue","darkolivegreen3"), ylim=range(c(0, 180)), legend=TRUE)##
## Call:
## randomForest(formula = PH ~ ., data = df_train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 11
##
## Mean of squared residuals: 0.3255122
## % Var explained: 66.95
Check the importance of the variables
## IncNodePurity
## Brand.Code.A 9.165471
## Brand.Code.B 15.590595
## Brand.Code.C 94.312222
## Brand.Code.D 10.842818
## Carb.Volume 35.156523
## Fill.Ounces 30.704189
## PC.Volume 42.981434
## Carb.Pressure 23.177384
## Carb.Temp 21.022808
## PSC 31.181205
## PSC.Fill 25.448667
## PSC.CO2 16.810734
## Mnf.Flow 253.257329
## Carb.Pressure1 70.962994
## Fill.Pressure 34.948208
## Hyd.Pressure1 27.296341
## Hyd.Pressure2 37.200558
## Hyd.Pressure3 41.787610
## Hyd.Pressure4 27.855715
## Filler.Level 93.146597
## Filler.Speed 51.047576
## Temperature 88.393631
## Usage.cont 160.072133
## Carb.Flow 47.870938
## Density 48.878262
## MFR 41.031190
## Balling 55.651280
## Pressure.Vacuum 71.155322
## Oxygen.Filler 84.570158
## Bowl.Setpoint 78.935494
## Pressure.Setpoint 30.501402
## Air.Pressurer 68.141795
## Alch.Rel 70.988096
## Carb.Rel 78.246228
## Balling.Lvl 69.279240
Predict the test dataset
## RMSE Rsquared MAE
## 0.5702752 0.7181757 0.4110123
## gbm(formula = PH ~ ., distribution = "gaussian", data = df_train,
## cv.folds = 5)
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## The best cross-validation iteration was 100.
## There were 35 predictors of which 19 had non-zero influence.
## [1] 100
## Using 100 trees...
## RMSE Rsquared MAE
## 0.7867417 0.4371067 0.6073389
##
## Call:
## svm(formula = PH ~ ., data = df_train, kernel = "radial", cost = 10,
## scale = FALSE)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 10
## gamma: 0.02857143
## epsilon: 0.1
##
##
## Number of Support Vectors: 1764
Predict the test dataset
## RMSE Rsquared MAE
## 0.6577194 0.5926516 0.4752371
result<-rbind(rfResult,gbmResult,svmResult)
result<-data.frame(result)
rownames(result)<-c('Random Forest','Gradient Boosting','SVM')
result## RMSE Rsquared MAE
## Random Forest 0.5702752 0.7181757 0.4110123
## Gradient Boosting 0.7867417 0.4371067 0.6073389
## SVM 0.6577194 0.5926516 0.4752371