You are given a simple data set from a beverage manufacturing company. It consists of 2,571 rows/cases of data and 33 columns / variables. Your goal is to use this data to predict PH (a column in the set). Potential for hydrogen (pH) is a measure of acidity/alkalinity, it must conform in a critical range and therefore it is important to understand its influence and predict its values. This is production data. pH is a KPI, Key Performance Indicator.
data2<-read.csv("StudentData - TO MODEL.csv")
data2_eval<-read.csv("StudentEvaluation- TO PREDICT.csv")
head(data2)## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC
## 1 B 5.340000 23.96667 0.2633333 68.2 141.2 0.104
## 2 A 5.426667 24.00667 0.2386667 68.4 139.6 0.124
## 3 B 5.286667 24.06000 0.2633333 70.8 144.8 0.090
## 4 A 5.440000 24.00667 0.2933333 63.0 132.6 NA
## 5 A 5.486667 24.31333 0.1113333 67.2 136.8 0.026
## 6 A 5.380000 23.92667 0.2693333 66.6 138.4 0.090
## PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1 0.26 0.04 -100 118.8 46.0 0
## 2 0.22 0.04 -100 121.6 46.0 0
## 3 0.34 0.16 -100 120.2 46.0 0
## 4 0.42 0.04 -100 115.2 46.4 0
## 5 0.16 0.12 -100 118.4 45.8 0
## 6 0.24 0.04 -100 119.6 45.6 0
## Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1 NA NA 118 121.2 4002
## 2 NA NA 106 118.6 3986
## 3 NA NA 82 120.0 4020
## 4 0 0 92 117.8 4012
## 5 0 0 92 118.6 4010
## 6 0 0 116 120.2 4014
## Temperature Usage.cont Carb.Flow Density MFR Balling Pressure.Vacuum PH
## 1 66.0 16.18 2932 0.88 725.0 1.398 -4.0 8.36
## 2 67.6 19.90 3144 0.92 726.8 1.498 -4.0 8.26
## 3 67.0 17.76 2914 1.58 735.0 3.142 -3.8 8.94
## 4 65.6 17.42 3062 1.54 730.6 3.042 -4.4 8.24
## 5 65.6 17.68 3054 1.54 722.8 3.042 -4.4 8.26
## 6 66.2 23.82 2948 1.52 738.8 2.992 -4.4 8.32
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 1 0.022 120 46.4 142.6 6.58 5.32
## 2 0.026 120 46.8 143.0 6.56 5.30
## 3 0.024 120 46.6 142.0 7.66 5.84
## 4 0.030 120 46.0 146.2 7.14 5.42
## 5 0.030 120 46.0 146.2 7.14 5.44
## 6 0.024 120 46.0 146.6 7.16 5.44
## Balling.Lvl
## 1 1.48
## 2 1.56
## 3 3.28
## 4 3.04
## 5 3.04
## 6 3.02
In the dataset 32 variables are numeric, only Brand Code’ is charecter. I am transforming `Brand Code’ to a factor for the convinience of our model.
glimpse(data2)## Rows: 2,571
## Columns: 33
## $ Brand.Code <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
## $ Carb.Volume <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
## $ Fill.Ounces <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
## $ PC.Volume <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
## $ Carb.Pressure <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
## $ Carb.Temp <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
## $ PSC <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
## $ PSC.Fill <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
## $ PSC.CO2 <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
## $ Mnf.Flow <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ Carb.Pressure1 <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
## $ Fill.Pressure <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
## $ Hyd.Pressure1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure2 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure3 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure4 <int> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
## $ Filler.Level <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
## $ Filler.Speed <int> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
## $ Temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
## $ Usage.cont <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
## $ Carb.Flow <int> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
## $ Density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
## $ MFR <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
## $ Balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
## $ Pressure.Vacuum <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
## $ PH <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
## $ Oxygen.Filler <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
## $ Bowl.Setpoint <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
## $ Air.Pressurer <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
## $ Alch.Rel <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
## $ Carb.Rel <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
## $ Balling.Lvl <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…
data2$`Brand.Code`<-as.factor(data2$`Brand.Code`)summary(data2)## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure
## : 120 Min. :5.040 Min. :23.63 Min. :0.07933 Min. :57.00
## A: 293 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917 1st Qu.:65.60
## B:1239 Median :5.347 Median :23.97 Median :0.27133 Median :68.20
## C: 304 Mean :5.370 Mean :23.97 Mean :0.27712 Mean :68.19
## D: 615 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200 3rd Qu.:70.60
## Max. :5.700 Max. :24.32 Max. :0.47800 Max. :79.40
## NA's :10 NA's :38 NA's :39 NA's :27
## Carb.Temp PSC PSC.Fill PSC.CO2
## Min. :128.6 Min. :0.00200 Min. :0.0000 Min. :0.00000
## 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000 1st Qu.:0.02000
## Median :140.8 Median :0.07600 Median :0.1800 Median :0.04000
## Mean :141.1 Mean :0.08457 Mean :0.1954 Mean :0.05641
## 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600 3rd Qu.:0.08000
## Max. :154.0 Max. :0.27000 Max. :0.6200 Max. :0.24000
## NA's :26 NA's :33 NA's :23 NA's :39
## Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## Min. :-100.20 Min. :105.6 Min. :34.60 Min. :-0.80
## 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00 1st Qu.: 0.00
## Median : 65.20 Median :123.2 Median :46.40 Median :11.40
## Mean : 24.57 Mean :122.6 Mean :47.92 Mean :12.44
## 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00 3rd Qu.:20.20
## Max. : 229.40 Max. :140.2 Max. :60.40 Max. :58.00
## NA's :2 NA's :32 NA's :22 NA's :11
## Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level
## Min. : 0.00 Min. :-1.20 Min. : 52.00 Min. : 55.8
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00 1st Qu.: 98.3
## Median :28.60 Median :27.60 Median : 96.00 Median :118.4
## Mean :20.96 Mean :20.46 Mean : 96.29 Mean :109.3
## 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00 3rd Qu.:120.0
## Max. :59.40 Max. :50.00 Max. :142.00 Max. :161.2
## NA's :15 NA's :15 NA's :30 NA's :20
## Filler.Speed Temperature Usage.cont Carb.Flow Density
## Min. : 998 Min. :63.60 Min. :12.08 Min. : 26 Min. :0.240
## 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144 1st Qu.:0.900
## Median :3982 Median :65.60 Median :21.79 Median :3028 Median :0.980
## Mean :3687 Mean :65.97 Mean :20.99 Mean :2468 Mean :1.174
## 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186 3rd Qu.:1.620
## Max. :4030 Max. :76.20 Max. :25.90 Max. :5104 Max. :1.920
## NA's :57 NA's :14 NA's :5 NA's :2 NA's :1
## MFR Balling Pressure.Vacuum PH
## Min. : 31.4 Min. :-0.170 Min. :-6.600 Min. :7.880
## 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600 1st Qu.:8.440
## Median :724.0 Median : 1.648 Median :-5.400 Median :8.540
## Mean :704.0 Mean : 2.198 Mean :-5.216 Mean :8.546
## 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000 3rd Qu.:8.680
## Max. :868.6 Max. : 4.012 Max. :-3.600 Max. :9.360
## NA's :212 NA's :1 NA's :4
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer
## Min. :0.00240 Min. : 70.0 Min. :44.00 Min. :140.8
## 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2
## Median :0.03340 Median :120.0 Median :46.00 Median :142.6
## Mean :0.04684 Mean :109.3 Mean :47.62 Mean :142.8
## 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:143.0
## Max. :0.40000 Max. :140.0 Max. :52.00 Max. :148.2
## NA's :12 NA's :2 NA's :12
## Alch.Rel Carb.Rel Balling.Lvl
## Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :6.560 Median :5.400 Median :1.48
## Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
vis_miss(data2)Only 1% values fro the whole dataset is missing. However, MFR variable has the most missing value. 8.25% data from the MFR column is missing
We can observe that some variables have more than one modes.
Carb.Pressure, Carb.Temp, Carb,Volume, Fill.Ounces, PC.Volume, PH, Pressure.Vacuum, PSC, PSC.Fill, and Temperature have relatively normal distribution.
plot_histogram(data2)The distribution of the variables are really diverse.For instance Filler Level, Carb Temp has narrow range of data spread, whereas carb flow has wide range of data spread. Most of the values have outliers, however, speed and MFR have distinct amount of outliers.
Therefore outliers handling will be really crucial to improve the performance of our model. Scaling will be applied to handle outliers.
ggplot(data = reshape2::melt(data2) , aes(x=variable, y=value)) +
geom_boxplot(outlier.colour="red", outlier.shape=3, outlier.size=5,aes(fill=variable)) +
coord_flip() + theme(legend.position = "none")## Warning: Removed 724 rows containing non-finite values (stat_boxplot).
corrplot(cor(data2[,-1], use = "na.or.complete"),order="alphabet",
col=brewer.pal(8,"Set3"))The graph displays that most of the variables have weak correlations. However,multicollinearity can be observed among Balling, Balling.Lvl.
Since this model is going to predict PH, we are going to explore which variables have strong relationships with PH.This may help with our variable selection. Variable selection is an important aspect of model building.It helps in building predictive models free from correlated variables, biases and unwanted noise.
# Perform Boruta search
boruta_output <- Boruta(PH ~ ., data=na.omit(data2), doTrace=0, maxRuns = 1000)
# Get significant variables including tentatives
boruta_signif <- getSelectedAttributes(boruta_output, withTentative = TRUE)
#print(boruta_signif)
# Plot variable importance
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")From the graph we can see that PSC CO2, PSC, PSC Fill and Carb Temp are insignificant. Carb Ref appears to be really significant.
Next we are going to utilize preprocess function from caret package to central, scale, impute,Box Cox and reduce multi collinearity in our dataset. We are going to exclude Brand Code and PH columns.
#dataprocess <- data.matrix(subset(data2, select = -c(Brand.Code))) #convert to mumeric matix. Exlude character column (brand code)
preprocessing <- preProcess(subset(data2, select = -c(Brand.Code, PH)), method = c("center", "scale", "knnImpute", "corr", "BoxCox"))
preprocessing## Created from 2134 samples and 31 variables
##
## Pre-processing:
## - Box-Cox transformation (19)
## - centered (25)
## - ignored (0)
## - 5 nearest neighbor imputation (25)
## - removed (6)
## - scaled (25)
##
## Lambda estimates for Box-Cox transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.0000 -2.0000 -0.2000 -0.1842 1.0000 2.0000
processe2 <- predict(preprocessing, subset(data2, select = -c(Brand.Code)))
processe2 <- processe2 %>% drop_na(PH)vis_miss(as.data.frame(processe2))c <- ncol(processe2) - 1
stack(sort(cor(processe2[, c + 1], processe2[,1:c])[,], decreasing=TRUE))## values ind
## 1 0.8397435768 Alch.Rel
## 2 0.7846690832 Carb.Volume
## 3 0.4295227573 Carb.Pressure
## 4 0.1655308292 PH
## 5 0.1262737242 Bowl.Setpoint
## 6 0.0269777802 Hyd.Pressure1
## 7 0.0230715756 Carb.Pressure1
## 8 0.0167511558 Hyd.Pressure2
## 9 0.0038708424 Carb.Temp
## 10 0.0032691401 MFR
## 11 0.0003110381 Pressure.Vacuum
## 12 -0.0152131729 PSC.Fill
## 13 -0.0198118459 Oxygen.Filler
## 14 -0.0312594022 Usage.cont
## 15 -0.0358325321 Mnf.Flow
## 16 -0.0531924694 Carb.Flow
## 17 -0.0643355562 PSC.CO2
## 18 -0.0682015477 PSC
## 19 -0.1001586534 Air.Pressurer
## 20 -0.1089324787 Temperature
## 21 -0.1187077889 Fill.Ounces
## 22 -0.1329121923 PC.Volume
## 23 -0.1656156064 Fill.Pressure
## 24 -0.2393783404 Pressure.Setpoint
## 25 -0.4390215899 Hyd.Pressure4
# Feature Creation
processe2 <- data.frame(processe2)
processe2 <- processe2 %>%
mutate(Mnf.Flow = if_else(Mnf.Flow < 0, 1, 0)) %>%
mutate(Hyd.Pressure1 = if_else(Hyd.Pressure1 <= 0 ,1, 0)) %>%
mutate(Hyd.Pressure2 = if_else(Hyd.Pressure2 <= 0, 1, 0)) %>%
mutate(Carb.Flow = if_else(Carb.Flow < 2000, 1, 0))
The data is split in the ratio of 80:20 for train and test sets respectively. A random seed is set for reproducibility.
ph <- data.matrix(processe2$PH) #ph --target
set.seed(100)
s <- ph %>% createDataPartition(p = 0.8, list = FALSE, times = 1)
train <- processe2[s, ] #student train
train <- subset(train, select = -c(PH)) # exlude target column
test <- processe2[-s, ] #student validation set
test <- subset(test, select = -c(PH)) #exclude target column
train_ph <- ph[s, ] # ph
test_ph <- ph[-s] # phctrl <- trainControl(method = "cv", number = 10)After preprocessing our data a linear model shows us that about half of the predictors are statistically significant. Our first attempt at a model gives us an RSME of 0.144 and an Rsquared of 0.297. Let’s see if we can find a model that’s better fit. Lastly, our MAPE is 0.01317.
lm <- train(train, train_ph, method = "lm", trControl = ctrl)
summary(lm)##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54082 -0.08425 0.01217 0.09318 0.64248
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.5460901 0.0031169 2741.898 < 2e-16 ***
## Carb.Volume -0.0041917 0.0082489 -0.508 0.61141
## Fill.Ounces -0.0085440 0.0033060 -2.584 0.00982 **
## PC.Volume -0.0060449 0.0038927 -1.553 0.12061
## Carb.Pressure 0.0003508 0.0125407 0.028 0.97768
## Carb.Temp 0.0060987 0.0113774 0.536 0.59199
## PSC -0.0051752 0.0033749 -1.533 0.12533
## PSC.Fill -0.0050991 0.0032476 -1.570 0.11655
## PSC.CO2 -0.0058976 0.0031977 -1.844 0.06528 .
## Mnf.Flow -0.0795196 0.0062685 -12.686 < 2e-16 ***
## Carb.Pressure1 0.0370769 0.0038514 9.627 < 2e-16 ***
## Fill.Pressure 0.0057435 0.0045865 1.252 0.21062
## Hyd.Pressure1 0.0029024 0.0052597 0.552 0.58114
## Hyd.Pressure2 0.0241311 0.0067796 3.559 0.00038 ***
## Hyd.Pressure4 -0.0051637 0.0043919 -1.176 0.23984
## Temperature -0.0293192 0.0034670 -8.457 < 2e-16 ***
## Usage.cont -0.0275841 0.0040617 -6.791 1.45e-11 ***
## Carb.Flow 0.0091266 0.0041978 2.174 0.02981 *
## MFR -0.0076758 0.0035765 -2.146 0.03198 *
## Pressure.Vacuum 0.0048421 0.0042225 1.147 0.25162
## Oxygen.Filler -0.0128752 0.0043916 -2.932 0.00341 **
## Bowl.Setpoint 0.0258166 0.0048404 5.334 1.07e-07 ***
## Pressure.Setpoint -0.0106848 0.0046342 -2.306 0.02123 *
## Air.Pressurer -0.0001223 0.0033295 -0.037 0.97071
## Alch.Rel 0.0083527 0.0065589 1.273 0.20299
## Carb.Rel 0.0094487 0.0065765 1.437 0.15095
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1409 on 2029 degrees of freedom
## Multiple R-squared: 0.3413, Adjusted R-squared: 0.3332
## F-statistic: 42.05 on 25 and 2029 DF, p-value: < 2.2e-16
lm_pred <- predict(lm,newdata = test)
table <- data.frame(lm = postResample(pred = lm_pred , obs = test_ph))
table## lm
## RMSE 0.1444703
## Rsquared 0.2972350
## MAE 0.1119767
MAPE(lm_pred, test_ph)## [1] 0.01317228
Out next attempt is a Partial Least Squares model. Plotting the components shows us that the model is best tuned at the 7th component. The RSME is slightly lower and the Rsquared marginally higher, but nothing significant. The variables that influence this model the most are Mnf.Flow at the top with significant weight given to Bowl.Setpoint and Usage.Cont as well. Residuals look randomly distributed, which is what we expect.
pls_model <- train(
train,
train_ph,
method = 'pls',
trControl = ctrl,
tuneLength = 20
)
summary(pls_model)## Data: X dimension: 2055 25
## Y dimension: 2055 1
## Fit method: oscorespls
## Number of components considered: 10
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 16.12 22.41 32.53 40.69 45.33 48.66 52.07
## .outcome 23.29 30.47 32.45 33.19 33.71 34.01 34.09
## 8 comps 9 comps 10 comps
## X 56.50 60.33 63.64
## .outcome 34.12 34.12 34.13
plot(pls_model)plot(varImp(pls_model), top = 10)pred <- predict(pls_model, test)
postResample(pred=pred, obs=test_ph)## RMSE Rsquared MAE
## 0.1444667 0.2972877 0.1119500
residuals <- test_ph - pred
plot(residuals,
main="Residuals for PLS model",
xlab = 'Observation Number in Test',
ylab = 'Residuals')MAPE(pred, test_ph)## [1] 0.01316916
The next model we use is K-nearest neighbor model. The inputs are pretty straightforward, and after running it we see it’s best tuned when k=7. Using the our new KNN model against our test set produces an RSME of 0.131 and an R squared of 0.43. This is our most performant model so far because it has a higher R squared and lower RSME than our previous attempts. MAPE is also lower then the previous models at 0.01138.
knn_model <- train(
train,
train_ph,
method = 'knn',
trControl = ctrl,
tuneLength = 10
)
knn_model## k-Nearest Neighbors
##
## 2055 samples
## 25 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1850, 1849, 1850, 1850, 1849, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.1300976 0.4424617 0.09600541
## 7 0.1292228 0.4474741 0.09652588
## 9 0.1295887 0.4428510 0.09712231
## 11 0.1294215 0.4447368 0.09756081
## 13 0.1301966 0.4391178 0.09805839
## 15 0.1301364 0.4402089 0.09838328
## 17 0.1310205 0.4322555 0.09940137
## 19 0.1321252 0.4224370 0.10030266
## 21 0.1326929 0.4179984 0.10081663
## 23 0.1328843 0.4160379 0.10137886
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.
plot(knn_model)pred <- predict(knn_model, test)
postResample(pred=pred, obs=test_ph)## RMSE Rsquared MAE
## 0.13110804 0.43201195 0.09655134
MAPE(pred, test_ph)## [1] 0.01138328
Residuals
residuals <- test_ph - pred
plot(residuals,
main="Residuals for KNN model",
xlab = 'Observation Number in Test',
ylab = 'Residuals')The optimal MARS model is achieved at degree=3 and nprune = 40. The RSME is 0.13207 and Rsqaured is 0.4183. This is marginally worse than the KNN model, but not enough to make a difference. MAPE is 0.01178.
set.seed(100)
parameter_grid <- floor(expand.grid(degree=1:4, nprune=seq(5, 50, by=5)))
mars_model <- train(
train,
train_ph,
method = 'earth',
metric='RMSE',
trControl = ctrl,
tuneGrid = parameter_grid
)
plot(mars_model)pred <- predict(mars_model$finalModel, test)
postResample(pred=pred, obs=test_ph)## RMSE Rsquared MAE
## 0.1353495 0.3883327 0.1024484
MAPE(pred, test_ph)## [1] 0.01206694
Residuals
residuals <- test_ph - pred
plot(residuals,
main="Residuals for MARS model",
xlab = 'Observation Number in Test',
ylab = 'Residuals')Next model is Random forest which gives us the smallest error so far with an RSME of 0.10738 and a MAPE equal to 0.009109. The Rsquared is also the highest at 0.61584. The final value used for the model was mtry = 25.
rf_model <- train(
train,
train_ph,
method = 'rf',
trControl = ctrl,
tunelength=10
)
rf_model## Random Forest
##
## 2055 samples
## 25 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1850, 1850, 1849, 1849, 1848, 1849, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.1189895 0.5647016 0.09077114
## 13 0.1088697 0.6141481 0.07951260
## 25 0.1081666 0.6119878 0.07809214
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 25.
plot(rf_model)pred <- predict(rf_model, test)
postResample(pred=pred, obs=test_ph)## RMSE Rsquared MAE
## 0.10754407 0.61468327 0.07746621
plot(varImp(rf_model), top = 10)MAPE(pred, test_ph)## [1] 0.009118084
Residuals
residuals <- test_ph - pred
plot(residuals,
main="Residuals for Random Forrest model",
xlab = 'Observation Number in Test',
ylab = 'Residuals')Last model we performed was cubist which produced slightly better results compared to random forrest.
parameter_grid <- expand.grid(committees = c(1, 10, 50, 100), neighbors = c(0, 1, 5, 9))
cubist_model <- train(
train,
train_ph,
method = 'cubist',
trControl = ctrl,
tuneGrid=parameter_grid
)
plot(cubist_model)plot(varImp(cubist_model), top = 10)pred <- predict(cubist_model, test)
postResample(pred=pred, obs=test_ph)## RMSE Rsquared MAE
## 0.10377080 0.64094431 0.07603397
MAPE(pred, test_ph)## [1] 0.008956511
Residuals
residuals <- test_ph - pred
plot(residuals,
main="Residuals for Cubist model",
xlab = 'Observation Number in Test',
ylab = 'Residuals')For our final prediction we will use the cubist and random forest model since they gave use the smallest error with the highest Rsquared.
data2_temp <- data2_eval %>% select(Carb.Volume , Fill.Ounces , PC.Volume , Carb.Pressure , Carb.Temp, PSC, PSC.Fill , PSC.CO2 , Mnf.Flow , Carb.Pressure1 , Fill.Pressure , Hyd.Pressure1, Hyd.Pressure2 , Hyd.Pressure4 , Temperature , Usage.cont , Carb.Flow , MFR, Pressure.Vacuum , Oxygen.Filler , Bowl.Setpoint , Pressure.Setpoint , Air.Pressurer, Alch.Rel, Carb.Rel)
preprocessing_eval <- preProcess(data2_temp, method = c("center", "scale", "knnImpute", "corr", "BoxCox"))
process_eval <- predict(preprocessing_eval, data2_temp)
vis_miss(as.data.frame(process_eval))Cubist
cubist_ph <- predict(cubist_model, process_eval)
cubist_eval <- data2_eval
cubist_eval$PH <- cubist_ph
write.csv(cubist_eval, "cubist_evaluation.csv")Random Forest
rf_ph <- predict(rf_model, process_eval)
rf_eval <- data2_eval
rf_eval$PH <- cubist_ph
write.csv(rf_eval, "random_forest_evaluation.csv")