This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of pH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
We are a team of Data Scientists working for a PCG company. Due to regulation, we need to better predict the PH in our beverages.
With the given data, we cannot confidently predict the PH for our line of variables, the issues begin with data gathering. We were able to achieve the highest RSqsaured of 67.8% and an RMSE of 1.01. Since PH is already a logged number, our error is essentially \(1^{10}\). Below is an overview of the RMSE and RSqaured for each individual Brand Code. We can see the highest is B and only has an RSqaured of .74. We aim to achieve an RSquared of at minimum .85.
## # A tibble: 4 x 3
## Brand.Code RMSE RSqaured
## <chr> <chr> <chr>
## 1 A 0.10487783 0.53631742
## 2 B 0.08658230 0.74858830
## 3 C 0.1543169 0.2629705
## 4 D 0.07831330 0.67088021
The forecasting methods we tried included Linear Modelling and Non-Linear Methods. Linear included multi-linear regression, partial least squares, AIC optimized. Non Linear included k-nearest neighbors (KNN), support vector machines (SVM) and multivariate adaptive regression splines (MARS), rpart, and XGBoost. With XGBoost offerinng the highest RSqaured and lowest RMSE.
This brings us to the issues with the data and how we can resolve them in the face of government regulation. The first and foremost issue is that we lost about 5% of our data because the beverages were not properly labelled. Then, we noticed within the sensors themselves there were several missing values and values with wide ranges. This begs the question, how often are our sensors calibrated or tested?
The first steps we should take given the pressure for governmental regulation should be to build out better data QA procedures. We should teach the importance of labeling data so that any bottles we are tested have the full name and data.
Second, we should build out a new process to ensure our sensors are working properly and that we have the information for when they were last calibrated/inspected in a table where we can properly ensure the quality of the data we are receiving.
myfile <- "https://raw.githubusercontent.com/jhumms/624/main/data-dump/StudentData.csv"
Train <- read_csv(myfile)
## Rows: 2571 Columns: 33
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Brand Code
## dbl (32): Carb Volume, Fill Ounces, PC Volume, Carb Pressure, Carb Temp, PSC...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
myfile <- "https://raw.githubusercontent.com/jhumms/624/main/data-dump/StudentEvaluation.csv"
Test <- read_csv(myfile)
## Rows: 267 Columns: 33
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Brand Code
## dbl (31): Carb Volume, Fill Ounces, PC Volume, Carb Pressure, Carb Temp, PSC...
## lgl (1): PH
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
rm(myfile)
exp(0.01159245)
## [1] 1.01166
The first step is to get a good look at our data and make sure it is clean enough to model.
dim(Train)
## [1] 2571 33
dim(Test)
## [1] 267 33
We see that the Train Table has 2571 entries and 33 columns. While Test has all 33 columns but only 267 entries. Let’s glimpse the data to see what we can find.
head(Train)
## # A tibble: 6 x 33
## `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 B 5.34 24.0 0.263 68.2
## 2 A 5.43 24.0 0.239 68.4
## 3 B 5.29 24.1 0.263 70.8
## 4 A 5.44 24.0 0.293 63
## 5 A 5.49 24.3 0.111 67.2
## 6 A 5.38 23.9 0.269 66.6
## # i 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## # `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## # `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## # `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## # `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## # `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
## # `Pressure Vacuum` <dbl>, PH <dbl>, `Oxygen Filler` <dbl>, ...
We can see that there are different Branded beverages with different fills, pressures, carb temps, etc. When we clean the data, it will be important to clean it by each group so that we are not filling in incorrect values. Now let’s run the summary.
summary(Train)
## Brand Code Carb Volume Fill Ounces PC Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb Pressure Carb Temp PSC PSC Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## Filler Level Filler Speed Temperature Usage cont Carb Flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## Density MFR Balling Pressure Vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## PH Oxygen Filler Bowl Setpoint Pressure Setpoint
## Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00
## Median :8.540 Median :0.03340 Median :120.0 Median :46.00
## Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62
## 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00
## NA's :4 NA's :12 NA's :2 NA's :12
## Air Pressurer Alch Rel Carb Rel Balling Lvl
## Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :142.6 Median :6.560 Median :5.400 Median :1.48
## Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
summary(Test)
## Brand Code Carb Volume Fill Ounces PC Volume
## Length:267 Min. :5.147 Min. :23.75 Min. :0.09867
## Class :character 1st Qu.:5.287 1st Qu.:23.92 1st Qu.:0.23333
## Mode :character Median :5.340 Median :23.97 Median :0.27533
## Mean :5.369 Mean :23.97 Mean :0.27769
## 3rd Qu.:5.465 3rd Qu.:24.01 3rd Qu.:0.32200
## Max. :5.667 Max. :24.20 Max. :0.46400
## NA's :1 NA's :6 NA's :4
## Carb Pressure Carb Temp PSC PSC Fill
## Min. :60.20 Min. :130.0 Min. :0.00400 Min. :0.0200
## 1st Qu.:65.30 1st Qu.:138.4 1st Qu.:0.04450 1st Qu.:0.1000
## Median :68.00 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.25 Mean :141.2 Mean :0.08545 Mean :0.1903
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :77.60 Max. :154.0 Max. :0.24600 Max. :0.6200
## NA's :1 NA's :5 NA's :3
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## Min. :0.00000 Min. :-100.20 Min. :113.0 Min. :37.80
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:120.2 1st Qu.:46.00
## Median :0.04000 Median : 0.20 Median :123.4 Median :47.80
## Mean :0.05107 Mean : 21.03 Mean :123.0 Mean :48.14
## 3rd Qu.:0.06000 3rd Qu.: 141.30 3rd Qu.:125.5 3rd Qu.:50.20
## Max. :0.24000 Max. : 220.40 Max. :136.0 Max. :60.20
## NA's :5 NA's :4 NA's :2
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## Min. :-50.00 Min. :-50.00 Min. :-50.00 Min. : 68.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 90.00
## Median : 10.40 Median : 26.80 Median : 27.70 Median : 98.00
## Mean : 12.01 Mean : 20.11 Mean : 19.61 Mean : 97.84
## 3rd Qu.: 20.40 3rd Qu.: 34.80 3rd Qu.: 33.00 3rd Qu.:104.00
## Max. : 50.00 Max. : 61.40 Max. : 49.20 Max. :140.00
## NA's :1 NA's :1 NA's :4
## Filler Level Filler Speed Temperature Usage cont Carb Flow
## Min. : 69.2 Min. :1006 Min. :63.80 Min. :12.90 Min. : 0
## 1st Qu.:100.6 1st Qu.:3812 1st Qu.:65.40 1st Qu.:18.12 1st Qu.:1083
## Median :118.6 Median :3978 Median :65.80 Median :21.44 Median :3038
## Mean :110.3 Mean :3581 Mean :66.23 Mean :20.90 Mean :2409
## 3rd Qu.:120.2 3rd Qu.:3996 3rd Qu.:66.60 3rd Qu.:23.74 3rd Qu.:3215
## Max. :153.2 Max. :4020 Max. :75.40 Max. :24.60 Max. :3858
## NA's :2 NA's :10 NA's :2 NA's :2
## Density MFR Balling Pressure Vacuum
## Min. :0.060 Min. : 15.6 Min. :0.902 Min. :-6.400
## 1st Qu.:0.920 1st Qu.:707.0 1st Qu.:1.498 1st Qu.:-5.600
## Median :0.980 Median :724.6 Median :1.648 Median :-5.200
## Mean :1.177 Mean :697.8 Mean :2.203 Mean :-5.174
## 3rd Qu.:1.600 3rd Qu.:731.5 3rd Qu.:3.242 3rd Qu.:-4.800
## Max. :1.840 Max. :784.8 Max. :3.788 Max. :-3.600
## NA's :1 NA's :31 NA's :1 NA's :1
## PH Oxygen Filler Bowl Setpoint Pressure Setpoint
## Mode:logical Min. :0.00240 Min. : 70.0 Min. :44.00
## NA's:267 1st Qu.:0.01960 1st Qu.:100.0 1st Qu.:46.00
## Median :0.03370 Median :120.0 Median :46.00
## Mean :0.04666 Mean :109.6 Mean :47.73
## 3rd Qu.:0.05440 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :0.39800 Max. :130.0 Max. :52.00
## NA's :3 NA's :1 NA's :2
## Air Pressurer Alch Rel Carb Rel Balling Lvl
## Min. :141.2 Min. :6.400 Min. :5.18 Min. :0.000
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.34 1st Qu.:1.380
## Median :142.6 Median :6.580 Median :5.40 Median :1.480
## Mean :142.8 Mean :6.907 Mean :5.44 Mean :2.051
## 3rd Qu.:142.8 3rd Qu.:7.180 3rd Qu.:5.56 3rd Qu.:3.080
## Max. :147.2 Max. :7.820 Max. :5.74 Max. :3.420
## NA's :1 NA's :3 NA's :2
When we look at the summary stats, we can see there are quite a few NAs in both Train and Test, I think it is safe to remove the NAs the values, we will replace with preprocess, but first we want to see if any variables have distributions that would make it hard to predict with.
par(mfrow=c(2,5))
for (i in 2:length(Train)) {
boxplot(Train[,i], main=names(Train[i]), type="l")
}
BPT <-Test %>% dplyr::select(-PH)
par(mfrow=c(2,5))
for (i in 2:length(BPT)) {
boxplot(BPT[,i], main=names(BPT[i]), type="l")
}
rm(BPT)
The only one that wouldn’t appear to work is MFR, which has crazy
variation and also a large amount of NAs. We will begin by removing that
and making sure Brand Code is a factor.
Train <- Train %>% dplyr::select(-MFR)
Train$`Brand Code` <- as.factor(Train$`Brand Code`)
unique(Train$`Brand Code`)
## [1] B A C D <NA>
## Levels: A B C D
Test <- Test %>% dplyr::select(-MFR)
Test$`Brand Code` <- as.factor(Test$`Brand Code`)
We also noticed that there are a few Brand Codes that are NA (120 ~ 5% of the data), we will remove those entries.
Train <- Train %>%
filter(!is.na(`Brand Code`))
Train <- Train %>%
filter(!is.na(PH))
Test <- Test %>%
filter(!is.na(`Brand Code`))
Now let’s process Train and test datasets
set.seed(100)
# First we will join them together so nzv works and does not cause mismatched data sets
Train <- rbind(Train, Test)
# Train
#remove pH from the train data set in order to only transform the predictors
train_features <- Train %>%
dplyr::select(-c(PH))
#remove nzv, correlated values, center and scale, apply BoxCox for normalization
preProc <- preProcess(as.data.frame(train_features), method=c("knnImpute","nzv","corr",
"center", "scale", "BoxCox"))
#get the transformed features
preProc_train <- predict(preProc, as.data.frame(train_features))
#add the PH response variable to the preProcessed train features
preProc_train$PH <- Train$PH
# Split datasets
test_features <- preProc_train %>% filter(is.na(PH))
preProc_train <- preProc_train %>% filter(!is.na(PH))
preProc_train$PH <- log(preProc_train$PH)
summary(preProc_train)
## Brand Code Carb Volume Fill Ounces PC Volume
## A: 293 Min. :-3.470696 Min. :-3.944068 Min. :-3.654285
## B:1235 1st Qu.:-0.755370 1st Qu.:-0.625751 1st Qu.:-0.608422
## C: 304 Median :-0.232343 Median :-0.003976 Median :-0.058849
## D: 615 Mean : 0.003934 Mean : 0.003439 Mean : 0.006839
## 3rd Qu.: 0.828635 3rd Qu.: 0.619185 3rd Qu.: 0.601500
## Max. : 2.870134 Max. : 4.071319 Max. : 3.015515
## Carb Pressure Carb Temp PSC
## Min. :-3.582425 Min. :-3.499888 Min. :-2.686287
## 1st Qu.:-0.686852 1st Qu.:-0.656547 1st Qu.:-0.622540
## Median : 0.010128 Median :-0.043490 Median : 0.019801
## Mean : 0.004525 Mean : 0.001779 Mean : 0.001113
## 3rd Qu.: 0.673590 3rd Qu.: 0.682859 3rd Qu.: 0.658694
## Max. : 2.859695 Max. : 2.862130 Max. : 2.792421
## PSC Fill PSC CO2 Mnf Flow Carb Pressure1
## Min. :-1.657007 Min. :-1.32240 Min. :-1.038655 Min. :-3.62123
## 1st Qu.:-0.808039 1st Qu.:-0.85004 1st Qu.:-1.036981 1st Qu.:-0.80414
## Median :-0.128864 Median :-0.37768 Median : 0.387427 Median : 0.09221
## Mean : 0.005334 Mean : 0.01164 Mean : 0.004988 Mean :-0.01776
## 3rd Qu.: 0.550311 3rd Qu.: 0.56705 3rd Qu.: 0.981628 3rd Qu.: 0.60440
## Max. : 3.606599 Max. : 4.34595 Max. : 1.609305 Max. : 3.76296
## Fill Pressure Hyd Pressure2 Hyd Pressure4
## Min. :-5.419682 Min. :-1.273252 Min. :-3.263706
## 1st Qu.:-0.572797 1st Qu.:-1.273252 1st Qu.:-0.624114
## Median :-0.438025 Median : 0.472212 Median : 0.031719
## Mean :-0.007289 Mean : 0.006838 Mean : 0.000661
## 3rd Qu.: 0.696492 3rd Qu.: 0.835850 3rd Qu.: 0.488667
## Max. : 3.348518 Max. : 2.326766 Max. : 2.982425
## Filler Level Filler Speed Temperature Usage cont
## Min. :-2.802104 Min. :-3.27593 Min. :-1.894117 Min. :-2.526635
## 1st Qu.:-0.831631 1st Qu.: 0.19424 1st Qu.:-0.547963 1st Qu.:-0.923576
## Median : 0.554903 Median : 0.41456 Median :-0.226721 Median : 0.213849
## Mean :-0.008277 Mean :-0.03438 Mean :-0.009198 Mean : 0.004474
## 3rd Qu.: 0.687469 3rd Qu.: 0.44627 3rd Qu.: 0.398435 3rd Qu.: 0.956966
## Max. : 4.269514 Max. : 0.51007 Max. : 6.529949 Max. : 1.841344
## Carb Flow Density Pressure Vacuum
## Min. :-2.263851 Min. :-4.752948 Min. :-2.439385
## 1st Qu.:-1.209594 1st Qu.:-0.685435 1st Qu.:-0.688367
## Median : 0.522135 Median :-0.423374 Median :-0.338163
## Mean : 0.002307 Mean : 0.001968 Mean :-0.006279
## 3rd Qu.: 0.667837 3rd Qu.: 1.161153 3rd Qu.: 0.362244
## Max. : 2.448752 Max. : 1.646233 Max. : 2.813670
## Oxygen Filler Pressure Setpoint Air Pressurer
## Min. :-1.995139 Min. :-1.958909 Min. :-1.722258
## 1st Qu.:-0.368743 1st Qu.:-0.782188 1st Qu.:-0.524201
## Median : 0.026936 Median :-0.782188 Median :-0.188363
## Mean : 0.003717 Mean :-0.008229 Mean : 0.000388
## 3rd Qu.: 0.640851 3rd Qu.: 1.161820 3rd Qu.: 0.144660
## Max. : 3.203252 Max. : 1.969996 Max. : 4.231559
## Carb Rel PH
## Min. :-4.337567 Min. :2.064
## 1st Qu.:-0.788069 1st Qu.:2.133
## Median :-0.295034 Median :2.145
## Mean :-0.001671 Mean :2.146
## 3rd Qu.: 0.942580 3rd Qu.:2.161
## Max. : 4.198497 Max. :2.236
summary(test_features)
## Brand Code Carb Volume Fill Ounces PC Volume
## A: 35 Min. :-2.27848 Min. :-2.63696 Min. :-3.37495
## B:129 1st Qu.:-0.82186 1st Qu.:-0.62575 1st Qu.:-0.69036
## C: 31 Median :-0.23234 Median :-0.08177 Median : 0.01963
## D: 64 Mean :-0.02601 Mean :-0.03374 Mean : 0.02317
## 3rd Qu.: 0.88900 3rd Qu.: 0.46326 3rd Qu.: 0.79986
## Max. : 2.60202 Max. : 2.65402 Max. : 2.92003
##
## Carb Pressure Carb Temp PSC PSC Fill
## Min. :-2.46115 Min. :-3.05670 Min. :-2.47258 Min. :-1.4872
## 1st Qu.:-0.80648 1st Qu.:-0.60432 1st Qu.:-0.70127 1st Qu.:-0.6382
## Median : 0.01013 Median :-0.04349 Median : 0.01980 Median :-0.1289
## Mean : 0.01339 Mean : 0.03905 Mean : 0.02216 Mean :-0.0300
## 3rd Qu.: 0.72745 3rd Qu.: 0.72978 3rd Qu.: 0.65869 3rd Qu.: 0.5333
## Max. : 2.44098 Max. : 2.86213 Max. : 2.51979 Max. : 3.6066
##
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## Min. :-1.32240 Min. :-1.03866 Min. :-2.04195 Min. :-3.82465
## 1st Qu.:-0.85004 1st Qu.:-1.03698 1st Qu.:-0.63341 1st Qu.:-0.57280
## Median :-0.37768 Median :-0.19840 Median : 0.17757 Median : 0.01940
## Mean :-0.10994 Mean :-0.04713 Mean : 0.07075 Mean : 0.05924
## 3rd Qu.: 0.09469 3rd Qu.: 0.96907 3rd Qu.: 0.60440 3rd Qu.: 0.75568
## Max. : 4.34595 Max. : 1.46536 Max. : 2.86661 Max. : 3.30451
##
## Hyd Pressure2 Hyd Pressure4 Filler Level Filler Speed
## Min. :-4.30357 Min. :-2.5675 Min. :-2.28422 Min. :-3.2720
## 1st Qu.:-1.27325 1st Qu.:-0.4547 1st Qu.:-0.63573 1st Qu.: 0.0666
## Median : 0.33888 Median : 0.1871 Median : 0.58419 Median : 0.3358
## Mean :-0.06819 Mean : 0.1053 Mean : 0.07062 Mean :-0.1905
## 3rd Qu.: 0.82373 3rd Qu.: 0.6350 3rd Qu.: 0.70232 3rd Qu.: 0.4423
## Max. : 2.44798 Max. : 2.8755 Max. : 3.49185 Max. : 0.4901
##
## Temperature Usage cont Carb Flow Density
## Min. :-1.72029 Min. :-2.35617 Min. :-2.28798 Min. :-9.0191
## 1st Qu.:-0.38660 1st Qu.:-1.01470 1st Qu.:-1.27548 1st Qu.:-0.6178
## Median :-0.06829 Median : 0.07005 Median : 0.53327 Median :-0.4234
## Mean : 0.17552 Mean :-0.04992 Mean :-0.02174 Mean :-0.0186
## 3rd Qu.: 0.47482 3rd Qu.: 0.94116 3rd Qu.: 0.69753 3rd Qu.: 1.0852
## Max. : 6.11728 Max. : 1.29502 Max. : 1.29241 Max. : 1.5153
##
## Pressure Vacuum Oxygen Filler Pressure Setpoint Air Pressurer
## Min. :-2.08918 Min. :-1.995139 Min. :-1.95891 Min. :-1.376317
## 1st Qu.:-0.68837 1st Qu.:-0.372889 1st Qu.:-0.78219 1st Qu.:-0.524201
## Median : 0.01204 Median : 0.044555 Median :-0.78219 Median :-0.188363
## Mean : 0.06045 Mean : 0.009222 Mean : 0.07158 Mean :-0.001272
## 3rd Qu.: 0.71245 3rd Qu.: 0.586101 3rd Qu.: 1.16182 3rd Qu.:-0.021502
## Max. : 2.81367 Max. : 3.195121 Max. : 1.97000 Max. : 3.479129
##
## Carb Rel PH
## Min. :-2.187610 Min. : NA
## 1st Qu.:-0.788069 1st Qu.: NA
## Median :-0.295034 Median : NA
## Mean : 0.005197 Mean :NaN
## 3rd Qu.: 0.942580 3rd Qu.: NA
## Max. : 2.213065 Max. : NA
## NA's :259
Now that we have the NAs out of the way we can take a look at the corrplot and see if there is any collinearity. We will look at all, and then any that are above .85
m <- stats::cor(preProc_train[,2:26], method = "pearson")
corrplot::corrplot(m)
We can see that there is some collinearity, especially between Balling, Density, Alch Rel, Carb Rel, and Balling Lvl. If we choose Ridge regression or any of the more advanced models, we will not have to worry too much about it. But if we are choosing a more simple model, we will have to make sure we account for the multicollinearity by merging/removing columns.
Let’s check the histogram for PH across each group, then in general see how the data plots against each other:
for (i in unique(Train$`Brand Code`)) {
print(Train %>% filter(`Brand Code` == i) %>%
ggplot(aes(x=PH)) +
geom_histogram(bins=25))
}
## Warning: Removed 129 rows containing non-finite values (stat_bin).
## Warning: Removed 35 rows containing non-finite values (stat_bin).
## Warning: Removed 31 rows containing non-finite values (stat_bin).
## Warning: Removed 64 rows containing non-finite values (stat_bin).
for (i in unique(Train$`Brand Code`)) {
print(Train %>% filter(`Brand Code` == i) %>%
ggplot(aes(x=log(PH))) +
geom_histogram(bins=25))
}
## Warning: Removed 129 rows containing non-finite values (stat_bin).
## Warning: Removed 35 rows containing non-finite values (stat_bin).
## Warning: Removed 31 rows containing non-finite values (stat_bin).
## Warning: Removed 64 rows containing non-finite values (stat_bin).
#partition data for evaluation
training_set <- createDataPartition(preProc_train$PH, p=0.8, list=FALSE)
train_data <- preProc_train[training_set,]
eval_data <- preProc_train[-training_set,]
The distribution around PH is better as logged and will help us with the forecasting. Now let’s see what the relationship looks like between PH and the other variables.
And now it is time for modelling:
We consider these linear regression models: multi-linear
regression, partial least squares, AIC
optimized . We utilize the train()
function for
all three models, feeding the same datasets for X and Y, and specifying
the proper model-building technique via the “method” variable.
Moreover, the prediction()
function in combination with
postResample()
are used to generate summary statistics for
how our model performed on the evaluation data:
First, a multi-linear regression is created to predict the pH response variable.
#Remove PH from sets to feed models
set.seed(100)
y_train <- subset(train_data, select = -c(PH))
y_test <- subset(eval_data, select = -c(PH))
#generate model
linear_model <- train(x= y_train, y= train_data$PH,
method='lm',
trControl=trainControl(method = "cv", number = 10))
#evaluate model
lmPred <- predict(linear_model, newdata = y_test)
lmResample <- postResample(pred=lmPred, obs = eval_data$PH)
Next PLS regression is performed on the data to predict PH. PLS was chosen given the multicolliniarity detected earlier in the exploratory data analysis phase.
set.seed(100)
#generate model
pls_model <- train(y_train, train_data$PH,
method='pls',
metric='Rsquared',
tuneLength=10,
trControl=trainControl(method = "cv", number = 10))
#evaluate model metrics
plsPred <-predict(pls_model, newdata=y_test)
plsReSample <- postResample(pred=plsPred, obs = eval_data$PH)
Lastly, for our linear models, a stepwise AIC model is run using
stepAIC
. The stepwise regression is performed by specifying
the direction parameter with “both.”
set.seed(100)
#generate model
initial <- lm(PH ~ . , data = train_data)
AIC_model <- stepAIC(initial, direction = "both",
trace = 0)
#evaluate model metrics
AIC_Pred <-predict(AIC_model, newdata=y_test)
aicResample <- postResample(pred=AIC_Pred, obs=eval_data$PH)
We need to verify model performance and identify the strongest performing model in our multi-linear regression subset.
display <- rbind("Linear Regression" = lmResample,
"Stepwise AIC" = aicResample,
"Partial Least Squares" = plsReSample)
display
## RMSE Rsquared MAE
## Linear Regression 0.01568033 0.4098999 0.01204503
## Stepwise AIC 0.01577032 0.4031818 0.01207796
## Partial Least Squares 0.01571769 0.4070956 0.01208152
Simple Linear Regression seemed to have better output among the linear models.
Building non-linear models. We will try k-nearest neighbors (KNN), support vector machines (SVM) and multivariate adaptive regression splines (MARS), and XGBoost. These models are not based on simple linear combinations of the predictors.
Predictors with the largest scales will contribute most to the distance between samples so centering and scaling the data during pre-processing is important.
set.seed(100)
knnModel <- train(PH~., data = train_data,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
#knnModel
knnPred <- predict(knnModel, newdata = eval_data)
knn_metrics <- postResample(pred = knnPred, obs = eval_data$PH)
#knn_metrics
Support Vector Machines follow the framework of robust regression where we seek to minimize the effect of outliers on the regression equations. The radial kernel we are using has an additional parameter which impacts the smoothness of the upper and lower boundary.
set.seed(100)
tc <- trainControl(method = "cv",
number = 5,
classProbs = T)
svmModel <- train(PH~., data = train_data,
method = "svmRadial",
preProcess = c("BoxCox","center", "scale"),
trControl = tc,
tuneLength = 9)
#svmModel
svmPred <- predict(svmModel, newdata = eval_data)
svm_metrics <- postResample(pred = svmPred, obs = eval_data$PH)
#svm_metrics
MARS features breaks the predictor into two groups, a “hinge” function of the original based on a cut point that achieves the smallest error, and models linear relationships between the predictor and the outcome in each group.
set.seed(100)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
mars <- train(PH~., data = train_data,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
#mars
marsPred <- predict(mars, newdata = eval_data)
mars_metrics <- postResample(pred = marsPred, obs = eval_data$PH)
#mars_metrics
Observed Vs. Predicted - Non - Linear Models with Reduced Predictor Set.
knnModel_pred <- knnModel %>% predict(eval_data)
# Model performance metrics
knn_Accuracy <- data.frame(
Model = "k-Nearest Neighbors",
RMSE = caret::RMSE(knnModel_pred,eval_data$PH),
Rsquare = caret::R2(knnModel_pred,eval_data$PH))
pred_svm <- svmModel %>% predict(eval_data)
# Model SVM performance metrics
SMV_Acc <- data.frame(
Model = "Support Vector Machine",
RMSE = caret::RMSE(pred_svm, eval_data$PH),
Rsquare = caret::R2(pred_svm, eval_data$PH)
)
# Make MARS predictions
pred_mars <- mars %>% predict(eval_data)
# Model MARS performance metrics
MARS_Acc <- data.frame(
Model = "MARS Tuned",
RMSE = caret::RMSE(pred_mars, eval_data$PH),
Rsquare = caret::R2(pred_mars, eval_data$PH)
)
names(MARS_Acc)[names(MARS_Acc) == 'y'] <- "Rsquare"
### code for the plot
par(mar = c(4, 4, 4, 4))
par(mfrow=c(2,2))
plot(knnModel_pred, eval_data$PH, ylab="Observed", col = "red")
abline(0, 1, lwd=2)
plot(pred_svm, eval_data$PH, ylab="Observed", col = "dark green")
abline(0, 1, lwd=2)
plot(pred_mars, eval_data$PH, ylab="Observed", col = "blue")
abline(0, 1, lwd=2)
mtext("Observed Vs. Predicted - Non - Linear Models with Reduced Predictor Set", side = 3, line = -2, outer = TRUE)
rbind( "KNN" = knn_metrics,
"SVM" = svm_metrics,
"MARS" = mars_metrics)
## RMSE Rsquared MAE
## KNN 0.01410296 0.5268104 0.01069230
## SVM 0.01296496 0.5982959 0.00923041
## MARS 0.01443122 0.5026712 0.01099409
The top predictors in our best performing non-linear model with Support Vector Machine (SVM).
The next model we developed was a decision tree. Decision trees are capable of capturing nonlinear relationships as well as feature importance. They are highly interpretable, but are not as accurate as other tree based methods. The model we created has a maximum depth of 12 leaves and a minimum split of 30 observations. The results and decision tree structure are presented below.
colnames(train_data) <- make.names(colnames(train_data))
colnames(eval_data) <- make.names(colnames(eval_data))
y_train <- subset(train_data, select = -c(PH))
y_test <- subset(eval_data, select = -c(PH))
rpart.model <- train(PH~.,
data = train_data,
method = "rpart",
tuneLength = 200,
control = rpart.control(maxdepth = 12, minsplit = 30),
trControl=trainControl(method = "cv", number = 10))
rpart.pred <- predict(rpart.model, newdata = eval_data)
rpart.metrics <- postResample(pred = rpart.pred, obs = eval_data$PH)
rpart.metrics
## RMSE Rsquared MAE
## 0.01407714 0.54268663 0.01029654
rpart.plot(rpart.model$finalModel)
The last model we created was an XGBoost tree model. XGBoost is a ensemble algorithm, in which many decision trees are combined into a single model to improve overall accuracy. The downside of ensemble methods are that they are more difficult to interpret compared to individual models. The optimal XGBoost model was found via a grid search where the best results occurred with 1000 decision trees, a learning rate of 0.05, a maximum tree depth of 5, and a gamma of 0. The results of the model can be seen below.
set.seed(100)
hyperparameters <- expand.grid(
nrounds = 1000,
eta = 0.05,
max_depth = 5,
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
#fit XGBoost model and display training and testing data at each iteration
xgb.model <- train(PH~.,
data = train_data,
method = "xgbTree",
tuneGrid = hyperparameters,
trControl = trainControl(method = "cv", number = 10),
verbosity = 0)
xgb.pred <- predict(xgb.model, newdata = eval_data)
xgb.metrics <- postResample(pred = xgb.pred, obs = eval_data$PH)
xgb.metrics
## RMSE Rsquared MAE
## 0.011592453 0.678232868 0.008349152
x = 1:nrow(y_test) # visualize the model, actual and predicted data
plot(x, xgb.pred, col = "red", type = "l")
lines(x, eval_data$PH, col = "blue", type = "l")
legend(x = 1, y = 38, legend = c("original y_test", "predicted y_test"),
col = c("red", "blue"), box.lty = 1, cex = 0.8, lty = c(1, 1))
Next, the metrics for each of the best performing Linear, Non-Linear, and Tree based models are complied to evaluate the performance of each and select the best model.
The RMSE, \(R^2\), and MAE were
recorded for each model presented above to be compared to find the
optimal model to predict PH
. The linear models produced the
worst results, nonlinear produced middle of the road results, and the
tree based models achieved the best performance metrics. The best of the
best model is the XGBoost with the lowest RMSE (0.10), best \(R^2\) (0.66), and lowest MAE (0.07).
rbind(
mars_metrics,
knn_metrics,
svm_metrics,
rpart.metrics,
xgb.metrics)
## RMSE Rsquared MAE
## mars_metrics 0.01443122 0.5026712 0.010994095
## knn_metrics 0.01410296 0.5268104 0.010692299
## svm_metrics 0.01296496 0.5982959 0.009230410
## rpart.metrics 0.01407714 0.5426866 0.010296535
## xgb.metrics 0.01159245 0.6782329 0.008349152
The variables that are most impactful for predicting PH
are Mnf.Flow
, Usage.cont
, and
Oxygen.Filter
. The data table exhibits all the variables
importance to the XGBoost model. The plot showcases the top 10
contributors to PH
predictive power.
importance_matrix = xgb.importance(data = train_data, model = xgb.model$finalModel)
## Warning in xgb.importance(data = train_data, model = xgb.model$finalModel):
## xgb.importance: parameters 'data', 'label' and 'target' are deprecated
importance_matrix
## Feature Gain Cover Frequency
## 1: Mnf.Flow 0.090764650 0.057847898 0.041183604
## 2: Usage.cont 0.073691131 0.051414743 0.051515886
## 3: Oxygen.Filler 0.069929482 0.053250823 0.051321853
## 4: Filler.Speed 0.060896230 0.061880612 0.057142857
## 5: Density 0.055382737 0.046687133 0.039631336
## 6: Air.Pressurer 0.053122417 0.021980742 0.031336406
## 7: Pressure.Vacuum 0.052936328 0.046811866 0.035168567
## 8: Filler.Level 0.050772493 0.050162577 0.044821732
## 9: Temperature 0.050210750 0.034198783 0.033907349
## 10: Brand.CodeC 0.048456593 0.006668869 0.008100897
## 11: Carb.Pressure1 0.044816360 0.062124384 0.050254669
## 12: Carb.Flow 0.039835555 0.068796798 0.054765947
## 13: Carb.Rel 0.036243622 0.032272672 0.030026680
## 14: PC.Volume 0.031626245 0.065659455 0.062672811
## 15: Carb.Volume 0.030590650 0.026348764 0.059616784
## 16: Fill.Ounces 0.026087850 0.036814203 0.058210041
## 17: Hyd.Pressure2 0.024254686 0.040538146 0.032694640
## 18: Brand.CodeD 0.020456094 0.007071217 0.007179238
## 19: PSC.Fill 0.019870716 0.028172596 0.033470774
## 20: Fill.Pressure 0.019474674 0.039478291 0.033228232
## 21: PSC 0.019268703 0.052298188 0.046034441
## 22: Hyd.Pressure4 0.018065634 0.018877671 0.027940820
## 23: Carb.Pressure 0.016709273 0.034344252 0.036429784
## 24: Carb.Temp 0.015247775 0.032394505 0.034537958
## 25: Brand.CodeB 0.012020200 0.002091347 0.013291293
## 26: Pressure.Setpoint 0.010084745 0.006564119 0.004753820
## 27: PSC.CO2 0.009184408 0.015249345 0.020761581
## Feature Gain Cover Frequency
xgb.plot.importance(importance_matrix[1:10,])
Let’s check individually what each Brand code would come back with for RSqaured and RMSE
for (i in unique(train_data$Brand.Code)) {
xgb.pred <- predict(xgb.model, newdata = eval_data %>% filter(Brand.Code == i))
d <- eval_data %>% filter(Brand.Code == i)
xgb.metrics <- postResample(pred = exp(xgb.pred), obs = exp(d$PH) )
print(i)
print(xgb.metrics)
}
## [1] "B"
## RMSE Rsquared MAE
## 0.08658230 0.74858830 0.06580456
## [1] "A"
## RMSE Rsquared MAE
## 0.10487783 0.53631742 0.08042107
## [1] "C"
## RMSE Rsquared MAE
## 0.1543169 0.2629705 0.1051827
## [1] "D"
## RMSE Rsquared MAE
## 0.07831330 0.67088021 0.06067395
Next, using the SVM model which was identified as the best performing
model above, the PH values are predicted with the provided
Test
data set.
Next, the predictions for PH are generated using the
predict
function along with the optimal model identified
earlier as the XGBoost model.
colnames(test_features) <- make.names(colnames(test_features))
predict <- round(predict(xgb.model, newdata=test_features),2)
pH <- as.data.frame(predict)
pH <- rename(pH, pH = predict)
pH$pH <- exp(pH$pH)
Finally, the predicted PH values along with the predictor variables
are exported to a CSV file using write_excel_csv
.
write_excel_csv(pH, "Predictions.csv")