Instructions

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of pH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Exploratory Data Analysis

Problem statement

We are a team of Data Scientists working for a PCG company. Due to regulation, we need to better predict the PH in our beverages.

Summary

With the given data, we cannot confidently predict the PH for our line of variables, the issues begin with data gathering. We were able to achieve the highest RSqsaured of 67.8% and an RMSE of 1.01. Since PH is already a logged number, our error is essentially \(1^{10}\). Below is an overview of the RMSE and RSqaured for each individual Brand Code. We can see the highest is B and only has an RSqaured of .74. We aim to achieve an RSquared of at minimum .85.

## # A tibble: 4 x 3
##   Brand.Code RMSE       RSqaured  
##   <chr>      <chr>      <chr>     
## 1 A          0.10487783 0.53631742
## 2 B          0.08658230 0.74858830
## 3 C          0.1543169  0.2629705 
## 4 D          0.07831330 0.67088021

The forecasting methods we tried included Linear Modelling and Non-Linear Methods. Linear included multi-linear regression, partial least squares, AIC optimized. Non Linear included k-nearest neighbors (KNN), support vector machines (SVM) and multivariate adaptive regression splines (MARS), rpart, and XGBoost. With XGBoost offerinng the highest RSqaured and lowest RMSE.

This brings us to the issues with the data and how we can resolve them in the face of government regulation. The first and foremost issue is that we lost about 5% of our data because the beverages were not properly labelled. Then, we noticed within the sensors themselves there were several missing values and values with wide ranges. This begs the question, how often are our sensors calibrated or tested?

The first steps we should take given the pressure for governmental regulation should be to build out better data QA procedures. We should teach the importance of labeling data so that any bottles we are tested have the full name and data.

Second, we should build out a new process to ensure our sensors are working properly and that we have the information for when they were last calibrated/inspected in a table where we can properly ensure the quality of the data we are receiving.

Load & Review Train Data

  1. Ensure Data Quality While performing the ETL we noticed that there were missing data points across several processes as well as several missing brand entries.
myfile <- "https://raw.githubusercontent.com/jhumms/624/main/data-dump/StudentData.csv"
Train <- read_csv(myfile)
## Rows: 2571 Columns: 33
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (1): Brand Code
## dbl (32): Carb Volume, Fill Ounces, PC Volume, Carb Pressure, Carb Temp, PSC...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
myfile <- "https://raw.githubusercontent.com/jhumms/624/main/data-dump/StudentEvaluation.csv"
Test <- read_csv(myfile)
## Rows: 267 Columns: 33
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (1): Brand Code
## dbl (31): Carb Volume, Fill Ounces, PC Volume, Carb Pressure, Carb Temp, PSC...
## lgl  (1): PH
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
rm(myfile)

exp(0.01159245)
## [1] 1.01166

The first step is to get a good look at our data and make sure it is clean enough to model.

dim(Train)
## [1] 2571   33
dim(Test)
## [1] 267  33

We see that the Train Table has 2571 entries and 33 columns. While Test has all 33 columns but only 267 entries. Let’s glimpse the data to see what we can find.

head(Train)
## # A tibble: 6 x 33
##   `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
##   <chr>                <dbl>         <dbl>       <dbl>           <dbl>
## 1 B                     5.34          24.0       0.263            68.2
## 2 A                     5.43          24.0       0.239            68.4
## 3 B                     5.29          24.1       0.263            70.8
## 4 A                     5.44          24.0       0.293            63  
## 5 A                     5.49          24.3       0.111            67.2
## 6 A                     5.38          23.9       0.269            66.6
## # i 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## #   `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## #   `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## #   `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## #   `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## #   `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
## #   `Pressure Vacuum` <dbl>, PH <dbl>, `Oxygen Filler` <dbl>, ...

We can see that there are different Branded beverages with different fills, pressures, carb temps, etc. When we clean the data, it will be important to clean it by each group so that we are not filling in incorrect values. Now let’s run the summary.

summary(Train)
##   Brand Code         Carb Volume     Fill Ounces      PC Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb Pressure     Carb Temp          PSC             PSC Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC CO2           Mnf Flow       Carb Pressure1  Fill Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd Pressure1   Hyd Pressure2   Hyd Pressure3   Hyd Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler Level    Filler Speed   Temperature      Usage cont      Carb Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        PH        Oxygen Filler     Bowl Setpoint   Pressure Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  Air Pressurer      Alch Rel        Carb Rel      Balling Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1
summary(Test)
##   Brand Code         Carb Volume     Fill Ounces      PC Volume      
##  Length:267         Min.   :5.147   Min.   :23.75   Min.   :0.09867  
##  Class :character   1st Qu.:5.287   1st Qu.:23.92   1st Qu.:0.23333  
##  Mode  :character   Median :5.340   Median :23.97   Median :0.27533  
##                     Mean   :5.369   Mean   :23.97   Mean   :0.27769  
##                     3rd Qu.:5.465   3rd Qu.:24.01   3rd Qu.:0.32200  
##                     Max.   :5.667   Max.   :24.20   Max.   :0.46400  
##                     NA's   :1       NA's   :6       NA's   :4        
##  Carb Pressure     Carb Temp          PSC             PSC Fill     
##  Min.   :60.20   Min.   :130.0   Min.   :0.00400   Min.   :0.0200  
##  1st Qu.:65.30   1st Qu.:138.4   1st Qu.:0.04450   1st Qu.:0.1000  
##  Median :68.00   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.25   Mean   :141.2   Mean   :0.08545   Mean   :0.1903  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :77.60   Max.   :154.0   Max.   :0.24600   Max.   :0.6200  
##                  NA's   :1       NA's   :5         NA's   :3       
##     PSC CO2           Mnf Flow       Carb Pressure1  Fill Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :113.0   Min.   :37.80  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:120.2   1st Qu.:46.00  
##  Median :0.04000   Median :   0.20   Median :123.4   Median :47.80  
##  Mean   :0.05107   Mean   :  21.03   Mean   :123.0   Mean   :48.14  
##  3rd Qu.:0.06000   3rd Qu.: 141.30   3rd Qu.:125.5   3rd Qu.:50.20  
##  Max.   :0.24000   Max.   : 220.40   Max.   :136.0   Max.   :60.20  
##  NA's   :5                           NA's   :4       NA's   :2      
##  Hyd Pressure1    Hyd Pressure2    Hyd Pressure3    Hyd Pressure4   
##  Min.   :-50.00   Min.   :-50.00   Min.   :-50.00   Min.   : 68.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.: 90.00  
##  Median : 10.40   Median : 26.80   Median : 27.70   Median : 98.00  
##  Mean   : 12.01   Mean   : 20.11   Mean   : 19.61   Mean   : 97.84  
##  3rd Qu.: 20.40   3rd Qu.: 34.80   3rd Qu.: 33.00   3rd Qu.:104.00  
##  Max.   : 50.00   Max.   : 61.40   Max.   : 49.20   Max.   :140.00  
##                   NA's   :1        NA's   :1        NA's   :4       
##   Filler Level    Filler Speed   Temperature      Usage cont      Carb Flow   
##  Min.   : 69.2   Min.   :1006   Min.   :63.80   Min.   :12.90   Min.   :   0  
##  1st Qu.:100.6   1st Qu.:3812   1st Qu.:65.40   1st Qu.:18.12   1st Qu.:1083  
##  Median :118.6   Median :3978   Median :65.80   Median :21.44   Median :3038  
##  Mean   :110.3   Mean   :3581   Mean   :66.23   Mean   :20.90   Mean   :2409  
##  3rd Qu.:120.2   3rd Qu.:3996   3rd Qu.:66.60   3rd Qu.:23.74   3rd Qu.:3215  
##  Max.   :153.2   Max.   :4020   Max.   :75.40   Max.   :24.60   Max.   :3858  
##  NA's   :2       NA's   :10     NA's   :2       NA's   :2                     
##     Density           MFR           Balling      Pressure Vacuum 
##  Min.   :0.060   Min.   : 15.6   Min.   :0.902   Min.   :-6.400  
##  1st Qu.:0.920   1st Qu.:707.0   1st Qu.:1.498   1st Qu.:-5.600  
##  Median :0.980   Median :724.6   Median :1.648   Median :-5.200  
##  Mean   :1.177   Mean   :697.8   Mean   :2.203   Mean   :-5.174  
##  3rd Qu.:1.600   3rd Qu.:731.5   3rd Qu.:3.242   3rd Qu.:-4.800  
##  Max.   :1.840   Max.   :784.8   Max.   :3.788   Max.   :-3.600  
##  NA's   :1       NA's   :31      NA's   :1       NA's   :1       
##     PH          Oxygen Filler     Bowl Setpoint   Pressure Setpoint
##  Mode:logical   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  NA's:267       1st Qu.:0.01960   1st Qu.:100.0   1st Qu.:46.00    
##                 Median :0.03370   Median :120.0   Median :46.00    
##                 Mean   :0.04666   Mean   :109.6   Mean   :47.73    
##                 3rd Qu.:0.05440   3rd Qu.:120.0   3rd Qu.:50.00    
##                 Max.   :0.39800   Max.   :130.0   Max.   :52.00    
##                 NA's   :3         NA's   :1       NA's   :2        
##  Air Pressurer      Alch Rel        Carb Rel     Balling Lvl   
##  Min.   :141.2   Min.   :6.400   Min.   :5.18   Min.   :0.000  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.34   1st Qu.:1.380  
##  Median :142.6   Median :6.580   Median :5.40   Median :1.480  
##  Mean   :142.8   Mean   :6.907   Mean   :5.44   Mean   :2.051  
##  3rd Qu.:142.8   3rd Qu.:7.180   3rd Qu.:5.56   3rd Qu.:3.080  
##  Max.   :147.2   Max.   :7.820   Max.   :5.74   Max.   :3.420  
##  NA's   :1       NA's   :3       NA's   :2

Distributions

When we look at the summary stats, we can see there are quite a few NAs in both Train and Test, I think it is safe to remove the NAs the values, we will replace with preprocess, but first we want to see if any variables have distributions that would make it hard to predict with.

par(mfrow=c(2,5))
for (i in 2:length(Train)) {
        boxplot(Train[,i], main=names(Train[i]), type="l")

}

BPT <-Test %>% dplyr::select(-PH)
par(mfrow=c(2,5))

for (i in 2:length(BPT)) {
        boxplot(BPT[,i], main=names(BPT[i]), type="l")

}

rm(BPT)

The only one that wouldn’t appear to work is MFR, which has crazy variation and also a large amount of NAs. We will begin by removing that and making sure Brand Code is a factor.

Train <- Train %>% dplyr::select(-MFR)
Train$`Brand Code` <- as.factor(Train$`Brand Code`)
unique(Train$`Brand Code`)
## [1] B    A    C    D    <NA>
## Levels: A B C D
Test <- Test %>% dplyr::select(-MFR)
Test$`Brand Code` <- as.factor(Test$`Brand Code`)

We also noticed that there are a few Brand Codes that are NA (120 ~ 5% of the data), we will remove those entries.

Train <- Train %>%
 filter(!is.na(`Brand Code`))

Train <- Train %>%
 filter(!is.na(PH))

Test <- Test %>%
 filter(!is.na(`Brand Code`))

Now let’s process Train and test datasets

set.seed(100)

# First we will join them together so nzv works and does not cause mismatched data sets
Train <- rbind(Train, Test)

# Train

#remove pH from the train data set in order to only transform the predictors
train_features <- Train %>% 
  dplyr::select(-c(PH))
#remove nzv, correlated values, center and scale, apply BoxCox for normalization
preProc <- preProcess(as.data.frame(train_features), method=c("knnImpute","nzv","corr",
                                               "center", "scale", "BoxCox"))
#get the transformed features
preProc_train <- predict(preProc, as.data.frame(train_features))
#add the PH response variable to the preProcessed train features
preProc_train$PH <- Train$PH 

# Split datasets


test_features <- preProc_train %>% filter(is.na(PH))

preProc_train <- preProc_train %>% filter(!is.na(PH))

preProc_train$PH <- log(preProc_train$PH)
summary(preProc_train)
##  Brand Code  Carb Volume         Fill Ounces          PC Volume        
##  A: 293     Min.   :-3.470696   Min.   :-3.944068   Min.   :-3.654285  
##  B:1235     1st Qu.:-0.755370   1st Qu.:-0.625751   1st Qu.:-0.608422  
##  C: 304     Median :-0.232343   Median :-0.003976   Median :-0.058849  
##  D: 615     Mean   : 0.003934   Mean   : 0.003439   Mean   : 0.006839  
##             3rd Qu.: 0.828635   3rd Qu.: 0.619185   3rd Qu.: 0.601500  
##             Max.   : 2.870134   Max.   : 4.071319   Max.   : 3.015515  
##  Carb Pressure         Carb Temp              PSC           
##  Min.   :-3.582425   Min.   :-3.499888   Min.   :-2.686287  
##  1st Qu.:-0.686852   1st Qu.:-0.656547   1st Qu.:-0.622540  
##  Median : 0.010128   Median :-0.043490   Median : 0.019801  
##  Mean   : 0.004525   Mean   : 0.001779   Mean   : 0.001113  
##  3rd Qu.: 0.673590   3rd Qu.: 0.682859   3rd Qu.: 0.658694  
##  Max.   : 2.859695   Max.   : 2.862130   Max.   : 2.792421  
##     PSC Fill            PSC CO2            Mnf Flow         Carb Pressure1    
##  Min.   :-1.657007   Min.   :-1.32240   Min.   :-1.038655   Min.   :-3.62123  
##  1st Qu.:-0.808039   1st Qu.:-0.85004   1st Qu.:-1.036981   1st Qu.:-0.80414  
##  Median :-0.128864   Median :-0.37768   Median : 0.387427   Median : 0.09221  
##  Mean   : 0.005334   Mean   : 0.01164   Mean   : 0.004988   Mean   :-0.01776  
##  3rd Qu.: 0.550311   3rd Qu.: 0.56705   3rd Qu.: 0.981628   3rd Qu.: 0.60440  
##  Max.   : 3.606599   Max.   : 4.34595   Max.   : 1.609305   Max.   : 3.76296  
##  Fill Pressure       Hyd Pressure2       Hyd Pressure4      
##  Min.   :-5.419682   Min.   :-1.273252   Min.   :-3.263706  
##  1st Qu.:-0.572797   1st Qu.:-1.273252   1st Qu.:-0.624114  
##  Median :-0.438025   Median : 0.472212   Median : 0.031719  
##  Mean   :-0.007289   Mean   : 0.006838   Mean   : 0.000661  
##  3rd Qu.: 0.696492   3rd Qu.: 0.835850   3rd Qu.: 0.488667  
##  Max.   : 3.348518   Max.   : 2.326766   Max.   : 2.982425  
##   Filler Level        Filler Speed       Temperature          Usage cont       
##  Min.   :-2.802104   Min.   :-3.27593   Min.   :-1.894117   Min.   :-2.526635  
##  1st Qu.:-0.831631   1st Qu.: 0.19424   1st Qu.:-0.547963   1st Qu.:-0.923576  
##  Median : 0.554903   Median : 0.41456   Median :-0.226721   Median : 0.213849  
##  Mean   :-0.008277   Mean   :-0.03438   Mean   :-0.009198   Mean   : 0.004474  
##  3rd Qu.: 0.687469   3rd Qu.: 0.44627   3rd Qu.: 0.398435   3rd Qu.: 0.956966  
##  Max.   : 4.269514   Max.   : 0.51007   Max.   : 6.529949   Max.   : 1.841344  
##    Carb Flow            Density          Pressure Vacuum    
##  Min.   :-2.263851   Min.   :-4.752948   Min.   :-2.439385  
##  1st Qu.:-1.209594   1st Qu.:-0.685435   1st Qu.:-0.688367  
##  Median : 0.522135   Median :-0.423374   Median :-0.338163  
##  Mean   : 0.002307   Mean   : 0.001968   Mean   :-0.006279  
##  3rd Qu.: 0.667837   3rd Qu.: 1.161153   3rd Qu.: 0.362244  
##  Max.   : 2.448752   Max.   : 1.646233   Max.   : 2.813670  
##  Oxygen Filler       Pressure Setpoint   Air Pressurer      
##  Min.   :-1.995139   Min.   :-1.958909   Min.   :-1.722258  
##  1st Qu.:-0.368743   1st Qu.:-0.782188   1st Qu.:-0.524201  
##  Median : 0.026936   Median :-0.782188   Median :-0.188363  
##  Mean   : 0.003717   Mean   :-0.008229   Mean   : 0.000388  
##  3rd Qu.: 0.640851   3rd Qu.: 1.161820   3rd Qu.: 0.144660  
##  Max.   : 3.203252   Max.   : 1.969996   Max.   : 4.231559  
##     Carb Rel               PH       
##  Min.   :-4.337567   Min.   :2.064  
##  1st Qu.:-0.788069   1st Qu.:2.133  
##  Median :-0.295034   Median :2.145  
##  Mean   :-0.001671   Mean   :2.146  
##  3rd Qu.: 0.942580   3rd Qu.:2.161  
##  Max.   : 4.198497   Max.   :2.236
summary(test_features)
##  Brand Code  Carb Volume        Fill Ounces         PC Volume       
##  A: 35      Min.   :-2.27848   Min.   :-2.63696   Min.   :-3.37495  
##  B:129      1st Qu.:-0.82186   1st Qu.:-0.62575   1st Qu.:-0.69036  
##  C: 31      Median :-0.23234   Median :-0.08177   Median : 0.01963  
##  D: 64      Mean   :-0.02601   Mean   :-0.03374   Mean   : 0.02317  
##             3rd Qu.: 0.88900   3rd Qu.: 0.46326   3rd Qu.: 0.79986  
##             Max.   : 2.60202   Max.   : 2.65402   Max.   : 2.92003  
##                                                                     
##  Carb Pressure        Carb Temp             PSC              PSC Fill      
##  Min.   :-2.46115   Min.   :-3.05670   Min.   :-2.47258   Min.   :-1.4872  
##  1st Qu.:-0.80648   1st Qu.:-0.60432   1st Qu.:-0.70127   1st Qu.:-0.6382  
##  Median : 0.01013   Median :-0.04349   Median : 0.01980   Median :-0.1289  
##  Mean   : 0.01339   Mean   : 0.03905   Mean   : 0.02216   Mean   :-0.0300  
##  3rd Qu.: 0.72745   3rd Qu.: 0.72978   3rd Qu.: 0.65869   3rd Qu.: 0.5333  
##  Max.   : 2.44098   Max.   : 2.86213   Max.   : 2.51979   Max.   : 3.6066  
##                                                                            
##     PSC CO2            Mnf Flow        Carb Pressure1     Fill Pressure     
##  Min.   :-1.32240   Min.   :-1.03866   Min.   :-2.04195   Min.   :-3.82465  
##  1st Qu.:-0.85004   1st Qu.:-1.03698   1st Qu.:-0.63341   1st Qu.:-0.57280  
##  Median :-0.37768   Median :-0.19840   Median : 0.17757   Median : 0.01940  
##  Mean   :-0.10994   Mean   :-0.04713   Mean   : 0.07075   Mean   : 0.05924  
##  3rd Qu.: 0.09469   3rd Qu.: 0.96907   3rd Qu.: 0.60440   3rd Qu.: 0.75568  
##  Max.   : 4.34595   Max.   : 1.46536   Max.   : 2.86661   Max.   : 3.30451  
##                                                                             
##  Hyd Pressure2      Hyd Pressure4      Filler Level       Filler Speed    
##  Min.   :-4.30357   Min.   :-2.5675   Min.   :-2.28422   Min.   :-3.2720  
##  1st Qu.:-1.27325   1st Qu.:-0.4547   1st Qu.:-0.63573   1st Qu.: 0.0666  
##  Median : 0.33888   Median : 0.1871   Median : 0.58419   Median : 0.3358  
##  Mean   :-0.06819   Mean   : 0.1053   Mean   : 0.07062   Mean   :-0.1905  
##  3rd Qu.: 0.82373   3rd Qu.: 0.6350   3rd Qu.: 0.70232   3rd Qu.: 0.4423  
##  Max.   : 2.44798   Max.   : 2.8755   Max.   : 3.49185   Max.   : 0.4901  
##                                                                           
##   Temperature         Usage cont         Carb Flow           Density       
##  Min.   :-1.72029   Min.   :-2.35617   Min.   :-2.28798   Min.   :-9.0191  
##  1st Qu.:-0.38660   1st Qu.:-1.01470   1st Qu.:-1.27548   1st Qu.:-0.6178  
##  Median :-0.06829   Median : 0.07005   Median : 0.53327   Median :-0.4234  
##  Mean   : 0.17552   Mean   :-0.04992   Mean   :-0.02174   Mean   :-0.0186  
##  3rd Qu.: 0.47482   3rd Qu.: 0.94116   3rd Qu.: 0.69753   3rd Qu.: 1.0852  
##  Max.   : 6.11728   Max.   : 1.29502   Max.   : 1.29241   Max.   : 1.5153  
##                                                                            
##  Pressure Vacuum    Oxygen Filler       Pressure Setpoint  Air Pressurer      
##  Min.   :-2.08918   Min.   :-1.995139   Min.   :-1.95891   Min.   :-1.376317  
##  1st Qu.:-0.68837   1st Qu.:-0.372889   1st Qu.:-0.78219   1st Qu.:-0.524201  
##  Median : 0.01204   Median : 0.044555   Median :-0.78219   Median :-0.188363  
##  Mean   : 0.06045   Mean   : 0.009222   Mean   : 0.07158   Mean   :-0.001272  
##  3rd Qu.: 0.71245   3rd Qu.: 0.586101   3rd Qu.: 1.16182   3rd Qu.:-0.021502  
##  Max.   : 2.81367   Max.   : 3.195121   Max.   : 1.97000   Max.   : 3.479129  
##                                                                               
##     Carb Rel               PH     
##  Min.   :-2.187610   Min.   : NA  
##  1st Qu.:-0.788069   1st Qu.: NA  
##  Median :-0.295034   Median : NA  
##  Mean   : 0.005197   Mean   :NaN  
##  3rd Qu.: 0.942580   3rd Qu.: NA  
##  Max.   : 2.213065   Max.   : NA  
##                      NA's   :259

Now that we have the NAs out of the way we can take a look at the corrplot and see if there is any collinearity. We will look at all, and then any that are above .85

Variable Correlations

m <- stats::cor(preProc_train[,2:26], method = "pearson")

corrplot::corrplot(m)

We can see that there is some collinearity, especially between Balling, Density, Alch Rel, Carb Rel, and Balling Lvl. If we choose Ridge regression or any of the more advanced models, we will not have to worry too much about it. But if we are choosing a more simple model, we will have to make sure we account for the multicollinearity by merging/removing columns.

Let’s check the histogram for PH across each group, then in general see how the data plots against each other:

for (i in unique(Train$`Brand Code`)) {
  print(Train %>% filter(`Brand Code` == i) %>%
    ggplot(aes(x=PH)) +
    geom_histogram(bins=25))
}
## Warning: Removed 129 rows containing non-finite values (stat_bin).

## Warning: Removed 35 rows containing non-finite values (stat_bin).

## Warning: Removed 31 rows containing non-finite values (stat_bin).

## Warning: Removed 64 rows containing non-finite values (stat_bin).

for (i in unique(Train$`Brand Code`)) {
  print(Train %>% filter(`Brand Code` == i) %>%
    ggplot(aes(x=log(PH))) +
    geom_histogram(bins=25))
}
## Warning: Removed 129 rows containing non-finite values (stat_bin).

## Warning: Removed 35 rows containing non-finite values (stat_bin).

## Warning: Removed 31 rows containing non-finite values (stat_bin).

## Warning: Removed 64 rows containing non-finite values (stat_bin).

#partition data for evaluation
training_set <- createDataPartition(preProc_train$PH, p=0.8, list=FALSE)
train_data <- preProc_train[training_set,]
eval_data <- preProc_train[-training_set,]

The distribution around PH is better as logged and will help us with the forecasting. Now let’s see what the relationship looks like between PH and the other variables.

And now it is time for modelling:

Build Models

Linear Models

We consider these linear regression models: multi-linear regression, partial least squares, AIC optimized . We utilize the train() function for all three models, feeding the same datasets for X and Y, and specifying the proper model-building technique via the “method” variable.

Moreover, the prediction() function in combination with postResample()are used to generate summary statistics for how our model performed on the evaluation data:

Multi-linear regression

First, a multi-linear regression is created to predict the pH response variable.

#Remove PH from sets to feed models
set.seed(100)
y_train <- subset(train_data, select = -c(PH))
y_test <- subset(eval_data, select = -c(PH))
#generate model
linear_model <- train(x= y_train, y= train_data$PH,
                      method='lm',
                      trControl=trainControl(method = "cv", number = 10))
#evaluate model
lmPred <- predict(linear_model, newdata = y_test)
lmResample <- postResample(pred=lmPred, obs = eval_data$PH)

Partial Least Squares

Next PLS regression is performed on the data to predict PH. PLS was chosen given the multicolliniarity detected earlier in the exploratory data analysis phase.

set.seed(100)
#generate model
pls_model <- train(y_train, train_data$PH,
                      method='pls',
                      metric='Rsquared',
                      tuneLength=10,
                      trControl=trainControl(method = "cv",  number = 10))
#evaluate model metrics
plsPred <-predict(pls_model, newdata=y_test)
plsReSample <- postResample(pred=plsPred, obs = eval_data$PH)

Stepwise AIC optimized

Lastly, for our linear models, a stepwise AIC model is run using stepAIC. The stepwise regression is performed by specifying the direction parameter with “both.”

set.seed(100)
#generate model
initial <- lm(PH ~ . , data = train_data)
AIC_model <- stepAIC(initial, direction = "both",
                     trace = 0)
#evaluate model metrics
AIC_Pred <-predict(AIC_model, newdata=y_test)
aicResample <- postResample(pred=AIC_Pred, obs=eval_data$PH)

Linear Model Metrics

We need to verify model performance and identify the strongest performing model in our multi-linear regression subset.

display <- rbind("Linear Regression" = lmResample,
                 "Stepwise AIC" = aicResample,
                 "Partial Least Squares" = plsReSample)
display 
##                             RMSE  Rsquared        MAE
## Linear Regression     0.01568033 0.4098999 0.01204503
## Stepwise AIC          0.01577032 0.4031818 0.01207796
## Partial Least Squares 0.01571769 0.4070956 0.01208152

Simple Linear Regression seemed to have better output among the linear models.

Non-linear Models

Building non-linear models. We will try k-nearest neighbors (KNN), support vector machines (SVM) and multivariate adaptive regression splines (MARS), and XGBoost. These models are not based on simple linear combinations of the predictors.

K-Nearest Neighbors

Predictors with the largest scales will contribute most to the distance between samples so centering and scaling the data during pre-processing is important.

set.seed(100)
knnModel <- train(PH~., data = train_data, 
                  method = "knn",
                  preProc = c("center", "scale"), 
                  tuneLength = 10)
#knnModel
knnPred <- predict(knnModel, newdata = eval_data)
knn_metrics <- postResample(pred = knnPred, obs = eval_data$PH)
#knn_metrics

Support Vector Machines

Support Vector Machines follow the framework of robust regression where we seek to minimize the effect of outliers on the regression equations. The radial kernel we are using has an additional parameter which impacts the smoothness of the upper and lower boundary.

set.seed(100)
tc <- trainControl(method = "cv",
                           number = 5,
                           classProbs = T)
svmModel <- train(PH~., data = train_data,
                    method = "svmRadial",
                    preProcess = c("BoxCox","center", "scale"),
                    trControl = tc,
                    tuneLength = 9)
#svmModel
svmPred <- predict(svmModel, newdata = eval_data)
svm_metrics <- postResample(pred = svmPred, obs = eval_data$PH)
#svm_metrics

MARS

MARS features breaks the predictor into two groups, a “hinge” function of the original based on a cut point that achieves the smallest error, and models linear relationships between the predictor and the outcome in each group.

set.seed(100)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
mars <- train(PH~., data = train_data,
                   method = "earth",
                   tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv"))
#mars
marsPred <- predict(mars, newdata = eval_data)
mars_metrics <- postResample(pred = marsPred, obs = eval_data$PH)
#mars_metrics

Observed Vs. Predicted - Non - Linear Models with Reduced Predictor Set.

knnModel_pred <- knnModel %>% predict(eval_data)
# Model performance metrics
knn_Accuracy <- data.frame(
  Model = "k-Nearest Neighbors",
  RMSE = caret::RMSE(knnModel_pred,eval_data$PH),
  Rsquare = caret::R2(knnModel_pred,eval_data$PH))
pred_svm <- svmModel %>% predict(eval_data)
# Model SVM performance metrics
SMV_Acc <- data.frame(
  Model = "Support Vector Machine",
  RMSE = caret::RMSE(pred_svm, eval_data$PH),
  Rsquare = caret::R2(pred_svm, eval_data$PH)
)

# Make MARS predictions
pred_mars <- mars %>% predict(eval_data)
# Model MARS performance metrics
MARS_Acc <- data.frame(
  Model = "MARS Tuned",
  RMSE = caret::RMSE(pred_mars, eval_data$PH),
  Rsquare = caret::R2(pred_mars, eval_data$PH)
)
names(MARS_Acc)[names(MARS_Acc) == 'y'] <- "Rsquare"

### code for the plot
par(mar = c(4, 4, 4, 4))
par(mfrow=c(2,2))
plot(knnModel_pred, eval_data$PH, ylab="Observed", col = "red")
abline(0, 1, lwd=2)
plot(pred_svm, eval_data$PH, ylab="Observed", col = "dark green")
abline(0, 1, lwd=2)
plot(pred_mars, eval_data$PH, ylab="Observed", col = "blue")
abline(0, 1, lwd=2)
mtext("Observed Vs. Predicted  - Non - Linear Models with Reduced Predictor Set", side = 3, line = -2, outer = TRUE)

Non-Linear Model Metrics

rbind( "KNN" = knn_metrics,
       "SVM" = svm_metrics,
       "MARS" = mars_metrics) 
##            RMSE  Rsquared        MAE
## KNN  0.01410296 0.5268104 0.01069230
## SVM  0.01296496 0.5982959 0.00923041
## MARS 0.01443122 0.5026712 0.01099409

The top predictors in our best performing non-linear model with Support Vector Machine (SVM).

Tree based Models

Decision Tree

The next model we developed was a decision tree. Decision trees are capable of capturing nonlinear relationships as well as feature importance. They are highly interpretable, but are not as accurate as other tree based methods. The model we created has a maximum depth of 12 leaves and a minimum split of 30 observations. The results and decision tree structure are presented below.

colnames(train_data) <- make.names(colnames(train_data))
colnames(eval_data) <- make.names(colnames(eval_data))

y_train <- subset(train_data, select = -c(PH))
y_test <- subset(eval_data, select = -c(PH))

rpart.model <- train(PH~.,
                   data = train_data,
                   method = "rpart",
                   tuneLength = 200,
                   control = rpart.control(maxdepth = 12, minsplit = 30),
                   trControl=trainControl(method = "cv", number = 10))


rpart.pred <- predict(rpart.model, newdata = eval_data)

rpart.metrics <- postResample(pred = rpart.pred, obs = eval_data$PH)

rpart.metrics
##       RMSE   Rsquared        MAE 
## 0.01407714 0.54268663 0.01029654
rpart.plot(rpart.model$finalModel)

XGBoost

The last model we created was an XGBoost tree model. XGBoost is a ensemble algorithm, in which many decision trees are combined into a single model to improve overall accuracy. The downside of ensemble methods are that they are more difficult to interpret compared to individual models. The optimal XGBoost model was found via a grid search where the best results occurred with 1000 decision trees, a learning rate of 0.05, a maximum tree depth of 5, and a gamma of 0. The results of the model can be seen below.

set.seed(100)




hyperparameters <- expand.grid(
  nrounds = 1000,
  eta = 0.05,
  max_depth = 5,
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)


#fit XGBoost model and display training and testing data at each iteration


xgb.model <- train(PH~.,
                   data = train_data,
                   method = "xgbTree",
                   tuneGrid = hyperparameters,
                   trControl = trainControl(method = "cv", number = 10),
                   verbosity = 0)


xgb.pred <- predict(xgb.model, newdata = eval_data)

xgb.metrics <- postResample(pred = xgb.pred, obs = eval_data$PH)

xgb.metrics
##        RMSE    Rsquared         MAE 
## 0.011592453 0.678232868 0.008349152
x = 1:nrow(y_test)                   # visualize the model, actual and predicted data
plot(x, xgb.pred, col = "red", type = "l")
lines(x, eval_data$PH, col = "blue", type = "l")
legend(x = 1, y = 38,  legend = c("original y_test", "predicted y_test"), 
       col = c("red", "blue"), box.lty = 1, cex = 0.8, lty = c(1, 1))

Evaluate & Select Models

Next, the metrics for each of the best performing Linear, Non-Linear, and Tree based models are complied to evaluate the performance of each and select the best model.

The RMSE, \(R^2\), and MAE were recorded for each model presented above to be compared to find the optimal model to predict PH. The linear models produced the worst results, nonlinear produced middle of the road results, and the tree based models achieved the best performance metrics. The best of the best model is the XGBoost with the lowest RMSE (0.10), best \(R^2\) (0.66), and lowest MAE (0.07).

rbind(
  mars_metrics,
  knn_metrics,
  svm_metrics,
  rpart.metrics,
  xgb.metrics)
##                     RMSE  Rsquared         MAE
## mars_metrics  0.01443122 0.5026712 0.010994095
## knn_metrics   0.01410296 0.5268104 0.010692299
## svm_metrics   0.01296496 0.5982959 0.009230410
## rpart.metrics 0.01407714 0.5426866 0.010296535
## xgb.metrics   0.01159245 0.6782329 0.008349152

Variable Importance

The variables that are most impactful for predicting PH are Mnf.Flow, Usage.cont, and Oxygen.Filter. The data table exhibits all the variables importance to the XGBoost model. The plot showcases the top 10 contributors to PH predictive power.

importance_matrix = xgb.importance(data = train_data, model = xgb.model$finalModel)
## Warning in xgb.importance(data = train_data, model = xgb.model$finalModel):
## xgb.importance: parameters 'data', 'label' and 'target' are deprecated
importance_matrix
##               Feature        Gain       Cover   Frequency
##  1:          Mnf.Flow 0.090764650 0.057847898 0.041183604
##  2:        Usage.cont 0.073691131 0.051414743 0.051515886
##  3:     Oxygen.Filler 0.069929482 0.053250823 0.051321853
##  4:      Filler.Speed 0.060896230 0.061880612 0.057142857
##  5:           Density 0.055382737 0.046687133 0.039631336
##  6:     Air.Pressurer 0.053122417 0.021980742 0.031336406
##  7:   Pressure.Vacuum 0.052936328 0.046811866 0.035168567
##  8:      Filler.Level 0.050772493 0.050162577 0.044821732
##  9:       Temperature 0.050210750 0.034198783 0.033907349
## 10:       Brand.CodeC 0.048456593 0.006668869 0.008100897
## 11:    Carb.Pressure1 0.044816360 0.062124384 0.050254669
## 12:         Carb.Flow 0.039835555 0.068796798 0.054765947
## 13:          Carb.Rel 0.036243622 0.032272672 0.030026680
## 14:         PC.Volume 0.031626245 0.065659455 0.062672811
## 15:       Carb.Volume 0.030590650 0.026348764 0.059616784
## 16:       Fill.Ounces 0.026087850 0.036814203 0.058210041
## 17:     Hyd.Pressure2 0.024254686 0.040538146 0.032694640
## 18:       Brand.CodeD 0.020456094 0.007071217 0.007179238
## 19:          PSC.Fill 0.019870716 0.028172596 0.033470774
## 20:     Fill.Pressure 0.019474674 0.039478291 0.033228232
## 21:               PSC 0.019268703 0.052298188 0.046034441
## 22:     Hyd.Pressure4 0.018065634 0.018877671 0.027940820
## 23:     Carb.Pressure 0.016709273 0.034344252 0.036429784
## 24:         Carb.Temp 0.015247775 0.032394505 0.034537958
## 25:       Brand.CodeB 0.012020200 0.002091347 0.013291293
## 26: Pressure.Setpoint 0.010084745 0.006564119 0.004753820
## 27:           PSC.CO2 0.009184408 0.015249345 0.020761581
##               Feature        Gain       Cover   Frequency
xgb.plot.importance(importance_matrix[1:10,])

Let’s check individually what each Brand code would come back with for RSqaured and RMSE

for (i in unique(train_data$Brand.Code)) {


xgb.pred <- predict(xgb.model, newdata = eval_data %>% filter(Brand.Code == i))

d <- eval_data %>% filter(Brand.Code == i)

xgb.metrics <- postResample(pred = exp(xgb.pred), obs = exp(d$PH) )

print(i)
print(xgb.metrics)

}
## [1] "B"
##       RMSE   Rsquared        MAE 
## 0.08658230 0.74858830 0.06580456 
## [1] "A"
##       RMSE   Rsquared        MAE 
## 0.10487783 0.53631742 0.08042107 
## [1] "C"
##      RMSE  Rsquared       MAE 
## 0.1543169 0.2629705 0.1051827 
## [1] "D"
##       RMSE   Rsquared        MAE 
## 0.07831330 0.67088021 0.06067395

Predict PH Values

Next, using the SVM model which was identified as the best performing model above, the PH values are predicted with the provided Test data set.

Next, the predictions for PH are generated using the predict function along with the optimal model identified earlier as the XGBoost model.

colnames(test_features) <- make.names(colnames(test_features))
predict <- round(predict(xgb.model, newdata=test_features),2)

pH <- as.data.frame(predict)
pH <- rename(pH, pH = predict)
pH$pH <- exp(pH$pH)

Generate Excel file

Finally, the predicted PH values along with the predictor variables are exported to a CSV file using write_excel_csv.

write_excel_csv(pH, "Predictions.csv")