Introduction

For this project, the focus is on looking to develop a model that allows us to predict the pH for our products and understand which factors are used in helping us to predict the output. In order to do this we will be looking at exploring different modeling techniques to try and predict the pH level. Given that we are looking to predict a continuous variable, we will use cnoider the following models:

  1. Linear Regression
  2. Decision Tree - Single Tree
  3. Random Forest
  4. Support Vector Machine
  5. Neural Network

Each of these have there own stengths and weaknesses, but given that one of our main objectives is to be able to explain the process associated with the manufacturing process, we anticipate one of our decision criteria will be based on our ability to explain the model and the relationship between our explanatory variables and the response variable.

Setup

We begin by importing the libraries that will be used for exploring the data as well as to build our machine learning models. The caret library will be the primary model that we will use in building our machine learning models.

Import and clean data

Next, we important the data that will be used to train and evaluate our models. One of the first pre-processing steps we are taking is to standardize the variable names for the variables in our dataframe.

sample_data_raw <- read_excel('StudentData.xlsx')

sample_data <- clean_names(sample_data_raw)
sample_data <- as_tibble(sample_data)

Data Exploration

We will being by looking at the data and understanding the variables that are present and exploring the shape of the variables. Some of the main things that we are focused on exploring will be:

  1. Missing variables
  2. Shape and distribution of the response variable
  3. Shape and distribution of the explanatory variables
  4. Any outliers within the data
  5. Correlation between explanatory variables and the response variable
  6. Correlation between explanatory variables
sample_data %>% summary()
##   brand_code         carb_volume     fill_ounces      pc_volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  carb_pressure     carb_temp          psc             psc_fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     psc_co2           mnf_flow       carb_pressure1  fill_pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  hyd_pressure1   hyd_pressure2   hyd_pressure3   hyd_pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   filler_level    filler_speed   temperature      usage_cont      carb_flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     density           mfr           balling       pressure_vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        ph        oxygen_filler     bowl_setpoint   pressure_setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  air_pressurer      alch_rel        carb_rel      balling_lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1
sample_data %>% str()
## tibble [2,571 × 33] (S3: tbl_df/tbl/data.frame)
##  $ brand_code       : chr [1:2571] "B" "A" "B" "A" ...
##  $ carb_volume      : num [1:2571] 5.34 5.43 5.29 5.44 5.49 ...
##  $ fill_ounces      : num [1:2571] 24 24 24.1 24 24.3 ...
##  $ pc_volume        : num [1:2571] 0.263 0.239 0.263 0.293 0.111 ...
##  $ carb_pressure    : num [1:2571] 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ carb_temp        : num [1:2571] 141 140 145 133 137 ...
##  $ psc              : num [1:2571] 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ psc_fill         : num [1:2571] 0.26 0.22 0.34 0.42 0.16 ...
##  $ psc_co2          : num [1:2571] 0.04 0.04 0.16 0.04 0.12 ...
##  $ mnf_flow         : num [1:2571] -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ carb_pressure1   : num [1:2571] 119 122 120 115 118 ...
##  $ fill_pressure    : num [1:2571] 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ hyd_pressure1    : num [1:2571] 0 0 0 0 0 0 0 0 0 0 ...
##  $ hyd_pressure2    : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
##  $ hyd_pressure3    : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
##  $ hyd_pressure4    : num [1:2571] 118 106 82 92 92 116 124 132 90 108 ...
##  $ filler_level     : num [1:2571] 121 119 120 118 119 ...
##  $ filler_speed     : num [1:2571] 4002 3986 4020 4012 4010 ...
##  $ temperature      : num [1:2571] 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ usage_cont       : num [1:2571] 16.2 19.9 17.8 17.4 17.7 ...
##  $ carb_flow        : num [1:2571] 2932 3144 2914 3062 3054 ...
##  $ density          : num [1:2571] 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ mfr              : num [1:2571] 725 727 735 731 723 ...
##  $ balling          : num [1:2571] 1.4 1.5 3.14 3.04 3.04 ...
##  $ pressure_vacuum  : num [1:2571] -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ ph               : num [1:2571] 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ oxygen_filler    : num [1:2571] 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ bowl_setpoint    : num [1:2571] 120 120 120 120 120 120 120 120 120 120 ...
##  $ pressure_setpoint: num [1:2571] 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ air_pressurer    : num [1:2571] 143 143 142 146 146 ...
##  $ alch_rel         : num [1:2571] 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ carb_rel         : num [1:2571] 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ balling_lvl      : num [1:2571] 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
sample_data %>% count(brand_code)
## # A tibble: 5 × 2
##   brand_code     n
##   <chr>      <int>
## 1 A            293
## 2 B           1239
## 3 C            304
## 4 D            615
## 5 <NA>         120

Our initial view of the data shows that the all of our explanatory variables are numerical - with the exception of the Brand. There are 2571 records in our dataset and 33 variables (including our resonse variable). Across our numerical variables, only the mfr variable has a significant number of missing values. The mfr variable has 212 missing values - representing 8.2% of the records. Outside of that variable, the largest number of missing values is 57, which represents 2.2% of the records included in our dataset. Finally, for the brand_code variable there are 120 missing values - which represents 4.6% of the total records. We’ll have to make a decision on how to evaluate this later.

In the following lines of code, we are focused on looking at the distribution of the numerical variables to see if there are any variables that have a problematic distribution

#Show histogram of variables
numeric_columns <- sample_data %>% select_if(., is.numeric) %>% colnames()
all_columns <- sample_data %>% colnames()


numeric_columns[[1]]
## [1] "carb_volume"
for(i in seq(numeric_columns)) {
  
  #print(numeric_columns[[i]])
  col_index = which(all_columns == numeric_columns[[i]])
  
  plt <- sample_data %>% 
    ggplot(aes_string(x=numeric_columns[[i]])) + 
    geom_histogram(color="black", fill="white")
  print(plt)
  
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 38 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 39 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 27 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 26 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 33 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 23 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 39 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 32 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 22 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 11 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 15 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 15 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 30 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 20 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 57 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 212 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 9 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

One of the first - and important - observations we make is that the ph variable appears to follow a normal distribution. Based on this, we can anticipate that we will not need to make an special adjustments or transformations on that variable.However, there are several other variables that we may want to consider making adjustments to. Some notable variables to consider adjusting include:

  1. psc
  2. psc_fill
  3. psc_co2
  4. mnf_flow
  5. hyd_pressure1
  6. hyd_pressure2
  7. hyd_pressure3
  8. filler_level
  9. filler_speed
  10. temperature
  11. usage_cont
  12. carb_flow
  13. density
  14. mfr
  15. balling
  16. pressure_vacuum
  17. oxygen_filler
  18. bowl_setpoint
  19. pressure_setpoint
  20. air_pressurer
  21. alch_rel
  22. carb_rel
  23. balling_lvl

Next we will look at the correlation between the numerical variables

sample_data %>% 
  na.omit() %>%
  select_if(is.numeric) %>%
  cor()
##                    carb_volume  fill_ounces    pc_volume carb_pressure
## carb_volume        1.000000000  0.022648037 -0.243234517  0.4278586099
## fill_ounces        0.022648037  1.000000000 -0.172701967  0.0044669914
## pc_volume         -0.243234517 -0.172701967  1.000000000 -0.1300941805
## carb_pressure      0.427858610  0.004466991 -0.130094181  1.0000000000
## carb_temp         -0.135080260 -0.020073098  0.008599427  0.8243350333
## psc               -0.020625187  0.043056041  0.190961265 -0.0520086907
## psc_fill          -0.012364141  0.069748652 -0.055914847 -0.0179275381
## psc_co2           -0.051977547  0.014705610 -0.055356932 -0.0174011136
## mnf_flow           0.100525364 -0.005787251 -0.203032882  0.0189865815
## carb_pressure1     0.081146212 -0.004186327 -0.238496181  0.0221844474
## fill_pressure     -0.096123314  0.068377098 -0.095744421 -0.0850458606
## hyd_pressure1     -0.051894268 -0.146359721  0.271171741 -0.0704043564
## hyd_pressure2      0.033567314 -0.143047986  0.028768135 -0.0267152675
## hyd_pressure3      0.057610071 -0.107816453 -0.024741466 -0.0082140513
## hyd_pressure4     -0.567908627  0.056086556  0.132413786 -0.3229128556
## filler_level      -0.018377226  0.010856354  0.220434988  0.0097134297
## filler_speed       0.003233452  0.044277046 -0.073766021  0.0320230919
## temperature       -0.189760851 -0.011208233  0.142065435 -0.0490025241
## usage_cont         0.073297402  0.085289046 -0.269204704  0.0004415267
## carb_flow         -0.088782919 -0.116991410  0.254535534 -0.0111938869
## density            0.801808803 -0.080414770 -0.167310345  0.4591743002
## mfr                0.010144634  0.043446000 -0.066678296  0.0292986178
## balling            0.818778791 -0.068439337 -0.204785622  0.4539307252
## pressure_vacuum   -0.079190180  0.052165822 -0.051605143  0.0051279147
## ph                 0.073251595 -0.118207504  0.097746859  0.0788617427
## oxygen_filler     -0.112682435 -0.032064259  0.212285069 -0.0506609705
## bowl_setpoint      0.001993607  0.011188521  0.210972310  0.0187581570
## pressure_setpoint -0.160640146  0.065078231 -0.022210124 -0.1171228371
## air_pressurer     -0.024854172  0.058101102 -0.036605483  0.0219237613
## alch_rel           0.824787575 -0.110432476 -0.177865067  0.4590380468
## carb_rel           0.841072250 -0.118666585 -0.167433257  0.4736224720
## balling_lvl        0.817993347 -0.063820071 -0.210081156  0.4589466196
##                      carb_temp          psc     psc_fill      psc_co2
## carb_volume       -0.135080260 -0.020625187 -0.012364141 -0.051977547
## fill_ounces       -0.020073098  0.043056041  0.069748652  0.014705610
## pc_volume          0.008599427  0.190961265 -0.055914847 -0.055356932
## carb_pressure      0.824335033 -0.052008691 -0.017927538 -0.017401114
## carb_temp          1.000000000 -0.044380293 -0.010968792  0.023660514
## psc               -0.044380293  1.000000000  0.193242059  0.053922116
## psc_fill          -0.010968792  0.193242059  1.000000000  0.201967429
## psc_co2            0.023660514  0.053922116  0.201967429  1.000000000
## mnf_flow          -0.030110321  0.042982180 -0.033801564  0.049931986
## carb_pressure1    -0.014312694 -0.011071066 -0.029772433  0.038316144
## fill_pressure     -0.028433189  0.030908100 -0.010528126  0.078519840
## hyd_pressure1     -0.045431217 -0.019758828 -0.053220514 -0.049037869
## hyd_pressure2     -0.039543090 -0.033813839 -0.080467321 -0.021838338
## hyd_pressure3     -0.035120030 -0.020448188 -0.068958062  0.005757630
## hyd_pressure4     -0.002573081  0.036485173  0.019810466  0.054940092
## filler_level       0.011038984 -0.016061814  0.048227359 -0.042306839
## filler_speed       0.029355933 -0.002395360 -0.017295907 -0.008421406
## temperature        0.061914927  0.014217909  0.030080929  0.037007308
## usage_cont        -0.043584359  0.049860720 -0.025727952  0.032502663
## carb_flow          0.045050619 -0.037361416  0.017579095 -0.013058433
## density            0.021593836 -0.067168725 -0.024708871 -0.045382139
## mfr                0.023442657  0.007659111 -0.008663604 -0.004263878
## balling            0.006005105 -0.064450826 -0.019108545 -0.043036807
## pressure_vacuum    0.037327836  0.055246194  0.033093793 -0.018438819
## ph                 0.034594375 -0.069906559 -0.031374041 -0.088759922
## oxygen_filler      0.012010201  0.002765929 -0.005769535 -0.016971816
## bowl_setpoint      0.007550616 -0.020808347  0.049191096 -0.044820225
## pressure_setpoint -0.027029545  0.042979144 -0.005800041  0.082772777
## air_pressurer      0.038177076  0.061433424 -0.017995701 -0.006984004
## alch_rel           0.004966509 -0.060809035 -0.017045886 -0.050859393
## carb_rel           0.009020802 -0.065354969 -0.015365172 -0.066475871
## balling_lvl        0.010458674 -0.061576860 -0.016064731 -0.047296997
##                       mnf_flow carb_pressure1 fill_pressure hyd_pressure1
## carb_volume        0.100525364    0.081146212   -0.09612331  -0.051894268
## fill_ounces       -0.005787251   -0.004186327    0.06837710  -0.146359721
## pc_volume         -0.203032882   -0.238496181   -0.09574442   0.271171741
## carb_pressure      0.018986582    0.022184447   -0.08504586  -0.070404356
## carb_temp         -0.030110321   -0.014312694   -0.02843319  -0.045431217
## psc                0.042982180   -0.011071066    0.03090810  -0.019758828
## psc_fill          -0.033801564   -0.029772433   -0.01052813  -0.053220514
## psc_co2            0.049931986    0.038316144    0.07851984  -0.049037869
## mnf_flow           1.000000000    0.465355315    0.49111329   0.298997690
## carb_pressure1     0.465355315    1.000000000    0.22002705  -0.012711716
## fill_pressure      0.491113286    0.220027054    1.00000000   0.137594249
## hyd_pressure1      0.298997690   -0.012711716    0.13759425   1.000000000
## hyd_pressure2      0.648337926    0.284710255    0.31836940   0.711851345
## hyd_pressure3      0.764142833    0.375658047    0.43637028   0.605589019
## hyd_pressure4      0.073124200    0.038936494    0.29144885   0.051658157
## filler_level      -0.608380668   -0.425646963   -0.45892889  -0.029684260
## filler_speed       0.024363769    0.012389362   -0.21347499  -0.022850088
## temperature       -0.063477350   -0.070067643    0.06397894  -0.073341745
## usage_cont         0.560192617    0.354875344    0.30257079   0.118691986
## carb_flow         -0.497211539   -0.273632871   -0.18715299  -0.169136801
## density            0.021343082    0.040888366   -0.20887500  -0.003601856
## mfr                0.019593355    0.013931715   -0.21290076  -0.035646540
## balling            0.112393982    0.073508154   -0.15676227   0.025863408
## pressure_vacuum   -0.559943315   -0.284881843   -0.30810832  -0.324133005
## ph                -0.457218313   -0.112108310   -0.31311751  -0.048631533
## oxygen_filler     -0.550066399   -0.278795209   -0.24769592  -0.118317983
## bowl_setpoint     -0.611580420   -0.432671459   -0.42609290  -0.029844050
## pressure_setpoint  0.482530739    0.234201877    0.81528556   0.190079720
## air_pressurer     -0.041525175    0.061506659    0.03280945  -0.209227461
## alch_rel           0.027999719    0.009047055   -0.20467008   0.006667887
## carb_rel          -0.027880871    0.004804776   -0.20272201   0.034387945
## balling_lvl        0.042267589    0.033601804   -0.17885789  -0.007754220
##                   hyd_pressure2 hyd_pressure3 hyd_pressure4 filler_level
## carb_volume          0.03356731   0.057610071  -0.567908627  -0.01837723
## fill_ounces         -0.14304799  -0.107816453   0.056086556   0.01085635
## pc_volume            0.02876814  -0.024741466   0.132413786   0.22043499
## carb_pressure       -0.02671527  -0.008214051  -0.322912856   0.00971343
## carb_temp           -0.03954309  -0.035120030  -0.002573081   0.01103898
## psc                 -0.03381384  -0.020448188   0.036485173  -0.01606181
## psc_fill            -0.08046732  -0.068958062   0.019810466   0.04822736
## psc_co2             -0.02183834   0.005757630   0.054940092  -0.04230684
## mnf_flow             0.64833793   0.764142833   0.073124200  -0.60838067
## carb_pressure1       0.28471026   0.375658047   0.038936494  -0.42564696
## fill_pressure        0.31836940   0.436370279   0.291448848  -0.45892889
## hyd_pressure1        0.71185135   0.605589019   0.051658157  -0.02968426
## hyd_pressure2        1.00000000   0.917661875   0.022980025  -0.41103521
## hyd_pressure3        0.91766187   1.000000000   0.046649319  -0.52796590
## hyd_pressure4        0.02298003   0.046649319   1.000000000  -0.10010344
## filler_level        -0.41103521  -0.527965898  -0.100103442   1.00000000
## filler_speed         0.07028157   0.031692930  -0.259685780   0.06319357
## temperature         -0.13725053  -0.120288087   0.283933984   0.05040323
## usage_cont           0.37224053   0.419761856   0.070916924  -0.35740823
## carb_flow           -0.32187768  -0.363691033  -0.009527157   0.01743320
## density              0.05842429   0.040354370  -0.584381995   0.00772950
## mfr                  0.04937645   0.012841039  -0.280014344   0.06935680
## balling              0.10480326   0.125869294  -0.594996647   0.01188558
## pressure_vacuum     -0.62816754  -0.672258326  -0.054652623   0.35866815
## ph                  -0.21571713  -0.260535183  -0.184518087   0.34879540
## oxygen_filler       -0.29438355  -0.363660021   0.009692254   0.22540536
## bowl_setpoint       -0.41457787  -0.529215855  -0.117019292   0.97724166
## pressure_setpoint    0.34183180   0.459918732   0.288770453  -0.44595292
## air_pressurer       -0.17237591  -0.067700529   0.021731381  -0.12573419
## alch_rel             0.03916606   0.049948952  -0.685576013   0.04355856
## carb_rel             0.03150549   0.013543248  -0.581827500   0.13107856
## balling_lvl          0.02885940   0.041595480  -0.575044299   0.05317923
##                   filler_speed temperature    usage_cont    carb_flow
## carb_volume        0.003233452 -0.18976085  0.0732974016 -0.088782919
## fill_ounces        0.044277046 -0.01120823  0.0852890459 -0.116991410
## pc_volume         -0.073766021  0.14206543 -0.2692047044  0.254535534
## carb_pressure      0.032023092 -0.04900252  0.0004415267 -0.011193887
## carb_temp          0.029355933  0.06191493 -0.0435843590  0.045050619
## psc               -0.002395360  0.01421791  0.0498607203 -0.037361416
## psc_fill          -0.017295907  0.03008093 -0.0257279520  0.017579095
## psc_co2           -0.008421406  0.03700731  0.0325026634 -0.013058433
## mnf_flow           0.024363769 -0.06347735  0.5601926171 -0.497211539
## carb_pressure1     0.012389362 -0.07006764  0.3548753438 -0.273632871
## fill_pressure     -0.213474995  0.06397894  0.3025707885 -0.187152994
## hyd_pressure1     -0.022850088 -0.07334175  0.1186919862 -0.169136801
## hyd_pressure2      0.070281568 -0.13725053  0.3722405332 -0.321877677
## hyd_pressure3      0.031692930 -0.12028809  0.4197618556 -0.363691033
## hyd_pressure4     -0.259685780  0.28393398  0.0709169237 -0.009527157
## filler_level       0.063193573  0.05040323 -0.3574082260  0.017433196
## filler_speed       1.000000000 -0.06794451  0.0466749382 -0.062641899
## temperature       -0.067944506  1.00000000 -0.0950428265  0.140119431
## usage_cont         0.046674938 -0.09504283  1.0000000000 -0.488854072
## carb_flow         -0.062641899  0.14011943 -0.4888540721  1.000000000
## density            0.025454377 -0.19369890 -0.0045531798  0.023058872
## mfr                0.951264445 -0.08156825  0.0397443802 -0.074148067
## balling            0.037938064 -0.23844781  0.0692694347 -0.102299824
## pressure_vacuum    0.045187804  0.04313148 -0.3196915871  0.290026894
## ph                -0.047596639 -0.18152910 -0.3577042377  0.240757744
## oxygen_filler     -0.044126285  0.09338308 -0.3164959408  0.385692082
## bowl_setpoint      0.027688288  0.05010937 -0.3584563273  0.012221618
## pressure_setpoint -0.068525700  0.05904194  0.2626733751 -0.166362628
## air_pressurer     -0.006199674  0.06692883 -0.1025056258  0.137849013
## alch_rel          -0.005598672 -0.24776232 -0.0237408078  0.007527013
## carb_rel           0.005300760 -0.13595194 -0.0390798757 -0.006940815
## balling_lvl        0.003823258 -0.22497504  0.0232069975 -0.055365859
##                        density          mfr      balling pressure_vacuum
## carb_volume        0.801808803  0.010144634  0.818778791    -0.079190180
## fill_ounces       -0.080414770  0.043446000 -0.068439337     0.052165822
## pc_volume         -0.167310345 -0.066678296 -0.204785622    -0.051605143
## carb_pressure      0.459174300  0.029298618  0.453930725     0.005127915
## carb_temp          0.021593836  0.023442657  0.006005105     0.037327836
## psc               -0.067168725  0.007659111 -0.064450826     0.055246194
## psc_fill          -0.024708871 -0.008663604 -0.019108545     0.033093793
## psc_co2           -0.045382139 -0.004263878 -0.043036807    -0.018438819
## mnf_flow           0.021343082  0.019593355  0.112393982    -0.559943315
## carb_pressure1     0.040888366  0.013931715  0.073508154    -0.284881843
## fill_pressure     -0.208875002 -0.212900763 -0.156762274    -0.308108315
## hyd_pressure1     -0.003601856 -0.035646540  0.025863408    -0.324133005
## hyd_pressure2      0.058424287  0.049376448  0.104803262    -0.628167539
## hyd_pressure3      0.040354370  0.012841039  0.125869294    -0.672258326
## hyd_pressure4     -0.584381995 -0.280014344 -0.594996647    -0.054652623
## filler_level       0.007729500  0.069356797  0.011885578     0.358668150
## filler_speed       0.025454377  0.951264445  0.037938064     0.045187804
## temperature       -0.193698895 -0.081568252 -0.238447810     0.043131483
## usage_cont        -0.004553180  0.039744380  0.069269435    -0.319691587
## carb_flow          0.023058872 -0.074148067 -0.102299824     0.290026894
## density            1.000000000  0.039383665  0.953127632    -0.083762802
## mfr                0.039383665  1.000000000  0.051171065     0.043987724
## balling            0.953127632  0.051171065  1.000000000    -0.167052546
## pressure_vacuum   -0.083762802  0.043987724 -0.167052546     1.000000000
## ph                 0.095340535 -0.052377217  0.072297357     0.213173584
## oxygen_filler     -0.046621720 -0.048876320 -0.113757056     0.278478283
## bowl_setpoint      0.019792536  0.042312091  0.027415802     0.358696671
## pressure_setpoint -0.250884909 -0.064390559 -0.210098742    -0.298520301
## air_pressurer     -0.082976446 -0.008182360 -0.102307854     0.175090738
## alch_rel           0.922993173 -0.001003949  0.945725964    -0.054788103
## carb_rel           0.860322233  0.010884909  0.858990647    -0.008434181
## balling_lvl        0.956417297  0.016039036  0.988412082    -0.051626926
##                             ph oxygen_filler bowl_setpoint pressure_setpoint
## carb_volume        0.073251595  -0.112682435   0.001993607      -0.160640146
## fill_ounces       -0.118207504  -0.032064259   0.011188521       0.065078231
## pc_volume          0.097746859   0.212285069   0.210972310      -0.022210124
## carb_pressure      0.078861743  -0.050660970   0.018758157      -0.117122837
## carb_temp          0.034594375   0.012010201   0.007550616      -0.027029545
## psc               -0.069906559   0.002765929  -0.020808347       0.042979144
## psc_fill          -0.031374041  -0.005769535   0.049191096      -0.005800041
## psc_co2           -0.088759922  -0.016971816  -0.044820225       0.082772777
## mnf_flow          -0.457218313  -0.550066399  -0.611580420       0.482530739
## carb_pressure1    -0.112108310  -0.278795209  -0.432671459       0.234201877
## fill_pressure     -0.313117513  -0.247695921  -0.426092896       0.815285564
## hyd_pressure1     -0.048631533  -0.118317983  -0.029844050       0.190079720
## hyd_pressure2     -0.215717133  -0.294383550  -0.414577866       0.341831797
## hyd_pressure3     -0.260535183  -0.363660021  -0.529215855       0.459918732
## hyd_pressure4     -0.184518087   0.009692254  -0.117019292       0.288770453
## filler_level       0.348795399   0.225405359   0.977241656      -0.445952917
## filler_speed      -0.047596639  -0.044126285   0.027688288      -0.068525700
## temperature       -0.181529097   0.093383081   0.050109366       0.059041937
## usage_cont        -0.357704238  -0.316495941  -0.358456327       0.262673375
## carb_flow          0.240757744   0.385692082   0.012221618      -0.166362628
## density            0.095340535  -0.046621720   0.019792536      -0.250884909
## mfr               -0.052377217  -0.048876320   0.042312091      -0.064390559
## balling            0.072297357  -0.113757056   0.027415802      -0.210098742
## pressure_vacuum    0.213173584   0.278478283   0.358696671      -0.298520301
## ph                 1.000000000   0.162016479   0.358504933      -0.316950863
## oxygen_filler      0.162016479   1.000000000   0.224734144      -0.241221158
## bowl_setpoint      0.358504933   0.224734144   1.000000000      -0.440555036
## pressure_setpoint -0.316950863  -0.241221158  -0.440555036       1.000000000
## air_pressurer     -0.008276544   0.116495267  -0.130939728       0.073805550
## alch_rel           0.156053375  -0.050111016   0.063332637      -0.268800293
## carb_rel           0.189575611  -0.039206163   0.152117160      -0.261257572
## balling_lvl        0.103279163  -0.074544984   0.070442468      -0.242343282
##                   air_pressurer     alch_rel     carb_rel  balling_lvl
## carb_volume        -0.024854172  0.824787575  0.841072250  0.817993347
## fill_ounces         0.058101102 -0.110432476 -0.118666585 -0.063820071
## pc_volume          -0.036605483 -0.177865067 -0.167433257 -0.210081156
## carb_pressure       0.021923761  0.459038047  0.473622472  0.458946620
## carb_temp           0.038177076  0.004966509  0.009020802  0.010458674
## psc                 0.061433424 -0.060809035 -0.065354969 -0.061576860
## psc_fill           -0.017995701 -0.017045886 -0.015365172 -0.016064731
## psc_co2            -0.006984004 -0.050859393 -0.066475871 -0.047296997
## mnf_flow           -0.041525175  0.027999719 -0.027880871  0.042267589
## carb_pressure1      0.061506659  0.009047055  0.004804776  0.033601804
## fill_pressure       0.032809452 -0.204670077 -0.202722006 -0.178857893
## hyd_pressure1      -0.209227461  0.006667887  0.034387945 -0.007754220
## hyd_pressure2      -0.172375910  0.039166055  0.031505491  0.028859401
## hyd_pressure3      -0.067700529  0.049948952  0.013543248  0.041595480
## hyd_pressure4       0.021731381 -0.685576013 -0.581827500 -0.575044299
## filler_level       -0.125734192  0.043558564  0.131078561  0.053179225
## filler_speed       -0.006199674 -0.005598672  0.005300760  0.003823258
## temperature         0.066928829 -0.247762315 -0.135951942 -0.224975043
## usage_cont         -0.102505626 -0.023740808 -0.039079876  0.023206997
## carb_flow           0.137849013  0.007527013 -0.006940815 -0.055365859
## density            -0.082976446  0.922993173  0.860322233  0.956417297
## mfr                -0.008182360 -0.001003949  0.010884909  0.016039036
## balling            -0.102307854  0.945725964  0.858990647  0.988412082
## pressure_vacuum     0.175090738 -0.054788103 -0.008434181 -0.051626926
## ph                 -0.008276544  0.156053375  0.189575611  0.103279163
## oxygen_filler       0.116495267 -0.050111016 -0.039206163 -0.074544984
## bowl_setpoint      -0.130939728  0.063332637  0.152117160  0.070442468
## pressure_setpoint   0.073805550 -0.268800293 -0.261257572 -0.242343282
## air_pressurer       1.000000000 -0.082451451 -0.107386102 -0.085434755
## alch_rel           -0.082451451  1.000000000  0.882791450  0.947343341
## carb_rel           -0.107386102  0.882791450  1.000000000  0.872148303
## balling_lvl        -0.085434755  0.947343341  0.872148303  1.000000000

When we look at the correlation between the target variable (ph) and the different explanatory variables, we don’t find any significant correlation between any of the explanatory variables and the respone variable. However, there is higher levels of correlation for the following variables:

  1. mnf_flow
  2. fill_pressure
  3. filler_level
  4. usage_count
  5. bowl_setpoint
  6. pressure_setpoint

Next we will do some explorary analysis with the categorical variable brand. We will look at the count of the variable, the distribution of the variable. Additionally, we

sample_data %>% count(brand_code)
## # A tibble: 5 × 2
##   brand_code     n
##   <chr>      <int>
## 1 A            293
## 2 B           1239
## 3 C            304
## 4 D            615
## 5 <NA>         120
sample_data %>% select(brand_code) %>% table() %>% prop.table()
## brand_code
##        A        B        C        D 
## 0.119543 0.505508 0.124031 0.250918
sample_data %>% 
  group_by(brand_code) %>%
  summarize(median_ph = median(ph, na.rm=TRUE),
            mean_ph = mean(ph, na.rm=TRUE),
            sd_ph = sd(ph, na.rm=TRUE))
## # A tibble: 5 × 4
##   brand_code median_ph mean_ph sd_ph
##   <chr>          <dbl>   <dbl> <dbl>
## 1 A               8.52    8.50 0.163
## 2 B               8.56    8.57 0.169
## 3 C               8.42    8.41 0.177
## 4 D               8.62    8.60 0.135
## 5 <NA>            8.51    8.49 0.177
sample_data %>%
  na.omit() %>%
  ggplot(aes(x=brand_code, y=ph)) +
  geom_boxplot()

sample_data %>%
  mutate(brand_code = ifelse(is.na(brand_code), "A",brand_code)) %>%
  na.omit() %>%
  ggplot(aes(x=brand_code, y=ph)) +
  geom_boxplot()

Some of the observations that we have when looking at the brand code is that - as mentioned earlier, about 4.8% of the records have no brand code. Amongst, those that do have a brand code, 50% are of brand code B, 25% have brand code D, with the remaining records being split relatively evenly between brand code A and C.

Finally, when we look at the boxplot of the data - along with the summary metrics grouped by brand code - we see some evidence to suggest that there are meaningful differences in the ph levels amongst the groups, with the median ph being 8.52 for Group A, 8.56 for Group B, 8.42 for Group C and 8.62 for Group D.

Data Preparation

Now that we’ve done our initial exploratory analysis of the data, we will focus on making the following adjustments to our data:

  1. Change records with NA brand code to brand code “A” - After reviewing the median and mean ph for the data based on brand code, we find the the values for the NA data are closest to Brand A. Thus we explored the impact on the Brand A metrics if we were to assign the NA records to a Brand Code of A and discovered that there would be minimal change to the data. Thus, we are going to convert the NA brand code records to having a Brand code of A

  2. Change the Brand code to a factor variable

  3. Since there were only a limited number of NA records in the data (since we’ve accouted for the blanks in the Brand Code variable), we will go ahead and just exclude the NA values from our overall dataset

#Set NA brand codes to Brand code A
sample_data <- sample_data %>% 
  mutate(brand_code = ifelse(is.na(brand_code), "A",brand_code))

#Change brand code variable to factor
sample_data_mod <- sample_data %>% mutate(brand_code = factor(brand_code, levels = sort(unique(brand_code))))
  
#Remove NA values from the dataset
sample_data_mod <- sample_data_mod %>% na.omit()

Model Building & Evaluation

Setup training data

We will split the data between test and training data for use in testing our models. Additionally since our data exploration showed evidence that the brand of product appears to be a source of distinction in the composition of the ph, we will also upsample the test data to make sure that we train the model on a balanced set of data that includes examples from each of the different brand codes.

set.seed(1234)

index <- createDataPartition(sample_data_mod$brand_code, p=0.75, list=FALSE)

train <- sample_data_mod[index,]
test <- sample_data_mod[-index,]

trainX <- train %>% select(-ph)
trainY <- train %>% select(ph)

testX <- test %>% select(-ph)
testY <- test %>% select(ph)

set.seed(111)



train_up <- upSample(x=train[,-ncol(train)],y=train$brand_code)
train_up <- train_up %>% select(-Class)

Linear Regression Model

We will use a multiple Linear regression model as the first model that we train and evaluate. After creating and reviewing the performance summary of our first model, we will then use backward elimination to remove parameters that are not statistically significant to our model in order to generate a simpler model and also, hopefully see improvements in our adjusted R-Squared along the way.

model_lm1 <- lm(ph ~ ., data=train_up)
summary(model_lm1)
## 
## Call:
## lm(formula = ph ~ ., data = train_up)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53025 -0.07647  0.00549  0.09279  0.40321 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.203e+01  1.106e+00  10.873  < 2e-16 ***
## brand_codeB        7.980e-02  1.147e-02   6.957 4.23e-12 ***
## brand_codeC       -8.392e-02  1.054e-02  -7.965 2.30e-15 ***
## brand_codeD        2.981e-03  1.682e-02   0.177 0.859330    
## carb_volume       -1.788e-01  9.225e-02  -1.938 0.052705 .  
## fill_ounces       -1.713e-01  3.029e-02  -5.655 1.70e-08 ***
## pc_volume         -8.337e-02  5.433e-02  -1.535 0.125003    
## carb_pressure      2.390e-03  4.575e-03   0.522 0.601381    
## carb_temp         -1.579e-03  3.552e-03  -0.445 0.656560    
## psc                8.043e-02  5.273e-02   1.525 0.127274    
## psc_fill          -5.894e-02  2.032e-02  -2.901 0.003751 ** 
## psc_co2           -7.428e-02  5.831e-02  -1.274 0.202787    
## mnf_flow          -7.173e-04  4.414e-05 -16.250  < 2e-16 ***
## carb_pressure1     5.254e-03  6.721e-04   7.818 7.30e-15 ***
## fill_pressure     -2.508e-03  1.477e-03  -1.698 0.089519 .  
## hyd_pressure1      1.233e-03  3.345e-04   3.685 0.000233 ***
## hyd_pressure2     -2.611e-03  5.225e-04  -4.998 6.12e-07 ***
## hyd_pressure3      3.049e-03  5.679e-04   5.369 8.49e-08 ***
## hyd_pressure4     -2.460e-04  2.940e-04  -0.837 0.402843    
## filler_level      -1.067e-03  7.884e-04  -1.353 0.176164    
## filler_speed       7.044e-05  2.196e-05   3.207 0.001354 ** 
## temperature       -1.116e-02  2.489e-03  -4.482 7.65e-06 ***
## usage_cont        -8.315e-03  1.102e-03  -7.544 5.93e-14 ***
## carb_flow         -1.910e-05  4.001e-06  -4.775 1.88e-06 ***
## density           -3.120e-03  2.502e-02  -0.125 0.900796    
## mfr               -4.438e-04  1.195e-04  -3.715 0.000207 ***
## balling           -2.835e-02  1.475e-02  -1.922 0.054691 .  
## pressure_vacuum   -2.771e-02  6.625e-03  -4.183 2.96e-05 ***
## oxygen_filler     -3.107e-01  7.550e-02  -4.115 3.96e-05 ***
## bowl_setpoint      2.111e-03  8.112e-04   2.602 0.009302 ** 
## pressure_setpoint -1.213e-03  2.013e-03  -0.602 0.546939    
## air_pressurer      5.547e-04  2.489e-03   0.223 0.823657    
## alch_rel           1.048e-01  2.783e-02   3.765 0.000170 ***
## carb_rel           2.259e-01  4.848e-02   4.660 3.30e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1326 on 3118 degrees of freedom
## Multiple R-squared:  0.4351, Adjusted R-squared:  0.4291 
## F-statistic: 72.76 on 33 and 3118 DF,  p-value: < 2.2e-16
model_lm2 <- lm(ph ~ . -alch_rel -air_pressurer -pressure_vacuum -density
               -filler_speed -filler_level -psc_co2 -pressure_setpoint -psc, data=train_up)
summary(model_lm2)
## 
## Call:
## lm(formula = ph ~ . - alch_rel - air_pressurer - pressure_vacuum - 
##     density - filler_speed - filler_level - psc_co2 - pressure_setpoint - 
##     psc, data = train_up)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53439 -0.07831  0.00479  0.09196  0.41078 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.247e+01  1.040e+00  11.992  < 2e-16 ***
## brand_codeB     7.466e-02  1.128e-02   6.619 4.24e-11 ***
## brand_codeC    -8.729e-02  1.023e-02  -8.533  < 2e-16 ***
## brand_codeD     5.461e-02  1.049e-02   5.205 2.07e-07 ***
## carb_volume    -1.665e-01  9.176e-02  -1.815 0.069627 .  
## fill_ounces    -1.598e-01  3.032e-02  -5.269 1.47e-07 ***
## pc_volume      -3.696e-02  4.972e-02  -0.743 0.457262    
## carb_pressure   2.580e-03  4.566e-03   0.565 0.572068    
## carb_temp      -1.706e-03  3.548e-03  -0.481 0.630742    
## psc_fill       -5.657e-02  1.953e-02  -2.896 0.003806 ** 
## mnf_flow       -6.878e-04  4.369e-05 -15.742  < 2e-16 ***
## carb_pressure1  4.993e-03  6.665e-04   7.492 8.76e-14 ***
## fill_pressure  -2.706e-03  1.035e-03  -2.614 0.008979 ** 
## hyd_pressure1   8.971e-04  3.240e-04   2.769 0.005654 ** 
## hyd_pressure2  -2.291e-03  4.826e-04  -4.748 2.15e-06 ***
## hyd_pressure3   3.393e-03  5.216e-04   6.504 9.04e-11 ***
## hyd_pressure4  -2.865e-04  2.935e-04  -0.976 0.328980    
## temperature    -1.111e-02  2.463e-03  -4.510 6.74e-06 ***
## usage_cont     -8.328e-03  1.091e-03  -7.635 2.98e-14 ***
## carb_flow      -1.442e-05  3.726e-06  -3.871 0.000111 ***
## mfr            -1.243e-04  4.279e-05  -2.905 0.003693 ** 
## balling         1.539e-03  7.768e-03   0.198 0.842977    
## oxygen_filler  -2.976e-01  7.509e-02  -3.963 7.57e-05 ***
## bowl_setpoint   1.133e-03  2.624e-04   4.319 1.62e-05 ***
## carb_rel        2.395e-01  4.579e-02   5.230 1.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1333 on 3127 degrees of freedom
## Multiple R-squared:  0.4275, Adjusted R-squared:  0.4231 
## F-statistic: 97.27 on 24 and 3127 DF,  p-value: < 2.2e-16

Throught our backward elimination process, we were able to improve the adjusted R-squared from 0.4313 to 0.4323. Next we will check the reliability of the model using our test dataset.

predict(model_lm1, test)
##        1        2        3        4        5        6        7        8 
## 8.535312 8.552172 8.617410 8.599141 8.592518 8.487506 8.522059 8.537330 
##        9       10       11       12       13       14       15       16 
## 8.563412 8.513961 8.572605 8.559585 8.579524 8.594673 8.522230 8.534854 
##       17       18       19       20       21       22       23       24 
## 8.596204 8.633726 8.661215 8.591473 8.630996 8.618544 8.567257 8.599976 
##       25       26       27       28       29       30       31       32 
## 8.617948 8.638176 8.550960 8.583514 8.615816 8.618693 8.638351 8.599334 
##       33       34       35       36       37       38       39       40 
## 8.568214 8.686447 8.646237 8.595908 8.696560 8.536267 8.383022 8.410642 
##       41       42       43       44       45       46       47       48 
## 8.444845 8.609029 8.608778 8.545063 8.627627 8.608474 8.626407 8.634280 
##       49       50       51       52       53       54       55       56 
## 8.667126 8.556596 8.613031 8.650503 8.638972 8.553280 8.600839 8.644670 
##       57       58       59       60       61       62       63       64 
## 8.597825 8.536441 8.522494 8.528203 8.572763 8.598653 8.673867 8.622717 
##       65       66       67       68       69       70       71       72 
## 8.554984 8.540230 8.565499 8.701430 8.586814 8.584006 8.647969 8.476226 
##       73       74       75       76       77       78       79       80 
## 8.656510 8.652321 8.584494 8.662530 8.643258 8.698844 8.543026 8.704525 
##       81       82       83       84       85       86       87       88 
## 8.777679 8.825176 8.659717 8.737716 8.661774 8.645128 8.701232 8.485669 
##       89       90       91       92       93       94       95       96 
## 8.497086 8.472881 8.575344 8.719753 8.741559 8.682135 8.629311 8.596895 
##       97       98       99      100      101      102      103      104 
## 8.668197 8.615002 8.657667 8.611367 8.620442 8.713801 8.632544 8.607552 
##      105      106      107      108      109      110      111      112 
## 8.638635 8.688853 8.605818 8.634368 8.627923 8.519610 8.593470 8.622751 
##      113      114      115      116      117      118      119      120 
## 8.578370 8.486476 8.474408 8.703535 8.678214 8.732956 8.609192 8.714570 
##      121      122      123      124      125      126      127      128 
## 8.643319 8.583682 8.666468 8.558109 8.621107 8.729932 8.873657 8.653398 
##      129      130      131      132      133      134      135      136 
## 8.629262 8.715678 8.788885 8.674881 8.665523 8.439865 8.469357 8.500706 
##      137      138      139      140      141      142      143      144 
## 8.437576 8.458453 8.482004 8.498254 8.611552 8.689605 8.698472 8.682046 
##      145      146      147      148      149      150      151      152 
## 8.694364 8.739120 8.609007 8.649823 8.598848 8.610661 8.694824 8.726789 
##      153      154      155      156      157      158      159      160 
## 8.681785 8.652567 8.598238 8.617016 8.562521 8.663504 8.642907 8.643729 
##      161      162      163      164      165      166      167      168 
## 8.477988 8.689018 8.617325 8.585246 8.677114 8.512572 8.680095 8.803345 
##      169      170      171      172      173      174      175      176 
## 8.692787 8.666025 8.756548 8.770646 8.682911 8.734524 8.715625 8.725685 
##      177      178      179      180      181      182      183      184 
## 8.667368 8.554460 8.743824 8.761286 8.758101 8.722627 8.656748 8.482051 
##      185      186      187      188      189      190      191      192 
## 8.477618 8.457926 8.501116 8.478556 8.651998 8.654843 8.634054 8.610014 
##      193      194      195      196      197      198      199      200 
## 8.692841 8.634047 8.734718 8.722285 8.694298 8.788174 8.588965 8.560117 
##      201      202      203      204      205      206      207      208 
## 8.612198 8.675494 8.625220 8.629646 8.577714 8.701874 8.699637 8.560705 
##      209      210      211      212      213      214      215      216 
## 8.644735 8.667605 8.684891 8.639527 8.636594 8.688191 8.675120 8.747888 
##      217      218      219      220      221      222      223      224 
## 8.727718 8.565172 8.650489 8.631932 8.675710 8.618460 8.662707 8.737362 
##      225      226      227      228      229      230      231      232 
## 8.757956 8.611127 8.628337 8.631014 8.584984 8.623373 8.597118 8.637044 
##      233      234      235      236      237      238      239      240 
## 8.650168 8.690843 8.508010 8.505688 8.472799 8.547967 8.657178 8.601530 
##      241      242      243      244      245      246      247      248 
## 8.706025 8.772766 8.644880 8.692180 8.727837 8.706923 8.698254 8.673956 
##      249      250      251      252      253      254      255      256 
## 8.713024 8.651606 8.473119 8.493855 8.622884 8.482721 8.484194 8.539246 
##      257      258      259      260      261      262      263      264 
## 8.551515 8.554118 8.504529 8.418718 8.512044 8.453088 8.501460 8.394030 
##      265      266      267      268      269      270      271      272 
## 8.475523 8.524289 8.503736 8.612974 8.281501 8.656562 8.642244 8.537584 
##      273      274      275      276      277      278      279      280 
## 8.489459 8.441599 8.521856 8.408867 8.536206 8.533114 8.550946 8.509091 
##      281      282      283      284      285      286      287      288 
## 8.560228 8.565240 8.567228 8.588729 8.596048 8.532336 8.603991 8.587775 
##      289      290      291      292      293      294      295      296 
## 8.557889 8.567364 8.589820 8.638057 8.560941 8.604563 8.583663 8.328241 
##      297      298      299      300      301      302      303      304 
## 8.341659 8.370631 8.344864 8.268771 8.281730 8.546698 8.542892 8.529807 
##      305      306      307      308      309      310      311      312 
## 8.491231 8.544168 8.476425 8.470104 8.476578 8.566667 8.588488 8.566843 
##      313      314      315      316      317      318      319      320 
## 8.464548 8.490479 8.463749 8.337031 8.410468 8.451988 8.458341 8.454712 
##      321      322      323      324      325      326      327      328 
## 8.556154 8.569134 8.631975 8.588798 8.527219 8.556726 8.499378 8.497597 
##      329      330      331      332      333      334      335      336 
## 8.447940 8.505432 8.479001 8.391495 8.356680 8.272870 8.441314 8.416109 
##      337      338      339      340      341      342      343      344 
## 8.424972 8.416890 8.525539 8.498188 8.464682 8.466235 8.516341 8.505952 
##      345      346      347      348      349      350      351      352 
## 8.480390 8.472402 8.510878 8.271139 8.336505 8.323018 8.328623 8.277577 
##      353      354      355      356      357      358      359      360 
## 8.572634 8.430445 8.441532 8.417792 8.466240 8.438397 8.464246 8.455339 
##      361      362      363      364      365      366      367      368 
## 8.463211 8.436249 8.422607 8.457396 8.443884 8.452611 8.469901 8.440857 
##      369      370      371      372      373      374      375      376 
## 8.424638 8.556547 8.519020 8.500247 8.478301 8.464870 8.418684 8.434603 
##      377      378      379      380      381      382      383      384 
## 8.294607 8.412229 8.373675 8.446003 8.463647 8.509818 8.494136 8.536547 
##      385      386      387      388      389      390      391      392 
## 8.321739 8.457268 8.491027 8.418307 8.272420 8.282787 8.435603 8.498043 
##      393      394      395      396      397      398      399      400 
## 8.476977 8.507094 8.512056 8.349803 8.459772 8.394004 8.499125 8.540402 
##      401      402      403      404      405      406      407      408 
## 8.506349 8.496515 8.498837 8.476264 8.544528 8.529859 8.510080 8.432343 
##      409      410      411      412      413      414      415      416 
## 8.483909 8.570198 8.404957 8.348317 8.323134 8.347512 8.349837 8.328327 
##      417      418      419      420      421      422      423      424 
## 8.546584 8.527267 8.553337 8.537409 8.393011 8.524774 8.454039 8.306176 
##      425      426      427      428      429      430      431      432 
## 8.507428 8.496851 8.480885 8.470022 8.453689 8.492330 8.508521 8.391348 
##      433      434      435      436      437      438      439      440 
## 8.528258 8.540623 8.555924 8.563143 8.587529 8.579418 8.582648 8.536633 
##      441      442      443      444      445      446      447      448 
## 8.495592 8.511122 8.665363 8.506894 8.313142 8.336063 8.347746 8.327580 
##      449      450      451      452      453      454      455      456 
## 8.376291 8.561202 8.570062 8.491047 8.508612 8.545889 8.533885 8.561067 
##      457      458      459      460      461      462      463      464 
## 8.534692 8.591785 8.584092 8.625438 8.477412 8.600057 8.534042 8.482883 
##      465      466      467      468      469      470      471      472 
## 8.495659 8.496797 8.308670 8.472488 8.531140 8.531340 8.450722 8.522283 
##      473      474      475      476      477      478      479      480 
## 8.544303 8.490571 8.502829 8.540052 8.613147 8.538882 8.594406 8.615205 
##      481      482      483      484      485      486      487      488 
## 8.346103 8.314281 8.514047 8.458406 8.475322 8.465305 8.508352 8.581415 
##      489      490      491      492      493      494      495      496 
## 8.499972 8.520424 8.501832 8.593729 8.630822 8.416822 8.389844 8.429733 
##      497      498      499      500      501      502      503      504 
## 8.446820 8.583815 8.540383 8.550342 8.560813 8.509433 8.510931 8.533974 
##      505      506      507      508      509      510      511      512 
## 8.471296 8.506307 8.556053 8.520599 8.514418 8.577183 8.642422 8.538299 
##      513      514      515      516      517      518      519      520 
## 8.528060 8.566496 8.527929 8.553100 8.490989 8.540184 8.503175 8.649171 
##      521      522      523      524      525      526      527      528 
## 8.583863 8.544450 8.600710 8.547345 8.640192 8.344598 8.370644 8.323353 
##      529      530      531 
## 8.363332 8.317567 8.506639
RMSE(predict(model_lm1, test), test$ph)
## [1] 0.1408222

The RMSE score for our Linear Regression model is 0.1358

Decision Tree Model (Single Tree)

The next model that we will try is a single tree Decision tree model. We will evaluate a model that uses all of our parameters and then a second one that uses only the parameters that we selected in our better performing Linear Regression model.

set.seed(1234)
model_tree1 <- rpart(ph ~., data = train)
#rpart.plot(tree)

predict(model_tree1, test)
##        1        2        3        4        5        6        7        8 
## 8.561120 8.561120 8.561120 8.561120 8.672667 8.492941 8.561120 8.561120 
##        9       10       11       12       13       14       15       16 
## 8.561120 8.561120 8.561120 8.561120 8.561120 8.561120 8.561120 8.561120 
##       17       18       19       20       21       22       23       24 
## 8.561120 8.672667 8.561120 8.561120 8.561120 8.561120 8.561120 8.561120 
##       25       26       27       28       29       30       31       32 
## 8.561120 8.561120 8.561120 8.719864 8.561120 8.672667 8.561120 8.561120 
##       33       34       35       36       37       38       39       40 
## 8.672667 8.561120 8.719864 8.719864 8.719864 8.719864 8.494231 8.251818 
##       41       42       43       44       45       46       47       48 
## 8.251818 8.719864 8.561120 8.561120 8.719864 8.719864 8.719864 8.561120 
##       49       50       51       52       53       54       55       56 
## 8.561120 8.561120 8.561120 8.561120 8.561120 8.561120 8.561120 8.561120 
##       57       58       59       60       61       62       63       64 
## 8.561120 8.561120 8.561120 8.561120 8.561120 8.561120 8.561120 8.719864 
##       65       66       67       68       69       70       71       72 
## 8.672667 8.561120 8.561120 8.672667 8.561120 8.561120 8.672667 8.251818 
##       73       74       75       76       77       78       79       80 
## 8.672667 8.561120 8.561120 8.561120 8.672667 8.672667 8.561120 8.561120 
##       81       82       83       84       85       86       87       88 
## 8.672667 8.672667 8.719864 8.719864 8.672667 8.719864 8.719864 8.251818 
##       89       90       91       92       93       94       95       96 
## 8.251818 8.494231 8.645595 8.719864 8.719864 8.645595 8.719864 8.719864 
##       97       98       99      100      101      102      103      104 
## 8.719864 8.719864 8.719864 8.719864 8.719864 8.719864 8.719864 8.719864 
##      105      106      107      108      109      110      111      112 
## 8.719864 8.719864 8.719864 8.645595 8.719864 8.719864 8.719864 8.645595 
##      113      114      115      116      117      118      119      120 
## 8.719864 8.494231 8.251818 8.719864 8.645595 8.645595 8.719864 8.719864 
##      121      122      123      124      125      126      127      128 
## 8.719864 8.645595 8.719864 8.719864 8.719864 8.719864 8.719864 8.719864 
##      129      130      131      132      133      134      135      136 
## 8.719864 8.719864 8.719864 8.719864 8.719864 8.494231 8.494231 8.494231 
##      137      138      139      140      141      142      143      144 
## 8.494231 8.494231 8.251818 8.494231 8.719864 8.719864 8.719864 8.719864 
##      145      146      147      148      149      150      151      152 
## 8.719864 8.719864 8.719864 8.719864 8.719864 8.719864 8.719864 8.719864 
##      153      154      155      156      157      158      159      160 
## 8.719864 8.672667 8.672667 8.672667 8.719864 8.719864 8.645595 8.719864 
##      161      162      163      164      165      166      167      168 
## 8.494231 8.645595 8.645595 8.645595 8.645595 8.645595 8.645595 8.645595 
##      169      170      171      172      173      174      175      176 
## 8.645595 8.645595 8.645595 8.719864 8.645595 8.645595 8.719864 8.719864 
##      177      178      179      180      181      182      183      184 
## 8.719864 8.719864 8.719864 8.719864 8.645595 8.645595 8.719864 8.494231 
##      185      186      187      188      189      190      191      192 
## 8.494231 8.494231 8.494231 8.494231 8.719864 8.719864 8.672667 8.561120 
##      193      194      195      196      197      198      199      200 
## 8.561120 8.645595 8.645595 8.645595 8.645595 8.719864 8.645595 8.645595 
##      201      202      203      204      205      206      207      208 
## 8.645595 8.719864 8.719864 8.645595 8.645595 8.645595 8.645595 8.494231 
##      209      210      211      212      213      214      215      216 
## 8.645595 8.645595 8.645595 8.719864 8.719864 8.719864 8.645595 8.645595 
##      217      218      219      220      221      222      223      224 
## 8.645595 8.645595 8.719864 8.719864 8.719864 8.719864 8.719864 8.719864 
##      225      226      227      228      229      230      231      232 
## 8.719864 8.719864 8.719864 8.719864 8.719864 8.719864 8.645595 8.645595 
##      233      234      235      236      237      238      239      240 
## 8.645595 8.719864 8.494231 8.494231 8.494231 8.494231 8.645595 8.719864 
##      241      242      243      244      245      246      247      248 
## 8.719864 8.719864 8.719864 8.719864 8.645595 8.645595 8.719864 8.645595 
##      249      250      251      252      253      254      255      256 
## 8.645595 8.645595 8.497333 8.530728 8.481053 8.481053 8.481053 8.530728 
##      257      258      259      260      261      262      263      264 
## 8.530728 8.530728 8.530728 8.530728 8.530728 8.530728 8.530728 8.481053 
##      265      266      267      268      269      270      271      272 
## 8.530728 8.530728 8.530728 8.608429 8.398605 8.481053 8.608429 8.530728 
##      273      274      275      276      277      278      279      280 
## 8.530728 8.530728 8.530728 8.530728 8.530728 8.530728 8.608429 8.530728 
##      281      282      283      284      285      286      287      288 
## 8.530728 8.530728 8.608429 8.608429 8.608429 8.530728 8.530728 8.530728 
##      289      290      291      292      293      294      295      296 
## 8.530728 8.530728 8.530728 8.530728 8.530728 8.530728 8.530728 8.530728 
##      297      298      299      300      301      302      303      304 
## 8.445138 8.530728 8.530728 8.445138 8.530728 8.530728 8.530728 8.530728 
##      305      306      307      308      309      310      311      312 
## 8.530728 8.530728 8.445138 8.445138 8.530728 8.608429 8.608429 8.608429 
##      313      314      315      316      317      318      319      320 
## 8.608429 8.530728 8.327917 8.530728 8.530728 8.530728 8.608429 8.608429 
##      321      322      323      324      325      326      327      328 
## 8.608429 8.608429 8.608429 8.608429 8.608429 8.608429 8.481053 8.530728 
##      329      330      331      332      333      334      335      336 
## 8.530728 8.530728 8.530728 8.327917 8.327917 8.327917 8.445138 8.445138 
##      337      338      339      340      341      342      343      344 
## 8.445138 8.445138 8.608429 8.608429 8.327917 8.327917 8.445138 8.445138 
##      345      346      347      348      349      350      351      352 
## 8.445138 8.445138 8.445138 8.327917 8.327917 8.327917 8.327917 8.327917 
##      353      354      355      356      357      358      359      360 
## 8.481053 8.481053 8.497333 8.445138 8.445138 8.445138 8.445138 8.445138 
##      361      362      363      364      365      366      367      368 
## 8.445138 8.445138 8.327917 8.445138 8.327917 8.327917 8.327917 8.445138 
##      369      370      371      372      373      374      375      376 
## 8.445138 8.481053 8.481053 8.608429 8.481053 8.481053 8.327917 8.327917 
##      377      378      379      380      381      382      383      384 
## 8.327917 8.327917 8.327917 8.327917 8.327917 8.481053 8.481053 8.327917 
##      385      386      387      388      389      390      391      392 
## 8.398605 8.327917 8.327917 8.327917 8.445138 8.445138 8.327917 8.445138 
##      393      394      395      396      397      398      399      400 
## 8.327917 8.608429 8.481053 8.327917 8.497333 8.327917 8.327917 8.445138 
##      401      402      403      404      405      406      407      408 
## 8.445138 8.445138 8.445138 8.445138 8.445138 8.445138 8.445138 8.497333 
##      409      410      411      412      413      414      415      416 
## 8.327917 8.608429 8.327917 8.327917 8.327917 8.327917 8.445138 8.327917 
##      417      418      419      420      421      422      423      424 
## 8.327917 8.327917 8.608429 8.608429 8.327917 8.327917 8.445138 8.445138 
##      425      426      427      428      429      430      431      432 
## 8.445138 8.327917 8.327917 8.327917 8.327917 8.327917 8.327917 8.327917 
##      433      434      435      436      437      438      439      440 
## 8.481053 8.481053 8.481053 8.481053 8.608429 8.608429 8.608429 8.481053 
##      441      442      443      444      445      446      447      448 
## 8.445138 8.497333 8.497333 8.497333 8.398605 8.398605 8.398605 8.398605 
##      449      450      451      452      453      454      455      456 
## 8.497333 8.608429 8.608429 8.497333 8.497333 8.497333 8.497333 8.497333 
##      457      458      459      460      461      462      463      464 
## 8.497333 8.608429 8.608429 8.608429 8.497333 8.497333 8.497333 8.497333 
##      465      466      467      468      469      470      471      472 
## 8.497333 8.497333 8.398605 8.481053 8.481053 8.481053 8.497333 8.497333 
##      473      474      475      476      477      478      479      480 
## 8.497333 8.497333 8.497333 8.497333 8.481053 8.497333 8.608429 8.608429 
##      481      482      483      484      485      486      487      488 
## 8.398605 8.398605 8.497333 8.497333 8.497333 8.497333 8.497333 8.608429 
##      489      490      491      492      493      494      495      496 
## 8.497333 8.497333 8.497333 8.481053 8.608429 8.497333 8.497333 8.497333 
##      497      498      499      500      501      502      503      504 
## 8.497333 8.608429 8.497333 8.497333 8.497333 8.608429 8.608429 8.608429 
##      505      506      507      508      509      510      511      512 
## 8.497333 8.497333 8.497333 8.497333 8.497333 8.497333 8.497333 8.497333 
##      513      514      515      516      517      518      519      520 
## 8.497333 8.497333 8.497333 8.608429 8.497333 8.497333 8.497333 8.608429 
##      521      522      523      524      525      526      527      528 
## 8.608429 8.608429 8.608429 8.608429 8.608429 8.171667 8.171667 8.171667 
##      529      530      531 
## 8.171667 8.398605 8.497333
RMSE(predict(model_tree1, test), test$ph)
## [1] 0.1344733
#Model with selected variables based on Linear Regression model

set.seed(1234)
model_tree2 <- rpart(ph ~ . -alch_rel -air_pressurer -pressure_vacuum -density
               -filler_speed -filler_level -psc_co2 -pressure_setpoint, data=train)
predict(model_tree2, test)
##        1        2        3        4        5        6        7        8 
## 8.605993 8.605993 8.605993 8.605993 8.605993 8.492941 8.605993 8.605993 
##        9       10       11       12       13       14       15       16 
## 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 
##       17       18       19       20       21       22       23       24 
## 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 
##       25       26       27       28       29       30       31       32 
## 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 
##       33       34       35       36       37       38       39       40 
## 8.605993 8.605993 8.724681 8.605993 8.605993 8.605993 8.494231 8.251818 
##       41       42       43       44       45       46       47       48 
## 8.251818 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 
##       49       50       51       52       53       54       55       56 
## 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 
##       57       58       59       60       61       62       63       64 
## 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 
##       65       66       67       68       69       70       71       72 
## 8.605993 8.605993 8.605993 8.724681 8.605993 8.605993 8.724681 8.251818 
##       73       74       75       76       77       78       79       80 
## 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 8.605993 
##       81       82       83       84       85       86       87       88 
## 8.724681 8.724681 8.605993 8.724681 8.724681 8.724681 8.724681 8.251818 
##       89       90       91       92       93       94       95       96 
## 8.251818 8.494231 8.704970 8.724681 8.704970 8.704970 8.704970 8.704970 
##       97       98       99      100      101      102      103      104 
## 8.704970 8.605993 8.704970 8.704970 8.704970 8.704970 8.704970 8.704970 
##      105      106      107      108      109      110      111      112 
## 8.605993 8.704970 8.704970 8.724681 8.724681 8.704970 8.704970 8.704970 
##      113      114      115      116      117      118      119      120 
## 8.704970 8.494231 8.251818 8.605993 8.605993 8.605993 8.704970 8.704970 
##      121      122      123      124      125      126      127      128 
## 8.605993 8.605993 8.605993 8.704970 8.704970 8.704970 8.704970 8.704970 
##      129      130      131      132      133      134      135      136 
## 8.605993 8.704970 8.704970 8.605993 8.704970 8.494231 8.494231 8.494231 
##      137      138      139      140      141      142      143      144 
## 8.494231 8.494231 8.251818 8.494231 8.605993 8.704970 8.704970 8.605993 
##      145      146      147      148      149      150      151      152 
## 8.704970 8.724681 8.704970 8.704970 8.704970 8.704970 8.704970 8.704970 
##      153      154      155      156      157      158      159      160 
## 8.704970 8.704970 8.704970 8.704970 8.704970 8.704970 8.704970 8.605993 
##      161      162      163      164      165      166      167      168 
## 8.494231 8.605993 8.605993 8.704970 8.704970 8.704970 8.704970 8.704970 
##      169      170      171      172      173      174      175      176 
## 8.704970 8.605993 8.605993 8.605993 8.605993 8.704970 8.704970 8.704970 
##      177      178      179      180      181      182      183      184 
## 8.704970 8.704970 8.724681 8.704970 8.704970 8.724681 8.724681 8.494231 
##      185      186      187      188      189      190      191      192 
## 8.494231 8.494231 8.494231 8.494231 8.704970 8.704970 8.704970 8.704970 
##      193      194      195      196      197      198      199      200 
## 8.704970 8.605993 8.704970 8.704970 8.704970 8.605993 8.704970 8.724681 
##      201      202      203      204      205      206      207      208 
## 8.704970 8.704970 8.605993 8.605993 8.704970 8.605993 8.605993 8.494231 
##      209      210      211      212      213      214      215      216 
## 8.704970 8.704970 8.605993 8.704970 8.704970 8.704970 8.605993 8.704970 
##      217      218      219      220      221      222      223      224 
## 8.704970 8.704970 8.704970 8.704970 8.704970 8.605993 8.704970 8.704970 
##      225      226      227      228      229      230      231      232 
## 8.605993 8.704970 8.605993 8.605993 8.605993 8.704970 8.605993 8.605993 
##      233      234      235      236      237      238      239      240 
## 8.704970 8.704970 8.494231 8.494231 8.494231 8.494231 8.704970 8.605993 
##      241      242      243      244      245      246      247      248 
## 8.704970 8.605993 8.605993 8.605993 8.704970 8.704970 8.704970 8.704970 
##      249      250      251      252      253      254      255      256 
## 8.704970 8.605993 8.493377 8.527018 8.498202 8.527018 8.498202 8.527018 
##      257      258      259      260      261      262      263      264 
## 8.527018 8.527018 8.527018 8.527018 8.527018 8.527018 8.527018 8.527018 
##      265      266      267      268      269      270      271      272 
## 8.527018 8.527018 8.527018 8.498202 8.301750 8.527018 8.527018 8.527018 
##      273      274      275      276      277      278      279      280 
## 8.527018 8.527018 8.527018 8.527018 8.527018 8.527018 8.527018 8.527018 
##      281      282      283      284      285      286      287      288 
## 8.527018 8.527018 8.560000 8.689756 8.689756 8.527018 8.527018 8.527018 
##      289      290      291      292      293      294      295      296 
## 8.527018 8.527018 8.527018 8.527018 8.527018 8.527018 8.527018 8.527018 
##      297      298      299      300      301      302      303      304 
## 8.301750 8.527018 8.527018 8.301750 8.527018 8.527018 8.527018 8.527018 
##      305      306      307      308      309      310      311      312 
## 8.527018 8.527018 8.424459 8.424459 8.527018 8.498202 8.689756 8.689756 
##      313      314      315      316      317      318      319      320 
## 8.301750 8.527018 8.424459 8.527018 8.527018 8.527018 8.527018 8.527018 
##      321      322      323      324      325      326      327      328 
## 8.689756 8.689756 8.689756 8.689756 8.527018 8.689756 8.527018 8.527018 
##      329      330      331      332      333      334      335      336 
## 8.527018 8.527018 8.527018 8.301750 8.301750 8.301750 8.424459 8.424459 
##      337      338      339      340      341      342      343      344 
## 8.424459 8.424459 8.560000 8.560000 8.560000 8.301750 8.424459 8.424459 
##      345      346      347      348      349      350      351      352 
## 8.424459 8.424459 8.424459 8.301750 8.301750 8.301750 8.301750 8.301750 
##      353      354      355      356      357      358      359      360 
## 8.498202 8.424459 8.301750 8.424459 8.424459 8.424459 8.424459 8.424459 
##      361      362      363      364      365      366      367      368 
## 8.424459 8.424459 8.301750 8.424459 8.424459 8.424459 8.424459 8.424459 
##      369      370      371      372      373      374      375      376 
## 8.424459 8.424459 8.301750 8.424459 8.493377 8.493377 8.424459 8.424459 
##      377      378      379      380      381      382      383      384 
## 8.498202 8.301750 8.301750 8.424459 8.424459 8.498202 8.424459 8.424459 
##      385      386      387      388      389      390      391      392 
## 8.301750 8.424459 8.424459 8.424459 8.301750 8.301750 8.424459 8.424459 
##      393      394      395      396      397      398      399      400 
## 8.493377 8.498202 8.498202 8.301750 8.498202 8.498202 8.560000 8.424459 
##      401      402      403      404      405      406      407      408 
## 8.424459 8.424459 8.424459 8.424459 8.424459 8.424459 8.424459 8.425373 
##      409      410      411      412      413      414      415      416 
## 8.425373 8.560000 8.425373 8.425373 8.425373 8.425373 8.425373 8.425373 
##      417      418      419      420      421      422      423      424 
## 8.493377 8.493377 8.560000 8.560000 8.425373 8.217143 8.217143 8.425373 
##      425      426      427      428      429      430      431      432 
## 8.217143 8.217143 8.217143 8.217143 8.217143 8.217143 8.217143 8.425373 
##      433      434      435      436      437      438      439      440 
## 8.498202 8.498202 8.498202 8.498202 8.560000 8.560000 8.560000 8.498202 
##      441      442      443      444      445      446      447      448 
## 8.217143 8.493377 8.493377 8.493377 8.425373 8.425373 8.425373 8.425373 
##      449      450      451      452      453      454      455      456 
## 8.425373 8.560000 8.560000 8.560000 8.493377 8.493377 8.493377 8.493377 
##      457      458      459      460      461      462      463      464 
## 8.493377 8.560000 8.560000 8.560000 8.425373 8.493377 8.493377 8.493377 
##      465      466      467      468      469      470      471      472 
## 8.493377 8.493377 8.425373 8.498202 8.498202 8.498202 8.560000 8.493377 
##      473      474      475      476      477      478      479      480 
## 8.493377 8.493377 8.493377 8.493377 8.498202 8.493377 8.560000 8.560000 
##      481      482      483      484      485      486      487      488 
## 8.425373 8.425373 8.493377 8.493377 8.493377 8.493377 8.493377 8.560000 
##      489      490      491      492      493      494      495      496 
## 8.560000 8.493377 8.493377 8.498202 8.560000 8.425373 8.425373 8.425373 
##      497      498      499      500      501      502      503      504 
## 8.425373 8.560000 8.493377 8.493377 8.493377 8.560000 8.560000 8.493377 
##      505      506      507      508      509      510      511      512 
## 8.560000 8.493377 8.493377 8.493377 8.493377 8.493377 8.493377 8.493377 
##      513      514      515      516      517      518      519      520 
## 8.493377 8.493377 8.493377 8.560000 8.560000 8.560000 8.560000 8.560000 
##      521      522      523      524      525      526      527      528 
## 8.560000 8.560000 8.560000 8.560000 8.560000 8.301750 8.301750 8.301750 
##      529      530      531 
## 8.301750 8.301750 8.493377
RMSE(predict(model_tree2, test), test$ph)
## [1] 0.1348702
plot(model_tree1, uniform=TRUE, compress=FALSE, margin=.015)
text(model_tree1, all=TRUE, cex=.5)

Random Forest

#Model 1
set.seed(1234)

model_rf1 <- randomForest(ph ~ ., data=train_up,
                         ntree = 50, nodesize=15,)
model_rf1$rsq
##  [1] 0.6976031 0.7143194 0.7518379 0.7541110 0.7795894 0.7974019 0.8154279
##  [8] 0.8247588 0.8372702 0.8470830 0.8562543 0.8600366 0.8646292 0.8654197
## [15] 0.8692691 0.8694372 0.8730366 0.8735679 0.8764539 0.8776930 0.8788295
## [22] 0.8793116 0.8804491 0.8810166 0.8811327 0.8806551 0.8812119 0.8821570
## [29] 0.8838412 0.8832307 0.8848415 0.8856629 0.8855656 0.8867830 0.8868331
## [36] 0.8876840 0.8882950 0.8888927 0.8895892 0.8891586 0.8891269 0.8893793
## [43] 0.8895869 0.8896685 0.8898406 0.8903273 0.8905896 0.8902392 0.8902180
## [50] 0.8904352
RMSE(predict(model_rf1,test), test$ph)
## [1] 0.1073807
#Model 2
set.seed(1234)

model_rf2 <- randomForest(ph ~ . -alch_rel -air_pressurer -pressure_vacuum -density
               -filler_speed -filler_level -psc_co2 -pressure_setpoint, data=train_up,
                         ntree = 50, nodesize=15,)
model_rf2$rsq
##  [1] 0.7484210 0.7413447 0.7350172 0.7615498 0.7885962 0.7956339 0.8041510
##  [8] 0.8113253 0.8235833 0.8276234 0.8314446 0.8389428 0.8439495 0.8462311
## [15] 0.8458698 0.8481294 0.8500746 0.8524150 0.8546256 0.8570357 0.8589027
## [22] 0.8603767 0.8616162 0.8628927 0.8643926 0.8661009 0.8675507 0.8687223
## [29] 0.8697785 0.8697520 0.8710171 0.8715354 0.8727304 0.8725161 0.8728927
## [36] 0.8737550 0.8739372 0.8742044 0.8749446 0.8745692 0.8748572 0.8745765
## [43] 0.8748385 0.8758151 0.8758966 0.8767572 0.8777655 0.8784806 0.8791126
## [50] 0.8795768
RMSE(predict(model_rf2,test), test$ph)
## [1] 0.1102083

Support Vector Machine

#Model 1
train_control <- trainControl(method = "cv", number = 5)
set.seed(1234)



model_svm <- svm(ph ~., data=train,
                 kernal = 'linear', cost=0.1)

summary(model_svm)
## 
## Call:
## svm(formula = ph ~ ., data = train, kernal = "linear", cost = 0.1)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  0.1 
##       gamma:  0.02857143 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  1397
ph_prediction <- predict(model_svm, test)

RMSE(test$ph,ph_prediction)
## [1] 0.1318081
#Model 2
train_control <- trainControl(method = "cv", number = 5)
set.seed(1234)


model_svm <- svm(ph ~. -alch_rel -air_pressurer -pressure_vacuum -density
               -filler_speed -filler_level -psc_co2 -pressure_setpoint, data=train,
                 kernal = 'linear', cost=0.1)

summary(model_svm)
## 
## Call:
## svm(formula = ph ~ . - alch_rel - air_pressurer - pressure_vacuum - 
##     density - filler_speed - filler_level - psc_co2 - pressure_setpoint, 
##     data = train, kernal = "linear", cost = 0.1)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  0.1 
##       gamma:  0.03703704 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  1413
ph_prediction <- predict(model_svm, test)

RMSE(test$ph,ph_prediction)
## [1] 0.1351914

Neural Network

Finally, we will train a neural net model

library(nnet)

my.grid <- expand.grid(.decay = c(0.5,0,1), .size=c(5,6,7))

model_nnet <- train(ph ~ carb_volume, data=train, method="nnet",
      maxit=1000, tuneGrid = my.grid, trace =F, linout = 1)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
predict(model_nnet, test)
##        1        2        3        4        5        6        7        8 
## 8.565224 8.547035 8.543868 8.537313 8.545383 8.536075 8.546192 8.547035 
##        9       10       11       12       13       14       15       16 
## 8.548825 8.545383 8.545383 8.546192 8.559843 8.571166 8.566657 8.552817 
##       17       18       19       20       21       22       23       24 
## 8.596397 8.582917 8.571166 8.558585 8.543161 8.537313 8.565936 8.543868 
##       25       26       27       28       29       30       31       32 
## 8.538726 8.541850 8.542488 8.545383 8.543161 8.541850 8.543868 8.545383 
##       33       34       35       36       37       38       39       40 
## 8.540136 8.536188 8.541245 8.543868 8.539632 8.536431 8.540136 8.539632 
##       41       42       43       44       45       46       47       48 
## 8.540136 8.574347 8.551767 8.539632 8.545383 8.539632 8.547035 8.566657 
##       49       50       51       52       53       54       55       56 
## 8.546192 8.540136 8.541850 8.540674 8.542488 8.538726 8.539162 8.543161 
##       57       58       59       60       61       62       63       64 
## 8.540136 8.536115 8.539632 8.540136 8.584737 8.574347 8.565224 8.581133 
##       65       66       67       68       69       70       71       72 
## 8.569628 8.547913 8.547913 8.542488 8.546192 8.539632 8.538726 8.536115 
##       73       74       75       76       77       78       79       80 
## 8.566657 8.563827 8.563827 8.536431 8.537953 8.539162 8.541850 8.566657 
##       81       82       83       84       85       86       87       88 
## 8.537616 8.543161 8.545383 8.569628 8.566657 8.545383 8.546192 8.543161 
##       89       90       91       92       93       94       95       96 
## 8.537616 8.537043 8.539162 8.558585 8.555020 8.572739 8.559843 8.550752 
##       97       98       99      100      101      102      103      104 
## 8.545383 8.540674 8.542488 8.584737 8.569628 8.577669 8.538726 8.541245 
##      105      106      107      108      109      110      111      112 
## 8.545383 8.568125 8.536150 8.552817 8.546192 8.545383 8.562464 8.539632 
##      113      114      115      116      117      118      119      120 
## 8.539632 8.543868 8.539632 8.581133 8.569628 8.561136 8.550752 8.537953 
##      121      122      123      124      125      126      127      128 
## 8.539632 8.547035 8.545383 8.550752 8.539162 8.539162 8.538726 8.559843 
##      129      130      131      132      133      134      135      136 
## 8.544608 8.561136 8.568125 8.543161 8.541850 8.543161 8.545383 8.536431 
##      137      138      139      140      141      142      143      144 
## 8.536115 8.542488 8.541245 8.542488 8.537953 8.545383 8.539632 8.541245 
##      145      146      147      148      149      150      151      152 
## 8.540674 8.572739 8.536075 8.538726 8.538323 8.541850 8.536602 8.569628 
##      153      154      155      156      157      158      159      160 
## 8.575991 8.543868 8.546192 8.543161 8.537313 8.537953 8.540136 8.538323 
##      161      162      163      164      165      166      167      168 
## 8.536293 8.557362 8.565224 8.569628 8.540136 8.543161 8.541245 8.574347 
##      169      170      171      172      173      174      175      176 
## 8.566657 8.563827 8.566657 8.562464 8.536431 8.537616 8.537043 8.569628 
##      177      178      179      180      181      182      183      184 
## 8.566657 8.563827 8.537953 8.536806 8.536068 8.538726 8.539632 8.537953 
##      185      186      187      188      189      190      191      192 
## 8.536293 8.536075 8.538726 8.536516 8.537313 8.543161 8.536115 8.536075 
##      193      194      195      196      197      198      199      200 
## 8.563827 8.562464 8.541850 8.540136 8.541245 8.539632 8.537616 8.537313 
##      201      202      203      204      205      206      207      208 
## 8.537313 8.540674 8.545383 8.559843 8.556174 8.551767 8.561136 8.536115 
##      209      210      211      212      213      214      215      216 
## 8.544608 8.540674 8.547913 8.541245 8.539162 8.541850 8.540674 8.566657 
##      217      218      219      220      221      222      223      224 
## 8.563827 8.552817 8.540674 8.537313 8.540674 8.537313 8.562464 8.562464 
##      225      226      227      228      229      230      231      232 
## 8.571166 8.540674 8.538726 8.536700 8.540136 8.538726 8.536188 8.538726 
##      233      234      235      236      237      238      239      240 
## 8.544608 8.572739 8.536093 8.539162 8.539632 8.541245 8.541245 8.537313 
##      241      242      243      244      245      246      247      248 
## 8.558585 8.575991 8.537616 8.539162 8.558585 8.558585 8.537953 8.537616 
##      249      250      251      252      253      254      255      256 
## 8.537953 8.536602 8.538323 8.538323 8.551767 8.546192 8.549771 8.536115 
##      257      258      259      260      261      262      263      264 
## 8.537313 8.537313 8.537313 8.536068 8.536115 8.545383 8.536806 8.545383 
##      265      266      267      268      269      270      271      272 
## 8.538323 8.540674 8.539162 8.582917 8.543161 8.551767 8.553901 8.536293 
##      273      274      275      276      277      278      279      280 
## 8.540674 8.537616 8.537313 8.577669 8.540136 8.537616 8.566657 8.537313 
##      281      282      283      284      285      286      287      288 
## 8.538726 8.542488 8.598464 8.596397 8.590408 8.537313 8.539632 8.540136 
##      289      290      291      292      293      294      295      296 
## 8.538726 8.536068 8.538323 8.538323 8.537953 8.538323 8.540136 8.540674 
##      297      298      299      300      301      302      303      304 
## 8.536115 8.538726 8.562464 8.537313 8.543868 8.541850 8.538726 8.539162 
##      305      306      307      308      309      310      311      312 
## 8.541850 8.538726 8.536602 8.539162 8.536150 8.574347 8.575991 8.582917 
##      313      314      315      316      317      318      319      320 
## 8.551767 8.536431 8.542488 8.536431 8.550752 8.552817 8.540674 8.557362 
##      321      322      323      324      325      326      327      328 
## 8.581133 8.581133 8.575991 8.565224 8.566657 8.561136 8.566657 8.543868 
##      329      330      331      332      333      334      335      336 
## 8.540674 8.536075 8.540674 8.562464 8.566657 8.542488 8.540674 8.541245 
##      337      338      339      340      341      342      343      344 
## 8.538323 8.537313 8.575991 8.594365 8.557362 8.555020 8.536293 8.538323 
##      345      346      347      348      349      350      351      352 
## 8.537616 8.536075 8.536602 8.542488 8.536806 8.541850 8.539632 8.543868 
##      353      354      355      356      357      358      359      360 
## 8.579383 8.586592 8.566657 8.542488 8.540136 8.550752 8.540674 8.536431 
##      361      362      363      364      365      366      367      368 
## 8.537313 8.537953 8.540674 8.542488 8.542488 8.547913 8.547035 8.537043 
##      369      370      371      372      373      374      375      376 
## 8.542488 8.582917 8.553901 8.577669 8.596397 8.596397 8.547913 8.537313 
##      377      378      379      380      381      382      383      384 
## 8.571166 8.548825 8.552817 8.545383 8.536240 8.548825 8.568125 8.539162 
##      385      386      387      388      389      390      391      392 
## 8.556174 8.557362 8.543868 8.547035 8.541245 8.543161 8.548825 8.540136 
##      393      394      395      396      397      398      399      400 
## 8.547913 8.582917 8.590408 8.555020 8.579383 8.594365 8.592369 8.540136 
##      401      402      403      404      405      406      407      408 
## 8.548825 8.540136 8.545383 8.543161 8.538726 8.536602 8.536115 8.537170 
##      409      410      411      412      413      414      415      416 
## 8.542488 8.600566 8.547913 8.542488 8.547913 8.546192 8.537616 8.552817 
##      417      418      419      420      421      422      423      424 
## 8.545383 8.547035 8.574347 8.584737 8.557362 8.543868 8.543868 8.536093 
##      425      426      427      428      429      430      431      432 
## 8.543868 8.547035 8.542488 8.566657 8.548825 8.543161 8.545383 8.544608 
##      433      434      435      436      437      438      439      440 
## 8.582917 8.577669 8.571166 8.577669 8.572739 8.568125 8.577669 8.588482 
##      441      442      443      444      445      446      447      448 
## 8.539632 8.542488 8.540136 8.548825 8.545383 8.538323 8.540136 8.539162 
##      449      450      451      452      453      454      455      456 
## 8.562464 8.569628 8.563827 8.562464 8.543161 8.541245 8.538323 8.536431 
##      457      458      459      460      461      462      463      464 
## 8.536188 8.594365 8.594365 8.584737 8.539162 8.538323 8.552817 8.543868 
##      465      466      467      468      469      470      471      472 
## 8.537313 8.548825 8.537043 8.577669 8.581133 8.607086 8.558585 8.536806 
##      473      474      475      476      477      478      479      480 
## 8.539632 8.541850 8.541245 8.539162 8.571166 8.540136 8.574347 8.572739 
##      481      482      483      484      485      486      487      488 
## 8.537313 8.542488 8.555020 8.541245 8.545383 8.550752 8.540136 8.556174 
##      489      490      491      492      493      494      495      496 
## 8.584737 8.536293 8.536806 8.577669 8.575991 8.541850 8.541850 8.544608 
##      497      498      499      500      501      502      503      504 
## 8.543868 8.581133 8.536068 8.536602 8.536188 8.577669 8.575991 8.562464 
##      505      506      507      508      509      510      511      512 
## 8.575991 8.536431 8.536115 8.539632 8.537616 8.541850 8.541245 8.537616 
##      513      514      515      516      517      518      519      520 
## 8.537616 8.536806 8.537953 8.565224 8.574347 8.561136 8.569628 8.553901 
##      521      522      523      524      525      526      527      528 
## 8.553901 8.571166 8.562464 8.571166 8.557362 8.536093 8.536093 8.537616 
##      529      530      531 
## 8.537313 8.540674 8.543161
RMSE(predict(model_nnet, test), test$ph)
## [1] 0.1715755

Out neural net model has an RMSE of 0.167.

Model Selection

Across our three models, we had the following RMSE when we applied the models to our test data set

  1. Linear Regression - 0.1358189
  2. Decision Tree (Single Tree) - 0.1266664
  3. Random Forest - 0.1060108
  4. Suppoer Vector Machine - 0.1240094
  5. Neural Network - 0.1673001

Based on the RMSE values, we find that the Random Forest tree - with all data included, has performed the best amongst our models and thus will be using that as the model to predict our ph.

Final Prediction on Eval Data

eval_data_raw <- read_excel('StudentEvaluation.xlsx')

eval_data <- clean_names(eval_data_raw)
eval_data <- as_tibble(eval_data)

eval_data <- eval_data %>% 
  mutate(brand_code = ifelse(is.na(brand_code), "A",brand_code))

#Change brand code variable to factor
eval_data <- eval_data %>% mutate(brand_code = factor(brand_code, levels = sort(unique(brand_code))))
  
eval_data <- eval_data %>% select(-ph)

eval_data <- eval_data %>% mutate(across(where(is.numeric), ~replace_na(., median(., na.rm=TRUE))))

ph_prediction <- predict(model_rf1,eval_data)

eval_data$ph <- ph_prediction

write_xlsx(eval_data, 'StudentEvaluation_final.xlsx')

For making our final predictions, we used the median value of our numeric columns when imputed the missing NA values. From there we used our Random Forest model to predict the ph for the the different combinations in our dataset.

Conclusion

While the usage of the Random Forest model is the best model in terms of predicting the ph of the product based on the root mean square error, one of the main shortcoming is that it generates a model that is not interpretable. As a result, we are able to feel slightly confident in our prediction for the ph, but we have a limited ability to interpret the model. If indeed we are primarily interested in understanding the model - and are willing to sacrifice some of the predictive ability, than we should use the Single decision tree - which has a higher RMSE but we are able to generate a visual illustration that helps us understand the factors that are used to determine the variables that impact the ph.