Project 2

ASSIGNMENT

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

TRAINING DATA ….. StudentData.xlsx TEST DATA ……… StudentEvaluation.xlsx

# Importing data, removing empty column

sd_data <- read_excel('StudentData.xlsx')
se_data <- read_excel('StudentEvaluation.xlsx')
#Exploring the data
colnames(sd_data)
##  [1] "Brand Code"        "Carb Volume"       "Fill Ounces"      
##  [4] "PC Volume"         "Carb Pressure"     "Carb Temp"        
##  [7] "PSC"               "PSC Fill"          "PSC CO2"          
## [10] "Mnf Flow"          "Carb Pressure1"    "Fill Pressure"    
## [13] "Hyd Pressure1"     "Hyd Pressure2"     "Hyd Pressure3"    
## [16] "Hyd Pressure4"     "Filler Level"      "Filler Speed"     
## [19] "Temperature"       "Usage cont"        "Carb Flow"        
## [22] "Density"           "MFR"               "Balling"          
## [25] "Pressure Vacuum"   "PH"                "Oxygen Filler"    
## [28] "Bowl Setpoint"     "Pressure Setpoint" "Air Pressurer"    
## [31] "Alch Rel"          "Carb Rel"          "Balling Lvl"
str(sd_data)
## tibble [2,571 × 33] (S3: tbl_df/tbl/data.frame)
##  $ Brand Code       : chr [1:2571] "B" "A" "B" "A" ...
##  $ Carb Volume      : num [1:2571] 5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill Ounces      : num [1:2571] 24 24 24.1 24 24.3 ...
##  $ PC Volume        : num [1:2571] 0.263 0.239 0.263 0.293 0.111 ...
##  $ Carb Pressure    : num [1:2571] 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ Carb Temp        : num [1:2571] 141 140 145 133 137 ...
##  $ PSC              : num [1:2571] 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC Fill         : num [1:2571] 0.26 0.22 0.34 0.42 0.16 ...
##  $ PSC CO2          : num [1:2571] 0.04 0.04 0.16 0.04 0.12 ...
##  $ Mnf Flow         : num [1:2571] -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb Pressure1   : num [1:2571] 119 122 120 115 118 ...
##  $ Fill Pressure    : num [1:2571] 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ Hyd Pressure1    : num [1:2571] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure2    : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure3    : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure4    : num [1:2571] 118 106 82 92 92 116 124 132 90 108 ...
##  $ Filler Level     : num [1:2571] 121 119 120 118 119 ...
##  $ Filler Speed     : num [1:2571] 4002 3986 4020 4012 4010 ...
##  $ Temperature      : num [1:2571] 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage cont       : num [1:2571] 16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb Flow        : num [1:2571] 2932 3144 2914 3062 3054 ...
##  $ Density          : num [1:2571] 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ MFR              : num [1:2571] 725 727 735 731 723 ...
##  $ Balling          : num [1:2571] 1.4 1.5 3.14 3.04 3.04 ...
##  $ Pressure Vacuum  : num [1:2571] -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ PH               : num [1:2571] 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen Filler    : num [1:2571] 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl Setpoint    : num [1:2571] 120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure Setpoint: num [1:2571] 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air Pressurer    : num [1:2571] 143 143 142 146 146 ...
##  $ Alch Rel         : num [1:2571] 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb Rel         : num [1:2571] 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling Lvl      : num [1:2571] 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
head(sd_data)
## # A tibble: 6 × 33
##   `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
##   <chr>                <dbl>         <dbl>       <dbl>           <dbl>
## 1 B                     5.34          24.0       0.263            68.2
## 2 A                     5.43          24.0       0.239            68.4
## 3 B                     5.29          24.1       0.263            70.8
## 4 A                     5.44          24.0       0.293            63  
## 5 A                     5.49          24.3       0.111            67.2
## 6 A                     5.38          23.9       0.269            66.6
## # … with 28 more variables: Carb Temp <dbl>, PSC <dbl>, PSC Fill <dbl>,
## #   PSC CO2 <dbl>, Mnf Flow <dbl>, Carb Pressure1 <dbl>, Fill Pressure <dbl>,
## #   Hyd Pressure1 <dbl>, Hyd Pressure2 <dbl>, Hyd Pressure3 <dbl>,
## #   Hyd Pressure4 <dbl>, Filler Level <dbl>, Filler Speed <dbl>,
## #   Temperature <dbl>, Usage cont <dbl>, Carb Flow <dbl>, Density <dbl>,
## #   MFR <dbl>, Balling <dbl>, Pressure Vacuum <dbl>, PH <dbl>,
## #   Oxygen Filler <dbl>, Bowl Setpoint <dbl>, Pressure Setpoint <dbl>, …
sd_data <- sd_data %>% 
  filter(!is.na(sd_data$PH), sd_data$PH < 9) 

##          values               ind
## 1   0.361587534     Bowl Setpoint
## 2   0.352043962      Filler Level
## 3   0.233593699         Carb Flow
## 4   0.219735497   Pressure Vacuum
## 5   0.196051481          Carb Rel
## 6   0.166682228          Alch Rel
## 7   0.164485364     Oxygen Filler
## 8   0.109371168       Balling Lvl
## 9   0.098866734         PC Volume
## 10  0.095546936           Density
## 11  0.076700227           Balling
## 12  0.076213407     Carb Pressure
## 13  0.072132509       Carb Volume
## 14  0.032279368         Carb Temp
## 15 -0.007997231     Air Pressurer
## 16 -0.023809796          PSC Fill
## 17 -0.040882953      Filler Speed
## 18 -0.045196477               MFR
## 19 -0.047066423     Hyd Pressure1
## 20 -0.069873041               PSC
## 21 -0.085259857           PSC CO2
## 22 -0.118335902       Fill Ounces
## 23 -0.118764185    Carb Pressure1
## 24 -0.171434026     Hyd Pressure4
## 25 -0.182659650       Temperature
## 26 -0.222660048     Hyd Pressure2
## 27 -0.268101792     Hyd Pressure3
## 28 -0.311663908 Pressure Setpoint
## 29 -0.316514463     Fill Pressure
## 30 -0.357611993        Usage cont
## 31 -0.459231253          Mnf Flow

Features with the highest and lowest correlations have the most predictive power.

Bowl Setpoint, Filler Level, Carb Flow, Pressure Vacuum, and Carb Rel have the highest correlations (positive) with PH, while Mnf Flow, Usage cont, Fill Pressure, Pressure Setpoint, and Hyd Pressure3 have the strongest negative correlations with PH.

Checking for multicollinearity by looking at correlations between all predictors. We’ll need to consider the correlations between features and avoid including pairs with strong correlations.

near zero-variance - Features that are the same across most of the instances will add little predictive information.

# Near-Zero Variance

nzv <- nearZeroVar(sd_data, saveMetrics= TRUE)
nzv[nzv$nzv,][1:5,] %>% drop_na()
##               freqRatio percentUnique zeroVar  nzv
## Hyd Pressure1        31      9.547935   FALSE TRUE

Hyd Pressure1 displays near-zero variance. We will drop this feature

y <- as.factor(sd_data$PH)
# split the data into train/test sets before running cross validation
bev_eval <- bev_imputed[2567:2833,]
bev_train <- bev_imputed[1:2566,]
samples <- createDataPartition(y, p = .75, list = F)
x_train <- bev_train[samples, ]
x_test<- bev_train[-samples, ]
y_train <- y[samples]
y_test <- y[-samples]

model stack approach - tuning models separately and then combining the candidate models into a metamodel that will formulate predictions with a linear combination of our tuned models.

# Training Grid

# xgb_grid <- expand.grid(eta = c(.01), nrounds = c(1000), max_depth = c(6), gamma = c(0), colsample_bytree = c(.8), min_child_weight = c(.8), subsample = c(.8))
# cub_grid <- expand.grid(.committees = c(5), .neighbors = c(7))
# mars_grid <- expand.grid(.degree = c(2), .nprune = 24)
# dart_grid <- expand.grid(eta = c(.01), nrounds = c(1000), gamma = c(.1), skip_drop = c(.6), rate_drop = c(.4), max_depth = c(6), colsample_bytree = c(.6), min_child_weight = c(.6), subsample = c(.6))
# rf_grid <- expand.grid(mtry = 25)
# tuning_list <-list(
#   caretModelSpec(method="xgbTree", tuneGrid = xgb_grid),
#   caretModelSpec(method="xgbDART", tuneGrid = dart_grid),
#   caretModelSpec(method="cubist", tuneGrid = cub_grid),
#   caretModelSpec(method="rf", tuneGrid = rf_grid, importance = TRUE),
#   caretModelSpec(method="bagEarth", tuneGrid = mars_grid)
# )
# my_control <- trainControl(method = 'cv',
#                            number = 5,
#                            savePredictions = 'final',
#                            index = createFolds(y_train, 5),
#                            allowParallel = TRUE) 
# mod_list <- caretList(
#   x = x_train,
#   y = y_train,
#   preProcess = c('center', 'scale'),
#   trControl = my_control,
#   tuneList = tuning_list
# )

My computer was not up to the task of compiling this grid. After numerous attempts and my hours it crashed RStudio and then my computer.