Project 2
This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
TRAINING DATA ….. StudentData.xlsx TEST DATA ……… StudentEvaluation.xlsx
# Importing data, removing empty column
sd_data <- read_excel('StudentData.xlsx')
se_data <- read_excel('StudentEvaluation.xlsx')
#Exploring the data
colnames(sd_data)
## [1] "Brand Code" "Carb Volume" "Fill Ounces"
## [4] "PC Volume" "Carb Pressure" "Carb Temp"
## [7] "PSC" "PSC Fill" "PSC CO2"
## [10] "Mnf Flow" "Carb Pressure1" "Fill Pressure"
## [13] "Hyd Pressure1" "Hyd Pressure2" "Hyd Pressure3"
## [16] "Hyd Pressure4" "Filler Level" "Filler Speed"
## [19] "Temperature" "Usage cont" "Carb Flow"
## [22] "Density" "MFR" "Balling"
## [25] "Pressure Vacuum" "PH" "Oxygen Filler"
## [28] "Bowl Setpoint" "Pressure Setpoint" "Air Pressurer"
## [31] "Alch Rel" "Carb Rel" "Balling Lvl"
str(sd_data)
## tibble [2,571 × 33] (S3: tbl_df/tbl/data.frame)
## $ Brand Code : chr [1:2571] "B" "A" "B" "A" ...
## $ Carb Volume : num [1:2571] 5.34 5.43 5.29 5.44 5.49 ...
## $ Fill Ounces : num [1:2571] 24 24 24.1 24 24.3 ...
## $ PC Volume : num [1:2571] 0.263 0.239 0.263 0.293 0.111 ...
## $ Carb Pressure : num [1:2571] 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
## $ Carb Temp : num [1:2571] 141 140 145 133 137 ...
## $ PSC : num [1:2571] 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
## $ PSC Fill : num [1:2571] 0.26 0.22 0.34 0.42 0.16 ...
## $ PSC CO2 : num [1:2571] 0.04 0.04 0.16 0.04 0.12 ...
## $ Mnf Flow : num [1:2571] -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ Carb Pressure1 : num [1:2571] 119 122 120 115 118 ...
## $ Fill Pressure : num [1:2571] 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
## $ Hyd Pressure1 : num [1:2571] 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd Pressure2 : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd Pressure3 : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd Pressure4 : num [1:2571] 118 106 82 92 92 116 124 132 90 108 ...
## $ Filler Level : num [1:2571] 121 119 120 118 119 ...
## $ Filler Speed : num [1:2571] 4002 3986 4020 4012 4010 ...
## $ Temperature : num [1:2571] 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
## $ Usage cont : num [1:2571] 16.2 19.9 17.8 17.4 17.7 ...
## $ Carb Flow : num [1:2571] 2932 3144 2914 3062 3054 ...
## $ Density : num [1:2571] 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
## $ MFR : num [1:2571] 725 727 735 731 723 ...
## $ Balling : num [1:2571] 1.4 1.5 3.14 3.04 3.04 ...
## $ Pressure Vacuum : num [1:2571] -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
## $ PH : num [1:2571] 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Oxygen Filler : num [1:2571] 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
## $ Bowl Setpoint : num [1:2571] 120 120 120 120 120 120 120 120 120 120 ...
## $ Pressure Setpoint: num [1:2571] 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
## $ Air Pressurer : num [1:2571] 143 143 142 146 146 ...
## $ Alch Rel : num [1:2571] 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
## $ Carb Rel : num [1:2571] 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
## $ Balling Lvl : num [1:2571] 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
head(sd_data)
## # A tibble: 6 × 33
## `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 B 5.34 24.0 0.263 68.2
## 2 A 5.43 24.0 0.239 68.4
## 3 B 5.29 24.1 0.263 70.8
## 4 A 5.44 24.0 0.293 63
## 5 A 5.49 24.3 0.111 67.2
## 6 A 5.38 23.9 0.269 66.6
## # … with 28 more variables: Carb Temp <dbl>, PSC <dbl>, PSC Fill <dbl>,
## # PSC CO2 <dbl>, Mnf Flow <dbl>, Carb Pressure1 <dbl>, Fill Pressure <dbl>,
## # Hyd Pressure1 <dbl>, Hyd Pressure2 <dbl>, Hyd Pressure3 <dbl>,
## # Hyd Pressure4 <dbl>, Filler Level <dbl>, Filler Speed <dbl>,
## # Temperature <dbl>, Usage cont <dbl>, Carb Flow <dbl>, Density <dbl>,
## # MFR <dbl>, Balling <dbl>, Pressure Vacuum <dbl>, PH <dbl>,
## # Oxygen Filler <dbl>, Bowl Setpoint <dbl>, Pressure Setpoint <dbl>, …
sd_data <- sd_data %>%
filter(!is.na(sd_data$PH), sd_data$PH < 9)
## values ind
## 1 0.361587534 Bowl Setpoint
## 2 0.352043962 Filler Level
## 3 0.233593699 Carb Flow
## 4 0.219735497 Pressure Vacuum
## 5 0.196051481 Carb Rel
## 6 0.166682228 Alch Rel
## 7 0.164485364 Oxygen Filler
## 8 0.109371168 Balling Lvl
## 9 0.098866734 PC Volume
## 10 0.095546936 Density
## 11 0.076700227 Balling
## 12 0.076213407 Carb Pressure
## 13 0.072132509 Carb Volume
## 14 0.032279368 Carb Temp
## 15 -0.007997231 Air Pressurer
## 16 -0.023809796 PSC Fill
## 17 -0.040882953 Filler Speed
## 18 -0.045196477 MFR
## 19 -0.047066423 Hyd Pressure1
## 20 -0.069873041 PSC
## 21 -0.085259857 PSC CO2
## 22 -0.118335902 Fill Ounces
## 23 -0.118764185 Carb Pressure1
## 24 -0.171434026 Hyd Pressure4
## 25 -0.182659650 Temperature
## 26 -0.222660048 Hyd Pressure2
## 27 -0.268101792 Hyd Pressure3
## 28 -0.311663908 Pressure Setpoint
## 29 -0.316514463 Fill Pressure
## 30 -0.357611993 Usage cont
## 31 -0.459231253 Mnf Flow
Features with the highest and lowest correlations have the most predictive power.
Bowl Setpoint, Filler Level, Carb Flow, Pressure Vacuum, and Carb Rel have the highest correlations (positive) with PH, while Mnf Flow, Usage cont, Fill Pressure, Pressure Setpoint, and Hyd Pressure3 have the strongest negative correlations with PH.
Checking for multicollinearity by looking at correlations between all predictors. We’ll need to consider the correlations between features and avoid including pairs with strong correlations.
near zero-variance - Features that are the same across most of the instances will add little predictive information.
# Near-Zero Variance
nzv <- nearZeroVar(sd_data, saveMetrics= TRUE)
nzv[nzv$nzv,][1:5,] %>% drop_na()
## freqRatio percentUnique zeroVar nzv
## Hyd Pressure1 31 9.547935 FALSE TRUE
Hyd Pressure1 displays near-zero variance. We will drop this feature
y <- as.factor(sd_data$PH)
# split the data into train/test sets before running cross validation
bev_eval <- bev_imputed[2567:2833,]
bev_train <- bev_imputed[1:2566,]
samples <- createDataPartition(y, p = .75, list = F)
x_train <- bev_train[samples, ]
x_test<- bev_train[-samples, ]
y_train <- y[samples]
y_test <- y[-samples]
model stack approach - tuning models separately and then combining the candidate models into a metamodel that will formulate predictions with a linear combination of our tuned models.
# Training Grid
# xgb_grid <- expand.grid(eta = c(.01), nrounds = c(1000), max_depth = c(6), gamma = c(0), colsample_bytree = c(.8), min_child_weight = c(.8), subsample = c(.8))
# cub_grid <- expand.grid(.committees = c(5), .neighbors = c(7))
# mars_grid <- expand.grid(.degree = c(2), .nprune = 24)
# dart_grid <- expand.grid(eta = c(.01), nrounds = c(1000), gamma = c(.1), skip_drop = c(.6), rate_drop = c(.4), max_depth = c(6), colsample_bytree = c(.6), min_child_weight = c(.6), subsample = c(.6))
# rf_grid <- expand.grid(mtry = 25)
# tuning_list <-list(
# caretModelSpec(method="xgbTree", tuneGrid = xgb_grid),
# caretModelSpec(method="xgbDART", tuneGrid = dart_grid),
# caretModelSpec(method="cubist", tuneGrid = cub_grid),
# caretModelSpec(method="rf", tuneGrid = rf_grid, importance = TRUE),
# caretModelSpec(method="bagEarth", tuneGrid = mars_grid)
# )
# my_control <- trainControl(method = 'cv',
# number = 5,
# savePredictions = 'final',
# index = createFolds(y_train, 5),
# allowParallel = TRUE)
# mod_list <- caretList(
# x = x_train,
# y = y_train,
# preProcess = c('center', 'scale'),
# trControl = my_control,
# tuneList = tuning_list
# )
My computer was not up to the task of compiling this grid. After numerous attempts and my hours it crashed RStudio and then my computer.