The data science team of Salma Elshahawy, John K. Hancock, and Farhana Zahir have prepared the following technical report to address the issue of understanding ABC’s manufacturing process and its predictive factors. This report is the predictive value of the PH.
The report consists of the following:
PART 1: THE DATASETS
PART 2: DATA PREPARATION
PART 3: EXPERIMENTATION
PART 4: EVALUATE MODELS
PART 5: USE THE BEST MODEL TO FORECAST PH
PART 6: CONCLUSIONS
In this section, we did the following:
Import the Datasets
Evaluate the Dataset
Devise a Data Preparation Strategy
The excel files, StudentData.xlsx and StudentEvaluation.xlsx, are hosted on the team’s Github page. Here, they’re downloaded and read into the dataframes, beverage_training_data and beverage_test_data.
temp_train_file <- tempfile(pattern="StudentData", fileext = ".xlsx")
temp_eval_file <- tempfile(pattern="StudentEvaluation", fileext = ".xlsx")
student_train <- "https://github.com/JohnKHancock/CUNY_DATA624_Project2/blob/main/raw/StudentData.xlsx?raw=true"
student_eval <- "https://github.com/JohnKHancock/CUNY_DATA624_Project2/blob/main/raw/StudentEvaluation.xlsx?raw=true"
student_data <- GET(student_train,
authenticate(Sys.getenv("GITHUB_PAT"), ""),
write_disk(path = temp_train_file))
student_eval <- GET(student_eval,
authenticate(Sys.getenv("GITHUB_PAT"), ""),
write_disk(path = temp_eval_file))After importing the Beverage Training dataset, we see that there are 2,571 observations consisiting of 32 predictor variables and one dependent variable, PH. We also see that “Brand Code” is a factor variable that will need to be handled as well as several observations with a number of NAs.
For the Beverage Testing dataset, we see 267 observations, the 32 predictors, and the dependent variable PH which is all NAs. This is the data that we will have to predict. Same as the training set, We also see that “Brand Code” is a character variable that will need to be handled as well as several observations with a number of NAs.
Beverage Training Data
## [1] 2571 33
## [1] "character"
Beverage Testing Data
After analyzing the data, we devised the following processes in order to prepare the data for analysis
A. Isolate predictors from the dependent variable
B. Correct the Predictor Names
C. Create a data frame of numeric values only
D. Identify and Impute Missing Data
E. Identify and Address Outliers
F. Check for and remove correlated predictors
G. Identify Near Zero Variance Predictors
H. Impute missing values and Create dummy variables for Brand.Code
I. Impute missing data for Dependent Variable PH
For the training set, remove the predictor variable, PH and store it into the variable, y_train.
Correct the space in the predictor names using the make.names function. The space in the names may be problematic. This was applied to both datasets.
## [1] "Brand Code" "Carb Volume" "Fill Ounces"
## [4] "PC Volume" "Carb Pressure" "Carb Temp"
## [7] "PSC" "PSC Fill" "PSC CO2"
## [10] "Mnf Flow" "Carb Pressure1" "Fill Pressure"
## [13] "Hyd Pressure1" "Hyd Pressure2" "Hyd Pressure3"
## [16] "Hyd Pressure4" "Filler Level" "Filler Speed"
## [19] "Temperature" "Usage cont" "Carb Flow"
## [22] "Density" "MFR" "Balling"
## [25] "Pressure Vacuum" "Oxygen Filler" "Bowl Setpoint"
## [28] "Pressure Setpoint" "Air Pressurer" "Alch Rel"
## [31] "Carb Rel" "Balling Lvl"
## [1] "Brand.Code" "Carb.Volume" "Fill.Ounces"
## [4] "PC.Volume" "Carb.Pressure" "Carb.Temp"
## [7] "PSC" "PSC.Fill" "PSC.CO2"
## [10] "Mnf.Flow" "Carb.Pressure1" "Fill.Pressure"
## [13] "Hyd.Pressure1" "Hyd.Pressure2" "Hyd.Pressure3"
## [16] "Hyd.Pressure4" "Filler.Level" "Filler.Speed"
## [19] "Temperature" "Usage.cont" "Carb.Flow"
## [22] "Density" "MFR" "Balling"
## [25] "Pressure.Vacuum" "Oxygen.Filler" "Bowl.Setpoint"
## [28] "Pressure.Setpoint" "Air.Pressurer" "Alch.Rel"
## [31] "Carb.Rel" "Balling.Lvl"
colnames(predictors_evaluate)<- make.names(colnames(predictors_evaluate))
colnames(predictors_evaluate)## [1] "Brand.Code" "Carb.Volume" "Fill.Ounces"
## [4] "PC.Volume" "Carb.Pressure" "Carb.Temp"
## [7] "PSC" "PSC.Fill" "PSC.CO2"
## [10] "Mnf.Flow" "Carb.Pressure1" "Fill.Pressure"
## [13] "Hyd.Pressure1" "Hyd.Pressure2" "Hyd.Pressure3"
## [16] "Hyd.Pressure4" "Filler.Level" "Filler.Speed"
## [19] "Temperature" "Usage.cont" "Carb.Flow"
## [22] "Density" "MFR" "Balling"
## [25] "Pressure.Vacuum" "Oxygen.Filler" "Bowl.Setpoint"
## [28] "Pressure.Setpoint" "Air.Pressurer" "Alch.Rel"
## [31] "Carb.Rel" "Balling.Lvl"
We saw earlier that Brand.Code is a categorical value. Because of that we subset the dataframe to remove it. We will handle this variable later.
The predictor MFR has the most missing values at 212. I used knn imputation to handle missing values. After the knn imputation, there are still missing values for Brand.Code which will be handled in a later section.
Training Data
| Predictors | NAs |
|---|---|
| MFR | 212 |
| Filler.Speed | 57 |
| PC.Volume | 39 |
| PSC.CO2 | 39 |
| Fill.Ounces | 38 |
| PSC | 33 |
missingData %>%
ggplot() +
geom_bar(aes(x=reorder(Predictors,NAs), y=NAs, fill=factor(NAs)), stat = 'identity', ) +
labs(x='Predictor', y="NAs", title='Number of missing values') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + coord_flip() missingData <- as.data.frame(colSums(is.na(predictors_imputed)))
colnames(missingData) <- c("NAs")
missingData <- cbind(Predictors = rownames(missingData), missingData)
rownames(missingData) <- 1:nrow(missingData)
missingData <- missingData[missingData$NAs != 0,] %>%
arrange(desc(NAs))
head(missingData)## [1] Predictors NAs
## <0 rows> (or 0-length row.names)
Next we looked at the distributions of the numeric variables. There are only four predictors that are normally distributed. The box plots show a high number of outliers in the data. To correct for this, the pre processing step of center and scale was used. We centered and scaled these distributions.
par(mfrow = c(3, 3))
datasub = melt(predictors_imputed)
suppressWarnings(ggplot(datasub, aes(x= value)) +
geom_density(fill='orange') + facet_wrap(~variable, scales = 'free') )ggplot(data = datasub , aes(x=variable, y=value)) +
geom_boxplot(outlier.colour="red", outlier.shape=3, outlier.size=8,aes(fill=variable)) +
coord_flip() + theme(legend.position = "none")preprocessing <- preProcess(as.data.frame(predictors_imputed), method = c("center", "scale"))
preprocessing ## Created from 2571 samples and 31 variables
##
## Pre-processing:
## - centered (31)
## - ignored (0)
## - scaled (31)
num_predictors_02 <- spatialSign(num_predictors_01)
num_predictors_02 <- as.data.frame(num_predictors_02)par(mfrow = c(3, 3))
datasub = melt(num_predictors_02)
suppressWarnings(ggplot(datasub, aes(x= value)) +
geom_density(fill='blue') + facet_wrap(~variable, scales = 'free') )ggplot(data = datasub , aes(x=variable, y=value)) +
geom_boxplot(outlier.colour="red", outlier.shape=3, outlier.size=8,aes(fill=variable)) +
coord_flip() + theme(legend.position = "none")Remove the zero variance predictor. There are no near zero variance predictors
## character(0)
Earlier, we saw that there are 120 missing values for Brand.Code, a factor variable. The imputation strategy here is to impute with the most frequent value, “B”. After imputation, Brand.Code was converted to dummy variables. The converted Brand.Code predictor is joined to the num_predictors_02.
## [1] 120
## [1] "A" "B" "C" "D"
##
## A B C D
## 293 1239 304 615
## factor(0)
## Levels: A B C D
## Dummy Variable Object
##
## Formula: ~Brand.Code
## 1 variables, 1 factors
## Variables and levels will be separated by '.'
## A less than full rank encoding is used
| Brand.Code.A | Brand.Code.B | Brand.Code.C | Brand.Code.D |
|---|---|---|---|
| 0 | 1 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
The final step is to impute missing values for the dependent variable, PH, with the median for PH.
missingData <- as.data.frame(colSums(is.na(processed.train)))
colnames(missingData) <- c("NAs")
missingData <- cbind(Predictors = rownames(missingData), missingData)
rownames(missingData) <- 1:nrow(missingData)
missingData <- missingData[missingData$NAs != 0,] %>%
arrange(desc(NAs))
head(missingData)## [1] Predictors NAs
## <0 rows> (or 0-length row.names)
Split the Time Series
Before we begin with the experimentation, We split the training data into train and test sets
evaluation.split <- initial_split(processed.train, prop = 0.7, strata = "PH")
train <- training(evaluation.split)
test <- testing(evaluation.split)Modeling
We examined 12 models. We looked at Linear Models, Non Linear Regression Models, and Tree Based Models. For all of the models, MNF.Flow was the most important predictor with the exception of the bag tree model. Other consistently important predictors include predictor, Brand C and D. Residuals for each model appear random with no discernable patterns. In Part 4, we evaluated the metrics from each model.
set.seed(100)
x_train <- train[, 2:29]
y_train <- as.data.frame(train$PH)
colnames(y_train) <- c("PH")
x_test <- test[, 2:29]
y_test <- as.data.frame(test$PH)
colnames(y_test) <- c("PH")
ctrl <- trainControl(method = "cv", number = 10)Basic linear model
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.53571 -0.07828 0.01045 0.08938 0.40342
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.615523 0.014001 615.348 < 2e-16 ***
## Brand.Code.A -0.078380 0.013764 -5.695 1.44e-08 ***
## Brand.Code.B -0.075440 0.020752 -3.635 0.000286 ***
## Brand.Code.C -0.197420 0.022514 -8.769 < 2e-16 ***
## Brand.Code.D NA NA NA NA
## Carb.Volume 0.016082 0.048987 0.328 0.742727
## Fill.Ounces -0.054264 0.018333 -2.960 0.003118 **
## PC.Volume -0.043269 0.022369 -1.934 0.053226 .
## Carb.Pressure -0.021761 0.070960 -0.307 0.759129
## Carb.Temp 0.053966 0.064859 0.832 0.405498
## PSC -0.022043 0.018197 -1.211 0.225907
## PSC.Fill 0.006666 0.017924 0.372 0.710019
## PSC.CO2 -0.020982 0.018518 -1.133 0.257354
## Mnf.Flow -0.384889 0.032089 -11.994 < 2e-16 ***
## Carb.Pressure1 0.148888 0.021024 7.082 2.05e-12 ***
## Fill.Pressure 0.065774 0.028514 2.307 0.021187 *
## Hyd.Pressure1 0.036877 0.028853 1.278 0.201376
## Hyd.Pressure2 0.065856 0.035810 1.839 0.066076 .
## Hyd.Pressure4 -0.001812 0.026705 -0.068 0.945924
## Filler.Level 0.139558 0.025630 5.445 5.90e-08 ***
## Filler.Speed 0.037063 0.046436 0.798 0.424889
## Temperature -0.131774 0.022300 -5.909 4.11e-09 ***
## Usage.cont -0.127890 0.020920 -6.113 1.20e-09 ***
## Carb.Flow 0.058711 0.024474 2.399 0.016548 *
## Density -0.151691 0.046570 -3.257 0.001146 **
## MFR -0.049564 0.044139 -1.123 0.261626
## Pressure.Vacuum -0.011111 0.022752 -0.488 0.625372
## Oxygen.Filler -0.080110 0.023002 -3.483 0.000508 ***
## Pressure.Setpoint -0.075700 0.027139 -2.789 0.005337 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1324 on 1773 degrees of freedom
## Multiple R-squared: 0.4008, Adjusted R-squared: 0.3917
## F-statistic: 43.92 on 27 and 1773 DF, p-value: < 2.2e-16
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 0.1332655 0.3852603 0.1048218 0.007213209 0.04610082 0.004388144
## lm variable importance
##
## only 20 most important variables shown (out of 27)
##
## Overall
## Mnf.Flow 100.000
## Brand.Code.C 72.957
## Carb.Pressure1 58.809
## Usage.cont 50.689
## Temperature 48.978
## Brand.Code.A 47.180
## Filler.Level 45.087
## Brand.Code.B 29.913
## Oxygen.Filler 28.633
## Density 26.742
## Fill.Ounces 24.250
## Pressure.Setpoint 22.819
## Carb.Flow 19.545
## Fill.Pressure 18.772
## PC.Volume 15.650
## Hyd.Pressure2 14.851
## Hyd.Pressure1 10.148
## PSC 9.588
## PSC.CO2 8.931
## MFR 8.847
Partial Least Squares or PLS
set.seed(100)
plsFit1 <- train(x_train, y_train$PH,
method = "pls",
tuneLength = 25,
trControl = ctrl)## Data: X dimension: 1801 28
## Y dimension: 1801 1
## Fit method: oscorespls
## Number of components considered: 11
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 12.07 33.31 46.71 54.58 60.23 66.57 69.87
## .outcome 30.30 32.63 35.96 38.35 39.17 39.50 39.81
## 8 comps 9 comps 10 comps 11 comps
## X 72.54 74.41 75.95 78.80
## .outcome 39.96 40.03 40.06 40.07
## ncomp
## 11 11
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 8 0.1332896 0.3853693 0.1050999 0.006947977 0.04477563 0.004187502
## pls variable importance
##
## only 20 most important variables shown (out of 28)
##
## Overall
## Mnf.Flow 100.00
## Brand.Code.C 91.11
## Brand.Code.D 74.85
## Filler.Level 69.57
## Usage.cont 66.87
## Pressure.Setpoint 59.18
## Brand.Code.B 47.70
## Fill.Pressure 47.02
## Pressure.Vacuum 43.73
## Hyd.Pressure2 42.09
## Temperature 38.52
## Carb.Flow 33.73
## Oxygen.Filler 30.53
## Brand.Code.A 30.28
## Carb.Pressure1 29.09
## Hyd.Pressure4 24.87
## PSC 22.39
## Fill.Ounces 19.71
## Hyd.Pressure1 17.01
## Density 14.34
Ridge Regression
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
ridgeRegFit <- train(x_train, y_train$PH,
method = "ridge",
tuneGrid = ridgeGrid,
trControl = ctrl)## Length Class Mode
## call 4 -none- call
## actions 29 -none- list
## allset 28 -none- numeric
## beta.pure 812 -none- numeric
## vn 28 -none- character
## mu 1 -none- numeric
## normx 28 -none- numeric
## meanx 28 -none- numeric
## lambda 1 -none- numeric
## L1norm 29 -none- numeric
## penalty 29 -none- numeric
## df 29 -none- numeric
## Cp 29 -none- numeric
## sigma2 1 -none- numeric
## xNames 28 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
## lambda
## 3 0.01428571
## lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 3 0.01428571 0.1331142 0.3876786 0.1048537 0.009136161 0.05586865 0.007290587
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 28)
##
## Overall
## Mnf.Flow 100.000
## Filler.Level 72.323
## Usage.cont 71.061
## Pressure.Setpoint 56.017
## Fill.Pressure 43.944
## Brand.Code.C 35.957
## Hyd.Pressure1 35.400
## Oxygen.Filler 34.869
## Hyd.Pressure2 31.914
## Pressure.Vacuum 31.715
## Carb.Flow 30.369
## Temperature 24.044
## Density 18.005
## Carb.Pressure1 16.407
## Brand.Code.D 13.635
## Hyd.Pressure4 10.293
## MFR 8.351
## PSC 6.888
## Carb.Volume 6.402
## Fill.Ounces 4.817
KNN
knnModel <- train(x = x_train, y = y_train$PH,
method = "knn",
tuneLength = 25,
trControl = ctrl)
knnModel## k-Nearest Neighbors
##
## 1801 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1621, 1621, 1620, 1621, 1622, 1621, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.1289071 0.4307780 0.09872210
## 7 0.1290256 0.4274458 0.09976753
## 9 0.1285311 0.4308696 0.10010117
## 11 0.1280419 0.4351280 0.10010809
## 13 0.1290343 0.4269183 0.10123481
## 15 0.1294052 0.4239534 0.10184088
## 17 0.1295109 0.4236731 0.10221408
## 19 0.1296798 0.4223976 0.10260451
## 21 0.1300127 0.4202426 0.10291240
## 23 0.1304302 0.4166402 0.10335095
## 25 0.1305764 0.4158462 0.10371342
## 27 0.1309964 0.4119659 0.10397707
## 29 0.1314408 0.4076336 0.10437183
## 31 0.1316899 0.4053112 0.10449530
## 33 0.1323309 0.3994751 0.10512665
## 35 0.1327375 0.3955493 0.10547676
## 37 0.1331463 0.3919652 0.10586459
## 39 0.1333990 0.3892265 0.10620414
## 41 0.1339517 0.3842548 0.10655252
## 43 0.1342326 0.3819583 0.10690409
## 45 0.1345271 0.3791287 0.10718533
## 47 0.1350015 0.3742641 0.10756981
## 49 0.1353721 0.3707581 0.10794895
## 51 0.1356639 0.3678003 0.10810084
## 53 0.1357413 0.3671776 0.10819017
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
knnPred <- predict(knnModel, newdata = x_test)
knn_res <- postResample(pred = knnPred, obs = y_test$PH)
knn_res## RMSE Rsquared MAE
## 0.1341075 0.4357828 0.1004390
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 28)
##
## Overall
## Mnf.Flow 100.000
## Filler.Level 72.323
## Usage.cont 71.061
## Pressure.Setpoint 56.017
## Fill.Pressure 43.944
## Brand.Code.C 35.957
## Hyd.Pressure1 35.400
## Oxygen.Filler 34.869
## Hyd.Pressure2 31.914
## Pressure.Vacuum 31.715
## Carb.Flow 30.369
## Temperature 24.044
## Density 18.005
## Carb.Pressure1 16.407
## Brand.Code.D 13.635
## Hyd.Pressure4 10.293
## MFR 8.351
## PSC 6.888
## Carb.Volume 6.402
## Fill.Ounces 4.817
Neural Network
nnetGrid <- expand.grid(.decay = c(0, .01, 1),
.size = c(1:10),
.bag = FALSE)
set.seed(100)
nnetTune <- train(x = x_train,
y = y_train$PH,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
linout = FALSE, trace = FALSE,
MaxNWts = 5* (ncol(x_train) + 1) + 5 + 1,
maxit = 250)## Model Averaged Neural Network
##
## 1801 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1621, 1621, 1621, 1621, 1621, 1621, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 7.548155 NaN 7.546248
## 0.00 2 7.548155 NaN 7.546248
## 0.00 3 7.548155 NaN 7.546248
## 0.00 4 7.548155 NaN 7.546248
## 0.00 5 7.548155 NaN 7.546248
## 0.00 6 NaN NaN NaN
## 0.00 7 NaN NaN NaN
## 0.00 8 NaN NaN NaN
## 0.00 9 NaN NaN NaN
## 0.00 10 NaN NaN NaN
## 0.01 1 7.548160 0.04638595 7.546253
## 0.01 2 7.548158 0.05146390 7.546252
## 0.01 3 7.548158 0.04821311 7.546251
## 0.01 4 7.548157 0.04938198 7.546251
## 0.01 5 7.548157 0.05003356 7.546250
## 0.01 6 NaN NaN NaN
## 0.01 7 NaN NaN NaN
## 0.01 8 NaN NaN NaN
## 0.01 9 NaN NaN NaN
## 0.01 10 NaN NaN NaN
## 1.00 1 7.548515 0.04450603 7.546608
## 1.00 2 7.548431 0.04375639 7.546524
## 1.00 3 7.548389 0.04313695 7.546482
## 1.00 4 7.548363 0.04384483 7.546456
## 1.00 5 7.548345 0.04261512 7.546438
## 1.00 6 NaN NaN NaN
## 1.00 7 NaN NaN NaN
## 1.00 8 NaN NaN NaN
## 1.00 9 NaN NaN NaN
## 1.00 10 NaN NaN NaN
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1, decay = 0 and bag = FALSE.
## Length Class Mode
## model 5 -none- list
## repeats 1 -none- numeric
## bag 1 -none- logical
## seeds 5 -none- numeric
## names 28 -none- character
## terms 3 terms call
## coefnames 28 -none- character
## xlevels 0 -none- list
## xNames 28 -none- character
## problemType 1 -none- character
## tuneValue 3 data.frame list
## obsLevels 1 -none- logical
## param 4 -none- list
## size decay bag
## 1 1 0 FALSE
nnetPred <- predict(nnetTune, newdata=x_test)
NNET <- postResample(pred = nnetPred, obs = y_test$PH)
NNET## RMSE Rsquared MAE
## 7.546313 NA 7.544208
## plotmo grid: Brand.Code.A Brand.Code.B Brand.Code.C Brand.Code.D
## 0 1 0 0
## Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC
## -0.0444967 -0.00255656 -0.01533572 -0.0009690082 -0.01388185 -0.02476432
## PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## -0.02733253 -0.06383343 -0.01796363 0.0211377 -0.07885156 -0.02117198
## Hyd.Pressure2 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont
## 0.08041816 -0.004648421 0.08756702 0.06767177 -0.03681121 0.04315959
## Carb.Flow Density MFR Pressure.Vacuum Oxygen.Filler Pressure.Setpoint
## 0.1000897 -0.103669 0.06397241 0.004087661 -0.05229242 -0.1130223
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 28)
##
## Overall
## Mnf.Flow 100.000
## Filler.Level 72.323
## Usage.cont 71.061
## Pressure.Setpoint 56.017
## Fill.Pressure 43.944
## Brand.Code.C 35.957
## Hyd.Pressure1 35.400
## Oxygen.Filler 34.869
## Hyd.Pressure2 31.914
## Pressure.Vacuum 31.715
## Carb.Flow 30.369
## Temperature 24.044
## Density 18.005
## Carb.Pressure1 16.407
## Brand.Code.D 13.635
## Hyd.Pressure4 10.293
## MFR 8.351
## PSC 6.888
## Carb.Volume 6.402
## Fill.Ounces 4.817
Multivariate Adaptive Regression Splines (MARS)
set.seed(100)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsTuned <- train(x = x_train, y = y_train$PH,
method = "earth",
tuneGrid = marsGrid,
trControl = ctrl)
marsTuned## Multivariate Adaptive Regression Spline
##
## 1801 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1621, 1621, 1621, 1621, 1621, 1621, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 0.1512491 0.2104052 0.11935097
## 1 3 0.1429723 0.2923063 0.11282176
## 1 4 0.1411747 0.3095787 0.11057112
## 1 5 0.1383362 0.3381902 0.10841522
## 1 6 0.1391648 0.3297423 0.10958049
## 1 7 0.1383856 0.3371574 0.10824672
## 1 8 0.1366790 0.3535772 0.10664396
## 1 9 0.1357293 0.3627081 0.10613276
## 1 10 0.1346032 0.3739627 0.10531150
## 1 11 0.1328562 0.3903256 0.10352115
## 1 12 0.1322423 0.3960232 0.10264091
## 1 13 0.1316478 0.4010136 0.10205327
## 1 14 0.1319410 0.3984270 0.10215082
## 1 15 0.1313941 0.4033764 0.10148739
## 1 16 0.1312224 0.4047243 0.10158651
## 1 17 0.1307316 0.4094399 0.10126565
## 1 18 0.1305490 0.4110641 0.10111011
## 1 19 0.1302390 0.4137795 0.10058515
## 1 20 0.1305649 0.4111408 0.10073974
## 1 21 0.1304526 0.4120958 0.10068330
## 1 22 0.1303694 0.4130050 0.10052329
## 1 23 0.1304248 0.4126426 0.10079779
## 1 24 0.1304229 0.4130461 0.10088656
## 1 25 0.1302218 0.4148954 0.10070464
## 1 26 0.1301330 0.4154645 0.10068667
## 1 27 0.1303311 0.4137986 0.10097572
## 1 28 0.1303412 0.4140682 0.10075750
## 1 29 0.1305394 0.4126508 0.10096984
## 1 30 0.1306181 0.4121157 0.10083498
## 1 31 0.1305637 0.4124257 0.10085856
## 1 32 0.1304635 0.4133839 0.10077203
## 1 33 0.1303501 0.4142317 0.10082221
## 1 34 0.1302887 0.4147064 0.10085682
## 1 35 0.1304079 0.4136023 0.10095888
## 1 36 0.1305100 0.4127716 0.10102440
## 1 37 0.1305100 0.4127716 0.10102440
## 1 38 0.1305100 0.4127716 0.10102440
## 2 2 0.1495282 0.2318041 0.11709558
## 2 3 0.1413626 0.3078472 0.11062236
## 2 4 0.1392298 0.3292158 0.10974376
## 2 5 0.1377947 0.3438899 0.10861701
## 2 6 0.1364721 0.3558759 0.10719276
## 2 7 0.1346738 0.3723283 0.10548004
## 2 8 0.1322354 0.3961861 0.10306367
## 2 9 0.1318670 0.3986459 0.10283865
## 2 10 0.1312618 0.4047675 0.10207636
## 2 11 0.1298063 0.4177025 0.10085769
## 2 12 0.1288457 0.4268240 0.10016016
## 2 13 0.1279448 0.4350628 0.09936107
## 2 14 0.1275795 0.4381088 0.09928713
## 2 15 0.1267544 0.4445442 0.09833877
## 2 16 0.1266954 0.4457516 0.09822790
## 2 17 0.1253916 0.4573028 0.09710860
## 2 18 0.1252078 0.4588626 0.09701196
## 2 19 0.1248681 0.4619553 0.09677054
## 2 20 0.1247959 0.4627747 0.09665633
## 2 21 0.1245650 0.4647098 0.09645497
## 2 22 0.1247639 0.4633985 0.09689047
## 2 23 0.1244335 0.4662407 0.09675330
## 2 24 0.1244335 0.4661557 0.09691331
## 2 25 0.1242193 0.4681984 0.09679243
## 2 26 0.1242007 0.4683947 0.09665251
## 2 27 0.1240529 0.4694822 0.09662905
## 2 28 0.1240480 0.4695364 0.09651391
## 2 29 0.1239637 0.4703552 0.09642612
## 2 30 0.1239724 0.4702576 0.09639823
## 2 31 0.1238557 0.4712662 0.09632827
## 2 32 0.1238232 0.4713762 0.09629974
## 2 33 0.1238502 0.4711859 0.09636653
## 2 34 0.1240042 0.4699042 0.09644997
## 2 35 0.1240180 0.4697838 0.09644714
## 2 36 0.1240940 0.4691551 0.09649490
## 2 37 0.1240940 0.4691551 0.09649490
## 2 38 0.1240940 0.4691551 0.09649490
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 32 and degree = 2.
marsPred <- predict(marsTuned, newdata=x_test)
MARS <- postResample(pred = marsPred, obs = y_test$PH)
MARS## RMSE Rsquared MAE
## 0.13202105 0.45224451 0.09834183
## plotmo grid: Brand.Code.A Brand.Code.B Brand.Code.C Brand.Code.D
## 0 1 0 0
## Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC
## -0.0444967 -0.00255656 -0.01533572 -0.0009690082 -0.01388185 -0.02476432
## PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## -0.02733253 -0.06383343 -0.01796363 0.0211377 -0.07885156 -0.02117198
## Hyd.Pressure2 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont
## 0.08041816 -0.004648421 0.08756702 0.06767177 -0.03681121 0.04315959
## Carb.Flow Density MFR Pressure.Vacuum Oxygen.Filler Pressure.Setpoint
## 0.1000897 -0.103669 0.06397241 0.004087661 -0.05229242 -0.1130223
## earth variable importance
##
## Overall
## Mnf.Flow 100.000
## Brand.Code.C 71.286
## Usage.cont 56.556
## Filler.Level 51.451
## Hyd.Pressure1 46.096
## Temperature 46.096
## Carb.Pressure1 37.879
## Pressure.Vacuum 37.879
## Brand.Code.A 27.457
## Pressure.Setpoint 24.466
## Fill.Pressure 21.352
## Density 20.589
## PC.Volume 11.118
## Filler.Speed 7.099
## Hyd.Pressure2 0.000
Support Vector Machines (SVM)
set.seed(100)
svmLTuned <- train(x = x_train, y = y_train$PH,
method = "svmLinear",
tuneLength = 25,
trControl = trainControl(method = "cv"))
svmLTuned## Support Vector Machines with Linear Kernel
##
## 1801 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1621, 1621, 1621, 1621, 1621, 1621, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1352883 0.3713609 0.1040294
##
## Tuning parameter 'C' was held constant at a value of 1
svmLPred <- predict(svmLTuned, newdata=x_test)
svmL<- postResample(pred = svmLPred, obs = y_test$PH)
svmL## RMSE Rsquared MAE
## 0.1422938 0.3660721 0.1057271
## plotmo grid: Brand.Code.A Brand.Code.B Brand.Code.C Brand.Code.D
## 0 1 0 0
## Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC
## -0.0444967 -0.00255656 -0.01533572 -0.0009690082 -0.01388185 -0.02476432
## PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## -0.02733253 -0.06383343 -0.01796363 0.0211377 -0.07885156 -0.02117198
## Hyd.Pressure2 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont
## 0.08041816 -0.004648421 0.08756702 0.06767177 -0.03681121 0.04315959
## Carb.Flow Density MFR Pressure.Vacuum Oxygen.Filler Pressure.Setpoint
## 0.1000897 -0.103669 0.06397241 0.004087661 -0.05229242 -0.1130223
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 28)
##
## Overall
## Mnf.Flow 100.000
## Filler.Level 72.323
## Usage.cont 71.061
## Pressure.Setpoint 56.017
## Fill.Pressure 43.944
## Brand.Code.C 35.957
## Hyd.Pressure1 35.400
## Oxygen.Filler 34.869
## Hyd.Pressure2 31.914
## Pressure.Vacuum 31.715
## Carb.Flow 30.369
## Temperature 24.044
## Density 18.005
## Carb.Pressure1 16.407
## Brand.Code.D 13.635
## Hyd.Pressure4 10.293
## MFR 8.351
## PSC 6.888
## Carb.Volume 6.402
## Fill.Ounces 4.817
resamples <- resamples( list(CondInfTree =ctreeModel,
BaggedTree = baggedTreeModel,
BoostedTree = gbmModel,
Cubist=cubistModel) )
summary(resamples)##
## Call:
## summary.resamples(object = resamples)
##
## Models: CondInfTree, BaggedTree, BoostedTree, Cubist
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## CondInfTree 0.08940068 0.09500435 0.10040473 0.09938600 0.10397095 0.10720325
## BaggedTree 0.07434815 0.07757889 0.08326889 0.08224526 0.08575461 0.09016741
## BoostedTree 0.08292606 0.08531318 0.08879434 0.08878783 0.09074029 0.09652978
## Cubist 0.07526615 0.07622772 0.07877568 0.08017837 0.08409543 0.08686983
## NA's
## CondInfTree 0
## BaggedTree 0
## BoostedTree 0
## Cubist 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CondInfTree 0.11862944 0.1271589 0.1334408 0.1316086 0.1353915 0.1434740 0
## BaggedTree 0.09770323 0.1028704 0.1103010 0.1096036 0.1161584 0.1217212 0
## BoostedTree 0.10516751 0.1090659 0.1146769 0.1148873 0.1203323 0.1264487 0
## Cubist 0.09621053 0.1028891 0.1045039 0.1061805 0.1129815 0.1146383 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CondInfTree 0.3334229 0.3846787 0.4193517 0.4141590 0.4400073 0.4853761 0
## BaggedTree 0.4944730 0.5706456 0.6019601 0.5883611 0.6221605 0.6587625 0
## BoostedTree 0.4549572 0.5210052 0.5592613 0.5466390 0.5767253 0.6047203 0
## Cubist 0.5454104 0.5798619 0.6346177 0.6136746 0.6464894 0.6678379 0
Single Tree Models - cTree
convert_top_20_to_df <- function(df){
df1 <- as.data.frame(df)
df1['Predictors'] <- rownames(df)
colnames(df1) <- c("Overall", "Predictors")
rownames(df1) <- 1:nrow(df1)
return (df1)
}ctree_20 <- varImp(ctreeModel)
ctree_20 <- ctree_20$importance %>%
arrange(desc(Overall))
ctree_20 <- head(ctree_20,20)
ctree_20## Overall
## Mnf.Flow 100.000000
## Filler.Level 72.322654
## Usage.cont 71.061472
## Pressure.Setpoint 56.017247
## Fill.Pressure 43.943880
## Brand.Code.C 35.956618
## Hyd.Pressure1 35.399527
## Oxygen.Filler 34.868753
## Hyd.Pressure2 31.913795
## Pressure.Vacuum 31.715468
## Carb.Flow 30.369047
## Temperature 24.043599
## Density 18.004649
## Carb.Pressure1 16.406943
## Brand.Code.D 13.635039
## Hyd.Pressure4 10.293074
## MFR 8.350943
## PSC 6.888474
## Carb.Volume 6.402182
## Fill.Ounces 4.817296
ctree_20_df<- convert_top_20_to_df(ctree_20)
ctree_20_df %>%
arrange(Overall)%>%
mutate(name = factor(Predictors, levels=c(Predictors))) %>%
ggplot(aes(x=name, y=Overall)) +
geom_segment(aes(xend = Predictors, yend = 0)) +
geom_point(size = 4, color = "blue") +
theme_minimal() +
coord_flip() +
labs(title="rPart Predictor Variable Importance",
y="rPart Importance", x="Predictors") +
scale_y_continuous()cTreePred <- predict(ctreeModel, newdata=x_test)
cTreePred <- postResample(pred = cTreePred, obs = y_test$PH)
cTreePred## RMSE Rsquared MAE
## 0.1378285 0.4099709 0.1007987
Bagged Trees - baggedTreeModel
baggedTreeModel_20 <- varImp(baggedTreeModel)
baggedTreeModel_20 <- baggedTreeModel_20$importance %>%
arrange(desc(Overall))
baggedTreeModel_20 <- head(baggedTreeModel_20,20)
baggedTreeModel_20## Overall
## PC.Volume 100.00000
## Carb.Volume 99.05002
## Fill.Ounces 97.22865
## Carb.Pressure 88.91747
## Carb.Temp 86.38320
## PSC 66.17968
## PSC.Fill 58.72903
## PSC.CO2 53.45121
## Mnf.Flow 48.29440
## Carb.Pressure1 43.83475
## Fill.Pressure 39.03129
## Hyd.Pressure1 34.68856
## Hyd.Pressure4 29.93678
## Filler.Level 29.90330
## Hyd.Pressure2 26.59632
## Filler.Speed 24.19008
## Temperature 23.56332
## Usage.cont 20.99806
## Pressure.Vacuum 19.96054
## Carb.Flow 19.44649
baggedTreeModel_20_df<- convert_top_20_to_df(baggedTreeModel_20)
baggedTreeModel_20_df %>%
arrange(Overall)%>%
mutate(name = factor(Predictors, levels=c(Predictors))) %>%
ggplot(aes(x=name, y=Overall)) +
geom_segment(aes(xend = Predictors, yend = 0)) +
geom_point(size = 4, color = "green") +
theme_minimal() +
coord_flip() +
labs(title="baggedTreeModel Predictor Variable Importance",
y="baggedTreeModel Importance", x="Predictors") +
scale_y_continuous()baggedTreeModelPred <- predict(baggedTreeModel, newdata=x_test)
baggedTreeModelPred <- postResample(pred = baggedTreeModelPred, obs = y_test$PH)
baggedTreeModel## Bagged CART
##
## 1801 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1621, 1621, 1621, 1621, 1621, 1621, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1096036 0.5883611 0.08224526
Random Forest - rfModel
rfModel_20 <- varImp(rfModel)
rfModel_20 <- rfModel_20 %>%
arrange(desc(Overall))
rfModel_20 <- head(rfModel_20,20)
rfModel_20## Overall
## Mnf.Flow 59.50962
## Brand.Code.C 52.84617
## Usage.cont 50.35397
## Temperature 42.18921
## Oxygen.Filler 41.61435
## Filler.Level 39.43007
## Pressure.Vacuum 35.68012
## Density 31.96549
## Carb.Pressure1 30.94438
## Carb.Flow 30.03117
## Carb.Volume 26.03864
## Filler.Speed 25.92427
## Pressure.Setpoint 24.98219
## Brand.Code.D 24.19307
## Fill.Pressure 24.08583
## Hyd.Pressure2 23.82041
## Hyd.Pressure1 22.38196
## Hyd.Pressure4 21.53386
## MFR 19.60721
## PC.Volume 18.49995
rfModel_20_df<- convert_top_20_to_df(rfModel_20)
rfModel_20_df %>%
arrange(Overall)%>%
mutate(name = factor(Predictors, levels=c(Predictors))) %>%
ggplot(aes(x=name, y=Overall)) +
geom_segment(aes(xend = Predictors, yend = 0)) +
geom_point(size = 4, color = "purple") +
theme_minimal() +
coord_flip() +
labs(title="rfModel Predictor Variable Importance",
y="rfModel Importance", x="Predictors") +
scale_y_continuous()rfModelPred <- predict(rfModel, newdata=x_test)
rfModelPred <- postResample(pred = rfModelPred, obs = y_test$PH)
rfModelPred## RMSE Rsquared MAE
## 0.11829053 0.57283468 0.08565728
Gradient Boost Model - gbmModel
gbmModel_20 <- varImp(gbmModel)
gbmModel_20 <- gbmModel_20$importance %>%
arrange(desc(Overall))
gbmModel_20 <- head(gbmModel_20,20)
gbmModel_20## Overall
## Mnf.Flow 100.000000
## Brand.Code.C 36.502244
## Temperature 32.283014
## Usage.cont 31.985841
## Pressure.Vacuum 27.699895
## Oxygen.Filler 26.672509
## Density 23.997295
## Filler.Level 21.153842
## Carb.Pressure1 21.010238
## Filler.Speed 10.789237
## Fill.Pressure 10.139399
## Brand.Code.D 8.435451
## Pressure.Setpoint 8.378807
## Carb.Flow 8.073840
## PC.Volume 7.860987
## Carb.Volume 7.019979
## MFR 6.695749
## Hyd.Pressure4 6.241750
## Hyd.Pressure1 6.102853
## PSC.Fill 5.900576
gbmModel_20_df<- convert_top_20_to_df(gbmModel_20)
gbmModel_20_df %>%
arrange(Overall)%>%
mutate(name = factor(Predictors, levels=c(Predictors))) %>%
ggplot(aes(x=name, y=Overall)) +
geom_segment(aes(xend = Predictors, yend = 0)) +
geom_point(size = 4, color = "gold") +
theme_minimal() +
coord_flip() +
labs(title="gbmModel Predictor Variable Importance",
y="gbmModel Importance", x="Predictors") +
scale_y_continuous()gbmModelPred <- predict(gbmModel, newdata=x_test)
gbmModelPred<- postResample(pred = gbmModelPred, obs = y_test$PH)
gbmModelPred## RMSE Rsquared MAE
## 0.12127520 0.54094938 0.08848211
Cubist Model - cubistModel
cubistModel_20 <- varImp(cubistModel)
cubistModel_20 <- cubistModel_20$importance %>%
arrange(desc(Overall))
cubistModel_20 <- head(cubistModel_20,20)
cubistModel_20## Overall
## Mnf.Flow 100.00000
## Density 66.66667
## Filler.Level 57.14286
## Temperature 54.42177
## Pressure.Vacuum 53.74150
## Usage.cont 50.34014
## Carb.Pressure1 46.93878
## Oxygen.Filler 44.89796
## Brand.Code.C 43.53741
## Hyd.Pressure2 40.13605
## Hyd.Pressure1 34.69388
## Carb.Flow 33.33333
## Pressure.Setpoint 33.33333
## Carb.Volume 27.89116
## Brand.Code.D 27.21088
## Fill.Pressure 26.53061
## Filler.Speed 21.08844
## Carb.Pressure 21.08844
## Carb.Temp 20.40816
## MFR 18.36735
cubistModel_20_df<- convert_top_20_to_df(cubistModel_20)
cubistVisualMostImportant <- cubistModel_20_df %>%
arrange(Overall)%>%
mutate(name = factor(Predictors, levels=c(Predictors))) %>%
ggplot(aes(x=name, y=Overall)) +
geom_segment(aes(xend = Predictors, yend = 0)) +
geom_point(size = 4, color = "pink") +
theme_minimal() +
coord_flip() +
labs(title="cubistModel Predictor Variable Importance",
y="cubistModel Importance", x="Predictors") +
scale_y_continuous()
cubistVisualMostImportantcubistModelPred <- predict(cubistModel, newdata=x_test)
cubistModelPred<- postResample(pred = cubistModelPred, obs = y_test$PH)
cubistModelPred## RMSE Rsquared MAE
## 0.11438785 0.58856866 0.08000401
## plotmo grid: Brand.Code.A Brand.Code.B Brand.Code.C Brand.Code.D
## 0 1 0 0
## Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC
## -0.0444967 -0.00255656 -0.01533572 -0.0009690082 -0.01388185 -0.02476432
## PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## -0.02733253 -0.06383343 -0.01796363 0.0211377 -0.07885156 -0.02117198
## Hyd.Pressure2 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont
## 0.08041816 -0.004648421 0.08756702 0.06767177 -0.03681121 0.04315959
## Carb.Flow Density MFR Pressure.Vacuum Oxygen.Filler Pressure.Setpoint
## 0.1000897 -0.103669 0.06397241 0.004087661 -0.05229242 -0.1130223
From our experimentation with 12 different models, we saw that the Cubist model had the lowest RMSE (0.10976) value as well as the lowest MAE value (0.081). It also had the highest Rsquared value (0.601).
| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| baggedTree Model | 0.109603613797648 | 0.588361134239771 | 0.0822452644644274 |
| Cubist Model | 0.114387852528189 | 0.588568657381733 | 0.0800040130615234 |
| Random Forest Model | 0.118290525448583 | 0.572834681831954 | 0.0856572761904752 |
| Gradient Boost Model | 0.121275198205224 | 0.540949375505972 | 0.0884821070145594 |
| KNN | 0.129025562076891 | 0.427445828767913 | 0.0997675321646709 |
| Multivariate Adaptive Regression Spline | 0.132021047955125 | 0.452244508754447 | 0.0983418268641931 |
| Ridge Regression | 0.13311419911386 | 0.387678566979895 | 0.104853653162224 |
| Linear Model | 0.133265504611133 | 0.385260278178317 | 0.104821844179685 |
| Partial Least Square | 0.133289559913808 | 0.385369348582959 | 0.10509987947128 |
| cTree Model | 0.137828544450021 | 0.409970886592315 | 0.100798661525504 |
| Support Vector Machines - Linear | 0.14229381432959 | 0.366072053176909 | 0.105727076535242 |
| Neural Network | 7.54631311028383 | NA | 7.54420779220779 |
We will use the Cubist model against the Student evaluation data and make predictions of the PH variable.
First, as we did with the Student train data, we have to convert the Brand.Code categorical value in the Student evaluation data to Dummy variables.
## Dummy Variable Object
##
## Formula: ~Brand.Code
## 1 variables, 0 factors
## Variables and levels will be separated by '.'
## A less than full rank encoding is used
dummies2 <- as.data.frame(predict(mod, predictors_evaluate))
predictors_evaluate2 <- subset(predictors_evaluate, select = -c(Brand.Code))
predictors_evaluate2 <- cbind(dummies2,predictors_evaluate)cubistPred <- round(predict(cubistModel, newdata=predictors_evaluate2),2)
head_predictions <- head(cubistPred,10)| x |
|---|
| 8.85 |
| 8.84 |
| 8.85 |
| 9.00 |
| 8.84 |
| 8.85 |
| 8.84 |
| 8.99 |
| 8.84 |
| 9.04 |
The data science team found that the Cubist model is the best for predicting the PH value. The most important predictors from this model are shown in the visualization below. The top five predictors are Mnf.Flow, Density, Temperature, Pressure.Vacuum, and Filler Level. Two discrete categorical factors, Brand Codes C and D, are also in the most important predictors.
We have exported the predicted PH values in the attached excel file.
Note: Uncomment out the code below and update the path to make sure that the data exports to your local path.