The data science team of Salma Elshahawy, John K. Hancock, and Farhana Zahir have prepared the following technical report to address the issue of understanding ABC’s manufacturing process and its predictive factors. This report is the predictive value of the PH.
The report consists of the following:
PART 1: THE DATASETS
PART 2: DATA PREPARATION
PART 3: EXPERIMENTATION
PART 4: EVALUATE MODELS
PART 5: USE THE BEST MODEL TO FORECAST PH
PART 6: CONCLUSIONS
In this section, we did the following:
Import the Datasets
Evaluate the Dataset
Devise a Data Preparation Strategy
The excel files, StudentData.xlsx and StudentEvaluation.xlsx, are hosted on the team’s Github page. Here, they’re downloaded and read into the dataframes, beverage_training_data and beverage_test_data.
temp_train_file <- tempfile(pattern="StudentData", fileext = ".xlsx")
temp_eval_file <- tempfile(pattern="StudentEvaluation", fileext = ".xlsx")
student_train <- "https://github.com/JohnKHancock/CUNY_DATA624_Project2/blob/main/raw/StudentData.xlsx?raw=true"
student_eval <- "https://github.com/JohnKHancock/CUNY_DATA624_Project2/blob/main/raw/StudentEvaluation.xlsx?raw=true"
student_data <- GET(student_train,
authenticate(Sys.getenv("GITHUB_PAT"), ""),
write_disk(path = temp_train_file))
student_eval <- GET(student_eval,
authenticate(Sys.getenv("GITHUB_PAT"), ""),
write_disk(path = temp_eval_file))After importing the Beverage Training dataset, we see that there are 2,571 observations consisiting of 32 predictor variables and one dependent variable, PH. We also see that “Brand Code” is a factor variable that will need to be handled as well as several observations with a number of NAs.
For the Beverage Testing dataset, we see 267 observations, the 32 predictors, and the dependent variable PH which is all NAs. This is the data that we will have to predict. Same as the training set, We also see that “Brand Code” is a character variable that will need to be handled as well as several observations with a number of NAs.
Beverage Training Data
## [1] 2571 33
## [1] "character"
Beverage Testing Data
After analyzing the data, we devised the following processes in order to prepare the data for analysis
A. Isolate predictors from the dependent variable
B. Correct the Predictor Names
C. Create a data frame of numeric values only
D. Identify and Impute Missing Data
E. Identify and Address Outliers
F. Check for and remove correlated predictors
G. Identify Near Zero Variance Predictors
H. Impute missing values and Create dummy variables for Brand.Code
I. Impute missing data for Dependent Variable PH
For the training set, remove the predictor variable, PH and store it into the variable, y_train.
Correct the space in the predictor names using the make.names function. The space in the names may be problematic. This was applied to both datasets.
## [1] "Brand Code" "Carb Volume" "Fill Ounces"
## [4] "PC Volume" "Carb Pressure" "Carb Temp"
## [7] "PSC" "PSC Fill" "PSC CO2"
## [10] "Mnf Flow" "Carb Pressure1" "Fill Pressure"
## [13] "Hyd Pressure1" "Hyd Pressure2" "Hyd Pressure3"
## [16] "Hyd Pressure4" "Filler Level" "Filler Speed"
## [19] "Temperature" "Usage cont" "Carb Flow"
## [22] "Density" "MFR" "Balling"
## [25] "Pressure Vacuum" "Oxygen Filler" "Bowl Setpoint"
## [28] "Pressure Setpoint" "Air Pressurer" "Alch Rel"
## [31] "Carb Rel" "Balling Lvl"
## [1] "Brand.Code" "Carb.Volume" "Fill.Ounces"
## [4] "PC.Volume" "Carb.Pressure" "Carb.Temp"
## [7] "PSC" "PSC.Fill" "PSC.CO2"
## [10] "Mnf.Flow" "Carb.Pressure1" "Fill.Pressure"
## [13] "Hyd.Pressure1" "Hyd.Pressure2" "Hyd.Pressure3"
## [16] "Hyd.Pressure4" "Filler.Level" "Filler.Speed"
## [19] "Temperature" "Usage.cont" "Carb.Flow"
## [22] "Density" "MFR" "Balling"
## [25] "Pressure.Vacuum" "Oxygen.Filler" "Bowl.Setpoint"
## [28] "Pressure.Setpoint" "Air.Pressurer" "Alch.Rel"
## [31] "Carb.Rel" "Balling.Lvl"
colnames(predictors_evaluate)<- make.names(colnames(predictors_evaluate))
colnames(predictors_evaluate)## [1] "Brand.Code" "Carb.Volume" "Fill.Ounces"
## [4] "PC.Volume" "Carb.Pressure" "Carb.Temp"
## [7] "PSC" "PSC.Fill" "PSC.CO2"
## [10] "Mnf.Flow" "Carb.Pressure1" "Fill.Pressure"
## [13] "Hyd.Pressure1" "Hyd.Pressure2" "Hyd.Pressure3"
## [16] "Hyd.Pressure4" "Filler.Level" "Filler.Speed"
## [19] "Temperature" "Usage.cont" "Carb.Flow"
## [22] "Density" "MFR" "Balling"
## [25] "Pressure.Vacuum" "Oxygen.Filler" "Bowl.Setpoint"
## [28] "Pressure.Setpoint" "Air.Pressurer" "Alch.Rel"
## [31] "Carb.Rel" "Balling.Lvl"
We saw earlier that Brand.Code is a categorical value. Because of that we subset the dataframe to remove it. We will handle this variable later.
The predictor MFR has the most missing values at 212. I used knn imputation to handle missing values. After the knn imputation, there are still missing values for Brand.Code which will be handled in a later section.
Training Data
| Predictors | NAs |
|---|---|
| MFR | 212 |
| Filler.Speed | 57 |
| PC.Volume | 39 |
| PSC.CO2 | 39 |
| Fill.Ounces | 38 |
| PSC | 33 |
missingData %>%
ggplot() +
geom_bar(aes(x=reorder(Predictors,NAs), y=NAs, fill=factor(NAs)), stat = 'identity', ) +
labs(x='Predictor', y="NAs", title='Number of missing values') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + coord_flip() missingData <- as.data.frame(colSums(is.na(predictors_imputed)))
colnames(missingData) <- c("NAs")
missingData <- cbind(Predictors = rownames(missingData), missingData)
rownames(missingData) <- 1:nrow(missingData)
missingData <- missingData[missingData$NAs != 0,] %>%
arrange(desc(NAs))
head(missingData)## [1] Predictors NAs
## <0 rows> (or 0-length row.names)
Next we looked at the distributions of the numeric variables. There are only four predictors that are normally distributed. The box plots show a high number of outliers in the data. To correct for this, the pre processing step of center and scale was used. We centered and scaled these distributions.
par(mfrow = c(3, 3))
datasub = melt(predictors_imputed)
suppressWarnings(ggplot(datasub, aes(x= value)) +
geom_density(fill='orange') + facet_wrap(~variable, scales = 'free') )ggplot(data = datasub , aes(x=variable, y=value)) +
geom_boxplot(outlier.colour="red", outlier.shape=3, outlier.size=8,aes(fill=variable)) +
coord_flip() + theme(legend.position = "none")preprocessing <- preProcess(as.data.frame(predictors_imputed), method = c("center", "scale"))
preprocessing ## Created from 2571 samples and 31 variables
##
## Pre-processing:
## - centered (31)
## - ignored (0)
## - scaled (31)
num_predictors_02 <- spatialSign(num_predictors_01)
num_predictors_02 <- as.data.frame(num_predictors_02)par(mfrow = c(3, 3))
datasub = melt(num_predictors_02)
suppressWarnings(ggplot(datasub, aes(x= value)) +
geom_density(fill='blue') + facet_wrap(~variable, scales = 'free') )ggplot(data = datasub , aes(x=variable, y=value)) +
geom_boxplot(outlier.colour="red", outlier.shape=3, outlier.size=8,aes(fill=variable)) +
coord_flip() + theme(legend.position = "none")Remove the zero variance predictor. There are no near zero variance predictors
## character(0)
Earlier, we saw that there are 120 missing values for Brand.Code, a factor variable. The imputation strategy here is to impute with the most frequent value, “B”. After imputation, Brand.Code was converted to dummy variables. The converted Brand.Code predictor is joined to the num_predictors_02.
## [1] 120
## [1] "A" "B" "C" "D"
##
## A B C D
## 293 1239 304 615
## factor(0)
## Levels: A B C D
## Dummy Variable Object
##
## Formula: ~Brand.Code
## 1 variables, 1 factors
## Variables and levels will be separated by '.'
## A less than full rank encoding is used
| Brand.Code.A | Brand.Code.B | Brand.Code.C | Brand.Code.D |
|---|---|---|---|
| 0 | 1 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
The final step is to impute missing values for the dependent variable, PH, with the median for PH.
missingData <- as.data.frame(colSums(is.na(processed.train)))
colnames(missingData) <- c("NAs")
missingData <- cbind(Predictors = rownames(missingData), missingData)
rownames(missingData) <- 1:nrow(missingData)
missingData <- missingData[missingData$NAs != 0,] %>%
arrange(desc(NAs))
head(missingData)## [1] Predictors NAs
## <0 rows> (or 0-length row.names)
Split the Time Series
Before we begin with the experimentation, We split the training data into train and test sets
evaluation.split <- initial_split(processed.train, prop = 0.7, strata = "PH")
train <- training(evaluation.split)
test <- testing(evaluation.split)Modeling
We examined 12 models. We looked at Linear Models, Non Linear Regression Models, and Tree Based Models. For all of the models, MNF.Flow was the most important predictor with the exception of the bag tree model. Other consistently important predictors include predictor, Brand C and D. Residuals for each model appear random with no discernable patterns. In Part 4, we evaluated the metrics from each model.
set.seed(100)
x_train <- train[, 2:29]
y_train <- as.data.frame(train$PH)
colnames(y_train) <- c("PH")
x_test <- test[, 2:29]
y_test <- as.data.frame(test$PH)
colnames(y_test) <- c("PH")
ctrl <- trainControl(method = "cv", number = 10)Basic linear model
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50289 -0.07797 0.01076 0.08598 0.40461
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.627224 0.014264 604.827 < 2e-16 ***
## Brand.Code.A -0.084278 0.013618 -6.189 7.52e-10 ***
## Brand.Code.B -0.097312 0.021305 -4.568 5.28e-06 ***
## Brand.Code.C -0.214816 0.023129 -9.288 < 2e-16 ***
## Brand.Code.D NA NA NA NA
## Carb.Volume -0.027336 0.050197 -0.545 0.586112
## Fill.Ounces -0.045655 0.018421 -2.478 0.013288 *
## PC.Volume -0.040475 0.021897 -1.848 0.064714 .
## Carb.Pressure 0.015385 0.075131 0.205 0.837768
## Carb.Temp 0.037876 0.068645 0.552 0.581176
## PSC -0.013387 0.018070 -0.741 0.458875
## PSC.Fill -0.018989 0.018074 -1.051 0.293569
## PSC.CO2 -0.014464 0.018446 -0.784 0.433067
## Mnf.Flow -0.377757 0.032001 -11.805 < 2e-16 ***
## Carb.Pressure1 0.158639 0.021034 7.542 7.35e-14 ***
## Fill.Pressure 0.078149 0.027974 2.794 0.005268 **
## Hyd.Pressure1 0.017788 0.028596 0.622 0.533995
## Hyd.Pressure2 0.087998 0.036617 2.403 0.016354 *
## Hyd.Pressure4 0.016160 0.027204 0.594 0.552564
## Filler.Level 0.163086 0.025843 6.311 3.50e-10 ***
## Filler.Speed 0.006547 0.047481 0.138 0.890343
## Temperature -0.118405 0.022533 -5.255 1.66e-07 ***
## Usage.cont -0.117676 0.021170 -5.559 3.13e-08 ***
## Carb.Flow 0.075019 0.024442 3.069 0.002178 **
## Density -0.189315 0.048954 -3.867 0.000114 ***
## MFR -0.019341 0.046950 -0.412 0.680434
## Pressure.Vacuum -0.024946 0.023264 -1.072 0.283727
## Oxygen.Filler -0.087742 0.023552 -3.725 0.000201 ***
## Pressure.Setpoint -0.092490 0.026740 -3.459 0.000555 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1331 on 1773 degrees of freedom
## Multiple R-squared: 0.3996, Adjusted R-squared: 0.3905
## F-statistic: 43.71 on 27 and 1773 DF, p-value: < 2.2e-16
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 0.1341192 0.3840371 0.1050857 0.007437498 0.04013758 0.005812337
## lm variable importance
##
## only 20 most important variables shown (out of 27)
##
## Overall
## Mnf.Flow 100.000
## Brand.Code.C 78.428
## Carb.Pressure1 63.463
## Filler.Level 52.908
## Brand.Code.A 51.864
## Usage.cont 46.464
## Temperature 43.859
## Brand.Code.B 37.968
## Density 31.965
## Oxygen.Filler 30.751
## Pressure.Setpoint 28.466
## Carb.Flow 25.126
## Fill.Pressure 22.763
## Fill.Ounces 20.062
## Hyd.Pressure2 19.417
## PC.Volume 14.661
## Pressure.Vacuum 8.009
## PSC.Fill 7.823
## PSC.CO2 5.539
## PSC 5.168
Partial Least Squares or PLS
set.seed(100)
plsFit1 <- train(x_train, y_train$PH,
method = "pls",
tuneLength = 25,
trControl = ctrl)## Data: X dimension: 1801 28
## Y dimension: 1801 1
## Fit method: oscorespls
## Number of components considered: 11
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 12.48 32.90 46.73 54.34 59.16 65.70 69.89
## .outcome 29.79 32.31 35.64 38.01 39.02 39.35 39.63
## 8 comps 9 comps 10 comps 11 comps
## X 72.38 74.4 76.32 79.12
## .outcome 39.82 39.9 39.95 39.96
## ncomp
## 11 11
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 8 0.1341466 0.3836032 0.1053703 0.007069507 0.03865624 0.005592642
## pls variable importance
##
## only 20 most important variables shown (out of 28)
##
## Overall
## Mnf.Flow 100.00
## Brand.Code.C 88.42
## Brand.Code.D 74.32
## Filler.Level 71.42
## Usage.cont 63.67
## Pressure.Setpoint 59.66
## Brand.Code.B 46.25
## Fill.Pressure 45.66
## Hyd.Pressure2 41.42
## Pressure.Vacuum 39.57
## Carb.Flow 33.69
## Temperature 32.84
## Oxygen.Filler 31.18
## Brand.Code.A 30.20
## Carb.Pressure1 27.29
## Hyd.Pressure4 21.54
## PSC 21.21
## Fill.Ounces 20.34
## Hyd.Pressure1 15.99
## Density 13.39
Ridge Regression
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
ridgeRegFit <- train(x_train, y_train$PH,
method = "ridge",
tuneGrid = ridgeGrid,
trControl = ctrl)## Length Class Mode
## call 4 -none- call
## actions 29 -none- list
## allset 28 -none- numeric
## beta.pure 812 -none- numeric
## vn 28 -none- character
## mu 1 -none- numeric
## normx 28 -none- numeric
## meanx 28 -none- numeric
## lambda 1 -none- numeric
## L1norm 29 -none- numeric
## penalty 29 -none- numeric
## df 29 -none- numeric
## Cp 29 -none- numeric
## sigma2 1 -none- numeric
## xNames 28 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
## lambda
## 3 0.01428571
## lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 3 0.01428571 0.1345357 0.3766957 0.1055482 0.007871785 0.05880301 0.005189457
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 28)
##
## Overall
## Mnf.Flow 100.000
## Filler.Level 72.742
## Usage.cont 66.059
## Pressure.Setpoint 56.947
## Fill.Pressure 46.135
## Brand.Code.C 35.539
## Oxygen.Filler 32.742
## Hyd.Pressure1 31.840
## Pressure.Vacuum 29.987
## Hyd.Pressure2 28.639
## Carb.Flow 26.675
## Temperature 20.496
## Density 18.039
## Carb.Pressure1 15.249
## Brand.Code.D 12.567
## Hyd.Pressure4 8.053
## MFR 4.863
## PSC.Fill 4.778
## Fill.Ounces 4.747
## PSC 4.534
KNN
knnModel <- train(x = x_train, y = y_train$PH,
method = "knn",
tuneLength = 25,
trControl = ctrl)
knnModel## k-Nearest Neighbors
##
## 1801 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1621, 1621, 1620, 1621, 1622, 1621, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.1276587 0.4439380 0.09763729
## 7 0.1272656 0.4438134 0.09823129
## 9 0.1272041 0.4432306 0.09897602
## 11 0.1280010 0.4368874 0.10013307
## 13 0.1287157 0.4307457 0.10098574
## 15 0.1293585 0.4255093 0.10171256
## 17 0.1294558 0.4256276 0.10188982
## 19 0.1300873 0.4196128 0.10257053
## 21 0.1304567 0.4176642 0.10312453
## 23 0.1308364 0.4149229 0.10359010
## 25 0.1312923 0.4108721 0.10382238
## 27 0.1318122 0.4065335 0.10419869
## 29 0.1324342 0.4007120 0.10474214
## 31 0.1325288 0.4002344 0.10482740
## 33 0.1329340 0.3969451 0.10527922
## 35 0.1334962 0.3915696 0.10577731
## 37 0.1337796 0.3887952 0.10604974
## 39 0.1341816 0.3848644 0.10626547
## 41 0.1344483 0.3827242 0.10642978
## 43 0.1345961 0.3813827 0.10653220
## 45 0.1348650 0.3790528 0.10681576
## 47 0.1353028 0.3750859 0.10723104
## 49 0.1355825 0.3725323 0.10751230
## 51 0.1358745 0.3699478 0.10768527
## 53 0.1360560 0.3683863 0.10784690
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
knnPred <- predict(knnModel, newdata = x_test)
knn_res <- postResample(pred = knnPred, obs = y_test$PH)
knn_res## RMSE Rsquared MAE
## 0.1337444 0.4297242 0.1008811
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 28)
##
## Overall
## Mnf.Flow 100.000
## Filler.Level 72.742
## Usage.cont 66.059
## Pressure.Setpoint 56.947
## Fill.Pressure 46.135
## Brand.Code.C 35.539
## Oxygen.Filler 32.742
## Hyd.Pressure1 31.840
## Pressure.Vacuum 29.987
## Hyd.Pressure2 28.639
## Carb.Flow 26.675
## Temperature 20.496
## Density 18.039
## Carb.Pressure1 15.249
## Brand.Code.D 12.567
## Hyd.Pressure4 8.053
## MFR 4.863
## PSC.Fill 4.778
## Fill.Ounces 4.747
## PSC 4.534
Neural Network
nnetGrid <- expand.grid(.decay = c(0, .01, 1),
.size = c(1:10),
.bag = FALSE)
set.seed(100)
nnetTune <- train(x = x_train,
y = y_train$PH,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
linout = FALSE, trace = FALSE,
MaxNWts = 5* (ncol(x_train) + 1) + 5 + 1,
maxit = 250)## Model Averaged Neural Network
##
## 1801 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1621, 1621, 1621, 1621, 1621, 1621, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 7.545910 NaN 7.543986
## 0.00 2 7.545910 NaN 7.543986
## 0.00 3 7.545910 NaN 7.543986
## 0.00 4 7.545910 NaN 7.543986
## 0.00 5 7.545910 NaN 7.543986
## 0.00 6 NaN NaN NaN
## 0.00 7 NaN NaN NaN
## 0.00 8 NaN NaN NaN
## 0.00 9 NaN NaN NaN
## 0.00 10 NaN NaN NaN
## 0.01 1 7.545915 0.04495528 7.543991
## 0.01 2 7.545914 0.05111649 7.543990
## 0.01 3 7.545913 0.04902320 7.543989
## 0.01 4 7.545912 0.05115683 7.543989
## 0.01 5 7.545912 0.05224777 7.543988
## 0.01 6 NaN NaN NaN
## 0.01 7 NaN NaN NaN
## 0.01 8 NaN NaN NaN
## 0.01 9 NaN NaN NaN
## 0.01 10 NaN NaN NaN
## 1.00 1 7.546270 0.04373027 7.544346
## 1.00 2 7.546187 0.04283044 7.544263
## 1.00 3 7.546144 0.04257038 7.544221
## 1.00 4 7.546118 0.04282537 7.544194
## 1.00 5 7.546100 0.04167386 7.544176
## 1.00 6 NaN NaN NaN
## 1.00 7 NaN NaN NaN
## 1.00 8 NaN NaN NaN
## 1.00 9 NaN NaN NaN
## 1.00 10 NaN NaN NaN
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1, decay = 0 and bag = FALSE.
## Length Class Mode
## model 5 -none- list
## repeats 1 -none- numeric
## bag 1 -none- logical
## seeds 5 -none- numeric
## names 28 -none- character
## terms 3 terms call
## coefnames 28 -none- character
## xlevels 0 -none- list
## xNames 28 -none- character
## problemType 1 -none- character
## tuneValue 3 data.frame list
## obsLevels 1 -none- logical
## param 4 -none- list
## size decay bag
## 1 1 0 FALSE
nnetPred <- predict(nnetTune, newdata=x_test)
NNET <- postResample(pred = nnetPred, obs = y_test$PH)
NNET## RMSE Rsquared MAE
## 7.551577 NA 7.549506
## plotmo grid: Brand.Code.A Brand.Code.B Brand.Code.C Brand.Code.D
## 0 1 0 0
## Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC
## -0.04813177 -0.003197337 -0.01273667 -0.0007745981 -0.005304253 -0.02933575
## PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## -0.03032371 -0.06544576 0.05637876 0.02422739 -0.07935299 -0.01488397
## Hyd.Pressure2 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont
## 0.08985365 -0.004717008 0.08813245 0.0678782 -0.04144125 0.05617176
## Carb.Flow Density MFR Pressure.Vacuum Oxygen.Filler
## 0.09978965 -0.1027022 0.06435543 -0.04451928 -0.0531649
## Pressure.Setpoint
## -0.1146138
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 28)
##
## Overall
## Mnf.Flow 100.000
## Filler.Level 72.742
## Usage.cont 66.059
## Pressure.Setpoint 56.947
## Fill.Pressure 46.135
## Brand.Code.C 35.539
## Oxygen.Filler 32.742
## Hyd.Pressure1 31.840
## Pressure.Vacuum 29.987
## Hyd.Pressure2 28.639
## Carb.Flow 26.675
## Temperature 20.496
## Density 18.039
## Carb.Pressure1 15.249
## Brand.Code.D 12.567
## Hyd.Pressure4 8.053
## MFR 4.863
## PSC.Fill 4.778
## Fill.Ounces 4.747
## PSC 4.534
Multivariate Adaptive Regression Splines (MARS)
set.seed(100)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsTuned <- train(x = x_train, y = y_train$PH,
method = "earth",
tuneGrid = marsGrid,
trControl = ctrl)
marsTuned## Multivariate Adaptive Regression Spline
##
## 1801 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1621, 1621, 1621, 1621, 1621, 1621, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 0.1519885 0.2069925 0.12117213
## 1 3 0.1477618 0.2515970 0.11726343
## 1 4 0.1424258 0.3043244 0.11101682
## 1 5 0.1417822 0.3108149 0.11022990
## 1 6 0.1412256 0.3161642 0.10995544
## 1 7 0.1389010 0.3388722 0.10819473
## 1 8 0.1376190 0.3502356 0.10703124
## 1 9 0.1360027 0.3651705 0.10595323
## 1 10 0.1353731 0.3710853 0.10461416
## 1 11 0.1338847 0.3850378 0.10321531
## 1 12 0.1336934 0.3873287 0.10249648
## 1 13 0.1321820 0.4013358 0.10166137
## 1 14 0.1313438 0.4098211 0.10077596
## 1 15 0.1311334 0.4114799 0.10026809
## 1 16 0.1303441 0.4183990 0.09940438
## 1 17 0.1301287 0.4208661 0.09917623
## 1 18 0.1301095 0.4211559 0.09911392
## 1 19 0.1305966 0.4171911 0.09947776
## 1 20 0.1303873 0.4188426 0.09923815
## 1 21 0.1303715 0.4190653 0.09917720
## 1 22 0.1302760 0.4199451 0.09915404
## 1 23 0.1301411 0.4213366 0.09918895
## 1 24 0.1302279 0.4210868 0.09909730
## 1 25 0.1303454 0.4204012 0.09918533
## 1 26 0.1301412 0.4220622 0.09920596
## 1 27 0.1303435 0.4203209 0.09928267
## 1 28 0.1303182 0.4207295 0.09922077
## 1 29 0.1304451 0.4197610 0.09939381
## 1 30 0.1303191 0.4208290 0.09936981
## 1 31 0.1304550 0.4196858 0.09948421
## 1 32 0.1304538 0.4198923 0.09947104
## 1 33 0.1304889 0.4199242 0.09954173
## 1 34 0.1304756 0.4199250 0.09944804
## 1 35 0.1304902 0.4198425 0.09945640
## 1 36 0.1304902 0.4198425 0.09945640
## 1 37 0.1304902 0.4198425 0.09945640
## 1 38 0.1304902 0.4198425 0.09945640
## 2 2 0.1497988 0.2299110 0.11761859
## 2 3 0.1447098 0.2827332 0.11380271
## 2 4 0.1421922 0.3069325 0.11123244
## 2 5 0.1390474 0.3368872 0.11036701
## 2 6 0.1373709 0.3533328 0.10863264
## 2 7 0.1352897 0.3729062 0.10633755
## 2 8 0.1337972 0.3863367 0.10480632
## 2 9 0.1332891 0.3922441 0.10361520
## 2 10 0.1321610 0.4022180 0.10269268
## 2 11 0.1310175 0.4131337 0.10164173
## 2 12 0.1295949 0.4246106 0.10002913
## 2 13 0.1284230 0.4355194 0.09917013
## 2 14 0.1280578 0.4390897 0.09886663
## 2 15 0.1276429 0.4432846 0.09868998
## 2 16 0.1273042 0.4462102 0.09830851
## 2 17 0.1270549 0.4485341 0.09819020
## 2 18 0.1263373 0.4547638 0.09760116
## 2 19 0.1265575 0.4539746 0.09745340
## 2 20 0.1265358 0.4549548 0.09728968
## 2 21 0.1268974 0.4523892 0.09773736
## 2 22 0.1269260 0.4526335 0.09784708
## 2 23 0.1270210 0.4520443 0.09785512
## 2 24 0.1270968 0.4520115 0.09763839
## 2 25 0.1271764 0.4520938 0.09757818
## 2 26 0.1272201 0.4521101 0.09772548
## 2 27 0.1270748 0.4535961 0.09744366
## 2 28 0.1272515 0.4520723 0.09749072
## 2 29 0.1270928 0.4533569 0.09737816
## 2 30 0.1271823 0.4527102 0.09748969
## 2 31 0.1273158 0.4517213 0.09764303
## 2 32 0.1272847 0.4519497 0.09755007
## 2 33 0.1274342 0.4508035 0.09766991
## 2 34 0.1274968 0.4503143 0.09771274
## 2 35 0.1275724 0.4496776 0.09775650
## 2 36 0.1275724 0.4496776 0.09775650
## 2 37 0.1275724 0.4496776 0.09775650
## 2 38 0.1275724 0.4496776 0.09775650
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 18 and degree = 2.
marsPred <- predict(marsTuned, newdata=x_test)
MARS <- postResample(pred = marsPred, obs = y_test$PH)
MARS## RMSE Rsquared MAE
## 0.1351018 0.4170886 0.1032485
## plotmo grid: Brand.Code.A Brand.Code.B Brand.Code.C Brand.Code.D
## 0 1 0 0
## Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC
## -0.04813177 -0.003197337 -0.01273667 -0.0007745981 -0.005304253 -0.02933575
## PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## -0.03032371 -0.06544576 0.05637876 0.02422739 -0.07935299 -0.01488397
## Hyd.Pressure2 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont
## 0.08985365 -0.004717008 0.08813245 0.0678782 -0.04144125 0.05617176
## Carb.Flow Density MFR Pressure.Vacuum Oxygen.Filler
## 0.09978965 -0.1027022 0.06435543 -0.04451928 -0.0531649
## Pressure.Setpoint
## -0.1146138
## earth variable importance
##
## Overall
## Mnf.Flow 100.00
## Brand.Code.C 71.17
## Filler.Level 58.17
## Carb.Flow 58.17
## Pressure.Vacuum 50.83
## Carb.Pressure1 50.83
## Brand.Code.A 40.31
## Temperature 34.89
## Hyd.Pressure1 33.37
## Density 27.25
## Brand.Code.D 23.81
## Usage.cont 20.59
## Pressure.Setpoint 0.00
Support Vector Machines (SVM)
set.seed(100)
svmLTuned <- train(x = x_train, y = y_train$PH,
method = "svmLinear",
tuneLength = 25,
trControl = trainControl(method = "cv"))
svmLTuned## Support Vector Machines with Linear Kernel
##
## 1801 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1621, 1621, 1621, 1621, 1621, 1621, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1357066 0.3743779 0.1042566
##
## Tuning parameter 'C' was held constant at a value of 1
svmLPred <- predict(svmLTuned, newdata=x_test)
svmL<- postResample(pred = svmLPred, obs = y_test$PH)
svmL## RMSE Rsquared MAE
## 0.1416567 0.3594363 0.1054432
## plotmo grid: Brand.Code.A Brand.Code.B Brand.Code.C Brand.Code.D
## 0 1 0 0
## Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC
## -0.04813177 -0.003197337 -0.01273667 -0.0007745981 -0.005304253 -0.02933575
## PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## -0.03032371 -0.06544576 0.05637876 0.02422739 -0.07935299 -0.01488397
## Hyd.Pressure2 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont
## 0.08985365 -0.004717008 0.08813245 0.0678782 -0.04144125 0.05617176
## Carb.Flow Density MFR Pressure.Vacuum Oxygen.Filler
## 0.09978965 -0.1027022 0.06435543 -0.04451928 -0.0531649
## Pressure.Setpoint
## -0.1146138
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 28)
##
## Overall
## Mnf.Flow 100.000
## Filler.Level 72.742
## Usage.cont 66.059
## Pressure.Setpoint 56.947
## Fill.Pressure 46.135
## Brand.Code.C 35.539
## Oxygen.Filler 32.742
## Hyd.Pressure1 31.840
## Pressure.Vacuum 29.987
## Hyd.Pressure2 28.639
## Carb.Flow 26.675
## Temperature 20.496
## Density 18.039
## Carb.Pressure1 15.249
## Brand.Code.D 12.567
## Hyd.Pressure4 8.053
## MFR 4.863
## PSC.Fill 4.778
## Fill.Ounces 4.747
## PSC 4.534
resamples <- resamples( list(CondInfTree =ctreeModel,
BaggedTree = baggedTreeModel,
BoostedTree = gbmModel,
Cubist=cubistModel) )
summary(resamples)##
## Call:
## summary.resamples(object = resamples)
##
## Models: CondInfTree, BaggedTree, BoostedTree, Cubist
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## CondInfTree 0.08473608 0.08861893 0.09678825 0.09472839 0.09863304 0.10685968
## BaggedTree 0.07378771 0.07915025 0.08041089 0.08156085 0.08497580 0.08983852
## BoostedTree 0.07957218 0.08495870 0.08981795 0.08876060 0.09242079 0.09642614
## Cubist 0.07151854 0.07393146 0.07585573 0.07787389 0.08160881 0.08865890
## NA's
## CondInfTree 0
## BaggedTree 0
## BoostedTree 0
## Cubist 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CondInfTree 0.11272576 0.1192896 0.1294941 0.1261672 0.1310877 0.1371878 0
## BaggedTree 0.09986836 0.1043449 0.1101592 0.1101612 0.1149333 0.1206988 0
## BoostedTree 0.10603498 0.1142042 0.1156811 0.1164589 0.1205235 0.1277245 0
## Cubist 0.09965847 0.1023029 0.1042195 0.1056102 0.1082524 0.1153712 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CondInfTree 0.3979264 0.4302470 0.4657415 0.4667827 0.5132486 0.5242436 0
## BaggedTree 0.5416168 0.5580249 0.5838407 0.5875821 0.6033266 0.6660155 0
## BoostedTree 0.4940727 0.5174132 0.5323353 0.5393031 0.5516800 0.6073912 0
## Cubist 0.5622753 0.5996309 0.6180414 0.6179799 0.6280174 0.6935418 0
Single Tree Models - cTree
convert_top_20_to_df <- function(df){
df1 <- as.data.frame(df)
df1['Predictors'] <- rownames(df)
colnames(df1) <- c("Overall", "Predictors")
rownames(df1) <- 1:nrow(df1)
return (df1)
}ctree_20 <- varImp(ctreeModel)
ctree_20 <- ctree_20$importance %>%
arrange(desc(Overall))
ctree_20 <- head(ctree_20,20)
ctree_20## Overall
## Mnf.Flow 100.000000
## Filler.Level 72.742140
## Usage.cont 66.059001
## Pressure.Setpoint 56.946619
## Fill.Pressure 46.135071
## Brand.Code.C 35.538704
## Oxygen.Filler 32.741803
## Hyd.Pressure1 31.839796
## Pressure.Vacuum 29.986506
## Hyd.Pressure2 28.639461
## Carb.Flow 26.675011
## Temperature 20.495620
## Density 18.039226
## Carb.Pressure1 15.248864
## Brand.Code.D 12.567132
## Hyd.Pressure4 8.053456
## MFR 4.862906
## PSC.Fill 4.778115
## Fill.Ounces 4.747072
## PSC 4.534167
ctree_20_df<- convert_top_20_to_df(ctree_20)
ctree_20_df %>%
arrange(Overall)%>%
mutate(name = factor(Predictors, levels=c(Predictors))) %>%
ggplot(aes(x=name, y=Overall)) +
geom_segment(aes(xend = Predictors, yend = 0)) +
geom_point(size = 4, color = "blue") +
theme_minimal() +
coord_flip() +
labs(title="rPart Predictor Variable Importance",
y="rPart Importance", x="Predictors") +
scale_y_continuous()cTreePred <- predict(ctreeModel, newdata=x_test)
cTreePred <- postResample(pred = cTreePred, obs = y_test$PH)
cTreePred## RMSE Rsquared MAE
## 0.13074344 0.46173157 0.09763883
Bagged Trees - baggedTreeModel
baggedTreeModel_20 <- varImp(baggedTreeModel)
baggedTreeModel_20 <- baggedTreeModel_20$importance %>%
arrange(desc(Overall))
baggedTreeModel_20 <- head(baggedTreeModel_20,20)
baggedTreeModel_20## Overall
## Carb.Volume 100.00000
## PC.Volume 99.74968
## Fill.Ounces 97.92218
## Carb.Pressure 92.19660
## Carb.Temp 88.07704
## PSC 68.47249
## PSC.Fill 61.53297
## PSC.CO2 54.13883
## Mnf.Flow 49.23656
## Carb.Pressure1 45.20649
## Fill.Pressure 41.05836
## Hyd.Pressure1 36.46444
## Hyd.Pressure4 32.16729
## Hyd.Pressure2 28.68248
## Filler.Level 28.45692
## Temperature 25.64155
## Filler.Speed 24.32391
## Usage.cont 21.10092
## Density 20.21854
## Carb.Flow 19.80553
baggedTreeModel_20_df<- convert_top_20_to_df(baggedTreeModel_20)
baggedTreeModel_20_df %>%
arrange(Overall)%>%
mutate(name = factor(Predictors, levels=c(Predictors))) %>%
ggplot(aes(x=name, y=Overall)) +
geom_segment(aes(xend = Predictors, yend = 0)) +
geom_point(size = 4, color = "green") +
theme_minimal() +
coord_flip() +
labs(title="baggedTreeModel Predictor Variable Importance",
y="baggedTreeModel Importance", x="Predictors") +
scale_y_continuous()baggedTreeModelPred <- predict(baggedTreeModel, newdata=x_test)
baggedTreeModelPred <- postResample(pred = baggedTreeModelPred, obs = y_test$PH)
baggedTreeModel## Bagged CART
##
## 1801 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1621, 1621, 1621, 1621, 1621, 1621, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1101612 0.5875821 0.08156085
Random Forest - rfModel
rfModel_20 <- varImp(rfModel)
rfModel_20 <- rfModel_20 %>%
arrange(desc(Overall))
rfModel_20 <- head(rfModel_20,20)
rfModel_20## Overall
## Mnf.Flow 61.55518
## Usage.cont 52.20738
## Brand.Code.C 50.95006
## Pressure.Vacuum 43.22820
## Filler.Level 39.82271
## Temperature 38.31964
## Carb.Pressure1 36.84317
## Oxygen.Filler 36.75693
## Density 34.53080
## Filler.Speed 28.01853
## Hyd.Pressure1 27.23747
## Carb.Flow 27.21569
## Pressure.Setpoint 25.79915
## Hyd.Pressure2 24.44479
## Carb.Volume 22.75455
## Brand.Code.D 22.48350
## Fill.Pressure 21.60299
## MFR 19.85477
## Hyd.Pressure4 18.94150
## Brand.Code.B 17.17734
rfModel_20_df<- convert_top_20_to_df(rfModel_20)
rfModel_20_df %>%
arrange(Overall)%>%
mutate(name = factor(Predictors, levels=c(Predictors))) %>%
ggplot(aes(x=name, y=Overall)) +
geom_segment(aes(xend = Predictors, yend = 0)) +
geom_point(size = 4, color = "purple") +
theme_minimal() +
coord_flip() +
labs(title="rfModel Predictor Variable Importance",
y="rfModel Importance", x="Predictors") +
scale_y_continuous()rfModelPred <- predict(rfModel, newdata=x_test)
rfModelPred <- postResample(pred = rfModelPred, obs = y_test$PH)
rfModelPred## RMSE Rsquared MAE
## 0.11744003 0.58984937 0.08682643
Gradient Boost Model - gbmModel
gbmModel_20 <- varImp(gbmModel)
gbmModel_20 <- gbmModel_20$importance %>%
arrange(desc(Overall))
gbmModel_20 <- head(gbmModel_20,20)
gbmModel_20## Overall
## Mnf.Flow 100.000000
## Pressure.Vacuum 33.129929
## Brand.Code.C 32.497261
## Temperature 31.832688
## Density 30.635680
## Usage.cont 26.303913
## Filler.Level 21.751844
## Carb.Pressure1 20.375560
## Oxygen.Filler 17.319583
## Brand.Code.D 9.621853
## PSC.Fill 9.240987
## MFR 8.478939
## Filler.Speed 7.836084
## Carb.Flow 7.276953
## Pressure.Setpoint 7.209489
## Hyd.Pressure4 6.917653
## Hyd.Pressure1 6.706023
## Fill.Pressure 6.670622
## Fill.Ounces 6.218982
## Carb.Temp 5.983509
gbmModel_20_df<- convert_top_20_to_df(gbmModel_20)
gbmModel_20_df %>%
arrange(Overall)%>%
mutate(name = factor(Predictors, levels=c(Predictors))) %>%
ggplot(aes(x=name, y=Overall)) +
geom_segment(aes(xend = Predictors, yend = 0)) +
geom_point(size = 4, color = "gold") +
theme_minimal() +
coord_flip() +
labs(title="gbmModel Predictor Variable Importance",
y="gbmModel Importance", x="Predictors") +
scale_y_continuous()gbmModelPred <- predict(gbmModel, newdata=x_test)
gbmModelPred<- postResample(pred = gbmModelPred, obs = y_test$PH)
gbmModelPred## RMSE Rsquared MAE
## 0.12161792 0.53764990 0.09197442
Cubist Model - cubistModel
cubistModel_20 <- varImp(cubistModel)
cubistModel_20 <- cubistModel_20$importance %>%
arrange(desc(Overall))
cubistModel_20 <- head(cubistModel_20,20)
cubistModel_20## Overall
## Mnf.Flow 100.00000
## Density 72.02797
## Pressure.Vacuum 66.43357
## Filler.Level 62.93706
## Temperature 53.14685
## Usage.cont 51.04895
## Carb.Pressure1 47.55245
## Oxygen.Filler 44.75524
## Brand.Code.C 40.55944
## Hyd.Pressure2 35.66434
## Carb.Flow 34.96503
## Pressure.Setpoint 34.96503
## Carb.Pressure 28.67133
## Hyd.Pressure1 25.87413
## Carb.Volume 23.77622
## Carb.Temp 23.77622
## Fill.Pressure 23.07692
## Brand.Code.D 20.97902
## Filler.Speed 20.27972
## Hyd.Pressure4 16.08392
cubistModel_20_df<- convert_top_20_to_df(cubistModel_20)
cubistVisualMostImportant <- cubistModel_20_df %>%
arrange(Overall)%>%
mutate(name = factor(Predictors, levels=c(Predictors))) %>%
ggplot(aes(x=name, y=Overall)) +
geom_segment(aes(xend = Predictors, yend = 0)) +
geom_point(size = 4, color = "pink") +
theme_minimal() +
coord_flip() +
labs(title="cubistModel Predictor Variable Importance",
y="cubistModel Importance", x="Predictors") +
scale_y_continuous()
cubistVisualMostImportantcubistModelPred <- predict(cubistModel, newdata=x_test)
cubistModelPred<- postResample(pred = cubistModelPred, obs = y_test$PH)
cubistModelPred## RMSE Rsquared MAE
## 0.11424048 0.58481353 0.08299015
## plotmo grid: Brand.Code.A Brand.Code.B Brand.Code.C Brand.Code.D
## 0 1 0 0
## Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC
## -0.04813177 -0.003197337 -0.01273667 -0.0007745981 -0.005304253 -0.02933575
## PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## -0.03032371 -0.06544576 0.05637876 0.02422739 -0.07935299 -0.01488397
## Hyd.Pressure2 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont
## 0.08985365 -0.004717008 0.08813245 0.0678782 -0.04144125 0.05617176
## Carb.Flow Density MFR Pressure.Vacuum Oxygen.Filler
## 0.09978965 -0.1027022 0.06435543 -0.04451928 -0.0531649
## Pressure.Setpoint
## -0.1146138
From our experimentation with 12 different models, we saw that the Cubist model had the lowest RMSE (0.10976) value as well as the lowest MAE value (0.081). It also had the highest Rsquared value (0.601).
| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| baggedTree Model | 0.110161222176994 | 0.587582121274168 | 0.0815608529120577 |
| Cubist Model | 0.114240479875035 | 0.584813533518392 | 0.082990152334238 |
| Random Forest Model | 0.117440026812849 | 0.589849374176466 | 0.086826432034632 |
| Gradient Boost Model | 0.121617924199404 | 0.537649903321855 | 0.091974418000837 |
| KNN | 0.127265605246562 | 0.443813404567053 | 0.0982312937654356 |
| cTree Model | 0.130743440773608 | 0.461731568560653 | 0.0976388347059531 |
| Linear Model | 0.134119241211064 | 0.384037147180449 | 0.105085743503772 |
| Partial Least Square | 0.134146647737492 | 0.383603238597427 | 0.105370300144996 |
| Ridge Regression | 0.134535682073648 | 0.376695740032475 | 0.105548235473887 |
| Multivariate Adaptive Regression Spline | 0.135101845051539 | 0.417088631109576 | 0.103248537986403 |
| Support Vector Machines - Linear | 0.141656699921108 | 0.359436260223437 | 0.105443174125596 |
| Neural Network | 7.55157663076987 | NA | 7.54950649350649 |
We will use the Cubist model against the Student evaluation data and make predictions of the PH variable.
First, as we did with the Student train data, we have to convert the Brand.Code categorical value in the Student evaluation data to Dummy variables.
## Dummy Variable Object
##
## Formula: ~Brand.Code
## 1 variables, 0 factors
## Variables and levels will be separated by '.'
## A less than full rank encoding is used
dummies2 <- as.data.frame(predict(mod, predictors_evaluate))
predictors_evaluate2 <- subset(predictors_evaluate, select = -c(Brand.Code))
predictors_evaluate2 <- cbind(dummies2,predictors_evaluate)cubistPred <- round(predict(cubistModel, newdata=predictors_evaluate2),2)
head_predictions <- head(cubistPred,10)| x |
|---|
| 8.76 |
| 8.72 |
| 8.76 |
| 8.85 |
| 8.76 |
| 8.76 |
| 8.73 |
| 8.91 |
| 8.72 |
| 8.90 |
The data science team found that the Cubist model is the best for predicting the PH value. The most important predictors from this model are shown in the visualization below. The top five predictors are Mnf.Flow, Density, Temperature, Pressure.Vacuum, and Filler Level. Two discrete categorical factors, Brand Codes C and D, are also in the most important predictors.
We have exported the predicted PH values in the attached excel file.