This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of pH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
Insatll libraries
library(ggplot2)
library(readr)
library(DataExplorer)
library(Amelia)
library(VIM)
library(dplyr)
library(forecast)
library(tidyr)
library(mice)
library(corrplot)
library(MASS)
library(earth)
library(RANN)
library(caret)
library(car)
library(forcats)
library(RColorBrewer)
library(randomForest)
library(gbm)
library(Cubist)
library(kableExtra)
library(e1071)#load data
train <- read.csv("TrainingData.csv")
#review
glimpse(train)## Rows: 2,571
## Columns: 33
## $ Brand.Code <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
## $ Carb.Volume <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
## $ Fill.Ounces <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
## $ PC.Volume <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
## $ Carb.Pressure <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
## $ Carb.Temp <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
## $ PSC <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
## $ PSC.Fill <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
## $ PSC.CO2 <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
## $ Mnf.Flow <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ Carb.Pressure1 <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
## $ Fill.Pressure <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
## $ Hyd.Pressure1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure2 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure3 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure4 <int> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
## $ Filler.Level <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
## $ Filler.Speed <int> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
## $ Temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
## $ Usage.cont <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
## $ Carb.Flow <int> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
## $ Density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
## $ MFR <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
## $ Balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
## $ Pressure.Vacuum <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
## $ PH <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
## $ Oxygen.Filler <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
## $ Bowl.Setpoint <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
## $ Air.Pressurer <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
## $ Alch.Rel <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
## $ Carb.Rel <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
## $ Balling.Lvl <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…
From the above it is noted that the train data set:
Brand.Code is character type and seems it is an unordered categorical variable which needs to be updated as suchHyd.Pressure4, Filler.Speed, Carb.Flow, Bowl.Setpoint, remaining are floatMnf. Flow -100.20 to 229.40, Carb.Flow 26 to 5104, and PSC 0.002 to 0.270#NA counts by column
#sapply(train, function(x) sum(is.na(x)))
VIM::aggr(train, numbers=T, sortVars=T, bars = FALSE, border= 'white',
cex.axis = .6,
ylab=c("Proportion of NAs", "Combinations"))Based on the above:
MFR variable is missing about 8% of its valuesFiller.Speed is missing about 2%Brand.Code, Pressure.Vacuum, Air.PressurerNext, the distributions of numerical response variable PH, the categorical predictor variable Brand.Code and remaining numerical predictor variables.
Additionally, using train Summary Statistics the values, frequency of each values of the variables can be noted, as well as small visuals of thier distributions.
summary(train)## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## Filler.Level Filler.Speed Temperature Usage.cont Carb.Flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## Density MFR Balling Pressure.Vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint
## Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00
## Median :8.540 Median :0.03340 Median :120.0 Median :46.00
## Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62
## 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00
## NA's :4 NA's :12 NA's :2 NA's :12
## Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
## Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :142.6 Median :6.560 Median :5.400 Median :1.48
## Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
Below is the distribution of the Brand.Code variable. The variable has 5 levels, one of which is empty and appears unlabeled. Of the codes, Brand.Code B has has the highest number of values, followed by D, C, A, and lastly the unlabeled Brand.Code observation.
ggplot(train, aes(x=reorder(Brand.Code, Brand.Code, function(x)-length(x)))) +
geom_bar() + labs(x='Brand.Code')+
labs(title= 'Brand.Code Distribution')+
theme_minimal()Next are the distributions for all the numeric variables.
DataExplorer::plot_histogram(train, nrow = 3L, ncol = 4L)From the above, variables in the training data set exhibit the follow skews in distribution:
Relatively Normal Distributions: Carb.Pressure, Carb.Temp, Fill.Ounces, PC.Volume, PH (response variable)
Left-skew Distributions: Carb.Flow, Filler.Speed, Mnf.Flow, MFR, Bowl.Setpoint, Filler.Level, Hyd.Pressure2, Hyd.Pressure3, Usage.cont, Carb.Pressure1, Filler.Speed
Right-skew Distributions: Pressure.Setpoint, Fill.Pressure, Hyd.Pressure1, Temperature, Carb.Volume, PSC, PSC.CO2, PSC.Fill, Balling, Density, Hyd.Pressure4, Air.Pressurer, Alch.Rel, Carb.Rel, Oxygen.Filler, Balling.Lvl, Pressure.Vacuum
The relationship between the variables are reviewed using a correlation plot in order to detect multicolliniarity within the training data set. Based on the below, it looks like multicolliniarity is an issue provided the correlations between a lot of the predictor variables.
numeric_values <- train
numeric_values<- numeric_values %>%
select_if(is.numeric) %>%
na.omit()
train_cor <- cor(numeric_values)
#train_cor_filtered <- cor(train_cor[,-findCorrelation(train_cor,cutoff = 0.9,exact = TRUE)])
corrplot(train_cor, tl.col = 'black', col=brewer.pal(n=10, name="RdYlBu"))ph_corr <- as.data.frame(cor(numeric_values[-1], numeric_values$PH))
ph_corr <- cbind("Predictor" = rownames(ph_corr), ph_corr)
rownames(ph_corr) <- 1:nrow(ph_corr)
ph_corr <- ph_corr[-24,]
ggplot(ph_corr, aes(x=fct_reorder(factor(Predictor),V1), y = (V1))) +
geom_col(position="dodge", fill="steelblue") +
coord_flip()+
labs(title="Correlations to pH",
x="Predictors",
y="Correlation Coefficient")+
geom_text(aes(label = round(V1,2)), colour = "black", size = 3,
position = position_stack(vjust = 0.6))+
theme_minimal()Un-informative variables are detected using the nearZeroVar function. There is one variable, Hyd.Pressure1 with near-zero variance which will be removed from in the pre-processing stage.
nzv<- nearZeroVar(train, saveMetrics= TRUE)
filter(nzv, nzv=="TRUE")## freqRatio percentUnique zeroVar nzv
## Hyd.Pressure1 31.11111 9.529366 FALSE TRUE
Brand.CodeNext, the empty value in the Brand.Code variable is updated with the string “Unknown”. Additionally, the variable is updated to an unordered factor type variable.
#add name to empty string
train$Brand.Code[train$Brand.Code == ""] <- "Unknown"
#convert variable type to factor
train <- train %>%
dplyr::mutate(Brand.Code = factor(Brand.Code,
levels = c('A','B','C','D', 'Unknown'),
ordered = FALSE))Pre-Process Training Data for linear and non-linear models
Pre-processing of the data is needed based on the distributions and missing values noted in the training data set. The training data for linear and non-linear needs to be normalized where as the data does not need normalization for tree-based models.
set.seed(624)
#remove pH from the train data set in order to only transform the predictors
train_features <- train %>%
dplyr::select(-c(PH))
#remove nzv, correlated values, center and scale, apply BoxCox for normalization
preProc <- preProcess(train_features, method=c("knnImpute","nzv","corr",
"center", "scale", "BoxCox"))
#get the transformed features
preProc_train <- predict(preProc, train_features)
#add the PH response variable to the preProcessed train features
preProc_train$PH <- train$PH
#there are 4 NAs in the PH response variable, those need to be removed
preProc_train <- na.omit(preProc_train)
#partition data for evaluation
training_set <- createDataPartition(preProc_train$PH, p=0.8, list=FALSE)
train_data <- preProc_train[training_set,]
eval_data <- preProc_train[-training_set,]We consider these linear regression models: multi-linear regression, partial least squares, AIC optimized . We utilize the train() function for all three models, feeding the same datasets for X and Y, and specifying the proper model-building technique via the “method” variable.
Moreover, the prediction() function in combination with postResample()are used to generate summary statistics for how our model performed on the evaluation data:
First, a multi-linear regression is created to predict the pH response variable.
Multiple linear regression is used to assess the relationship between two variables while taking into account the effect of other variables. By taking into account the effect of other variables, we cancel out the effect of these other variables in order to isolate and measure the relationship between the two variables of interest. This point is the main difference with simple linear regression.
#Remove PH from sets to feed models
set.seed(222)
y_train <- subset(train_data, select = -c(PH))
y_test <- subset(eval_data, select = -c(PH))
#generate model
linear_model <- train(x= y_train, y= train_data$PH,
method='lm',
trControl=trainControl(method = "cv", number = 10))
#evaluate model
lmPred <- predict(linear_model, newdata = y_test)
lmResample <- postResample(pred=lmPred, obs = eval_data$PH)Next PLS regression is performed on the data to predict PH. PLS was chosen given the multicolliniarity detected earlier in the exploratory data analysis phase.
Partial least squares regression (PLS regression) is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of minimum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space.
set.seed(222)
#generate model
pls_model <- train(y_train, train_data$PH,
method='pls',
metric='Rsquared',
tuneLength=10,
trControl=trainControl(method = "cv", number = 10))
#evaluate model metrics
plsPred <-predict(pls_model, newdata=y_test)
plsReSample <- postResample(pred=plsPred, obs = eval_data$PH)Lastly, for our linear models, a stepwise AIC model is run using stepAIC. The stepwise regression is performed by specifying the direction parameter with “both.”
Stepwise regression is a combination of both backward elimination and forward selection methods. Stepwise method is a modification of the forward selection approach and differs in that variables already in the model do not necessarily stay. As in forward selection, stepwise regression adds one variable to the model at a time. After a variable is added, however, stepwise regression checks all the variables already included again to see whether there is a need to delete any variable that does not provide an improvement to the model based on a certain criterion.
set.seed(222)
#generate model
initial <- lm(PH ~ . , data = train_data)
AIC_model <- stepAIC(initial, direction = "both",
trace = 0)
#evaluate model metrics
AIC_Pred <-predict(AIC_model, newdata=y_test)
aicResample <- postResample(pred=AIC_Pred, obs=eval_data$PH)We need to verify model performance and identify the strongest performing model in our multi-linear regression subset.
display <- rbind("Linear Regression" = lmResample,
"Stepwise AIC" = aicResample,
"Partial Least Squares" = plsReSample)
display %>% kable() %>% kable_paper()| RMSE | Rsquared | MAE | |
|---|---|---|---|
| Linear Regression | 0.1276140 | 0.4344247 | 0.0997148 |
| Stepwise AIC | 0.1279072 | 0.4318320 | 0.1000196 |
| Partial Least Squares | 0.1276828 | 0.4337910 | 0.0995133 |
Building non-linear models. We will try k-nearest neighbors (KNN), support vector machines (SVM), multivariate adaptive regression splines (MARS), and neural networks. These models are not based on simple linear combinations of the predictors.
K-Nearest Neighbors simply predicts a new sample using the K-closest samples from the training set. The predicted response for the new sample is then the mean of the K neighbors’ responses. Predictors with the largest scales will contribute most to the distance between samples so centering and scaling the data during pre-processing is important.
set.seed(624)
knnModel <- train(PH~., data = train_data,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
#knnModel
knnPred <- predict(knnModel, newdata = eval_data)
knn_metrics <- postResample(pred = knnPred, obs = eval_data$PH)
#knn_metricsSupport Vector Machines follow the framework of robust regression where we seek to minimize the effect of outliers on the regression equations. We find parameter estimates that minimize SSE by not squaring the residuals when they are very large. In addition samples that the model fits well have no effect on the regression equation. A threshold is set using resampling and a kernel function which specifies the relationship between predictors and outcome so that only poorly predicted points called support vectors are used to fit the line. The radial kernel we are using has an additional parameter which impacts the smoothness of the upper and lower boundary.
set.seed(624)
tc <- trainControl(method = "cv",
number = 5,
classProbs = T)
svmModel <- train(PH~., data = train_data,
method = "svmRadial",
preProcess = c("BoxCox","center", "scale"),
trControl = tc,
tuneLength = 9)
#svmModel
svmPred <- predict(svmModel, newdata = eval_data)
svm_metrics <- postResample(pred = svmPred, obs = eval_data$PH)
#svm_metricsMARS uses surrogate features instead of the original predictors. However, whereas PLS and neural networks are based on linear combinations of the predictors, MARS creates two contrasted versions of a predictor to enter the model. MARS features breaks the predictor into two groups, a “hinge” function of the original based on a cut point that achieves the smallest error, and models linear relationships between the predictor and the outcome in each group. The new features are added to a basic linear regression model to estimate the slopes and intercepts.
set.seed(624)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
mars <- train(PH~., data = train_data,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
#mars
marsPred <- predict(mars, newdata = eval_data)
mars_metrics <- postResample(pred = marsPred, obs = eval_data$PH)
#mars_metricsObserved Vs. Predicted - Non - Linear Models with Reduced Predictor Set.
knnModel_pred <- knnModel %>% predict(eval_data)
# Model performance metrics
knn_Accuracy <- data.frame(
Model = "k-Nearest Neighbors",
RMSE = caret::RMSE(knnModel_pred,eval_data$PH),
Rsquare = caret::R2(knnModel_pred,eval_data$PH))
pred_svm <- svmModel %>% predict(eval_data)
# Model SVM performance metrics
SMV_Acc <- data.frame(
Model = "Support Vector Machine",
RMSE = caret::RMSE(pred_svm, eval_data$PH),
Rsquare = caret::R2(pred_svm, eval_data$PH)
)
#summary(marsTuned)
# Make MARS predictions
pred_mars <- mars %>% predict(eval_data)
# Model MARS performance metrics
MARS_Acc <- data.frame(
Model = "MARS Tuned",
RMSE = caret::RMSE(pred_mars, eval_data$PH),
Rsquare = caret::R2(pred_mars, eval_data$PH)
)
names(MARS_Acc)[names(MARS_Acc) == 'y'] <- "Rsquare"
#rbind(knn_Accuracy,SMV_Acc,MARS_Acc)
### code for the plot
par(mar = c(4, 4, 4, 4))
par(mfrow=c(2,2))
plot(knnModel_pred, eval_data$PH, ylab="Observed", col = "red")
abline(0, 1, lwd=2)
plot(pred_svm, eval_data$PH, ylab="Observed", col = "dark green")
abline(0, 1, lwd=2)
plot(pred_mars, eval_data$PH, ylab="Observed", col = "blue")
abline(0, 1, lwd=2)
mtext("Observed Vs. Predicted - Non - Linear Models with Reduced Predictor Set", side = 3, line = -2, outer = TRUE)Neural networks, like partial least squares, the outcome is modeled by an intermediary set of unobserved variables. These hidden units are linear combinations of the original predictors, but, unlike PLS models, they are not estimated in a hierarchical fashion. There are no constraints that help define these linear combinations. Each unit must then be related to the outcome using another linear combination connecting the hidden units. Treating this model as a nonlinear regression model, the parameters are usually optimized using the back-propagation algorithm to minimize the sum of the squared residuals.
set.seed(624)
NNModel <- avNNet(PH~., data = train_data,
size = 5,
decay = 0.01,
linout = TRUE,
trace = FALSE,
maxit = 500)
#NNModel
pred_NNModel <- NNModel %>% predict(eval_data)
nn_metrics <- postResample(pred = pred_NNModel, obs = eval_data$PH)
#nn_metricsrbind( "KNN" = knn_metrics,
"SVM" = svm_metrics,
"MARS" = mars_metrics,
"Neural Network" = nn_metrics) %>%
kable() %>% kable_paper()| RMSE | Rsquared | MAE | |
|---|---|---|---|
| KNN | 0.1181856 | 0.5169728 | 0.0889128 |
| SVM | 0.1132267 | 0.5605168 | 0.0818392 |
| MARS | 0.1227359 | 0.4787088 | 0.0934747 |
| Neural Network | 0.1116412 | 0.5678863 | 0.0829216 |
In the optimal nonlinear regression model was the Neural model with one of the lowest RMSE and higher R-squared. Below are the most important overall variables using the function varImp.
varImp(NNModel) %>% head(10)## Overall
## X1 6.6457519
## X10 0.7462469
## X101 0.9932248
## X102 0.9562043
## X103 0.7116890
## X104 0.3775559
## X11 0.6425230
## X110 9.6704146
## X111 1.0555018
## X112 10.0602537
The top predictors in our best performing non-linear model with Neural Network.
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model.
ggplot(train_data, aes(Oxygen.Filler, PH)) +
geom_point()ggplot(train_data, aes(Mnf.Flow , PH)) +
geom_point()ggplot(train_data, aes(Bowl.Setpoint , PH)) +
geom_point()Checking the top nonlinear predictors overall:
Pre-Process Training Data for Tree Based Models
The training data is pre-processed differently for tree based models since they do not require the training data to be normalized.
set.seed(624)
#remove pH from the train data set in order to only transform the predictors
train_features <- train %>%
dplyr::select(-c(PH))
#remove nzv, correlated values, impute NAs
preProc <- preProcess(train_features, method=c("knnImpute","nzv","corr"))
#get the transformed features
preProc_train <- predict(preProc, train_features)
#add the PH response variable to the preProcessed train features
preProc_train$PH <- train$PH
#there are 4 NAs in the PH response variable, those need to be removed
preProc_train <- na.omit(preProc_train)
#partition data for evaluation
training_set <- createDataPartition(preProc_train$PH, p=0.8, list=FALSE)
train_data_rf <- preProc_train[training_set,]
eval_data_rf <- preProc_train[-training_set,]
train_rf_predictors <- train_data_rf[-c(26)]
eval_rf_predictors <- eval_data_rf[-c(26)]Below a random forest regression model is created to predict the desired response variable, PH.
Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.
set.seed(624)
#fit the model
rf_model <- randomForest(train_rf_predictors, train_data_rf$PH,
importance = TRUE, ntrees = 500)
#metrics
rfPred <- predict(rf_model, newdata = eval_rf_predictors)
rf_metrics <- postResample(pred = rfPred, obs = eval_data_rf$PH)
#rf_metricsNext, boosted trees are also used to model the data using the train function and defining the method parameter with gbm to generate a prediction for the response variable PH.
Boosted Trees are commonly used in regression. They are an ensemble method similar to bagging, however, instead of building multiple trees in parallel, they build tress sequentially. They use the previous tree to find errors and build a new tree by correcting the previous.
set.seed(624)
#fit model
gbm_model <- train(train_rf_predictors, train_data_rf$PH,
method = "gbm",
verbose = FALSE)
#metrics
gbmPred <- predict(gbm_model, newdata = eval_rf_predictors)
gbm_metrics <- postResample(pred = gbmPred, obs = eval_data_rf$PH)
#gbm_metricsFinally for the last Tree Based Model, a Cubist model is run also using the train function and defining the method parameter with cubist.
In Cubist models, a tree is grown where the terminal leaves contain linear regression models. These models are based on the predictors used in previous splits. Also, there are intermediate linear models at each step of the tree. A prediction is made using the linear regression model at the terminal node of the tree, but is “smoothed” by taking into account the prediction from the linear model in the previous node of the tree (which also occurs recursively up the tree). The tree is reduced to a set of rules, which initially are paths from the top of the tree to the bottom. Rules are eliminated via pruning and/or combined for simplification.
set.seed(6)
#fit model
cube_model <- train(train_rf_predictors, train_data_rf$PH,
method = "cubist",
verbose = FALSE)
#metrics
cubePred <- predict(cube_model, newdata = eval_rf_predictors)
cube_metrics <- postResample(pred = cubePred, obs = eval_data_rf$PH)
#cube_metricsOf the tree based models produced, the best performing model turned out to be the Cubist Model with an RMSE of 0.095, and an R-squared value of 0.685 indicating the percentage of the variance in the dependent variable, PH that the independent variables explain.
rbind("Random Forest" = rf_metrics,
"Boosted Trees" = gbm_metrics,
"Cubist" = cube_metrics) %>%
kable() %>% kable_paper()| RMSE | Rsquared | MAE | |
|---|---|---|---|
| Random Forest | 0.0990469 | 0.6761053 | 0.0722867 |
| Boosted Trees | 0.1105297 | 0.5789906 | 0.0846337 |
| Cubist | 0.0952903 | 0.6849514 | 0.0683080 |
Next, the metrics for each of the best performing Linear, Non-Linear, and Tree based models are complied to evaluate the performance of each and select the best model.
all_metrics <- as.data.frame(rbind("Linear Regression" = lmResample,
"Neural Network" = nn_metrics,
"Cubist" = cube_metrics))
all_metrics[order(all_metrics$RMSE),] %>%
kable() %>% kable_paper()| RMSE | Rsquared | MAE | |
|---|---|---|---|
| Cubist | 0.0952903 | 0.6849514 | 0.0683080 |
| Neural Network | 0.1116412 | 0.5678863 | 0.0829216 |
| Linear Regression | 0.1276140 | 0.4344247 | 0.0997148 |
Using the RMSE, we see the best performing model of those created is the Cubist model. Next we will proceed to evaluate the model and its important variables using varImp.
plot(varImp(cube_model), top=10)Next, using the Cubist model which was identified as the best performing model above, the PH values are predicted with the provided test data set.
First the data is loaded and reviewed. Using the glimpse function, we note the test data has a similar structure to that of the training data with Brand.Code appearing as chr type which will need to be checked for empty values and converted to unordered factor type. The PH response variable needs to be removed to run the model and will be subset from the features.
#load test data
test <- read.csv("TestData.csv")
#review
glimpse(test)## Rows: 267
## Columns: 33
## $ Brand.Code <chr> "D", "A", "B", "B", "B", "B", "A", "B", "A", "D", "B…
## $ Carb.Volume <dbl> 5.480000, 5.393333, 5.293333, 5.266667, 5.406667, 5.…
## $ Fill.Ounces <dbl> 24.03333, 23.95333, 23.92000, 23.94000, 24.20000, 24…
## $ PC.Volume <dbl> 0.2700000, 0.2266667, 0.3033333, 0.1860000, 0.160000…
## $ Carb.Pressure <dbl> 65.4, 63.2, 66.4, 64.8, 69.4, 73.4, 65.2, 67.4, 66.8…
## $ Carb.Temp <dbl> 134.6, 135.0, 140.4, 139.0, 142.2, 147.2, 134.6, 139…
## $ PSC <dbl> 0.236, 0.042, 0.068, 0.004, 0.040, 0.078, 0.088, 0.0…
## $ PSC.Fill <dbl> 0.40, 0.22, 0.10, 0.20, 0.30, 0.22, 0.14, 0.10, 0.48…
## $ PSC.CO2 <dbl> 0.04, 0.08, 0.02, 0.02, 0.06, NA, 0.00, 0.04, 0.04, …
## $ Mnf.Flow <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ Carb.Pressure1 <dbl> 116.6, 118.8, 120.2, 124.8, 115.0, 118.6, 117.6, 121…
## $ Fill.Pressure <dbl> 46.0, 46.2, 45.8, 40.0, 51.4, 46.4, 46.2, 40.0, 43.8…
## $ Hyd.Pressure1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure2 <dbl> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Hyd.Pressure3 <dbl> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Hyd.Pressure4 <int> 96, 112, 98, 132, 94, 94, 108, 108, 110, 106, 98, 96…
## $ Filler.Level <dbl> 129.4, 120.0, 119.4, 120.2, 116.0, 120.4, 119.6, 131…
## $ Filler.Speed <int> 3986, 4012, 4010, NA, 4018, 4010, 4010, NA, 4010, 10…
## $ Temperature <dbl> 66.0, 65.6, 65.6, 74.4, 66.4, 66.6, 66.8, NA, 65.8, …
## $ Usage.cont <dbl> 21.66, 17.60, 24.18, 18.12, 21.32, 18.00, 17.68, 12.…
## $ Carb.Flow <int> 2950, 2916, 3056, 28, 3214, 3064, 3042, 1972, 2502, …
## $ Density <dbl> 0.88, 1.50, 0.90, 0.74, 0.88, 0.84, 1.48, 1.60, 1.52…
## $ MFR <dbl> 727.6, 735.8, 734.8, NA, 752.0, 732.0, 729.8, NA, 74…
## $ Balling <dbl> 1.398, 2.942, 1.448, 1.056, 1.398, 1.298, 2.894, 3.3…
## $ Pressure.Vacuum <dbl> -3.8, -4.4, -4.2, -4.0, -4.0, -3.8, -4.2, -4.4, -4.4…
## $ PH <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Oxygen.Filler <dbl> 0.022, 0.030, 0.046, NA, 0.082, 0.064, 0.042, 0.096,…
## $ Bowl.Setpoint <int> 130, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ Pressure.Setpoint <dbl> 45.2, 46.0, 46.0, 46.0, 50.0, 46.0, 46.0, 46.0, 46.0…
## $ Air.Pressurer <dbl> 142.6, 147.2, 146.6, 146.4, 145.8, 146.0, 145.0, 146…
## $ Alch.Rel <dbl> 6.56, 7.14, 6.52, 6.48, 6.50, 6.50, 7.18, 7.16, 7.14…
## $ Carb.Rel <dbl> 5.34, 5.58, 5.34, 5.50, 5.38, 5.42, 5.46, 5.42, 5.44…
## $ Balling.Lvl <dbl> 1.48, 3.04, 1.46, 1.48, 1.46, 1.44, 3.02, 3.00, 3.10…
#subset features from response
test_features <- test %>%
dplyr::select(-c(PH))Using the summary function in addition to the test Summary Statistics, the distributions and NA’s are identified.
summary(test)## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## Length:267 Min. :5.147 Min. :23.75 Min. :0.09867
## Class :character 1st Qu.:5.287 1st Qu.:23.92 1st Qu.:0.23333
## Mode :character Median :5.340 Median :23.97 Median :0.27533
## Mean :5.369 Mean :23.97 Mean :0.27769
## 3rd Qu.:5.465 3rd Qu.:24.01 3rd Qu.:0.32200
## Max. :5.667 Max. :24.20 Max. :0.46400
## NA's :1 NA's :6 NA's :4
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :60.20 Min. :130.0 Min. :0.00400 Min. :0.0200
## 1st Qu.:65.30 1st Qu.:138.4 1st Qu.:0.04450 1st Qu.:0.1000
## Median :68.00 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.25 Mean :141.2 Mean :0.08545 Mean :0.1903
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :77.60 Max. :154.0 Max. :0.24600 Max. :0.6200
## NA's :1 NA's :5 NA's :3
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :-100.20 Min. :113.0 Min. :37.80
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:120.2 1st Qu.:46.00
## Median :0.04000 Median : 0.20 Median :123.4 Median :47.80
## Mean :0.05107 Mean : 21.03 Mean :123.0 Mean :48.14
## 3rd Qu.:0.06000 3rd Qu.: 141.30 3rd Qu.:125.5 3rd Qu.:50.20
## Max. :0.24000 Max. : 220.40 Max. :136.0 Max. :60.20
## NA's :5 NA's :4 NA's :2
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## Min. :-50.00 Min. :-50.00 Min. :-50.00 Min. : 68.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 90.00
## Median : 10.40 Median : 26.80 Median : 27.70 Median : 98.00
## Mean : 12.01 Mean : 20.11 Mean : 19.61 Mean : 97.84
## 3rd Qu.: 20.40 3rd Qu.: 34.80 3rd Qu.: 33.00 3rd Qu.:104.00
## Max. : 50.00 Max. : 61.40 Max. : 49.20 Max. :140.00
## NA's :1 NA's :1 NA's :4
## Filler.Level Filler.Speed Temperature Usage.cont Carb.Flow
## Min. : 69.2 Min. :1006 Min. :63.80 Min. :12.90 Min. : 0
## 1st Qu.:100.6 1st Qu.:3812 1st Qu.:65.40 1st Qu.:18.12 1st Qu.:1083
## Median :118.6 Median :3978 Median :65.80 Median :21.44 Median :3038
## Mean :110.3 Mean :3581 Mean :66.23 Mean :20.90 Mean :2409
## 3rd Qu.:120.2 3rd Qu.:3996 3rd Qu.:66.60 3rd Qu.:23.74 3rd Qu.:3215
## Max. :153.2 Max. :4020 Max. :75.40 Max. :24.60 Max. :3858
## NA's :2 NA's :10 NA's :2 NA's :2
## Density MFR Balling Pressure.Vacuum
## Min. :0.060 Min. : 15.6 Min. :0.902 Min. :-6.400
## 1st Qu.:0.920 1st Qu.:707.0 1st Qu.:1.498 1st Qu.:-5.600
## Median :0.980 Median :724.6 Median :1.648 Median :-5.200
## Mean :1.177 Mean :697.8 Mean :2.203 Mean :-5.174
## 3rd Qu.:1.600 3rd Qu.:731.5 3rd Qu.:3.242 3rd Qu.:-4.800
## Max. :1.840 Max. :784.8 Max. :3.788 Max. :-3.600
## NA's :1 NA's :31 NA's :1 NA's :1
## PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint
## Mode:logical Min. :0.00240 Min. : 70.0 Min. :44.00
## NA's:267 1st Qu.:0.01960 1st Qu.:100.0 1st Qu.:46.00
## Median :0.03370 Median :120.0 Median :46.00
## Mean :0.04666 Mean :109.6 Mean :47.73
## 3rd Qu.:0.05440 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :0.39800 Max. :130.0 Max. :52.00
## NA's :3 NA's :1 NA's :2
## Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
## Min. :141.2 Min. :6.400 Min. :5.18 Min. :0.000
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.34 1st Qu.:1.380
## Median :142.6 Median :6.580 Median :5.40 Median :1.480
## Mean :142.8 Mean :6.907 Mean :5.44 Mean :2.051
## 3rd Qu.:142.8 3rd Qu.:7.180 3rd Qu.:5.56 3rd Qu.:3.080
## Max. :147.2 Max. :7.820 Max. :5.74 Max. :3.420
## NA's :1 NA's :3 NA's :2
Next, for easy of visualization the Brand.Code is plotted to identify the spread of the factors and identify the empty values that will need to be updated. The distributions for the remaining features will not be added to the visualization as the distributions for tree based models do not require normalization in the pre-processing stage.
ggplot(test, aes(x=reorder(Brand.Code, Brand.Code, function(x)-length(x)))) +
geom_bar() + labs(x='Brand.Code')+
labs(title= 'Brand.Code Distribution')+
theme_minimal()Below the NAs in the test data set are visualized to easily see how many there are present. In the test data, about 10% of the missing values are within the MFR predictor variable. The NA values present in the test data will be imputed in the same fashion they were imputed in the training data by using the “knnImpute” method in the preProcess function of the caret library.
#NAs in Test data set
#NA counts by column
#sapply(test, function(x) sum(is.na(x)))
VIM::aggr(test_features, numbers=T, sortVars=T, bars = FALSE, border= 'white',
cex.axis = .6,
ylab=c("Proportion of NAs", "Combinations"))In order to run our model using the Cubist model, the test data needs to be cleaned in a similar fashion to that of train data used to run the model. To do so, the Brand.Code predictor needs to converted to a factor and add an “Unknown” level for the empty values in the variable.
#prep data- change Brand.Code to factor
#add name to empty string
test_features$Brand.Code[test_features$Brand.Code == ""] <- "Unknown"
#convert variable type to factor
test_features <- test_features %>%
dplyr::mutate(Brand.Code = factor(Brand.Code,
levels = c('A','B','C','D', 'Unknown'),
ordered = FALSE))The NAs identified in the predictors need to be imputed using the preProcess function. Additionally, near zero variance features and highly correlated variables were removed from the train data using the preProcess function. In order to ensure the same predictors in both the train and test data, the preProcess function will only be run with the method of “knnImpute,” and variables eliminated from the train data will be identified and removed from the test data using setdiff and dplyr’s select function, respectfully.
set.seed(624)
#impute NAs
preProc_test <- preProcess(test_features, method=c("knnImpute"))
#get the transformed features
test_features <- predict(preProc_test, test_features)
#identify the variables that need to be removed
setdiff(colnames(test_features),colnames(train_rf_predictors))## [1] "Hyd.Pressure1" "Hyd.Pressure3" "Filler.Level" "Filler.Speed"
## [5] "Density" "Balling" "Balling.Lvl"
#remove the variables identified above
test_features <- test_features %>%
dplyr::select(-c("Hyd.Pressure1","Hyd.Pressure3","Filler.Level",
"Filler.Speed","Density","Balling","Balling.Lvl" ))Next, the predictions for PH are generated using the predict function along with the optimal model identified earlier as the Cubist model from the Tree Based models.
predict <- round(predict(cube_model, newdata=test_features),2)
pH <- as.data.frame(predict)
pH<- rename(pH, pH = predict)Finally, the predicted PH values along with the predictor variables are exported to a CSV file using write_excel_csv.
write_excel_csv(pH, "Predictions.csv")