Instructions

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of pH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Insatll libraries

library(ggplot2)
library(readr)
library(DataExplorer)
library(Amelia)
library(VIM)
library(dplyr)
library(forecast)
library(tidyr)
library(mice)
library(corrplot)
library(MASS)
library(earth)
library(RANN)
library(caret)
library(car)
library(forcats)
library(RColorBrewer)
library(randomForest)
library(gbm)
library(Cubist)
library(kableExtra)
library(e1071)

Exploratory Data Analysis

Load & Review Train Data

#load data
train <- read.csv("TrainingData.csv")
#review
glimpse(train)
## Rows: 2,571
## Columns: 33
## $ Brand.Code        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
## $ Carb.Volume       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
## $ Fill.Ounces       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
## $ PC.Volume         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
## $ Carb.Pressure     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
## $ Carb.Temp         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
## $ PSC               <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
## $ PSC.Fill          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
## $ PSC.CO2           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
## $ Mnf.Flow          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ Carb.Pressure1    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
## $ Fill.Pressure     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
## $ Hyd.Pressure1     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure2     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure3     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure4     <int> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
## $ Filler.Level      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
## $ Filler.Speed      <int> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
## $ Temperature       <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
## $ Usage.cont        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
## $ Carb.Flow         <int> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
## $ Density           <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
## $ MFR               <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
## $ Balling           <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
## $ Pressure.Vacuum   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
## $ PH                <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
## $ Oxygen.Filler     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
## $ Bowl.Setpoint     <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
## $ Air.Pressurer     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
## $ Alch.Rel          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
## $ Carb.Rel          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
## $ Balling.Lvl       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…

From the above it is noted that the train data set:

  • 2571 observations with 33 columns (32 predictor variables)
  • Brand.Code is character type and seems it is an unordered categorical variable which needs to be updated as such
  • 4 predictor variables are integer type: Hyd.Pressure4, Filler.Speed, Carb.Flow, Bowl.Setpoint, remaining are float
  • There are predictors with varying ranges for example: Mnf. Flow -100.20 to 229.40, Carb.Flow 26 to 5104, and PSC 0.002 to 0.270
  • NAs detected in the first few observations

Missing Values

#NA counts by column
#sapply(train, function(x) sum(is.na(x)))
VIM::aggr(train, numbers=T, sortVars=T, bars = FALSE, border= 'white',
          cex.axis = .6,
          ylab=c("Proportion of NAs", "Combinations"))

Based on the above:

  • MFR variable is missing about 8% of its values
  • Filler.Speed is missing about 2%
  • 28 other variables missing about 1% or less of their values
  • 3 variables appear to contain all their values (no NAs present) Brand.Code, Pressure.Vacuum, Air.Pressurer

Distributions

Next, the distributions of numerical response variable PH, the categorical predictor variable Brand.Code and remaining numerical predictor variables.

Additionally, using train Summary Statistics the values, frequency of each values of the variables can be noted, as well as small visuals of thier distributions.

summary(train)
##   Brand.Code         Carb.Volume     Fill.Ounces      PC.Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb.Pressure     Carb.Temp          PSC             PSC.Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC.CO2           Mnf.Flow       Carb.Pressure1  Fill.Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd.Pressure1   Hyd.Pressure2   Hyd.Pressure3   Hyd.Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler.Level    Filler.Speed   Temperature      Usage.cont      Carb.Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure.Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        PH        Oxygen.Filler     Bowl.Setpoint   Pressure.Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  Air.Pressurer      Alch.Rel        Carb.Rel      Balling.Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1

Below is the distribution of the Brand.Code variable. The variable has 5 levels, one of which is empty and appears unlabeled. Of the codes, Brand.Code B has has the highest number of values, followed by D, C, A, and lastly the unlabeled Brand.Code observation.

ggplot(train, aes(x=reorder(Brand.Code, Brand.Code, function(x)-length(x)))) +
geom_bar() +  labs(x='Brand.Code')+
labs(title= 'Brand.Code Distribution')+
   theme_minimal()

Next are the distributions for all the numeric variables.

DataExplorer::plot_histogram(train, nrow = 3L, ncol = 4L)

From the above, variables in the training data set exhibit the follow skews in distribution:

  • Relatively Normal Distributions: Carb.Pressure, Carb.Temp, Fill.Ounces, PC.Volume, PH (response variable)

  • Left-skew Distributions: Carb.Flow, Filler.Speed, Mnf.Flow, MFR, Bowl.Setpoint, Filler.Level, Hyd.Pressure2, Hyd.Pressure3, Usage.cont, Carb.Pressure1, Filler.Speed

  • Right-skew Distributions: Pressure.Setpoint, Fill.Pressure, Hyd.Pressure1, Temperature, Carb.Volume, PSC, PSC.CO2, PSC.Fill, Balling, Density, Hyd.Pressure4, Air.Pressurer, Alch.Rel, Carb.Rel, Oxygen.Filler, Balling.Lvl, Pressure.Vacuum

Variable Correlations

The relationship between the variables are reviewed using a correlation plot in order to detect multicolliniarity within the training data set. Based on the below, it looks like multicolliniarity is an issue provided the correlations between a lot of the predictor variables.

numeric_values <- train 

numeric_values<- numeric_values %>% 
  select_if(is.numeric) %>% 
  na.omit()

train_cor <- cor(numeric_values)

#train_cor_filtered <- cor(train_cor[,-findCorrelation(train_cor,cutoff = 0.9,exact = TRUE)])

corrplot(train_cor, tl.col = 'black', col=brewer.pal(n=10, name="RdYlBu"))

ph_corr <- as.data.frame(cor(numeric_values[-1], numeric_values$PH))

ph_corr <- cbind("Predictor" = rownames(ph_corr), ph_corr)
rownames(ph_corr) <- 1:nrow(ph_corr) 

ph_corr <- ph_corr[-24,]

ggplot(ph_corr, aes(x=fct_reorder(factor(Predictor),V1), y = (V1))) +
  geom_col(position="dodge", fill="steelblue") +
  coord_flip()+
  labs(title="Correlations to pH",
    x="Predictors",
    y="Correlation Coefficient")+
    geom_text(aes(label = round(V1,2)), colour = "black", size = 3,
              position = position_stack(vjust = 0.6))+
   theme_minimal()

Near-Zero Variance

Un-informative variables are detected using the nearZeroVar function. There is one variable, Hyd.Pressure1 with near-zero variance which will be removed from in the pre-processing stage.

nzv<- nearZeroVar(train, saveMetrics= TRUE)
filter(nzv, nzv=="TRUE")
##               freqRatio percentUnique zeroVar  nzv
## Hyd.Pressure1  31.11111      9.529366   FALSE TRUE

Update Brand.Code

Next, the empty value in the Brand.Code variable is updated with the string “Unknown”. Additionally, the variable is updated to an unordered factor type variable.

#add name to empty string
train$Brand.Code[train$Brand.Code == ""] <- "Unknown"

#convert variable type to factor
train <- train %>% 
  dplyr::mutate(Brand.Code = factor(Brand.Code, 
                         levels = c('A','B','C','D', 'Unknown'), 
                         ordered = FALSE))

Build Models

Pre-Process Training Data for linear and non-linear models

Pre-processing of the data is needed based on the distributions and missing values noted in the training data set. The training data for linear and non-linear needs to be normalized where as the data does not need normalization for tree-based models.

set.seed(624)

#remove pH from the train data set in order to only transform the predictors
train_features <- train %>% 
  dplyr::select(-c(PH))

#remove nzv, correlated values, center and scale, apply BoxCox for normalization
preProc <- preProcess(train_features, method=c("knnImpute","nzv","corr",
                                               "center", "scale", "BoxCox"))

#get the transformed features
preProc_train <- predict(preProc, train_features)

#add the PH response variable to the preProcessed train features
preProc_train$PH <- train$PH 

#there are 4 NAs in the PH response variable, those need to be removed
preProc_train <- na.omit(preProc_train)

#partition data for evaluation
training_set <- createDataPartition(preProc_train$PH, p=0.8, list=FALSE)

train_data <- preProc_train[training_set,]
eval_data <- preProc_train[-training_set,]

Linear Models

We consider these linear regression models: multi-linear regression, partial least squares, AIC optimized . We utilize the train() function for all three models, feeding the same datasets for X and Y, and specifying the proper model-building technique via the “method” variable.

Moreover, the prediction() function in combination with postResample()are used to generate summary statistics for how our model performed on the evaluation data:

Multi-linear regression

First, a multi-linear regression is created to predict the pH response variable.

Multiple linear regression is used to assess the relationship between two variables while taking into account the effect of other variables. By taking into account the effect of other variables, we cancel out the effect of these other variables in order to isolate and measure the relationship between the two variables of interest. This point is the main difference with simple linear regression.

#Remove PH from sets to feed models
set.seed(222)
y_train <- subset(train_data, select = -c(PH))
y_test <- subset(eval_data, select = -c(PH))

#generate model
linear_model <- train(x= y_train, y= train_data$PH,
                      method='lm',
                      trControl=trainControl(method = "cv", number = 10))

#evaluate model
lmPred <- predict(linear_model, newdata = y_test)
lmResample <- postResample(pred=lmPred, obs = eval_data$PH)

Partial Least Squares

Next PLS regression is performed on the data to predict PH. PLS was chosen given the multicolliniarity detected earlier in the exploratory data analysis phase.

Partial least squares regression (PLS regression) is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of minimum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space.

set.seed(222)
#generate model
pls_model <- train(y_train, train_data$PH,
                      method='pls',
                      metric='Rsquared',
                      tuneLength=10,
                      trControl=trainControl(method = "cv",  number = 10))
#evaluate model metrics
plsPred <-predict(pls_model, newdata=y_test)
plsReSample <- postResample(pred=plsPred, obs = eval_data$PH)

Stepwise AIC optimized

Lastly, for our linear models, a stepwise AIC model is run using stepAIC. The stepwise regression is performed by specifying the direction parameter with “both.”

Stepwise regression is a combination of both backward elimination and forward selection methods. Stepwise method is a modification of the forward selection approach and differs in that variables already in the model do not necessarily stay. As in forward selection, stepwise regression adds one variable to the model at a time. After a variable is added, however, stepwise regression checks all the variables already included again to see whether there is a need to delete any variable that does not provide an improvement to the model based on a certain criterion.

set.seed(222)
#generate model
initial <- lm(PH ~ . , data = train_data)
AIC_model <- stepAIC(initial, direction = "both",
                     trace = 0)

#evaluate model metrics
AIC_Pred <-predict(AIC_model, newdata=y_test)
aicResample <- postResample(pred=AIC_Pred, obs=eval_data$PH)

Linear Model Metrics

We need to verify model performance and identify the strongest performing model in our multi-linear regression subset.

display <- rbind("Linear Regression" = lmResample,
                 "Stepwise AIC" = aicResample,
                 "Partial Least Squares" = plsReSample)
display %>% kable() %>% kable_paper()
RMSE Rsquared MAE
Linear Regression 0.1276140 0.4344247 0.0997148
Stepwise AIC 0.1279072 0.4318320 0.1000196
Partial Least Squares 0.1276828 0.4337910 0.0995133

Non-linear Models

Building non-linear models. We will try k-nearest neighbors (KNN), support vector machines (SVM), multivariate adaptive regression splines (MARS), and neural networks. These models are not based on simple linear combinations of the predictors.

K-Nearest Neighbors

K-Nearest Neighbors simply predicts a new sample using the K-closest samples from the training set. The predicted response for the new sample is then the mean of the K neighbors’ responses. Predictors with the largest scales will contribute most to the distance between samples so centering and scaling the data during pre-processing is important.

set.seed(624)
knnModel <- train(PH~., data = train_data, 
                  method = "knn",
                  preProc = c("center", "scale"), 
                  tuneLength = 10)

#knnModel

knnPred <- predict(knnModel, newdata = eval_data)
knn_metrics <- postResample(pred = knnPred, obs = eval_data$PH)
#knn_metrics

Support Vector Machines

Support Vector Machines follow the framework of robust regression where we seek to minimize the effect of outliers on the regression equations. We find parameter estimates that minimize SSE by not squaring the residuals when they are very large. In addition samples that the model fits well have no effect on the regression equation. A threshold is set using resampling and a kernel function which specifies the relationship between predictors and outcome so that only poorly predicted points called support vectors are used to fit the line. The radial kernel we are using has an additional parameter which impacts the smoothness of the upper and lower boundary.

set.seed(624)
tc <- trainControl(method = "cv",
                           number = 5,
                           classProbs = T)


svmModel <- train(PH~., data = train_data,
                    method = "svmRadial",
                    preProcess = c("BoxCox","center", "scale"),
                    trControl = tc,
                    tuneLength = 9)

#svmModel

svmPred <- predict(svmModel, newdata = eval_data)
svm_metrics <- postResample(pred = svmPred, obs = eval_data$PH)
#svm_metrics

MARS

MARS uses surrogate features instead of the original predictors. However, whereas PLS and neural networks are based on linear combinations of the predictors, MARS creates two contrasted versions of a predictor to enter the model. MARS features breaks the predictor into two groups, a “hinge” function of the original based on a cut point that achieves the smallest error, and models linear relationships between the predictor and the outcome in each group. The new features are added to a basic linear regression model to estimate the slopes and intercepts.

set.seed(624)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

mars <- train(PH~., data = train_data,
                   method = "earth",
                   tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv"))

#mars

marsPred <- predict(mars, newdata = eval_data)
mars_metrics <- postResample(pred = marsPred, obs = eval_data$PH)
#mars_metrics

Observed Vs. Predicted - Non - Linear Models with Reduced Predictor Set.

knnModel_pred <- knnModel %>% predict(eval_data)

# Model performance metrics
knn_Accuracy <- data.frame(
  Model = "k-Nearest Neighbors",
  RMSE = caret::RMSE(knnModel_pred,eval_data$PH),
  Rsquare = caret::R2(knnModel_pred,eval_data$PH))


pred_svm <- svmModel %>% predict(eval_data)
# Model SVM performance metrics
SMV_Acc <- data.frame(
  Model = "Support Vector Machine",
  RMSE = caret::RMSE(pred_svm, eval_data$PH),
  Rsquare = caret::R2(pred_svm, eval_data$PH)
)
#summary(marsTuned)
# Make MARS predictions
pred_mars <- mars %>% predict(eval_data)
# Model MARS performance metrics
MARS_Acc <- data.frame(
  Model = "MARS Tuned",
  RMSE = caret::RMSE(pred_mars, eval_data$PH),
  Rsquare = caret::R2(pred_mars, eval_data$PH)
)
names(MARS_Acc)[names(MARS_Acc) == 'y'] <- "Rsquare"
#rbind(knn_Accuracy,SMV_Acc,MARS_Acc)

### code for the plot
par(mar = c(4, 4, 4, 4))
par(mfrow=c(2,2))
plot(knnModel_pred, eval_data$PH, ylab="Observed", col = "red")
abline(0, 1, lwd=2)
plot(pred_svm, eval_data$PH, ylab="Observed", col = "dark green")
abline(0, 1, lwd=2)
plot(pred_mars, eval_data$PH, ylab="Observed", col = "blue")
abline(0, 1, lwd=2)
mtext("Observed Vs. Predicted  - Non - Linear Models with Reduced Predictor Set", side = 3, line = -2, outer = TRUE)

Neural Networks

Neural networks, like partial least squares, the outcome is modeled by an intermediary set of unobserved variables. These hidden units are linear combinations of the original predictors, but, unlike PLS models, they are not estimated in a hierarchical fashion. There are no constraints that help define these linear combinations. Each unit must then be related to the outcome using another linear combination connecting the hidden units. Treating this model as a nonlinear regression model, the parameters are usually optimized using the back-propagation algorithm to minimize the sum of the squared residuals.

set.seed(624)
NNModel <- avNNet(PH~., data = train_data,
                   size = 5, 
                   decay = 0.01,
                   linout = TRUE, 
                   trace = FALSE,
                   maxit = 500)

#NNModel

pred_NNModel <- NNModel %>% predict(eval_data)
nn_metrics <- postResample(pred = pred_NNModel, obs = eval_data$PH)
#nn_metrics

Non-Linear Model Metrics

rbind( "KNN" = knn_metrics,
       "SVM" = svm_metrics,
       "MARS" = mars_metrics,
       "Neural Network" = nn_metrics) %>% 
  kable() %>% kable_paper()
RMSE Rsquared MAE
KNN 0.1181856 0.5169728 0.0889128
SVM 0.1132267 0.5605168 0.0818392
MARS 0.1227359 0.4787088 0.0934747
Neural Network 0.1116412 0.5678863 0.0829216

Select Neural Network as best nonlinear regression models

In the optimal nonlinear regression model was the Neural model with one of the lowest RMSE and higher R-squared. Below are the most important overall variables using the function varImp.

varImp(NNModel) %>% head(10)
##         Overall
## X1    6.6457519
## X10   0.7462469
## X101  0.9932248
## X102  0.9562043
## X103  0.7116890
## X104  0.3775559
## X11   0.6425230
## X110  9.6704146
## X111  1.0555018
## X112 10.0602537

The top predictors in our best performing non-linear model with Neural Network.

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model.

ggplot(train_data, aes(Oxygen.Filler, PH)) +
  geom_point()

ggplot(train_data, aes(Mnf.Flow , PH)) +
  geom_point()

ggplot(train_data, aes(Bowl.Setpoint    , PH)) +
  geom_point()

Checking the top nonlinear predictors overall:

  • Oxygen.Filler 100 has positive correlation
  • Mnf.Flow 62 has negative correlation
  • Bowl.Setpoint 52 has positive correlation

Tree Based Models

Pre-Process Training Data for Tree Based Models

The training data is pre-processed differently for tree based models since they do not require the training data to be normalized.

set.seed(624)

#remove pH from the train data set in order to only transform the predictors
train_features <- train %>% 
  dplyr::select(-c(PH))

#remove nzv, correlated values, impute NAs
preProc <- preProcess(train_features, method=c("knnImpute","nzv","corr"))

#get the transformed features
preProc_train <- predict(preProc, train_features)

#add the PH response variable to the preProcessed train features
preProc_train$PH <- train$PH 

#there are 4 NAs in the PH response variable, those need to be removed
preProc_train <- na.omit(preProc_train)

#partition data for evaluation
training_set <- createDataPartition(preProc_train$PH, p=0.8, list=FALSE)

train_data_rf <- preProc_train[training_set,]
eval_data_rf <- preProc_train[-training_set,]

train_rf_predictors <- train_data_rf[-c(26)]
eval_rf_predictors <- eval_data_rf[-c(26)]

Random Forest

Below a random forest regression model is created to predict the desired response variable, PH.
Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.

set.seed(624)
#fit the model
rf_model <- randomForest(train_rf_predictors, train_data_rf$PH, 
                         importance = TRUE, ntrees = 500)
#metrics
rfPred <- predict(rf_model, newdata = eval_rf_predictors)
rf_metrics <- postResample(pred = rfPred, obs = eval_data_rf$PH)
#rf_metrics

Boosted Trees

Next, boosted trees are also used to model the data using the train function and defining the method parameter with gbm to generate a prediction for the response variable PH.

Boosted Trees are commonly used in regression. They are an ensemble method similar to bagging, however, instead of building multiple trees in parallel, they build tress sequentially. They use the previous tree to find errors and build a new tree by correcting the previous.

set.seed(624)
#fit model
gbm_model <- train(train_rf_predictors, train_data_rf$PH,
    method = "gbm",
    verbose = FALSE)
#metrics
gbmPred <- predict(gbm_model, newdata = eval_rf_predictors)
gbm_metrics <- postResample(pred = gbmPred, obs = eval_data_rf$PH)
#gbm_metrics

Cubist

Finally for the last Tree Based Model, a Cubist model is run also using the train function and defining the method parameter with cubist.

In Cubist models, a tree is grown where the terminal leaves contain linear regression models. These models are based on the predictors used in previous splits. Also, there are intermediate linear models at each step of the tree. A prediction is made using the linear regression model at the terminal node of the tree, but is “smoothed” by taking into account the prediction from the linear model in the previous node of the tree (which also occurs recursively up the tree). The tree is reduced to a set of rules, which initially are paths from the top of the tree to the bottom. Rules are eliminated via pruning and/or combined for simplification.

set.seed(6)
#fit model
cube_model <- train(train_rf_predictors, train_data_rf$PH,
    method = "cubist",
    verbose = FALSE)
#metrics
cubePred <- predict(cube_model, newdata =  eval_rf_predictors)
cube_metrics <- postResample(pred = cubePred, obs = eval_data_rf$PH)
#cube_metrics

Tree Based Model Metrics

Of the tree based models produced, the best performing model turned out to be the Cubist Model with an RMSE of 0.095, and an R-squared value of 0.685 indicating the percentage of the variance in the dependent variable, PH that the independent variables explain.

rbind("Random Forest" = rf_metrics,
      "Boosted Trees" = gbm_metrics,
      "Cubist" = cube_metrics) %>% 
  kable() %>% kable_paper()
RMSE Rsquared MAE
Random Forest 0.0990469 0.6761053 0.0722867
Boosted Trees 0.1105297 0.5789906 0.0846337
Cubist 0.0952903 0.6849514 0.0683080

Evaluate & Select Models

Next, the metrics for each of the best performing Linear, Non-Linear, and Tree based models are complied to evaluate the performance of each and select the best model.

all_metrics <- as.data.frame(rbind("Linear Regression" = lmResample,
      "Neural Network" = nn_metrics,
      "Cubist" = cube_metrics))

all_metrics[order(all_metrics$RMSE),] %>% 
  kable() %>% kable_paper()
RMSE Rsquared MAE
Cubist 0.0952903 0.6849514 0.0683080
Neural Network 0.1116412 0.5678863 0.0829216
Linear Regression 0.1276140 0.4344247 0.0997148

Using the RMSE, we see the best performing model of those created is the Cubist model. Next we will proceed to evaluate the model and its important variables using varImp.

plot(varImp(cube_model), top=10)

Predict PH Values

Next, using the Cubist model which was identified as the best performing model above, the PH values are predicted with the provided test data set.

Load and Review Test Data

First the data is loaded and reviewed. Using the glimpse function, we note the test data has a similar structure to that of the training data with Brand.Code appearing as chr type which will need to be checked for empty values and converted to unordered factor type. The PH response variable needs to be removed to run the model and will be subset from the features.

#load test data
test <- read.csv("TestData.csv")

#review 
glimpse(test)
## Rows: 267
## Columns: 33
## $ Brand.Code        <chr> "D", "A", "B", "B", "B", "B", "A", "B", "A", "D", "B…
## $ Carb.Volume       <dbl> 5.480000, 5.393333, 5.293333, 5.266667, 5.406667, 5.…
## $ Fill.Ounces       <dbl> 24.03333, 23.95333, 23.92000, 23.94000, 24.20000, 24…
## $ PC.Volume         <dbl> 0.2700000, 0.2266667, 0.3033333, 0.1860000, 0.160000…
## $ Carb.Pressure     <dbl> 65.4, 63.2, 66.4, 64.8, 69.4, 73.4, 65.2, 67.4, 66.8…
## $ Carb.Temp         <dbl> 134.6, 135.0, 140.4, 139.0, 142.2, 147.2, 134.6, 139…
## $ PSC               <dbl> 0.236, 0.042, 0.068, 0.004, 0.040, 0.078, 0.088, 0.0…
## $ PSC.Fill          <dbl> 0.40, 0.22, 0.10, 0.20, 0.30, 0.22, 0.14, 0.10, 0.48…
## $ PSC.CO2           <dbl> 0.04, 0.08, 0.02, 0.02, 0.06, NA, 0.00, 0.04, 0.04, …
## $ Mnf.Flow          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ Carb.Pressure1    <dbl> 116.6, 118.8, 120.2, 124.8, 115.0, 118.6, 117.6, 121…
## $ Fill.Pressure     <dbl> 46.0, 46.2, 45.8, 40.0, 51.4, 46.4, 46.2, 40.0, 43.8…
## $ Hyd.Pressure1     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure2     <dbl> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Hyd.Pressure3     <dbl> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Hyd.Pressure4     <int> 96, 112, 98, 132, 94, 94, 108, 108, 110, 106, 98, 96…
## $ Filler.Level      <dbl> 129.4, 120.0, 119.4, 120.2, 116.0, 120.4, 119.6, 131…
## $ Filler.Speed      <int> 3986, 4012, 4010, NA, 4018, 4010, 4010, NA, 4010, 10…
## $ Temperature       <dbl> 66.0, 65.6, 65.6, 74.4, 66.4, 66.6, 66.8, NA, 65.8, …
## $ Usage.cont        <dbl> 21.66, 17.60, 24.18, 18.12, 21.32, 18.00, 17.68, 12.…
## $ Carb.Flow         <int> 2950, 2916, 3056, 28, 3214, 3064, 3042, 1972, 2502, …
## $ Density           <dbl> 0.88, 1.50, 0.90, 0.74, 0.88, 0.84, 1.48, 1.60, 1.52…
## $ MFR               <dbl> 727.6, 735.8, 734.8, NA, 752.0, 732.0, 729.8, NA, 74…
## $ Balling           <dbl> 1.398, 2.942, 1.448, 1.056, 1.398, 1.298, 2.894, 3.3…
## $ Pressure.Vacuum   <dbl> -3.8, -4.4, -4.2, -4.0, -4.0, -3.8, -4.2, -4.4, -4.4…
## $ PH                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Oxygen.Filler     <dbl> 0.022, 0.030, 0.046, NA, 0.082, 0.064, 0.042, 0.096,…
## $ Bowl.Setpoint     <int> 130, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ Pressure.Setpoint <dbl> 45.2, 46.0, 46.0, 46.0, 50.0, 46.0, 46.0, 46.0, 46.0…
## $ Air.Pressurer     <dbl> 142.6, 147.2, 146.6, 146.4, 145.8, 146.0, 145.0, 146…
## $ Alch.Rel          <dbl> 6.56, 7.14, 6.52, 6.48, 6.50, 6.50, 7.18, 7.16, 7.14…
## $ Carb.Rel          <dbl> 5.34, 5.58, 5.34, 5.50, 5.38, 5.42, 5.46, 5.42, 5.44…
## $ Balling.Lvl       <dbl> 1.48, 3.04, 1.46, 1.48, 1.46, 1.44, 3.02, 3.00, 3.10…
#subset features from response
test_features <- test %>% 
  dplyr::select(-c(PH))

Using the summary function in addition to the test Summary Statistics, the distributions and NA’s are identified.

summary(test)
##   Brand.Code         Carb.Volume     Fill.Ounces      PC.Volume      
##  Length:267         Min.   :5.147   Min.   :23.75   Min.   :0.09867  
##  Class :character   1st Qu.:5.287   1st Qu.:23.92   1st Qu.:0.23333  
##  Mode  :character   Median :5.340   Median :23.97   Median :0.27533  
##                     Mean   :5.369   Mean   :23.97   Mean   :0.27769  
##                     3rd Qu.:5.465   3rd Qu.:24.01   3rd Qu.:0.32200  
##                     Max.   :5.667   Max.   :24.20   Max.   :0.46400  
##                     NA's   :1       NA's   :6       NA's   :4        
##  Carb.Pressure     Carb.Temp          PSC             PSC.Fill     
##  Min.   :60.20   Min.   :130.0   Min.   :0.00400   Min.   :0.0200  
##  1st Qu.:65.30   1st Qu.:138.4   1st Qu.:0.04450   1st Qu.:0.1000  
##  Median :68.00   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.25   Mean   :141.2   Mean   :0.08545   Mean   :0.1903  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :77.60   Max.   :154.0   Max.   :0.24600   Max.   :0.6200  
##                  NA's   :1       NA's   :5         NA's   :3       
##     PSC.CO2           Mnf.Flow       Carb.Pressure1  Fill.Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :113.0   Min.   :37.80  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:120.2   1st Qu.:46.00  
##  Median :0.04000   Median :   0.20   Median :123.4   Median :47.80  
##  Mean   :0.05107   Mean   :  21.03   Mean   :123.0   Mean   :48.14  
##  3rd Qu.:0.06000   3rd Qu.: 141.30   3rd Qu.:125.5   3rd Qu.:50.20  
##  Max.   :0.24000   Max.   : 220.40   Max.   :136.0   Max.   :60.20  
##  NA's   :5                           NA's   :4       NA's   :2      
##  Hyd.Pressure1    Hyd.Pressure2    Hyd.Pressure3    Hyd.Pressure4   
##  Min.   :-50.00   Min.   :-50.00   Min.   :-50.00   Min.   : 68.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.: 90.00  
##  Median : 10.40   Median : 26.80   Median : 27.70   Median : 98.00  
##  Mean   : 12.01   Mean   : 20.11   Mean   : 19.61   Mean   : 97.84  
##  3rd Qu.: 20.40   3rd Qu.: 34.80   3rd Qu.: 33.00   3rd Qu.:104.00  
##  Max.   : 50.00   Max.   : 61.40   Max.   : 49.20   Max.   :140.00  
##                   NA's   :1        NA's   :1        NA's   :4       
##   Filler.Level    Filler.Speed   Temperature      Usage.cont      Carb.Flow   
##  Min.   : 69.2   Min.   :1006   Min.   :63.80   Min.   :12.90   Min.   :   0  
##  1st Qu.:100.6   1st Qu.:3812   1st Qu.:65.40   1st Qu.:18.12   1st Qu.:1083  
##  Median :118.6   Median :3978   Median :65.80   Median :21.44   Median :3038  
##  Mean   :110.3   Mean   :3581   Mean   :66.23   Mean   :20.90   Mean   :2409  
##  3rd Qu.:120.2   3rd Qu.:3996   3rd Qu.:66.60   3rd Qu.:23.74   3rd Qu.:3215  
##  Max.   :153.2   Max.   :4020   Max.   :75.40   Max.   :24.60   Max.   :3858  
##  NA's   :2       NA's   :10     NA's   :2       NA's   :2                     
##     Density           MFR           Balling      Pressure.Vacuum 
##  Min.   :0.060   Min.   : 15.6   Min.   :0.902   Min.   :-6.400  
##  1st Qu.:0.920   1st Qu.:707.0   1st Qu.:1.498   1st Qu.:-5.600  
##  Median :0.980   Median :724.6   Median :1.648   Median :-5.200  
##  Mean   :1.177   Mean   :697.8   Mean   :2.203   Mean   :-5.174  
##  3rd Qu.:1.600   3rd Qu.:731.5   3rd Qu.:3.242   3rd Qu.:-4.800  
##  Max.   :1.840   Max.   :784.8   Max.   :3.788   Max.   :-3.600  
##  NA's   :1       NA's   :31      NA's   :1       NA's   :1       
##     PH          Oxygen.Filler     Bowl.Setpoint   Pressure.Setpoint
##  Mode:logical   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  NA's:267       1st Qu.:0.01960   1st Qu.:100.0   1st Qu.:46.00    
##                 Median :0.03370   Median :120.0   Median :46.00    
##                 Mean   :0.04666   Mean   :109.6   Mean   :47.73    
##                 3rd Qu.:0.05440   3rd Qu.:120.0   3rd Qu.:50.00    
##                 Max.   :0.39800   Max.   :130.0   Max.   :52.00    
##                 NA's   :3         NA's   :1       NA's   :2        
##  Air.Pressurer      Alch.Rel        Carb.Rel     Balling.Lvl   
##  Min.   :141.2   Min.   :6.400   Min.   :5.18   Min.   :0.000  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.34   1st Qu.:1.380  
##  Median :142.6   Median :6.580   Median :5.40   Median :1.480  
##  Mean   :142.8   Mean   :6.907   Mean   :5.44   Mean   :2.051  
##  3rd Qu.:142.8   3rd Qu.:7.180   3rd Qu.:5.56   3rd Qu.:3.080  
##  Max.   :147.2   Max.   :7.820   Max.   :5.74   Max.   :3.420  
##  NA's   :1       NA's   :3       NA's   :2

Next, for easy of visualization the Brand.Code is plotted to identify the spread of the factors and identify the empty values that will need to be updated. The distributions for the remaining features will not be added to the visualization as the distributions for tree based models do not require normalization in the pre-processing stage.

ggplot(test, aes(x=reorder(Brand.Code, Brand.Code, function(x)-length(x)))) +
geom_bar() +  labs(x='Brand.Code')+
labs(title= 'Brand.Code Distribution')+
   theme_minimal()

Below the NAs in the test data set are visualized to easily see how many there are present. In the test data, about 10% of the missing values are within the MFR predictor variable. The NA values present in the test data will be imputed in the same fashion they were imputed in the training data by using the “knnImpute” method in the preProcess function of the caret library.

#NAs in Test data set 
#NA counts by column
#sapply(test, function(x) sum(is.na(x)))

VIM::aggr(test_features, numbers=T, sortVars=T, bars = FALSE, border= 'white',
          cex.axis = .6,
          ylab=c("Proportion of NAs", "Combinations"))

In order to run our model using the Cubist model, the test data needs to be cleaned in a similar fashion to that of train data used to run the model. To do so, the Brand.Code predictor needs to converted to a factor and add an “Unknown” level for the empty values in the variable.

#prep data- change Brand.Code to factor
#add name to empty string
test_features$Brand.Code[test_features$Brand.Code == ""] <- "Unknown"

#convert variable type to factor
test_features <- test_features %>% 
  dplyr::mutate(Brand.Code = factor(Brand.Code, 
                         levels = c('A','B','C','D', 'Unknown'), 
                         ordered = FALSE))

The NAs identified in the predictors need to be imputed using the preProcess function. Additionally, near zero variance features and highly correlated variables were removed from the train data using the preProcess function. In order to ensure the same predictors in both the train and test data, the preProcess function will only be run with the method of “knnImpute,” and variables eliminated from the train data will be identified and removed from the test data using setdiff and dplyr’s select function, respectfully.

set.seed(624)
#impute NAs
preProc_test <- preProcess(test_features, method=c("knnImpute"))

#get the transformed features
test_features <- predict(preProc_test, test_features)

#identify the variables that need to be removed 
setdiff(colnames(test_features),colnames(train_rf_predictors))
## [1] "Hyd.Pressure1" "Hyd.Pressure3" "Filler.Level"  "Filler.Speed" 
## [5] "Density"       "Balling"       "Balling.Lvl"
#remove the variables identified above
test_features <- test_features %>% 
  dplyr::select(-c("Hyd.Pressure1","Hyd.Pressure3","Filler.Level",
                   "Filler.Speed","Density","Balling","Balling.Lvl" ))

Generate Predictions

Next, the predictions for PH are generated using the predict function along with the optimal model identified earlier as the Cubist model from the Tree Based models.

predict <- round(predict(cube_model, newdata=test_features),2)

pH <- as.data.frame(predict)
pH<- rename(pH, pH = predict)

Generate Excel file

Finally, the predicted PH values along with the predictor variables are exported to a CSV file using write_excel_csv.

write_excel_csv(pH, "Predictions.csv")