Introduction

As most of us know, ensuring our beverages are produced at the correct potential for hydrogen (pH) level is an essential driver to our business. This pH score, the measure of acidity and alkalinity in our liquids, must be within a narrow, critical range to ensure long-term sales.

The objective of this project is to consider a number of measures and data points involved in the production of our beverages and build a model to use those factors to be able to predict the pH level of the beverage.

Note that a factor being highly predictive of an outcome does not necessarily mean that the factor caused the outcome. For example, beverages that end up with higher than desired pH levels may also show high bowl setpoints. This does not mean that high bowl setpoints cause excessive pH values. They might both be similarly affected by an unknown cause, or large pH values may cause the high bowl setpoints. Our goal is simply to build a model that predicts pH levels. This process may lead to insights about possible causes of pH level problems, but that outcome is certainly not a given.

To construct this model, we had to first understand our data. We have more than 2800 production records of beverages that feature 33 data points, including Brand Code, Fill Ounces, PSC CO2, Temperature, etc. A full list of data elements will be conveyed in the next section. Some of the production records do not have values for all data elements, which is an obstacle we will had to overcome in building our model using established statistical principles. We also had to review

Data scientists use different methods to build models. No one approach gives the optimal approach for all data sets - and their underlying processes. To predict pH, we started by using linear regression, which constructs equations that look similar to you might have experienced in algebra at school. Next, we saw how effective nonlinear regression approaches such as neural networks and support vector machines (SVMs) are in predicting pH. Finally, we attempted to build models that use regression trees and rule-based models.

Don’t be overwhelmed by mathematical terminology. We - or, rather our computers - are just using different methods to take the 33 different data elements and trying to use them to figure out their relationship to pH levels in our beverages.

Approach

  1. Exploratory Data Analysis
  2. Data Processing
  3. Linear Regression
  4. Nonlinear Regression
  5. Regression Trees and Rule-Based Models
  6. Conclusion

Exploratory Data Analysis

Data Loading

bev_model <- readxl::read_excel('StudentData_TO_MODEL.xlsx',col_names = TRUE, sheet = 'Subset')
bev_score <- readxl::read_excel('StudentEvaluation_TO_PREDICT.xlsx',col_names = TRUE, sheet = 'Subset (2)')

For this predictive model project, we’ve been provided two data sets - a modeling data set we will use to train and test the predictive models and an evaluation data set which will be used to predict unknown pH values and be scored to assess our model performance.

str(bev_model)
## Classes 'tbl_df', 'tbl' and 'data.frame':    2571 obs. of  33 variables:
##  $ Brand Code       : chr  "B" "A" "B" "A" ...
##  $ Carb Volume      : num  5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill Ounces      : num  24 24 24.1 24 24.3 ...
##  $ PC Volume        : num  0.263 0.239 0.263 0.293 0.111 ...
##  $ Carb Pressure    : num  68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ Carb Temp        : num  141 140 145 133 137 ...
##  $ PSC              : num  0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC Fill         : num  0.26 0.22 0.34 0.42 0.16 ...
##  $ PSC CO2          : num  0.04 0.04 0.16 0.04 0.12 ...
##  $ Mnf Flow         : num  -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb Pressure1   : num  119 122 120 115 118 ...
##  $ Fill Pressure    : num  46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ Hyd Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure2    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure3    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure4    : num  118 106 82 92 92 116 124 132 90 108 ...
##  $ Filler Level     : num  121 119 120 118 119 ...
##  $ Filler Speed     : num  4002 3986 4020 4012 4010 ...
##  $ Temperature      : num  66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage cont       : num  16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb Flow        : num  2932 3144 2914 3062 3054 ...
##  $ Density          : num  0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ MFR              : num  725 727 735 731 723 ...
##  $ Balling          : num  1.4 1.5 3.14 3.04 3.04 ...
##  $ Pressure Vacuum  : num  -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen Filler    : num  0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl Setpoint    : num  120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure Setpoint: num  46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air Pressurer    : num  143 143 142 146 146 ...
##  $ Alch Rel         : num  6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb Rel         : num  5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling Lvl      : num  1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...

We have 33 data elements for 2571 records that will be used for training. We expect some missing or NULL values - here we see them reflected in the “NA” values in the “Hyd Pressure 2” measurements.

Here, we verify that our scoring data contains the same elements.

dim(bev_score)
## [1] 267  33

As mentioned above, we will generate the model on using the training data set that covers 2571 rows and then evaluate its accuracy - and other metrics - using the 267-record test data set.

Data Exploration

Next, we explore our data using summary statistics.

describe(bev_model)

Brand Code is a string (non-numeric) field, so its lack of a mean makes sense. We see a wide range of values. Some may be on different scales. Despite being mostly numeric fields, some of these might use temperatures in Celsius, whereas others like Carb[onation] Pressure use pounds per square inch. In order to build the most effective predictive model, we may apply data transformations to standardize these numeric data elements.

plot_histogram(bev_model, ggtheme=theme_light())

Note that the scales of the x-axes vary for each element. Nonetheless, we see data elements with normal distributions and many with likely outliers. Some histograms reveal bimodal or even trimodal distributions.

Here, we take a closer look at the pH values present in our training data.

ggplot(bev_model, aes(PH)) + geom_histogram(bins=20) + theme_classic()

A slight left skew is evident, as is a right potential outlier.

Finally, a count of missing (NA) values by variable is conducted.

plot_missing(bev_model, title = "Beverage Training Data: % Missing Values by Data Element")

MFR is missing for more than eight percent of our training records, and Brand Code is NA for almost five percent of them. Those will need to be handled in our next section.

Perhaps the most conceptually difficult aspect of data exploration in predictive modeling is checking for correlation. If two different variables reliably occur together, they can negatively affect our model. They tend to change in unison, and it becomes very difficult for the model to estimate the relationship between the two (or more) correlated independent variables and the dependent variable. If this sounds tricky - don’t worry. In short, if we wanted to predict if it rained, we probably wouldn’t want to include both 1) is the road wet and 2) whether or not drivers were using windshield wipers. One would give us the information contained within the others.

The correlations between variables in our training dataset are below.

cor_bev_model <- cor(bev_model[,-1], use = "na.or.complete")
corrplot(cor_bev_model, order = 'hclust', type = 'lower')

With so many dimensions (variables) in the training data, seeing individual correlations between variables is difficult here. Suffice to say, we have high correlation between certain variables that would negatively affect our predictive model if we use certain approaches.

Data Processing

Before we get to modeling, we will do a little data processing - also known as cleansing. First, we will remove the handful of records (0.16%, as we saw before) from the training set that have missing pH values. This will remove four records. Next, we will label the small number of records that have missing Brand categories as “U” for unknown.

#remove NA pH records
bev_model <- bev_model[complete.cases(bev_model[,26]),]
#update NA Brand Code records to "U"
bev_model$`Brand Code`[is.na(bev_model$`Brand Code`)] <- 'U'

Finally we will impute missing values for the training records that remain. There are a number of advanced methods used by data scientists to impute or “fill in” missing/NA values. Based on past experience and the fact that we don’t have “big data,” we will use k-neaest neighbor (KNN) imputation, which uses a distance function to essentially predict values for missing data elements based on other records that are similar to in it for the elements that are present. Simply, if everyone on your street has two cars, we’ll guess that you do two.

bev_model_i <- as.data.frame(bev_model[, !names(bev_model) %in% c("Brand Code", "PH")])
#https://www.r-bloggers.com/missing-value-treatment/
bev_model_impute <- knnImputation(bev_model_i, k = 10)
bev_model_impute$pH <- bev_model$PH
bev_model_impute$'Brand Code' <- bev_model$`Brand Code`

Before getting to modelling, we will split our modelling set into train and test sets.

set.seed(3456)
trainIndex <- createDataPartition(bev_model_impute$pH, p = .8, 
                                  list = FALSE, 
                                  times = 1)
bev_model_train <- bev_model_impute[ trainIndex,]
bev_model_test  <- bev_model_impute[-trainIndex,]

As mentioned earlier, data processing frequently includes such steps as performing a train-test split, transforming variables so that they’re all on normalized scales, and removing variables that have correlation with others - or have no correlation with the dependent variables, pH in this case. As different modeling approaches handle these data issues in disparate ways, we will handle those data transformations and processing steps in the indivdual sections of model construction.

Linear Regression - Rajwant

In this section, we will try to fit multiple Linear Regression MOdel and its cousins. We will especially try building : + Simple Linear Regression Model + Ridge Regression Model + Lasso Regression Model + Elastic Net Regression Model

We will be doing 10 fold cross-validation,we will repeat it 5 times as per our train control parameter.

Linear Model

metric = 'RMSE'
head(bev_model_train )
# Train control

customTrainControl <- trainControl(method = "repeatedcv", 
                                   number = 10 , 
                                   repeats = 5 ,
                                   verboseIter = F)
#Linear Model
lm <- train(pH ~ .,
            bev_model_train,
            method= 'lm',
            trControl = customTrainControl
          )

lm$results  
lm  # 2055 , 32 predictors,
## Linear Regression 
## 
## 2055 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1850, 1849, 1849, 1850, 1850, 1849, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1346963  0.3955688  0.1039546
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
summary(lm)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53237 -0.07853  0.00937  0.08774  0.79469 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.053e+01  1.089e+00   9.672  < 2e-16 ***
## `\\`Carb Volume\\``       -1.506e-01  7.709e-02  -1.954 0.050819 .  
## `\\`Fill Ounces\\``       -5.098e-02  3.603e-02  -1.415 0.157233    
## `\\`PC Volume\\``         -9.977e-02  6.047e-02  -1.650 0.099131 .  
## `\\`Carb Pressure\\``      5.075e-04  3.255e-03   0.156 0.876135    
## `\\`Carb Temp\\``          6.018e-05  2.599e-03   0.023 0.981532    
## PSC                       -1.120e-01  6.422e-02  -1.744 0.081336 .  
## `\\`PSC Fill\\``          -3.082e-02  2.611e-02  -1.180 0.237980    
## `\\`PSC CO2\\``           -9.741e-02  7.131e-02  -1.366 0.172137    
## `\\`Mnf Flow\\``          -7.103e-04  5.134e-05 -13.835  < 2e-16 ***
## `\\`Carb Pressure1\\``     6.668e-03  7.832e-04   8.513  < 2e-16 ***
## `\\`Fill Pressure\\``      2.830e-03  1.369e-03   2.066 0.038917 *  
## `\\`Hyd Pressure1\\``     -2.141e-04  4.141e-04  -0.517 0.605219    
## `\\`Hyd Pressure2\\``     -9.818e-04  5.955e-04  -1.649 0.099356 .  
## `\\`Hyd Pressure3\\``      3.551e-03  6.526e-04   5.442 5.90e-08 ***
## `\\`Hyd Pressure4\\``     -2.685e-04  3.492e-04  -0.769 0.442079    
## `\\`Filler Level\\``      -1.231e-03  6.589e-04  -1.868 0.061913 .  
## `\\`Filler Speed\\``      -2.010e-06  8.249e-06  -0.244 0.807489    
## Temperature               -1.126e-02  2.508e-03  -4.489 7.57e-06 ***
## `\\`Usage cont\\``        -7.349e-03  1.273e-03  -5.774 8.93e-09 ***
## `\\`Carb Flow\\``          1.255e-05  4.049e-06   3.100 0.001961 ** 
## Density                   -1.518e-01  3.071e-02  -4.942 8.36e-07 ***
## MFR                        1.810e-05  6.578e-05   0.275 0.783251    
## Balling                   -9.348e-02  2.714e-02  -3.445 0.000583 ***
## `\\`Pressure Vacuum\\``   -2.147e-02  8.585e-03  -2.501 0.012451 *  
## `\\`Oxygen Filler\\``     -3.139e-01  7.823e-02  -4.012 6.23e-05 ***
## `\\`Bowl Setpoint\\``      3.408e-03  6.794e-04   5.016 5.75e-07 ***
## `\\`Pressure Setpoint\\`` -9.917e-03  2.207e-03  -4.492 7.44e-06 ***
## `\\`Air Pressurer\\``     -3.425e-03  2.627e-03  -1.304 0.192517    
## `\\`Alch Rel\\``           6.605e-02  2.267e-02   2.913 0.003618 ** 
## `\\`Carb Rel\\``           3.009e-02  5.340e-02   0.564 0.573145    
## `\\`Balling Lvl\\``        1.444e-01  2.674e-02   5.400 7.47e-08 ***
## `\\`Brand Code\\`B`        7.936e-02  2.507e-02   3.165 0.001572 ** 
## `\\`Brand Code\\`C`       -7.190e-02  2.496e-02  -2.881 0.004006 ** 
## `\\`Brand Code\\`D`        6.603e-02  1.692e-02   3.903 9.83e-05 ***
## `\\`Brand Code\\`U`        6.165e-03  2.773e-02   0.222 0.824100    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.133 on 2019 degrees of freedom
## Multiple R-squared:  0.418,  Adjusted R-squared:  0.4079 
## F-statistic: 41.42 on 35 and 2019 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm$finalModel)

We see the method ‘lm’ can explain 39 % of the data of the training set. As per the Linear Model, we see that we have 32 predictors, collected from 2055 samples. QQ plot suggests it has little variations in the beginning of the data. The residual plot shows some condensed around the zero mean lines.

Ridge Regression

Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value.

With the tunning parameter, alfa= 0, and lambda is between 0 to 1, we can say that is a ridge model. This model is not sensitive to the variables that has multicollinearity.

 # lambda = 0.00621
ridge2 <- train(pH ~ .,
            bev_model_train,
            method= 'glmnet',
            tuneGrid = expand.grid(alpha= 0,
                                   lambda= seq(0.1,0.001,length= 20)),
            trControl = customTrainControl)

# lambda = 0.00716 
ridge <- train(pH ~ .,
            bev_model_train,
            method= 'glmnet',
            tuneGrid = expand.grid(alpha= 0,
                                   lambda= seq(0.001,0.01,length= 20)),
            trControl = customTrainControl)
par(mfrow=c(1,2))
plot(ridge, main="Ridge Model 1 ") # PLot suggest the 

plot(ridge2, main="Ridge Model 2 ")

When we had lambda sequnce between 0 to 1, it was close to 0.01 or less than that, we reduced the lambda to see its impact better in model 1. We can see the Ridge model start increase the RMSE with lambda = 0.007.

ridge  # SHows the best model my rmse and alfa as zero as we are ridge refression 
## glmnet 
## 
## 2055 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1850, 1850, 1849, 1850, 1848, 1849, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE      
##   0.001000000  0.1349738  0.3920773  0.1053958
##   0.001473684  0.1349738  0.3920773  0.1053958
##   0.001947368  0.1349738  0.3920773  0.1053958
##   0.002421053  0.1349738  0.3920773  0.1053958
##   0.002894737  0.1349738  0.3920773  0.1053958
##   0.003368421  0.1349738  0.3920773  0.1053958
##   0.003842105  0.1349738  0.3920773  0.1053958
##   0.004315789  0.1349738  0.3920773  0.1053958
##   0.004789474  0.1349738  0.3920773  0.1053958
##   0.005263158  0.1349738  0.3920773  0.1053958
##   0.005736842  0.1349738  0.3920773  0.1053958
##   0.006210526  0.1349738  0.3920773  0.1053958
##   0.006684211  0.1349738  0.3920773  0.1053958
##   0.007157895  0.1349738  0.3920773  0.1053958
##   0.007631579  0.1349831  0.3920143  0.1054110
##   0.008105263  0.1350291  0.3916464  0.1054790
##   0.008578947  0.1350758  0.3912749  0.1055466
##   0.009052632  0.1351229  0.3909032  0.1056128
##   0.009526316  0.1351691  0.3905403  0.1056762
##   0.010000000  0.1352163  0.3901712  0.1057387
## 
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 0.007157895.
plot(ridge$finalModel, xvar= "lambda",label= T)

When Log Lamda is big all coefficients are close to zero, but when we relax lambda,coefficients increase.As you can see we have all 35 variables used in the model coefficients even when all the coefficients are close to zero. So increasing the lamda is not helping us get away with multicollinearity attributes or any other nonsignificant variables.

plot(ridge$finalModel,xvar= 'dev', label=T)

This plot very clearly says that this model only explains little more than 40% of the total data. In that too until 30%, we see a slight trend in the coefficients but after that, they have moved too much in different directions, and coefficients has increased a lot for variable 25. These variables may not be significant for the model.

Let’s see how the other imp. variables stand in this mdoel.

plot(varImp(ridge,scale = T))

plot(varImp(ridge,scale = F))

Lasso Regression

#Lasso Regression
# IT slects feature and also drops varaibles  that has multicolinerlty 

set.seed(3456)
customTrainControl <- trainControl(method = "repeatedcv", 
                                   number = 10 , 
                                   repeats = 5 ,
                                   verboseIter = F)
lasso1 <- train(pH ~ .,
            bev_model_train,
            method= 'glmnet',
            metric = metric,
            tuneGrid = expand.grid(alpha= 1,
                                   lambda= seq(0.0001,0.001,length= 20)),
            trControl = customTrainControl)

lasso2 <- train(pH ~ .,
            bev_model_train,
            method= 'glmnet',
            metric = metric,
            tuneGrid = expand.grid(alpha= 1,
                                   lambda= seq(0.001,0.0001,length= 20)),
            trControl = customTrainControl)
lasso3 <- train(pH ~ .,
            bev_model_train,
            method= 'glmnet',
            metric = metric,
            tuneGrid = expand.grid(alpha= 1,
                                   lambda= seq(0.001,0.00005,length= 20)),
            trControl = customTrainControl)

lasso4 <- train(pH ~ .,
            bev_model_train,
            method= 'glmnet',
            metric = metric,
            tuneGrid = expand.grid(alpha= 1,
                                   lambda= seq(0.00001,0.00003,length= 20)),
            trControl = customTrainControl)

plot(lasso1)

plot(lasso2)

plot(lasso3) # PLot suggest the 

plot(lasso4)

lasso1$bestTune
lasso4$bestTune
lasso<- lasso4
lasso  # SHows the best model my rmse and alfa as zero as we are ridge refression 
## glmnet 
## 
## 2055 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1850, 1849, 1850, 1851, 1849, 1850, ... 
## Resampling results across tuning parameters:
## 
##   lambda        RMSE       Rsquared   MAE      
##   1.000000e-05  0.1346485  0.3966658  0.1040182
##   1.105263e-05  0.1346485  0.3966658  0.1040182
##   1.210526e-05  0.1346485  0.3966658  0.1040182
##   1.315789e-05  0.1346485  0.3966658  0.1040182
##   1.421053e-05  0.1346485  0.3966658  0.1040182
##   1.526316e-05  0.1346485  0.3966658  0.1040182
##   1.631579e-05  0.1346485  0.3966658  0.1040182
##   1.736842e-05  0.1346485  0.3966658  0.1040182
##   1.842105e-05  0.1346485  0.3966658  0.1040182
##   1.947368e-05  0.1346485  0.3966658  0.1040182
##   2.052632e-05  0.1346485  0.3966658  0.1040182
##   2.157895e-05  0.1346485  0.3966658  0.1040182
##   2.263158e-05  0.1346485  0.3966658  0.1040182
##   2.368421e-05  0.1346485  0.3966658  0.1040182
##   2.473684e-05  0.1346485  0.3966658  0.1040182
##   2.578947e-05  0.1346482  0.3966681  0.1040182
##   2.684211e-05  0.1346475  0.3966727  0.1040182
##   2.789474e-05  0.1346468  0.3966779  0.1040183
##   2.894737e-05  0.1346459  0.3966832  0.1040186
##   3.000000e-05  0.1346451  0.3966884  0.1040190
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 3e-05.

With alpha = 1 and lambda = 3e-05 , we selected lasso model, after multiple model tuning .

par(mfrow=c(2,2))
plot(ridge$finalModel,xvar= 'lambda', label=T,main="Ridge Model Lambda")
plot(ridge$finalModel,xvar= 'dev', label=T,main="Ridge Model Deviance")
plot(lasso$finalModel, xvar= "lambda",label= T,main="Lasso Model Lambda")
plot(lasso$finalModel,xvar= 'dev', label=T,main="Lasso Model Deviance")

We can see that we have 25 variables when log lambda is -6, and it increases fast to log lambda go further down. Similarly, fraction deviance shows both models can explain 40% of the data but the number of coefficients increases at the end but max it goes to 27 variables. With 9 variables we can explain 30% of the data with Lasso model.

Imp variables from this model:

plot(varImp(lasso,scale = T))

plot(varImp(lasso,scale = F))

Elastic Net Model

Let’s apply the Elastic Net model where we can tune both the parameter i.e. Alpha and Lambda. So our model can give us a hybrid model and best performance.

en1 <- train(pH ~ .,
            bev_model_train,
            method= 'glmnet',
           
            tuneGrid = expand.grid(alpha= seq(0.001,0.01,length=10),
                                   lambda= seq(0.00001,0.00003,length= 10)),
            trControl = customTrainControl)




en2 <- train(pH ~ .,
            bev_model_train,
            method= 'glmnet',
            preProcess = c("center", "scale"),
            tuneGrid = expand.grid(alpha= seq(0.001,0.01,length=10),
                                   lambda= seq(0.00001,0.00003,length= 10)),
            trControl = customTrainControl)



en3 <- train(pH ~ .,
            bev_model_train,
            method= 'glmnet',
            metric = metric,
            tuneGrid = expand.grid(alpha= seq(0,0.1,length=10),
                                   lambda= seq(0.00001,0.00003,length= 10)),
            trControl = customTrainControl)
en <- en1

par(mfrow=c(2,2))
plot(en1)

plot(en2)

plot(en3) # PLot suggest the 

en1$bestTune
en2$bestTune
en3$bestTune
# As we can see the best tune is close to 0.00003 for lambda and alpha is 0.009 , We choose to see how these models look.

The above plot suggests how the mixing parameter is changing for each regularization parameter and its impact on RMSE. We note from the last plot that alpha of 0.009 and lambda = 3e-05 is the best-tunned parameter of the elastic net model.

par(mfrow=c(2,2))
plot(ridge$finalModel,xvar= 'dev', label=T,main="Ridge Deviance")
plot(ridge$finalModel, xvar= "lambda",label= T,main="Ridge  Lambda")
plot(en1$finalModel,xvar= 'dev', label=T,main="Elastic Net Deviance")
plot(en1$finalModel, xvar= "lambda",label= T,main="Elastic Net Lambda")

For elastic net model when log lambda =2 coefficients is close to zero, when it’s close to -2 it is having 29 variables and coefficients starts increasing rapidly. after log lambda crosses -4. Variable 25 which would have less impact on the model as it’s increasing very fast after log lambda is equal to -2.3.

We don’t’ see many variations in the rsquired, this model seems to explain little more than 40% of the data, which is the same trend we have seen with the Lasso and Ridge model.

Let’s see impo variables:

plot(varImp(en1,scale = T))

plot(varImp(en1,scale = F))

model_list <- list(Linearmodel = lm, Ridge = ridge , Lasso = lasso4, 
                   ElasticNet= en1,
                   ElasticNet2 = en2, 
                   ElasticNet3= en3 )
res <- resamples(model_list)
summary(res)
## 
## Call:
## summary.resamples(object = res)
## 
## Models: Linearmodel, Ridge, Lasso, ElasticNet, ElasticNet2, ElasticNet3 
## Number of resamples: 50 
## 
## MAE 
##                   Min.    1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Linearmodel 0.09299080 0.10052097 0.1035598 0.1039546 0.1069291 0.1171766    0
## Ridge       0.09342707 0.10194429 0.1063105 0.1053958 0.1084192 0.1145358    0
## Lasso       0.09064172 0.10082617 0.1039980 0.1040190 0.1073169 0.1154078    0
## ElasticNet  0.08981577 0.10067957 0.1052122 0.1044438 0.1081815 0.1156158    0
## ElasticNet2 0.09026291 0.09944086 0.1051377 0.1042396 0.1076899 0.1257865    0
## ElasticNet3 0.09195069 0.10013296 0.1040161 0.1041918 0.1075936 0.1169733    0
## 
## RMSE 
##                  Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Linearmodel 0.1207467 0.1301755 0.1344744 0.1346963 0.1382281 0.1517793    0
## Ridge       0.1208264 0.1305683 0.1349234 0.1349738 0.1392830 0.1501176    0
## Lasso       0.1190617 0.1297322 0.1343992 0.1346451 0.1387024 0.1503834    0
## ElasticNet  0.1132916 0.1289169 0.1343682 0.1346282 0.1407235 0.1510983    0
## ElasticNet2 0.1163942 0.1290582 0.1331081 0.1343408 0.1395460 0.1657001    0
## ElasticNet3 0.1164842 0.1285108 0.1346027 0.1344450 0.1394467 0.1488947    0
## 
## Rsquared 
##                  Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Linearmodel 0.2677024 0.3754875 0.3977370 0.3955688 0.4330148 0.5020104    0
## Ridge       0.3079485 0.3607750 0.3882505 0.3920773 0.4196622 0.4867108    0
## Lasso       0.2907377 0.3667909 0.4007262 0.3966884 0.4269408 0.4883761    0
## ElasticNet  0.2944670 0.3660446 0.3876575 0.3955251 0.4238631 0.5401986    0
## ElasticNet2 0.2434117 0.3688284 0.3878286 0.3980138 0.4330470 0.5138498    0
## ElasticNet3 0.2899435 0.3696444 0.3988657 0.3981281 0.4200143 0.5217361    0

We can very clearly see that R-squared is better with close to 39.75 with the en1 model, but its Mean RMSE is not the winner but we would choose the en1 model due to its better Mean R-squared value.

Best model

en1$bestTune #$ 0.009 can be said to be more of close to zero , and its more of ridgw modwl.
finalLMe<- en1$finalModel
coef(finalLMe,s= en1$bestTune$lambda)
## 36 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)          1.040346e+01
## `Carb Volume`       -1.528416e-01
## `Fill Ounces`       -4.815667e-02
## `PC Volume`         -9.694543e-02
## `Carb Pressure`      3.917142e-04
## `Carb Temp`          1.209669e-04
## PSC                 -1.123043e-01
## `PSC Fill`          -3.008626e-02
## `PSC CO2`           -9.702900e-02
## `Mnf Flow`          -6.849464e-04
## `Carb Pressure1`     6.536669e-03
## `Fill Pressure`      3.018242e-03
## `Hyd Pressure1`     -2.093803e-04
## `Hyd Pressure2`     -6.817635e-04
## `Hyd Pressure3`      3.076954e-03
## `Hyd Pressure4`     -1.979513e-04
## `Filler Level`      -9.067648e-04
## `Filler Speed`      -2.501868e-06
## Temperature         -1.095627e-02
## `Usage cont`        -7.420418e-03
## `Carb Flow`          1.255874e-05
## Density             -1.499491e-01
## MFR                 -2.923339e-07
## Balling             -4.430879e-02
## `Pressure Vacuum`   -1.217569e-02
## `Oxygen Filler`     -2.930130e-01
## `Bowl Setpoint`      3.046156e-03
## `Pressure Setpoint` -1.003592e-02
## `Air Pressurer`     -3.034933e-03
## `Alch Rel`           5.869294e-02
## `Carb Rel`           5.555941e-02
## `Balling Lvl`        8.795306e-02
## `Brand Code`B        6.594803e-02
## `Brand Code`C       -8.280183e-02
## `Brand Code`D        6.340303e-02
## `Brand Code`U       -9.299641e-03
bwplot(res)

xyplot(res,metric = "RMSE")

There is an equal number of datapoint above and below the line which suggests that RMSE for LM and Ridge model is almost the same.

Top 5 Variables

getRank <- function(trainObjects,n=5){
  cn = 0
  temp <- c()
  methods <- c()
  for(object in trainObjects){
    
        methods <- c(methods, object$method)
        varimp <- varImp(object)[[1]]
        varimp$variables <- row.names(varimp)
        rank <- varimp[order(varimp$Overall, decreasing = T),] %>% row.names()
        temp <- cbind(temp[1:5], rank[1:5])
  }
  temp <- as.data.frame(temp)
  names(temp) <- methods
  temp$Rank <- c(1:dim(temp)[1])
  temp <- select(temp, Rank, everything())
  return(temp)
}

getRank(list(lm))
getRank(list(lasso))
getRank(list(ridge))
getRank(list(en1))
plot(varImp(lm, scale=T), main='Variable Importance based on Linear Model')

Test from the training set

# Train data Test set 
X_test <- bev_model_test[,-32] # Dropped PH
Y_test <- bev_model_test[,32] # Only PH

test_model <- function(modelName,predData){
options(warn=-1)      #turn off warnings
predicted_result <- predict(modelName, predData)
options(warn=1)  

#We can collect the observed and predicted values into a data frame, then use
# the caret function defaultSummary to estimate the test set performance
DT_model_lm_pred <- data.frame(obs=Y_test,pred=predicted_result)
res_sum <- defaultSummary(DT_model_lm_pred)
mape_score <- MLmetrics::MAPE(predicted_result,Y_test)
return(cbind(res_sum,mape_score))
}
# 
# kable(list(test_model(lm,X_test),
#       test_model(ridge,X_test),
#       test_model(lasso,X_test),
#       test_model(en,X_test)
#       ))



data.frame("LM Model"= defaultSummary(data.frame(obs=Y_test,pred=predict(lm, X_test))),"MAPE" =  MLmetrics::MAPE(predict(lm, X_test),Y_test))
data.frame("Ridge Model"= defaultSummary(data.frame(obs=Y_test,pred=predict(ridge, X_test))),"MAPE" =  MLmetrics::MAPE(predict(ridge, X_test),Y_test))
data.frame("Lasso Model"= defaultSummary(data.frame(obs=Y_test,pred=predict(lasso, X_test))),"MAPE" =  MLmetrics::MAPE(predict(lasso, X_test),Y_test))
data.frame("Elastic Net"= defaultSummary(data.frame(obs=Y_test,pred=predict(en, X_test))),"MAPE" =  MLmetrics::MAPE(predict(en, X_test),Y_test))

Elastic net model seems to be having a better MAPE score of 0.0116. Let’s use the en model to check the test set.

Test with Given Test set

# Score data 
X_Stest <- bev_score[,-26] # Dropped PH
Y_Stest <- bev_score[,26] # Only PH

en_lm_result<- predict(en,X_Stest)

# Plots to show that how 
par(mfrow=c(2,2))
plot(Y_test,main='Plot of Ph from Train data')
plot(en_lm_result,main='Plot Ph from Predicted data')
boxplot(Y_test,label="sad",main='Boxplot of Ph from Train data')
boxplot(as.data.frame(en_lm_result)[,c(1)],main='Boxplot of Predicted Ph test data')

# Write to file 
# xlsx::write.xlsx(as.data.frame(en_lm_result)[,c(1)], file = "Project2_lm.xlsx",  col.name = T, row.names = T, append = T)

Nonlinear Regression - Jimmy

Regression Trees and Rule-Based Models - Samriti

Conclusion - Alain

References

https://readxl.tidyverse.org/

https://www.rdocumentation.org/packages/DataExplorer/versions/0.8.1/topics/plot_missing

https://datascience.stackexchange.com/questions/24452/in-supervised-learning-why-is-it-bad-to-have-correlated-features#:~:text=The%20stronger%20the%20correlation%2C%20the,tend%20to%20change%20in%20unison

https://medium.com/coinmonks/dealing-with-missing-data-using-r-3ae428da2d17

https://www.r-bloggers.com/missing-value-treatment/

https://stackoverflow.com/questions/51548255/caret-there-were-missing-values-in-resampled-performance-measures < Why we get NA for predict R squared for lasso regression.