Overview

ABC Beverage has new regulations in place and the leadership team requires the data scientists team to understand the manufacturing process, the predictive factors and be able to report to them predictive model of PH. The selection of model depends upon various factors like model accuracy, data relevance, cross validation etc.

R packages

We will use r for data modeling. All packages used for data exploration, visualization, preparation and modeling are listed in # Data Exploration We will first get the historical dataset, provided in excel and use it to analyze and eventually predict the PH of beverages.

# download training data from git repo
tempdata.file <- tempfile(fileext = ".xlsx")
download.file(url="https://github.com/jtul333/Data624/blob/main/StudentData.xlsx?raw=true", 
              destfile = tempdata.file, 
              mode = "wb", 
              quiet = TRUE)

## read excel for training data
traindata.df <- read_excel(tempdata.file, skip=0) 
traindata.df
## # A tibble: 2,571 × 33
##    `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
##    <chr>                <dbl>         <dbl>       <dbl>           <dbl>
##  1 B                     5.34          24.0       0.263            68.2
##  2 A                     5.43          24.0       0.239            68.4
##  3 B                     5.29          24.1       0.263            70.8
##  4 A                     5.44          24.0       0.293            63  
##  5 A                     5.49          24.3       0.111            67.2
##  6 A                     5.38          23.9       0.269            66.6
##  7 A                     5.31          23.9       0.268            64.2
##  8 B                     5.32          24.2       0.221            67.6
##  9 B                     5.25          24.0       0.263            64.2
## 10 B                     5.27          24.0       0.231            72  
## # ℹ 2,561 more rows
## # ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## #   `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## #   `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## #   `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## #   `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## #   `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>, …
# download testing data from git repo
download.file(url="https://github.com/jtul333/Data624/blob/main/StudentEvaluation.xlsx?raw=true", 
              destfile = tempdata.file, 
              mode = "wb", 
              quiet = TRUE)

# read excel for testing data
testdata.df <- read_excel(tempdata.file, skip=0)
testdata.df
## # A tibble: 267 × 33
##    `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
##    <chr>                <dbl>         <dbl>       <dbl>           <dbl>
##  1 D                     5.48          24.0       0.27             65.4
##  2 A                     5.39          24.0       0.227            63.2
##  3 B                     5.29          23.9       0.303            66.4
##  4 B                     5.27          23.9       0.186            64.8
##  5 B                     5.41          24.2       0.16             69.4
##  6 B                     5.29          24.1       0.212            73.4
##  7 A                     5.48          23.9       0.243            65.2
##  8 B                     5.42          24.1       0.123            67.4
##  9 A                     5.41          23.9       0.333            66.8
## 10 D                     5.47          24.0       0.256            72.6
## # ℹ 257 more rows
## # ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## #   `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## #   `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## #   `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## #   `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## #   `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>, …

Data summary

There are 31 predictor variables that are numeric and 1 predictor variable Brand Code which is factor. The training data set has 2,571 observations.

# transform Brand.code to factor
traindata.df$`Brand Code` = as.factor(traindata.df$`Brand Code`)
testdata.df$`Brand Code` = as.factor(testdata.df$`Brand Code`)
glimpse(traindata.df)
## Rows: 2,571
## Columns: 33
## $ `Brand Code`        <fct> B, A, B, A, A, A, A, B, B, B, B, B, B, B, B, B, C,…
## $ `Carb Volume`       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, …
## $ `Fill Ounces`       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, …
## $ `PC Volume`         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.1113…
## $ `Carb Pressure`     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64…
## $ `Carb Temp`         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 1…
## $ PSC                 <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.15…
## $ `PSC Fill`          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.…
## $ `PSC CO2`           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.…
## $ `Mnf Flow`          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -1…
## $ `Carb Pressure1`    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 1…
## $ `Fill Pressure`     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46…
## $ `Hyd Pressure1`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure2`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure3`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure4`     <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, …
## $ `Filler Level`      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 1…
## $ `Filler Speed`      <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014…
## $ Temperature         <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65…
## $ `Usage cont`        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 1…
## $ `Carb Flow`         <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902,…
## $ Density             <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.…
## $ MFR                 <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, …
## $ Balling             <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1…
## $ `Pressure Vacuum`   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4…
## $ PH                  <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.…
## $ `Oxygen Filler`     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0…
## $ `Bowl Setpoint`     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, …
## $ `Pressure Setpoint` <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46…
## $ `Air Pressurer`     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 1…
## $ `Alch Rel`          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.…
## $ `Carb Rel`          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.…
## $ `Balling Lvl`       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.…
describe(traindata.df) %>% dplyr::select(-vars, -trimmed, -mad, -se)
##                      n    mean      sd  median     min     max   range  skew
## Brand Code*       2451    2.51    1.00    2.00    1.00    4.00    3.00  0.38
## Carb Volume       2561    5.37    0.11    5.35    5.04    5.70    0.66  0.39
## Fill Ounces       2533   23.97    0.09   23.97   23.63   24.32    0.69 -0.02
## PC Volume         2532    0.28    0.06    0.27    0.08    0.48    0.40  0.34
## Carb Pressure     2544   68.19    3.54   68.20   57.00   79.40   22.40  0.18
## Carb Temp         2545  141.09    4.04  140.80  128.60  154.00   25.40  0.25
## PSC               2538    0.08    0.05    0.08    0.00    0.27    0.27  0.85
## PSC Fill          2548    0.20    0.12    0.18    0.00    0.62    0.62  0.93
## PSC CO2           2532    0.06    0.04    0.04    0.00    0.24    0.24  1.73
## Mnf Flow          2569   24.57  119.48   65.20 -100.20  229.40  329.60  0.00
## Carb Pressure1    2539  122.59    4.74  123.20  105.60  140.20   34.60  0.05
## Fill Pressure     2549   47.92    3.18   46.40   34.60   60.40   25.80  0.55
## Hyd Pressure1     2560   12.44   12.43   11.40   -0.80   58.00   58.80  0.78
## Hyd Pressure2     2556   20.96   16.39   28.60    0.00   59.40   59.40 -0.30
## Hyd Pressure3     2556   20.46   15.98   27.60   -1.20   50.00   51.20 -0.32
## Hyd Pressure4     2541   96.29   13.12   96.00   52.00  142.00   90.00  0.55
## Filler Level      2551  109.25   15.70  118.40   55.80  161.20  105.40 -0.85
## Filler Speed      2514 3687.20  770.82 3982.00  998.00 4030.00 3032.00 -2.87
## Temperature       2557   65.97    1.38   65.60   63.60   76.20   12.60  2.39
## Usage cont        2566   20.99    2.98   21.79   12.08   25.90   13.82 -0.54
## Carb Flow         2569 2468.35 1073.70 3028.00   26.00 5104.00 5078.00 -0.99
## Density           2570    1.17    0.38    0.98    0.24    1.92    1.68  0.53
## MFR               2359  704.05   73.90  724.00   31.40  868.60  837.20 -5.09
## Balling           2570    2.20    0.93    1.65   -0.17    4.01    4.18  0.59
## Pressure Vacuum   2571   -5.22    0.57   -5.40   -6.60   -3.60    3.00  0.53
## PH                2567    8.55    0.17    8.54    7.88    9.36    1.48 -0.29
## Oxygen Filler     2559    0.05    0.05    0.03    0.00    0.40    0.40  2.66
## Bowl Setpoint     2569  109.33   15.30  120.00   70.00  140.00   70.00 -0.97
## Pressure Setpoint 2559   47.62    2.04   46.00   44.00   52.00    8.00  0.20
## Air Pressurer     2571  142.83    1.21  142.60  140.80  148.20    7.40  2.25
## Alch Rel          2562    6.90    0.51    6.56    5.28    8.62    3.34  0.88
## Carb Rel          2561    5.44    0.13    5.40    4.96    6.06    1.10  0.50
## Balling Lvl       2570    2.05    0.87    1.48    0.00    3.66    3.66  0.59
##                   kurtosis
## Brand Code*          -1.06
## Carb Volume          -0.47
## Fill Ounces           0.86
## PC Volume             0.67
## Carb Pressure        -0.01
## Carb Temp             0.24
## PSC                   0.65
## PSC Fill              0.77
## PSC CO2               3.73
## Mnf Flow             -1.87
## Carb Pressure1        0.14
## Fill Pressure         1.41
## Hyd Pressure1        -0.14
## Hyd Pressure2        -1.56
## Hyd Pressure3        -1.57
## Hyd Pressure4         0.63
## Filler Level          0.05
## Filler Speed          6.71
## Temperature          10.16
## Usage cont           -1.02
## Carb Flow            -0.58
## Density              -1.20
## MFR                  30.46
## Balling              -1.39
## Pressure Vacuum      -0.03
## PH                    0.06
## Oxygen Filler        11.09
## Bowl Setpoint        -0.06
## Pressure Setpoint    -1.60
## Air Pressurer         4.73
## Alch Rel             -0.85
## Carb Rel             -0.29
## Balling Lvl          -1.49

Based of above description, we can see the dataset has missing values so it would need imputation. The predictors Oxygen Filler, MFR, Filler Speed and Temperature seems highly skewed and would require transformation. This could be seen in below histogram plots as well.

Variables Distribution

Below we have shown the distribution of dataset variables. There are 2 sets of histograms; the one in red is natural distribution and the ones in green are logarithmic disctribution

plot_histogram(traindata.df, geom_histogram_args = list("fill" = "red"))

# log histograms
plot_histogram(traindata.df, scale_x = "log10", geom_histogram_args = list("fill" = "springgreen4"))

## Missing Data The summary and following graphs show the missing data in training dataset. The plot below shows more than 8% data is missing for MFR variable. Next feature that has missing data is Filler Speed which shows more than 2% missing data. The missing data will be handled through imputation.

colSums(is.na(traindata.df))
##        Brand Code       Carb Volume       Fill Ounces         PC Volume 
##               120                10                38                39 
##     Carb Pressure         Carb Temp               PSC          PSC Fill 
##                27                26                33                23 
##           PSC CO2          Mnf Flow    Carb Pressure1     Fill Pressure 
##                39                 2                32                22 
##     Hyd Pressure1     Hyd Pressure2     Hyd Pressure3     Hyd Pressure4 
##                11                15                15                30 
##      Filler Level      Filler Speed       Temperature        Usage cont 
##                20                57                14                 5 
##         Carb Flow           Density               MFR           Balling 
##                 2                 1               212                 1 
##   Pressure Vacuum                PH     Oxygen Filler     Bowl Setpoint 
##                 0                 4                12                 2 
## Pressure Setpoint     Air Pressurer          Alch Rel          Carb Rel 
##                12                 0                 9                10 
##       Balling Lvl 
##                 1
plot_missing(traindata.df[-1])

## Correlation Below plot shows the correlation among numeric variables in the dataset. We can see few variables are highly correlated. We will handle the pairwise predictors that has correlation above 0.90 in data preparation section.

forcorr <- traindata.df[complete.cases(traindata.df),-1]
corrplot::corrplot(cor(forcorr), method = 'ellipse', type = 'lower')

## Outliers

# boxplot 
par(mfrow = c(3,4))
for(i in colnames(traindata.df[-1])){
boxplot(traindata.df[,i], xlab = names(traindata.df[i]),
  main = names(traindata.df[i]), col="blue", horizontal = T)
}

set.seed(317)

# Data Preparation ## Handling missing and outliers The very first in data preparation we will perform is handling missing data and outliers through imputation. We will use mice package to perform imputation here. MICE (Multivariate Imputation via Chained Equations) is one of the commonly used package for this activity.

set.seed(317)

# Training set
traindata.df.clean <- mice(data.frame(traindata.df), method = 'rf', m=2, maxit = 2, print=FALSE)
traindata.df.clean <- complete(traindata.df.clean)

nzv_preds <- nearZeroVar(traindata.df.clean)
traindata.df.clean <- traindata.df.clean[,-nzv_preds]
set.seed(317)

# Testing set
testdata.df.clean <- mice(data.frame(testdata.df), method = 'rf', m=2, maxit = 2, print=FALSE)
testdata.df.clean <- complete(testdata.df.clean)

Create Dummy Variables

The variable Brand Code is a categorical variable, having 4 classes (A, B, C, and D). For modeling, we got to convert into set of dummy variables.

set.seed(317)
dum.brandcode <- dummyVars(PH ~ Brand.Code, data = traindata.df.clean)
dum.traindata.predict <- predict(dum.brandcode, traindata.df.clean)
traindata.df.clean <- cbind(dum.traindata.predict, traindata.df.clean) %>% dplyr::select(-Brand.Code)
set.seed(317)
dum.brandcode <- dummyVars( ~ Brand.Code, data = testdata.df.clean)
dum.testdata.predict <- predict(dum.brandcode, testdata.df.clean)
testdata.df.clean <- cbind(dum.testdata.predict, testdata.df.clean) %>% dplyr::select(-Brand.Code)

Correlation

highCorr <- findCorrelation(cor(traindata.df.clean), 0.90)
traindata.df.clean <- traindata.df.clean[, -highCorr]
set.seed(317)

Preprocess using transformation

In this step we will use caret preprocess method using transformation. First I tried with BoxCox transformation but observed RMSE values are too high and then tried with “YeoJohnson” which gave better results.

preproc_traindatadf <- preProcess(traindata.df.clean, method = "YeoJohnson")
traindata.df.clean <- predict(preproc_traindatadf, traindata.df.clean)
set.seed(317)

preproc_testdatadf <- preProcess(testdata.df.clean, method = "YeoJohnson")
testdata.df.clean <- predict(preproc_testdatadf, testdata.df.clean)
set.seed(317)

Training and Test Partition

partition <- createDataPartition(traindata.df.clean$PH, p=0.75, list = FALSE)

# training/validation partition for independent variables
X.train <- traindata.df.clean[partition, ] %>% dplyr::select(-PH)
X.test <- traindata.df.clean[-partition, ] %>% dplyr::select(-PH)

# training/validation partition for dependent variable PH
y.train <- traindata.df.clean$PH[partition]
y.test <- traindata.df.clean$PH[-partition]
set.seed(317)

Build Models

Linear Regression

Simple Linear Regression

lm_model <- lm(y.train ~ ., data = X.train)
summary(lm_model)
## 
## Call:
## lm(formula = y.train ~ ., data = X.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50957 -0.07841  0.00947  0.09040  0.81196 
## 
## Coefficients: (1 not defined because of singularities)
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.205e+03  1.447e+03  -0.833 0.404954    
## Brand.Code.A      -8.450e-02  1.834e-02  -4.608 4.34e-06 ***
## Brand.Code.B      -1.290e-02  2.919e-02  -0.442 0.658625    
## Brand.Code.C      -1.492e-01  3.032e-02  -4.920 9.42e-07 ***
## Brand.Code.D              NA         NA      NA       NA    
## Carb.Volume       -6.475e-02  8.059e-02  -0.803 0.421813    
## Fill.Ounces       -9.433e-02  3.797e-02  -2.485 0.013059 *  
## PC.Volume         -1.560e-01  9.664e-02  -1.614 0.106703    
## Carb.Pressure     -2.254e-01  6.190e-01  -0.364 0.715764    
## Carb.Temp          2.156e+03  2.575e+03   0.837 0.402421    
## PSC               -4.284e-02  6.669e-02  -0.642 0.520717    
## PSC.Fill          -5.385e-02  5.781e-02  -0.931 0.351734    
## PSC.CO2           -1.433e-01  7.438e-02  -1.927 0.054149 .  
## Mnf.Flow          -6.465e-04  5.356e-05 -12.072  < 2e-16 ***
## Carb.Pressure1     2.028e-02  2.220e-03   9.137  < 2e-16 ***
## Fill.Pressure      2.312e+00  6.693e-01   3.454 0.000565 ***
## Hyd.Pressure2      7.629e-03  1.222e-03   6.241 5.35e-10 ***
## Hyd.Pressure4      1.294e-01  1.324e-01   0.977 0.328568    
## Filler.Speed      -5.883e-06  8.693e-06  -0.677 0.498663    
## Temperature       -1.123e-02  2.543e-03  -4.417 1.06e-05 ***
## Usage.cont        -7.943e-03  1.338e-03  -5.936 3.46e-09 ***
## Carb.Flow          1.137e-06  4.375e-07   2.599 0.009410 ** 
## MFR               -8.421e-06  5.033e-05  -0.167 0.867154    
## Pressure.Vacuum    1.108e-04  1.697e-04   0.653 0.513999    
## Oxygen.Filler     -2.646e-01  7.858e-02  -3.367 0.000776 ***
## Bowl.Setpoint      2.317e-03  3.068e-04   7.550 6.72e-14 ***
## Pressure.Setpoint -8.418e-03  2.205e-03  -3.817 0.000139 ***
## Air.Pressurer      1.394e-03  2.737e-03   0.509 0.610774    
## Alch.Rel           2.434e-02  2.351e-02   1.035 0.300692    
## Carb.Rel           9.222e-04  5.228e-02   0.018 0.985929    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1349 on 1902 degrees of freedom
## Multiple R-squared:  0.4002, Adjusted R-squared:  0.3913 
## F-statistic: 45.32 on 28 and 1902 DF,  p-value: < 2.2e-16
set.seed(317)

We can see that Simple Linear Regression model only covers 39% of variability of data. Next we will check for better models which covers better variability, RMSE and MAE.

Partial Least Squares

Partial least squares (PLS) is an alternative to ordinary least squares (OLS) regression. It reduces the predictors to a smaller set of uncorrelated components and then performs least squares regression on these components, instead of on the original data.

pls_model <- train(x=X.train,
                 y=y.train,
                 method="pls",
                 metric="Rsquared",
                 tuneLength=10, 
                 trControl=trainControl(method = "cv")
                 )

pls_model
## Partial Least Squares 
## 
## 1931 samples
##   29 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1739, 1737, 1738, 1738, 1737, 1738, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared    MAE      
##    1     0.1701854  0.03552919  0.1362411
##    2     0.1687926  0.04803549  0.1358485
##    3     0.1545949  0.20364304  0.1218054
##    4     0.1542707  0.20722486  0.1216203
##    5     0.1534367  0.21547764  0.1210742
##    6     0.1528542  0.22146186  0.1204187
##    7     0.1464045  0.28416669  0.1142322
##    8     0.1428342  0.31879189  0.1111580
##    9     0.1414288  0.33233305  0.1096653
##   10     0.1395500  0.34964559  0.1082889
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 10.
pls_model$bestTune
##    ncomp
## 10    10
plot(pls_model)

pls_model$results %>% 
  filter(ncomp == pls_model$bestTune$ncomp) %>% 
  dplyr::select(ncomp,RMSE,Rsquared)
##   ncomp    RMSE  Rsquared
## 1    10 0.13955 0.3496456
data.frame(Rsquared=pls_model[["results"]][["Rsquared"]][as.numeric(rownames(pls_model$bestTune))],
           RMSE=pls_model[["results"]][["RMSE"]][as.numeric(rownames(pls_model$bestTune))])
##    Rsquared    RMSE
## 1 0.3496456 0.13955
set.seed(317)

The final value used for the model was ncomp = 3 which corresponds to best tune model. In this case we see that R2 is 0.34 so only covers 34% variability in data and RMSE is small.

Non Linear Regression

MARS

MARS creates a piecewise linear model which provides an intuitive stepping block into non-linearity after grasping the concept of multiple linear regression. MARS provided a convenient approach to capture the nonlinear relationships in the data by assessing cutpoints (knots) similar to step functions.

marsGrid <- expand.grid(.degree=1:2, .nprune=2:30)
mars_model <- train(x=X.train, 
                    y=y.train,
                    method = "earth",
                    tuneGrid = marsGrid,
                    trControl = trainControl(method = "cv"))

# final parameters
mars_model$bestTune
##    nprune degree
## 58     30      2
# plot RMSE
plot(mars_model)

summary(mars_model$finalModel)
## Call: earth(x=data.frame[1931,29], y=c(8.36,8.26,8.9...), keepxy=TRUE,
##             degree=2, nprune=30)
## 
##                                                     coefficients
## (Intercept)                                            8.3922373
## Brand.Code.C                                          -0.0811976
## h(0.199351-Mnf.Flow)                                   0.0022027
## h(Mnf.Flow-0.199351)                                   0.0013793
## h(68.4-Temperature)                                    0.0205847
## h(7.16-Alch.Rel)                                       0.0945742
## h(Alch.Rel-7.16)                                       0.1232198
## Brand.Code.C * h(4.618-Hyd.Pressure2)                 -0.0341154
## h(0.018-PSC) * h(Temperature-68.4)                    16.7097112
## h(0.199351-Mnf.Flow) * h(58.5522-Carb.Pressure1)      -0.0001367
## h(Mnf.Flow-0.199351) * h(Filler.Speed-3894)            0.0000068
## h(Mnf.Flow-0.199351) * h(Usage.cont-20.92)            -0.0002783
## h(Mnf.Flow-0.199351) * h(20.92-Usage.cont)            -0.0000895
## h(0.199351-Mnf.Flow) * h(Pressure.Vacuum- -78.7763)   -0.0000178
## h(0.199351-Mnf.Flow) * h(-78.7763-Pressure.Vacuum)    -0.0000257
## h(0.199351-Mnf.Flow) * h(Air.Pressurer-143.8)         -0.0003202
## h(Mnf.Flow-0.199351) * h(146.4-Air.Pressurer)         -0.0003487
## h(Mnf.Flow-0.199351) * h(Alch.Rel-7.16)                0.0007837
## h(Filler.Speed-3906) * h(7.16-Alch.Rel)               -0.0009949
## h(Bowl.Setpoint-90) * h(7.16-Alch.Rel)                 0.0050108
## h(7.16-Alch.Rel) * h(Carb.Rel-5.28)                   -0.4726061
## h(7.16-Alch.Rel) * h(5.28-Carb.Rel)                   -1.4327431
## 
## Selected 22 of 32 terms, and 13 of 29 predictors (nprune=30)
## Termination condition: RSq changed by less than 0.001 at 32 terms
## Importance: Mnf.Flow, Brand.Code.C, Air.Pressurer, Bowl.Setpoint, Alch.Rel, ...
## Number of terms at each degree of interaction: 1 6 15
## GCV 0.01543752    RSS 28.18092    GRSq 0.4840773    RSq 0.5117639
data.frame(Rsquared=mars_model[["results"]][["Rsquared"]][as.numeric(rownames(mars_model$bestTune))],
           RMSE=mars_model[["results"]][["RMSE"]][as.numeric(rownames(mars_model$bestTune))])
##   Rsquared      RMSE
## 1 0.446055 0.1290755
set.seed(317)

RMSE was used to select the optimal model using the smallest value. The final values used for the model were nprune = 30- and degree = 2 which corresponds to best tune model. In this case we see that R2 is 0.44 so only covers 44% variability in data.

Support Vector Machines

The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space (N being the number of features) that classifies the data points. Hyperplanes are decision boundaries to classify the data points. Data points that falls on either side of the hyperplane can be qualified for different classes

svm_model <- train(x=X.train, 
                   y=y.train,
                   method = "svmRadial",
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))

svm_model
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1931 samples
##   29 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1739, 1737, 1738, 1738, 1737, 1738, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE       Rsquared   MAE       
##     0.25  0.1274659  0.4666720  0.09652837
##     0.50  0.1249260  0.4838413  0.09402050
##     1.00  0.1226166  0.5009064  0.09190804
##     2.00  0.1212213  0.5114286  0.09074903
##     4.00  0.1211507  0.5126727  0.09054655
##     8.00  0.1221653  0.5079426  0.09089925
##    16.00  0.1243393  0.4980601  0.09230454
##    32.00  0.1288197  0.4778671  0.09646228
##    64.00  0.1343930  0.4542547  0.10127143
##   128.00  0.1415743  0.4263579  0.10665300
## 
## Tuning parameter 'sigma' was held constant at a value of 0.0227007
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0227007 and C = 4.
summary(svm_model$finalModel)
## Length  Class   Mode 
##      1   ksvm     S4
# plot RMSE
plot(svm_model)

data.frame(Rsquared=svm_model[["results"]][["Rsquared"]][as.numeric(rownames(svm_model$bestTune))],
           RMSE=svm_model[["results"]][["RMSE"]][as.numeric(rownames(svm_model$bestTune))])
##    Rsquared      RMSE
## 1 0.5126727 0.1211507
set.seed(317)

RMSE was used to select the optimal model using the smallest value. We see an improvement here in R2 value which is 0.52 so this model covers 52% variability in the data and RMSE is high.

Trees

Single Tree

Regression trees partition a data set into smaller groups and then fit a simple model for each subgroup. To achieve outcome consistency, regression trees determine the predictor to split on and value of the split, the depth or complexity of the tree and the prediction equation in the terminal nodes

set.seed(317)

st_model <- train(x=X.train,
                  y=y.train,
                  method = "rpart",
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))

st_model
## CART 
## 
## 1931 samples
##   29 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1739, 1737, 1738, 1738, 1737, 1738, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE       Rsquared   MAE      
##   0.01341725  0.1326879  0.4119192  0.1036643
##   0.01353745  0.1327662  0.4111283  0.1038517
##   0.01575474  0.1345745  0.3944922  0.1061004
##   0.01759488  0.1384887  0.3603367  0.1090065
##   0.01784934  0.1388527  0.3570006  0.1094215
##   0.01853256  0.1395969  0.3499675  0.1101374
##   0.02764051  0.1415558  0.3321555  0.1123226
##   0.04201649  0.1450045  0.2998396  0.1154871
##   0.06409967  0.1498520  0.2512457  0.1177793
##   0.21638976  0.1628036  0.1888247  0.1291839
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01341725.
st_model$bestTune
##           cp
## 1 0.01341725
# plot RMSE
plot(st_model)

data.frame(Rsquared=st_model[["results"]][["Rsquared"]][as.numeric(rownames(st_model$bestTune))],
           RMSE=st_model[["results"]][["RMSE"]][as.numeric(rownames(st_model$bestTune))])
##    Rsquared      RMSE
## 1 0.4119192 0.1326879
set.seed(317)

RMSE was used to select the optimal model using the smallest value. The final value used for the model was cp = 0.01341725. We see R2 value is 0.41 so this model covers 41% variability in the data . The Rsquared value is comparatively low as compared to previous best value and RMSE is low

Boosted Tree

Boosting algorithms are influenced by learning theory. Boosting algorithm seeks to improve the prediction power by training a sequence of weak models where each of them compensates the weaknesses of its predecessors. The trees in boosting are dependent on past trees, have minimum depth and do not contribute equally to the final model. It requires usto specify a weak model (e.g. regression, shallow decision trees etc) and then improves it.

gbmGrid <- expand.grid(interaction.depth = c(5,10), 
                       n.trees = seq(100, 1000, by = 100), 
                       shrinkage = 0.1,
                       n.minobsinnode = c(5,10))

gbm_model <- train(x=X.train,
                   y=y.train,
                   method = "gbm",
                   tuneGrid = gbmGrid, 
                   trControl = trainControl(method = "cv"),
                   verbose = FALSE)

gbm_model
## Stochastic Gradient Boosting 
## 
## 1931 samples
##   29 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1739, 1737, 1738, 1738, 1737, 1738, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.minobsinnode  n.trees  RMSE       Rsquared   MAE       
##    5                  5               100     0.1181207  0.5356186  0.09095306
##    5                  5               200     0.1146912  0.5600683  0.08813336
##    5                  5               300     0.1131651  0.5714510  0.08659026
##    5                  5               400     0.1125090  0.5763827  0.08571083
##    5                  5               500     0.1121328  0.5792529  0.08529782
##    5                  5               600     0.1121029  0.5797223  0.08544281
##    5                  5               700     0.1120534  0.5803108  0.08514009
##    5                  5               800     0.1115883  0.5837666  0.08493526
##    5                  5               900     0.1114065  0.5851395  0.08465719
##    5                  5              1000     0.1114236  0.5853826  0.08468017
##    5                 10               100     0.1174006  0.5420473  0.09025112
##    5                 10               200     0.1153192  0.5553027  0.08833686
##    5                 10               300     0.1143151  0.5628791  0.08721418
##    5                 10               400     0.1140410  0.5656746  0.08661415
##    5                 10               500     0.1135934  0.5691245  0.08607892
##    5                 10               600     0.1133626  0.5714354  0.08588050
##    5                 10               700     0.1131615  0.5731913  0.08566205
##    5                 10               800     0.1130978  0.5739798  0.08561127
##    5                 10               900     0.1130722  0.5744855  0.08550557
##    5                 10              1000     0.1129666  0.5755219  0.08538704
##   10                  5               100     0.1129898  0.5734856  0.08624244
##   10                  5               200     0.1110087  0.5879971  0.08472156
##   10                  5               300     0.1104638  0.5921622  0.08380622
##   10                  5               400     0.1105544  0.5917007  0.08367144
##   10                  5               500     0.1105270  0.5923236  0.08342931
##   10                  5               600     0.1103082  0.5939664  0.08305669
##   10                  5               700     0.1102276  0.5945979  0.08296273
##   10                  5               800     0.1101736  0.5950147  0.08286298
##   10                  5               900     0.1101137  0.5954582  0.08275191
##   10                  5              1000     0.1100380  0.5960311  0.08270462
##   10                 10               100     0.1131666  0.5729713  0.08553418
##   10                 10               200     0.1120475  0.5812974  0.08396878
##   10                 10               300     0.1118802  0.5830740  0.08354693
##   10                 10               400     0.1114955  0.5860991  0.08326673
##   10                 10               500     0.1114530  0.5868694  0.08332931
##   10                 10               600     0.1116515  0.5856516  0.08331197
##   10                 10               700     0.1115272  0.5865016  0.08317056
##   10                 10               800     0.1113815  0.5877413  0.08303976
##   10                 10               900     0.1113274  0.5880811  0.08299693
##   10                 10              1000     0.1113835  0.5878071  0.08304368
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth
##  = 10, shrinkage = 0.1 and n.minobsinnode = 5.
gbm_model$bestTune
##    n.trees interaction.depth shrinkage n.minobsinnode
## 30    1000                10       0.1              5
plot(gbm_model)

data.frame(Rsquared=gbm_model[["results"]][["Rsquared"]][as.numeric(rownames(gbm_model$bestTune))],
           RMSE=gbm_model[["results"]][["RMSE"]][as.numeric(rownames(gbm_model$bestTune))])
##    Rsquared      RMSE
## 1 0.5739798 0.1130978
set.seed(317)

Tuning parameter ‘shrinkage’ was held constant at a value of 0.1. RMSE was used to select the optimal model using the smallest value. The final values used for the model were n.trees = 1000, interaction.depth = 10, shrinkage = 0.1 and n.minobsinnode = 5. The R2 is 0.57 on training data and RMSE is also low 0.11 compared to others.

Random Forest

Random forest consists of a large number of individual decision trees that work as an ensemble. Each model in the ensemble is used to generate a prediction for a new sample and these predictions are then averaged to give the forest’s prediction.

rf_model <- train(x=X.train,
                  y=y.train,
                  method = "rf",
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))

rf_model
## Random Forest 
## 
## 1931 samples
##   29 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1739, 1737, 1738, 1738, 1737, 1738, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    2    0.1152559  0.6033049  0.08847318
##    5    0.1068602  0.6457838  0.08018538
##    8    0.1036800  0.6606325  0.07710210
##   11    0.1028908  0.6614996  0.07600682
##   14    0.1017431  0.6664324  0.07461153
##   17    0.1014897  0.6661951  0.07420499
##   20    0.1014257  0.6647875  0.07381055
##   23    0.1012847  0.6646976  0.07349939
##   26    0.1015562  0.6610503  0.07342562
##   29    0.1016050  0.6597682  0.07332262
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 23.
rf_model$bestTune
##   mtry
## 8   23
plot(rf_model)

data.frame(Rsquared=rf_model[["results"]][["Rsquared"]][as.numeric(rownames(rf_model$bestTune))],
           RMSE=rf_model[["results"]][["RMSE"]][as.numeric(rownames(rf_model$bestTune))])
##    Rsquared      RMSE
## 1 0.6646976 0.1012847
varImp(rf_model)
## rf variable importance
## 
##   only 20 most important variables shown (out of 29)
## 
##                 Overall
## Mnf.Flow        100.000
## Brand.Code.C     31.195
## Usage.cont       30.628
## Oxygen.Filler    29.873
## Alch.Rel         24.805
## Pressure.Vacuum  20.134
## Air.Pressurer    18.806
## Carb.Rel         18.408
## Temperature      17.875
## Carb.Pressure1   16.174
## Carb.Flow        13.392
## Filler.Speed     12.524
## Bowl.Setpoint    12.377
## PC.Volume         9.613
## MFR               7.949
## Carb.Volume       6.734
## Hyd.Pressure2     6.226
## Fill.Ounces       6.203
## PSC               6.085
## Fill.Pressure     5.779
plot(varImp(rf_model), top=10, main="Random Forest")

set.seed(317)

RMSE was used to select the optimal model using the smallest value. The final value used for the model was mtry = 16. It has R2 as 0.65 and RMSE as 0.10. Both these values are best among the models used so far.

Lets see the informative variables found by Random Forest models. we will use varImp method to find these variables.

varImp(rf_model)
## rf variable importance
## 
##   only 20 most important variables shown (out of 29)
## 
##                 Overall
## Mnf.Flow        100.000
## Brand.Code.C     31.195
## Usage.cont       30.628
## Oxygen.Filler    29.873
## Alch.Rel         24.805
## Pressure.Vacuum  20.134
## Air.Pressurer    18.806
## Carb.Rel         18.408
## Temperature      17.875
## Carb.Pressure1   16.174
## Carb.Flow        13.392
## Filler.Speed     12.524
## Bowl.Setpoint    12.377
## PC.Volume         9.613
## MFR               7.949
## Carb.Volume       6.734
## Hyd.Pressure2     6.226
## Fill.Ounces       6.203
## PSC               6.085
## Fill.Pressure     5.779
plot(varImp(rf_model), top=10, main="Random Forest")

set.seed(317)

From above plot, it is evident Mnf.Flow is the most informative variable for PH response variable.

Cubist

Cubist is a rule-based model. A tree is built where the terminal leaves contain linear regression models. These models are based upon the predictors used in previous splits along with intermediate models. The tree is reduced to a set of rules which initially are paths from the top of the tree to the bottom. Rules are eliminated via pruning or combined and the candidate variables for the models are the predictors that were pruned away. .

cubist_model <- train(x=X.train,
                      y=y.train,
                      method = "cubist",
                      tuneLength = 10,
                      trControl = trainControl(method = "cv"))

cubist_model
## Cubist 
## 
## 1931 samples
##   29 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1739, 1737, 1738, 1738, 1737, 1738, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE       
##    1          0          0.1330532  0.4532330  0.09274212
##    1          5          0.1282295  0.5055688  0.08748197
##    1          9          0.1270212  0.5058917  0.08700673
##   10          0          0.1106746  0.5937901  0.08041293
##   10          5          0.1064742  0.6248150  0.07565165
##   10          9          0.1056872  0.6282083  0.07548381
##   20          0          0.1094413  0.6058863  0.08008476
##   20          5          0.1044339  0.6368167  0.07485124
##   20          9          0.1038807  0.6395624  0.07480966
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 9.
cubist_model$bestTune
##   committees neighbors
## 9         20         9
plot(cubist_model)

data.frame(Rsquared=cubist_model[["results"]][["Rsquared"]][as.numeric(rownames(cubist_model$bestTune))],
           RMSE=cubist_model[["results"]][["RMSE"]][as.numeric(rownames(cubist_model$bestTune))])
##    Rsquared      RMSE
## 1 0.6395624 0.1038807
set.seed(123) 

RMSE was used to select the optimal model using the smallest value. The best tune for the cubist model which resulted in the smallest root mean squared error was with 20 committees. It had RMSE = 0.10, and R2 = 0.63. So far, it covered 63% of the variability in the data than all other variables and with the low RMSE.

XGB Tree

set.seed(123) 

xgb_trcontrol <- trainControl(
  method = "cv",
  number = 5,  
  allowParallel = TRUE,
  verboseIter = FALSE,
  returnData = FALSE
)
# Setting up a grid search for the best parameters
xgbGrid <- expand.grid(nrounds = c(100,200),  # this is n_estimators above
                       max_depth = c(10, 15, 20, 25),
                       colsample_bytree = seq(0.5, 0.9, length.out = 5),
                       eta = 0.1,
                       gamma=0,
                       min_child_weight = 1,
                       subsample = 1
                      )
xgb_model <- train(x=X.train,
                  y=y.train,
                  method = "xgbTree",
                  trControl = xgb_trcontrol,
                  tuneGrid = xgbGrid)
## [19:21:30] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:21:33] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:21:37] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:21:40] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:21:43] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:21:47] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:21:51] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:21:55] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:21:59] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:22:03] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:22:07] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:22:11] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:22:16] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:22:20] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:22:26] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:22:34] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:22:39] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:22:43] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:22:47] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:22:53] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:22:57] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:23:00] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:23:03] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:23:07] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:23:11] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:23:15] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:23:18] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:23:23] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:23:27] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:23:31] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:23:35] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:23:40] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:23:44] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:23:48] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:23:53] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:23:57] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:24:02] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:24:07] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:24:11] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:24:16] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:24:19] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:24:23] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:24:26] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:24:30] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:24:33] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:24:37] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:24:41] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:24:45] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:24:49] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:24:53] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:24:57] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:25:02] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:25:09] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:25:14] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:25:19] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:25:24] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:25:28] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:25:32] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:25:37] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:25:42] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:25:45] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:25:48] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:25:52] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:25:56] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:26:00] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:26:03] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:26:08] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:26:12] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:26:16] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:26:21] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:26:25] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:26:29] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:26:33] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:26:38] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:26:43] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:26:47] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:26:52] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:26:57] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:27:01] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:27:06] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:27:10] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:27:13] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:27:17] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:27:21] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:27:25] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:27:29] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:27:33] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:27:38] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:27:42] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:27:45] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:27:50] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:27:55] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:27:59] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:28:03] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:28:08] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:28:13] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:28:17] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:28:23] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:28:28] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [19:28:33] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
xgb_model$bestTune
##   nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 2     200        10 0.1     0              0.5                1         1
plot(xgb_model)

data.frame(Rsquared=xgb_model[["results"]][["Rsquared"]][as.numeric(rownames(xgb_model$bestTune))],
           RMSE=xgb_model[["results"]][["RMSE"]][as.numeric(rownames(xgb_model$bestTune))])
##    Rsquared      RMSE
## 1 0.5978237 0.1098375
set.seed(317)

RMSE was used to select the optimal model using the smallest value. XGB tree had RMSE = 0.10, and R2 = 0.59. So far, it covered 59% of the variability in the data than all other variables and with the low RMSE.

Select Model

To select the best model for making predictions for evaluation data, we will look at 3 parameters.

R2 which shows the variance explained by given model. RMSE (Root Mean Squared Error), which is the std deviation of the residuals. MAE (Mean Absolute Error), which is avg of all absolute errors. Here we will summarize the resamplling to compare the above 3 values among all the models followed by checking the prediction on validation data which we reserved earlier during data partition.

summary(resamples(list(PLS=pls_model, MARS=mars_model, SVM=svm_model, RandFrst=rf_model,  Cubist=cubist_model, SingleTree=st_model,Boosting=gbm_model)))
## 
## Call:
## summary.resamples(object = resamples(list(PLS = pls_model, MARS =
##  mars_model, SVM = svm_model, RandFrst = rf_model, Cubist =
##  cubist_model, SingleTree = st_model, Boosting = gbm_model)))
## 
## Models: PLS, MARS, SVM, RandFrst, Cubist, SingleTree, Boosting 
## Number of resamples: 10 
## 
## MAE 
##                  Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
## PLS        0.10104581 0.10392759 0.10777466 0.10828888 0.11181179 0.11991258
## MARS       0.08799430 0.09360964 0.09685786 0.09812694 0.09926750 0.11382982
## SVM        0.08319923 0.08732445 0.09030278 0.09054655 0.09295711 0.10222076
## RandFrst   0.06476638 0.07059229 0.07256585 0.07349939 0.07587763 0.08269445
## Cubist     0.06769922 0.07064207 0.07373361 0.07480966 0.07854536 0.08367675
## SingleTree 0.09413354 0.09830918 0.10368079 0.10366435 0.10831672 0.11383325
## Boosting   0.07227100 0.08088005 0.08130622 0.08270462 0.08493678 0.09662541
##            NA's
## PLS           0
## MARS          0
## SVM           0
## RandFrst      0
## Cubist        0
## SingleTree    0
## Boosting      0
## 
## RMSE 
##                  Min.    1st Qu.     Median      Mean   3rd Qu.      Max. NA's
## PLS        0.12640646 0.13535583 0.13876080 0.1395500 0.1434692 0.1561543    0
## MARS       0.11325827 0.12542012 0.12871849 0.1290755 0.1323502 0.1519357    0
## SVM        0.11292796 0.11492604 0.11897544 0.1211507 0.1214452 0.1406849    0
## RandFrst   0.08698197 0.09546896 0.09901298 0.1012847 0.1067193 0.1170894    0
## Cubist     0.09095591 0.09569312 0.10344363 0.1038807 0.1086207 0.1197946    0
## SingleTree 0.12192700 0.12810045 0.13212779 0.1326879 0.1379240 0.1443405    0
## Boosting   0.09367776 0.10438225 0.10604521 0.1100380 0.1135478 0.1320624    0
## 
## Rsquared 
##                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS        0.2609018 0.3346554 0.3535611 0.3496456 0.3841001 0.4229308    0
## MARS       0.3061129 0.4308917 0.4602051 0.4460550 0.4755359 0.5356680    0
## SVM        0.4099302 0.4966205 0.5265570 0.5126727 0.5470755 0.5944796    0
## RandFrst   0.5860873 0.6307120 0.6714160 0.6646976 0.7046975 0.7383091    0
## Cubist     0.5383955 0.6133924 0.6536651 0.6395624 0.6674408 0.7158958    0
## SingleTree 0.3269190 0.3806521 0.4123231 0.4119192 0.4427852 0.4840696    0
## Boosting   0.4676527 0.5856929 0.6201142 0.5960311 0.6385496 0.6479620    0
bwplot(resamples(list(PLS=pls_model, MARS=mars_model, SVM=svm_model, RandFrst=rf_model,  Cubist=cubist_model, SingleTree=st_model, Boosting=gbm_model)), main = "Models Comparison")

set.seed(317)
pls_pred <- predict(pls_model, newdata = X.test)
mars_pred <- predict(mars_model, newdata = X.test)
svm_pred <- predict(svm_model, newdata = X.test)
rf_pred <- predict(rf_model, newdata = X.test)
cubist_pred <- predict(cubist_model, newdata = X.test)
st_pred<- predict(st_model, newdata = X.test)
gbm_pred <- predict(gbm_model, newdata = X.test)

data.frame(rbind(PLS=postResample(pred=pls_pred,obs = y.test),
                 MARS=postResample(pred=mars_pred,obs = y.test),
                 SVM=postResample(pred=svm_pred,obs = y.test),
                 SingleTree=postResample(pred=st_pred,obs = y.test),
                 RandFrst=postResample(pred=rf_pred,obs = y.test),
                 Boosting=postResample(pred=gbm_pred,obs = y.test),
                 Cubist=postResample(pred=cubist_pred,obs = y.test)))
##                 RMSE  Rsquared        MAE
## PLS        0.1387713 0.3494500 0.10833786
## MARS       0.1249069 0.4742658 0.09553506
## SVM        0.1219557 0.5038570 0.09025843
## SingleTree 0.1321144 0.4111946 0.10318986
## RandFrst   0.1010167 0.6613835 0.07325009
## Boosting   0.1095096 0.5990087 0.08215116
## Cubist     0.1008364 0.6566128 0.07226294
set.seed(317)

We can that Random Forest performed the best among all the models tried considering the 3 metrics Rsquared, RMSE and MAE.

Prediction

Based on the analysis so far, it is confirmed that the Random Forest model is the optimal model. we will now use it to predict PH values of evaluation dataset and then write it in csv. Here are the predicted values of PH for evaluation dataset.

# remove PH from evaluation data
testdata.df.clean <- testdata.df.clean %>% dplyr::select(-PH)

# predict final PH values
testdata.df.clean$PH <- predict(rf_model, newdata = testdata.df.clean)
# PH predictions
testdata.df.clean$PH %>% tibble::enframe(name = NULL) %>% datatable()
plot_histogram(testdata.df.clean$PH)

write.csv(testdata.df.clean$PH, "StudentEvaluations_PHPredictions.csv")

Conclusion

After extracting the data from the given files, we did first perform data exploration which helped us to find missing data, correlation among the variables and outliers. Next we performed steps for data preparation that included handling missing and outliers through mice, creating dummy vars for a categorical variable Brand Code, remove highly correlated variables, transform data for Normality and finally data partition of 75% and 25% for training and validation respectively. We then trained various models using linear regression, non linear and Trees model. We finally founf the optimal model as Random Forest for predicting PH values for evaluation data.

We notice that all the values predicted are greater than 8. This value translates that the beverage made is alkaline. At the start of this study, we were not known about the nature of the ABC Beverage company i.e. what type of beverage manufacturer it was. But from this study we can conclude that this company mainly produces alkaline beverages like water, tea, fruit drinks and all.