Predictive Analytics - Homework #7

Oluwakemi Omotunde

2019-04-16

7.2

Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data: \[ y = 10sin(\pi x_1x_2) + 20(x_3 - 0.5)^2 +1-x_4 + 5x_5 +N(0, \sigma^2)\] where the x values are random variables uniformly distributed between 0,1. The package mlbench contains a function called mlbench.friedma1 that simulates these data:

Tune several models on these data. For example:

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.617103  0.4815238  2.932357
##    7  3.469391  0.5268823  2.827466
##    9  3.404273  0.5542161  2.764704
##   11  3.367277  0.5745575  2.726791
##   13  3.313918  0.6022923  2.681996
##   15  3.310264  0.6142757  2.687478
##   17  3.308316  0.6266659  2.686031
##   19  3.306431  0.6392056  2.690283
##   21  3.317481  0.6435421  2.700244
##   23  3.323616  0.6521553  2.708896
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 19.
##      RMSE  Rsquared       MAE 
## 3.2286834 0.6871735 2.5939727

The RMSE value we got is 3.23, \[R^2\] is .69, and MAE is 2.59. Next we will tune some MARS models and compare.

## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

We see that 12 of 18 terms and 6 or 10 predictors are selected. For importance we have X1, X4, X2, X5, X6, X7 through X10 were unused. Let’s take a look at the summary to get more extensive output

## Call: earth(x=trainingData$x, y=trainingData$y)
## 
##                coefficients
## (Intercept)       18.451984
## h(0.621722-X1)   -11.074396
## h(0.601063-X2)   -10.744225
## h(X3-0.281766)    20.607853
## h(0.447442-X3)    17.880232
## h(X3-0.447442)   -23.282007
## h(X3-0.636458)    15.150350
## h(0.734892-X4)   -10.027487
## h(X4-0.734892)     9.092045
## h(0.850094-X5)    -4.723407
## h(X5-0.850094)    10.832932
## h(X6-0.361791)    -1.956821
## 
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

With summary, we can see the coefficients and intercepts for this model. We see the hinge functions for each of the predictors. For example, for X1, the hinge funtion is h(.62 - x1). Now, let’s tune the model using external resampling with the train function.

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.461735  0.2126864  3.697085
##   1        4      2.839751  0.6758087  2.283042
##   1        6      2.372350  0.7734818  1.883979
##   1        8      1.866813  0.8598841  1.454385
##   1       10      1.819863  0.8665983  1.399155
##   1       12      1.823581  0.8663811  1.397191
##   1       14      1.841520  0.8636975  1.410815
##   2        2      4.461735  0.2126864  3.697085
##   2        4      2.886416  0.6657159  2.323079
##   2        6      2.347399  0.7772242  1.839180
##   2        8      1.933750  0.8493698  1.504360
##   2       10      1.646725  0.8893854  1.279194
##   2       12      1.534422  0.9047396  1.197307
##   2       14      1.508731  0.9083735  1.174498
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.
##      RMSE  Rsquared       MAE 
## 1.2779993 0.9338365 1.0147070
## earth variable importance
## 
##     Overall
## X1   100.00
## X4    84.98
## X2    68.87
## X5    48.55
## X3    38.96
## X8     0.00
## X6     0.00
## X7     0.00
## X10    0.00
## X9     0.00

We were able to tune the MARS model and check for variable importance. This tuned model had X1 as most important, X4, X2, X5 and X3 are the other vairables that were important. The RMSE value is 1.28, \[R^2\] IS .94 and MAE is 1.01. This RMSE value is better than the 3.2 that we got with KNN. I will try a SVM model next.

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.488560  0.8070561  1.986379
##      0.50  2.248461  0.8168043  1.796747
##      1.00  2.061844  0.8397372  1.644829
##      2.00  1.941198  0.8547226  1.525840
##      4.00  1.885180  0.8632244  1.491907
##      8.00  1.873598  0.8642778  1.492104
##     16.00  1.879064  0.8632513  1.498301
##     32.00  1.879064  0.8632513  1.498301
##     64.00  1.879064  0.8632513  1.498301
##    128.00  1.879064  0.8632513  1.498301
##    256.00  1.879064  0.8632513  1.498301
##    512.00  1.879064  0.8632513  1.498301
##   1024.00  1.879064  0.8632513  1.498301
##   2048.00  1.879064  0.8632513  1.498301
## 
## Tuning parameter 'sigma' was held constant at a value of 0.07022076
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.07022076 and C = 8.

Let’s take a look at the finalModel, which contains the model created by the ksvm function.

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 8 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0702207565619088 
## 
## Number of Support Vectors : 155 
## 
## Objective Function Value : -63.1839 
## Training error : 0.008775

The model uses 155 training set data points as support vectors. We will now complete out predications then, look at variable importance.

##      RMSE  Rsquared       MAE 
## 2.0856664 0.8239632 1.5849823

The RMSE is 2.08, \[R^2\] is .82 and MAE is 1.58. The RMSE value is better than KNN but not as good as MARS.

## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

Variable importance is ranked X4, X1, X2, X5, X3, X9, X10, X8, X6, X7.

Overall, it appears that our MARS model performed the best as it had the lowest RMSE. Our MARS model did infact, select variables X1-X5 as being important.

7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputations, data splitting, and pre-processing steps as before and train several non-linear regression models.

I first subset my data into a dataframe of the predictor variables and another for the response/yield variable, then into test and train sets. Before, I used the mice package to handle my inputation but I have realized that there is a way to take care of all of my preprocessing in a quicker way. The preProcess function from the caret package allows me to impute, scale, center and perform a BoxCox transformation in one swoop. I also removed highly correlated data and near zero values. Now that we have completed our data preprocessing and splitting, we will perform a knn, neural network, MARS, svm models.

## k-Nearest Neighbors 
## 
## 132 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.471155  0.3970451  1.185740
##    7  1.473831  0.3952187  1.188562
##    9  1.466505  0.3984768  1.180142
##   11  1.468380  0.3982967  1.185868
##   13  1.468671  0.3995290  1.186408
##   15  1.458793  0.4119094  1.179862
##   17  1.448414  0.4230017  1.168782
##   19  1.448308  0.4284784  1.166523
##   21  1.452348  0.4301752  1.172608
##   23  1.452411  0.4374085  1.170903
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 19.

From the output, the best RMSE value for knn is when k is 19.

##      RMSE  Rsquared       MAE 
## 1.3398772 0.3645254 1.0810766
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 47)
## 
##                        Overall
## ManufacturingProcess17  100.00
## ManufacturingProcess13   96.37
## ManufacturingProcess32   88.50
## ManufacturingProcess09   79.09
## BiologicalMaterial06     75.17
## ManufacturingProcess36   70.24
## BiologicalMaterial03     69.76
## ManufacturingProcess06   64.01
## ManufacturingProcess11   60.76
## BiologicalMaterial11     53.96
## BiologicalMaterial08     50.38
## BiologicalMaterial04     50.15
## ManufacturingProcess30   47.47
## ManufacturingProcess33   47.33
## ManufacturingProcess12   39.60
## BiologicalMaterial01     39.21
## BiologicalMaterial09     35.53
## BiologicalMaterial10     27.33
## ManufacturingProcess15   26.57
## ManufacturingProcess26   23.82

The RMSE value is 1.34, RSquared .36 and MAE is 1.08. Next to a basisic neural network model, as our data has been scaled.

##      RMSE  Rsquared       MAE 
## 2.4705065 0.1318926 2.0825607
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 47)
## 
##                        Overall
## ManufacturingProcess17  100.00
## ManufacturingProcess13   96.37
## ManufacturingProcess32   88.50
## ManufacturingProcess09   79.09
## BiologicalMaterial06     75.17
## ManufacturingProcess36   70.24
## BiologicalMaterial03     69.76
## ManufacturingProcess06   64.01
## ManufacturingProcess11   60.76
## BiologicalMaterial11     53.96
## BiologicalMaterial08     50.38
## BiologicalMaterial04     50.15
## ManufacturingProcess30   47.47
## ManufacturingProcess33   47.33
## ManufacturingProcess12   39.60
## BiologicalMaterial01     39.21
## BiologicalMaterial09     35.53
## BiologicalMaterial10     27.33
## ManufacturingProcess15   26.57
## ManufacturingProcess26   23.82

The metrics for the nnet model were worse than for the knn model with RMSE at 2.47, Rsquared at .13 and MAE of 2.08. Let’s take a look at out MARS model then svm.