Homework 8 Predictive analytics

Salma Elshahawy

2021-04-25

Github repo | portfolio | Blog

Problem 7.2

Friedman (1991) introduced several benchmark data sets created by simulation. On of these simulations used the following nonlinear equations to create data:

\(y = 10 sin(\pi x_1x_2) + 20(x_3- 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\)

where the \(x\) values are random variables uniformly distributed between \([0,1]\) (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will five the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'.  Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

KNN Model

library(caret)
knnModel <- train(x = trainingData$x, 
                  y = trainingData$y,
                  method = "knn",
                  preProcess = c("center", "scale"),
                  tuneLength = 10)
knnModel
#> k-Nearest Neighbors 
#> 
#> 200 samples
#>  10 predictor
#> 
#> Pre-processing: centered (10), scaled (10) 
#> Resampling: Bootstrapped (25 reps) 
#> Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
#> Resampling results across tuning parameters:
#> 
#>   k   RMSE      Rsquared   MAE     
#>    5  3.466085  0.5121775  2.816838
#>    7  3.349428  0.5452823  2.727410
#>    9  3.264276  0.5785990  2.660026
#>   11  3.214216  0.6024244  2.603767
#>   13  3.196510  0.6176570  2.591935
#>   15  3.184173  0.6305506  2.577482
#>   17  3.183130  0.6425367  2.567787
#>   19  3.198752  0.6483184  2.592683
#>   21  3.188993  0.6611428  2.588787
#>   23  3.200458  0.6638353  2.604529
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = knnPred, obs = testData$y)
#>      RMSE  Rsquared       MAE 
#> 3.2040595 0.6819919 2.5683461

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1-X5)?

MARS Model

MARS_grid <- expand.grid(.degree = 1:2, .nprune = 2:15)
MARS_model <- train(x = trainingData$x, 
                  y = trainingData$y,
                  method = "earth",
                  tuneGrid = MARS_grid,
                  preProcess = c("center", "scale"),
                  tuneLength = 10)
#> Loading required package: earth
#> Loading required package: Formula
#> Loading required package: plotmo
#> Loading required package: plotrix
#> Loading required package: TeachingDemos
MARS_model
#> Multivariate Adaptive Regression Spline 
#> 
#> 200 samples
#>  10 predictor
#> 
#> Pre-processing: centered (10), scaled (10) 
#> Resampling: Bootstrapped (25 reps) 
#> Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
#> Resampling results across tuning parameters:
#> 
#>   degree  nprune  RMSE      Rsquared   MAE     
#>   1        2      4.383438  0.2405683  3.597961
#>   1        3      3.645469  0.4745962  2.930453
#>   1        4      2.727602  0.7035031  2.184240
#>   1        5      2.449243  0.7611230  1.939231
#>   1        6      2.331605  0.7835496  1.833420
#>   1        7      1.976830  0.8421599  1.562591
#>   1        8      1.870959  0.8585503  1.464551
#>   1        9      1.804342  0.8683110  1.410395
#>   1       10      1.787676  0.8711960  1.386944
#>   1       11      1.790700  0.8707740  1.393076
#>   1       12      1.821005  0.8670619  1.419893
#>   1       13      1.858688  0.8617344  1.445459
#>   1       14      1.862343  0.8623072  1.446050
#>   1       15      1.871033  0.8607099  1.457618
#>   2        2      4.383438  0.2405683  3.597961
#>   2        3      3.644919  0.4742570  2.929647
#>   2        4      2.730222  0.7028372  2.183075
#>   2        5      2.481291  0.7545789  1.965749
#>   2        6      2.338369  0.7827873  1.825542
#>   2        7      2.030065  0.8328250  1.602024
#>   2        8      1.890997  0.8551326  1.477422
#>   2        9      1.742626  0.8757904  1.371910
#>   2       10      1.608221  0.8943432  1.255416
#>   2       11      1.474325  0.9111463  1.157848
#>   2       12      1.437483  0.9157967  1.120977
#>   2       13      1.439395  0.9164721  1.128309
#>   2       14      1.428565  0.9184503  1.118634
#>   2       15      1.434093  0.9182413  1.121622
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were nprune = 14 and degree = 2.

The optimal MARS model minimized the RMSE when the nprune = 14 and the degree = 2.

MARS_predictions <- predict(MARS_model, newdata = testData$x)
postResample(pred = MARS_predictions, obs = testData$y)
#>      RMSE  Rsquared       MAE 
#> 1.2779993 0.9338365 1.0147070

The RMSE of the MARS model is a lot lower than the KNN model. Let’s see what variables are important.

varImp(MARS_model)
#> earth variable importance
#> 
#>    Overall
#> X1  100.00
#> X4   75.40
#> X2   49.00
#> X5   15.72
#> X3    0.00

The MARS model picks up the X1-X5 varaibles.

SVM Model

SVM_model <- train(x = trainingData$x,
                   y = trainingData$y,
                   method = "svmRadial",
                   preProcess = c("center", "scale"),
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))
SVM_model
#> Support Vector Machines with Radial Basis Function Kernel 
#> 
#> 200 samples
#>  10 predictor
#> 
#> Pre-processing: centered (10), scaled (10) 
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
#> Resampling results across tuning parameters:
#> 
#>   C       RMSE      Rsquared   MAE     
#>     0.25  2.548095  0.7967225  2.001583
#>     0.50  2.290818  0.8118247  1.779392
#>     1.00  2.095918  0.8337552  1.634192
#>     2.00  1.989469  0.8462100  1.539721
#>     4.00  1.901315  0.8565159  1.510413
#>     8.00  1.879808  0.8588137  1.514492
#>    16.00  1.878851  0.8591592  1.518105
#>    32.00  1.878851  0.8591592  1.518105
#>    64.00  1.878851  0.8591592  1.518105
#>   128.00  1.878851  0.8591592  1.518105
#> 
#> Tuning parameter 'sigma' was held constant at a value of 0.06670077
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were sigma = 0.06670077 and C = 16.

The optimal SVM mdoel has a \(\sigma\) of 0.0667008 and an C of 16.

SVM_predictions <- predict(SVM_model, newdata = testData$x)
postResample(pred = SVM_predictions, obs = testData$y)
#>      RMSE  Rsquared       MAE 
#> 2.0829707 0.8242096 1.5826017
varImp(SVM_model)
#> loess r-squared variable importance
#> 
#>      Overall
#> X4  100.0000
#> X1   95.5047
#> X2   89.6186
#> X5   45.2170
#> X3   29.9330
#> X9    6.3299
#> X10   5.5182
#> X8    3.2527
#> X6    0.8884
#> X7    0.0000

The SVM picked up the important variables.

Neural Network Model

The best neural network has a size = 4 and a decay of 0.1.

nnet_predictions <- predict(nnet_model, newdata = testData$x)
postResample(pred = nnet_predictions, obs = testData$y)
#>      RMSE  Rsquared       MAE 
#> 2.0532730 0.8346365 1.5444574
varImp(nnet_model)
#> loess r-squared variable importance
#> 
#>      Overall
#> X4  100.0000
#> X1   95.5047
#> X2   89.6186
#> X5   45.2170
#> X3   29.9330
#> X9    6.3299
#> X10   5.5182
#> X8    3.2527
#> X6    0.8884
#> X7    0.0000

The top 5 variables are the ones we want to see listed.

Summary

results <- data.frame(t(postResample(pred = knnPred, obs = testData$y))) %>% 
  mutate("Model" = "KNN")
results <- data.frame(t(postResample(pred = MARS_predictions, obs = testData$y))) %>%
  mutate("Model"= "MARS") %>%
  bind_rows(results)
results <- data.frame(t(postResample(pred = SVM_predictions, obs = testData$y))) %>%
  mutate("Model"= "SVM") %>%
  bind_rows(results)
results <- data.frame(t(postResample(pred = nnet_predictions, obs = testData$y))) %>%
  mutate("Model"= "Neural Network") %>%
  bind_rows(results)
results %>%
  select(Model, RMSE, Rsquared, MAE) %>%
  arrange(RMSE) %>%
  kable() %>%
  kable_styling()
Model RMSE Rsquared MAE
MARS 1.277999 0.9338365 1.014707
Neural Network 2.053273 0.8346365 1.544457
SVM 2.082971 0.8242096 1.582602
KNN 3.204060 0.6819919 2.568346

The MARS model preformed the best and identified the right variables as the important ones. The \(R^2\) on it is extremely high with a relatively lowest RMSE. It has a better performance on the test set!

Problem 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and traing several nonlinear regression models.

I will run the data through the models in the chapter trying to keep each model as close to the original linear model in terms of the parameters.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# Make this reproducible
set.seed(42)
knn_model <- preProcess(ChemicalManufacturingProcess, "knnImpute")
df <- predict(knn_model, ChemicalManufacturingProcess)
df <- df %>%
  select_at(vars(-one_of(nearZeroVar(., names = TRUE))))
in_train <- createDataPartition(df$Yield, times = 1, p = 0.8, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]
pls_model <- train(
  Yield ~ ., data = train_df, method = "pls",
  center = TRUE,
  scale = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)
pls_model
#> Partial Least Squares 
#> 
#> 144 samples
#>  56 predictor
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 130, 129, 128, 129, 130, 129, ... 
#> Resampling results across tuning parameters:
#> 
#>   ncomp  RMSE       Rsquared   MAE      
#>    1     0.8824790  0.3779221  0.6711462
#>    2     1.1458456  0.4219806  0.7086431
#>    3     0.7363066  0.5244517  0.5688553
#>    4     0.8235294  0.5298005  0.5933120
#>    5     0.9670735  0.4846010  0.6371199
#>    6     0.9959036  0.4776684  0.6427478
#>    7     0.9119517  0.4986338  0.6200233
#>    8     0.9068621  0.5012144  0.6293371
#>    9     0.8517370  0.5220166  0.6163795
#>   10     0.8919356  0.5062912  0.6332243
#>   11     0.9173758  0.4934557  0.6463164
#>   12     0.9064125  0.4791526  0.6485663
#>   13     0.9255289  0.4542181  0.6620193
#>   14     1.0239913  0.4358371  0.6944056
#>   15     1.0754710  0.4365214  0.7077991
#>   16     1.1110579  0.4269065  0.7135684
#>   17     1.1492855  0.4210485  0.7222868
#>   18     1.1940639  0.4132534  0.7396357
#>   19     1.2271867  0.4079005  0.7494818
#>   20     1.2077102  0.4022859  0.7470327
#>   21     1.2082648  0.4026711  0.7452969
#>   22     1.2669285  0.3987044  0.7634170
#>   23     1.3663033  0.3970188  0.7957514
#>   24     1.4531634  0.3898475  0.8243034
#>   25     1.5624265  0.3820102  0.8612935
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was ncomp = 3.
pls_predictions <- predict(pls_model, test_df)
results <- data.frame(t(postResample(pred = pls_predictions, obs = test_df$Yield))) %>%
  mutate("Model"= "PLS")

Part A

Which nonlinear regression model give the optimal resampling and test set performance?

KNN Model

knn_model <- train(
  Yield ~ ., data = train_df, method = "knn",
  center = TRUE,
  scale = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)
knn_model
#> k-Nearest Neighbors 
#> 
#> 144 samples
#>  56 predictor
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 130, 129, 130, 130, 130, 131, ... 
#> Resampling results across tuning parameters:
#> 
#>   k   RMSE       Rsquared   MAE      
#>    5  0.7085987  0.4846711  0.5709345
#>    7  0.7331379  0.4559107  0.5981516
#>    9  0.7417741  0.4556845  0.6124459
#>   11  0.7505969  0.4526780  0.6277377
#>   13  0.7450412  0.4691830  0.6181457
#>   15  0.7539451  0.4604423  0.6246762
#>   17  0.7663763  0.4474991  0.6306860
#>   19  0.7694326  0.4480450  0.6315047
#>   21  0.7706841  0.4508441  0.6288689
#>   23  0.7780866  0.4419184  0.6350889
#>   25  0.7834333  0.4320541  0.6382165
#>   27  0.7877890  0.4339581  0.6454522
#>   29  0.7981302  0.4187796  0.6533767
#>   31  0.8060716  0.3977540  0.6610186
#>   33  0.8063566  0.4128701  0.6600828
#>   35  0.8144040  0.3964037  0.6671734
#>   37  0.8160580  0.4009313  0.6662999
#>   39  0.8220176  0.3884786  0.6693883
#>   41  0.8274778  0.3750873  0.6746765
#>   43  0.8288000  0.3820179  0.6758220
#>   45  0.8285016  0.3878549  0.6747216
#>   47  0.8331504  0.3808342  0.6774480
#>   49  0.8368096  0.3816669  0.6829726
#>   51  0.8391074  0.3804316  0.6857959
#>   53  0.8442763  0.3704957  0.6915276
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 5.
knn_predictions <- predict(knn_model, test_df)
results <- data.frame(t(postResample(pred = knn_predictions, obs = test_df$Yield))) %>%
  mutate("Model"= "KNN") %>% rbind(results)

MARS Model

MARS_grid <- expand.grid(.degree = 1:2, .nprune = 2:15)
MARS_model <- train(
  Yield ~ ., data = train_df, method = "earth",
  tuneGrid = MARS_grid,
  # If the following lines are uncommented, it throws an error
  #center = TRUE,
  #scale = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)
MARS_model
#> Multivariate Adaptive Regression Spline 
#> 
#> 144 samples
#>  56 predictor
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 129, 130, 130, 128, 130, 128, ... 
#> Resampling results across tuning parameters:
#> 
#>   degree  nprune  RMSE       Rsquared   MAE      
#>   1        2      0.7744594  0.4159942  0.6084363
#>   1        3      0.7003597  0.5682276  0.5530374
#>   1        4      0.6469215  0.6086750  0.5154043
#>   1        5      0.6404520  0.6073071  0.5042864
#>   1        6      0.6960964  0.5396157  0.5588565
#>   1        7      0.6997921  0.5521698  0.5669813
#>   1        8      0.7147341  0.5423472  0.5621022
#>   1        9      0.7173559  0.5508650  0.5641610
#>   1       10      0.7487512  0.5325936  0.5869088
#>   1       11      0.7415722  0.5353676  0.5847493
#>   1       12      0.7353149  0.5519142  0.5768069
#>   1       13      0.7450704  0.5357967  0.5887685
#>   1       14      0.7419274  0.5400866  0.5844385
#>   1       15      0.7262459  0.5553445  0.5775694
#>   2        2      0.7814439  0.4052275  0.6150579
#>   2        3      0.6961165  0.5526862  0.5636101
#>   2        4      0.6947315  0.5595844  0.5539150
#>   2        5      0.6779962  0.5808479  0.5327456
#>   2        6      0.6727275  0.5828458  0.5188054
#>   2        7      0.6679783  0.5914730  0.5207659
#>   2        8      0.6108172  0.6495008  0.4727198
#>   2        9      0.5989850  0.6605148  0.4657677
#>   2       10      0.6026849  0.6641443  0.4591813
#>   2       11      0.6077285  0.6574897  0.4719777
#>   2       12      0.6015413  0.6605563  0.4640395
#>   2       13      0.6154576  0.6602372  0.4727339
#>   2       14      0.6222890  0.6555194  0.4751916
#>   2       15      0.6102534  0.6703285  0.4650639
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were nprune = 9 and degree = 2.
MARS_predictions <- predict(MARS_model, test_df)
results <- data.frame(t(postResample(pred = MARS_predictions, obs = test_df$Yield))) %>%
  mutate("Model"= "MARS") %>% rbind(results)

SVM Model

SVM_model <- train(
  Yield ~ ., data = train_df, method = "svmRadial",
  center = TRUE,
  scale = TRUE,
  trControl = trainControl(method = "cv"),
  tuneLength = 25
)
SVM_model
#> Support Vector Machines with Radial Basis Function Kernel 
#> 
#> 144 samples
#>  56 predictor
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 129, 129, 129, 130, 131, 131, ... 
#> Resampling results across tuning parameters:
#> 
#>   C           RMSE       Rsquared   MAE      
#>         0.25  0.7683015  0.4646282  0.6392703
#>         0.50  0.7174781  0.4848381  0.5965095
#>         1.00  0.6686679  0.5330949  0.5530530
#>         2.00  0.6409061  0.5552794  0.5275947
#>         4.00  0.6356640  0.5619233  0.5166512
#>         8.00  0.6286151  0.5761807  0.5115837
#>        16.00  0.6287131  0.5760690  0.5116965
#>        32.00  0.6287131  0.5760690  0.5116965
#>        64.00  0.6287131  0.5760690  0.5116965
#>       128.00  0.6287131  0.5760690  0.5116965
#>       256.00  0.6287131  0.5760690  0.5116965
#>       512.00  0.6287131  0.5760690  0.5116965
#>      1024.00  0.6287131  0.5760690  0.5116965
#>      2048.00  0.6287131  0.5760690  0.5116965
#>      4096.00  0.6287131  0.5760690  0.5116965
#>      8192.00  0.6287131  0.5760690  0.5116965
#>     16384.00  0.6287131  0.5760690  0.5116965
#>     32768.00  0.6287131  0.5760690  0.5116965
#>     65536.00  0.6287131  0.5760690  0.5116965
#>    131072.00  0.6287131  0.5760690  0.5116965
#>    262144.00  0.6287131  0.5760690  0.5116965
#>    524288.00  0.6287131  0.5760690  0.5116965
#>   1048576.00  0.6287131  0.5760690  0.5116965
#>   2097152.00  0.6287131  0.5760690  0.5116965
#>   4194304.00  0.6287131  0.5760690  0.5116965
#> 
#> Tuning parameter 'sigma' was held constant at a value of 0.015148
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were sigma = 0.015148 and C = 8.
SVM_predictions <- predict(SVM_model, test_df)
results <- data.frame(t(postResample(pred = SVM_predictions, obs = test_df$Yield))) %>%
  mutate("Model"= "SVM") %>% rbind(results)

Neural Network Model

Summary

results %>%
  select(Model, RMSE, Rsquared, MAE) %>%
  arrange(RMSE) %>%
  kable() %>%
  kable_styling()
Model RMSE Rsquared MAE
PLS 0.6192577 0.6771122 0.5059984
MARS 0.6797644 0.6313203 0.5452501
SVM 0.6815833 0.6251941 0.4926105
KNN 0.7149819 0.5894607 0.5047973
Neural Network 0.7368477 0.5465295 0.5848484

The PLS Model was the best model with the highest Rsquared of 67.7% and lowest RMSE wwith 0.619.

Part B

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

varImp(SVM_model, 10)
#> loess r-squared variable importance
#> 
#>   only 20 most important variables shown (out of 56)
#> 
#>                        Overall
#> ManufacturingProcess32  100.00
#> ManufacturingProcess13   93.82
#> ManufacturingProcess09   89.93
#> ManufacturingProcess17   88.20
#> BiologicalMaterial06     82.61
#> BiologicalMaterial03     79.44
#> ManufacturingProcess36   73.85
#> BiologicalMaterial12     72.36
#> ManufacturingProcess06   69.00
#> ManufacturingProcess11   62.34
#> ManufacturingProcess31   56.39
#> BiologicalMaterial02     50.34
#> BiologicalMaterial11     48.53
#> BiologicalMaterial09     44.76
#> ManufacturingProcess30   41.87
#> BiologicalMaterial08     40.24
#> ManufacturingProcess29   38.54
#> ManufacturingProcess33   38.16
#> BiologicalMaterial04     36.92
#> ManufacturingProcess25   36.83
varImp(pls_model, 10)
#> pls variable importance
#> 
#>   only 20 most important variables shown (out of 56)
#> 
#>                        Overall
#> ManufacturingProcess32  100.00
#> ManufacturingProcess09   88.04
#> ManufacturingProcess36   82.20
#> ManufacturingProcess13   82.11
#> ManufacturingProcess17   80.25
#> ManufacturingProcess06   59.06
#> ManufacturingProcess11   55.93
#> BiologicalMaterial02     55.46
#> BiologicalMaterial06     54.64
#> BiologicalMaterial03     54.50
#> ManufacturingProcess33   53.91
#> ManufacturingProcess12   52.04
#> BiologicalMaterial08     49.76
#> BiologicalMaterial12     47.40
#> ManufacturingProcess34   45.47
#> BiologicalMaterial11     45.05
#> BiologicalMaterial01     44.18
#> BiologicalMaterial04     42.95
#> ManufacturingProcess04   39.94
#> ManufacturingProcess28   36.61

The SVM model was very similar to the PLS model. In either case the manufacuturing process variables dominate the list. The SVM model found BiologicalMaterial12 to be important and didn’t find BiologicalMaterial02 to be important.

Part C

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

ggplot(train_df, aes(BiologicalMaterial12, Yield)) +
  geom_point()

This indicates a positive relationship between the BiologicalMaterial02 and the yeild, although it seems that there is a sweet spot, or a point of diminishing marginal returns on the material as the highest yeilds are not the furthest to the right on the graph.