R Markdown

7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\(y=10sin(πx1x2)+20(x3−0.5)2+10x4+5x5+N(0,σ2)\)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(mlbench)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)

trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)

testData <- mlbench.friedman1(5000, sd= 1)
testData$x <- data.frame(testData$x)

# this is book exmaple of knn

knn_fit<- train(x=trainingData$x,y=trainingData$y, method = "knn", preProcess = c("center", "scale"), tuneLength = 10)
knnPred <- predict(knn_fit, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461
############################################################

# this is my example of mars
mars_fit<- train(x = trainingData$x, y = trainingData$y, method = "earth",
  preProcess = c("center", "scale"), tuneLength = 10)
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
marsPred <- predict(mars_fit, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
##     RMSE Rsquared      MAE 
## 1.776575 0.872700 1.358367
svmRTuned <- train(trainingData$x, trainingData$y,   method = "svmLinear",   preProc = c("center", "scale"),   tuneLength = 14,   trControl = trainControl(method = "cv")) 

svmRTuned
## Support Vector Machines with Linear Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2.503119  0.7588397  2.021759
## 
## Tuning parameter 'C' was held constant at a value of 1
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 172 
## 
## Objective Function Value : -54.9759 
## Training error : 0.216921
# The MARS model is the best model. The r-squared is high at .8677, which is kind of the perfect level, more or less, because more than that might be overfit, which is what talked about in class. 

Exercise 6.3 describes data for a chemcial manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

  1. Which nonlinear regression model gives the optimal resampling and test set performance?
  2. Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
  3. Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intution about the biological or process predictors and their relationship with yield?
# Let's load up our local libraries

library(AppliedPredictiveModeling) # has ChemicalManufacturingProcess data in it
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.1     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()
library(RANN)
library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
## 
##     cross
## The following object is masked from 'package:ggplot2':
## 
##     alpha
library(earth)
library(e1071)
library(caret)

# load the chemical manufacturing process data and then make it local data frame
data(ChemicalManufacturingProcess)
ChemicalManufacturingProcess <- ChemicalManufacturingProcess

# this is a cool bit of code that i'm going to save and use with other datasets. I used this in Project 1 also. It let's you really visualize missing values. 
visdat::vis_miss(ChemicalManufacturingProcess) + theme(axis.text.x = element_text(size = 5))

# always set this seed
set.seed(1024)

# use pre-process function
ChemicalManufacturingProcess_KNN <- preProcess(ChemicalManufacturingProcess, "knnImpute")
ChemicalManufacturingProcess_predictions <- predict(ChemicalManufacturingProcess_KNN, ChemicalManufacturingProcess)

# now we split up the data into a test and train version. We can save this training set for all the different nonlinear models we will try out
in_train <- createDataPartition(ChemicalManufacturingProcess_predictions$Yield, times = 1, p = 0.8, list = FALSE)
train_df <- ChemicalManufacturingProcess_predictions[in_train, ]
chemical_manufacturing_process_testing_df <- ChemicalManufacturingProcess_predictions[-in_train, ]

# First up is KNN

knn_model <- train(
  Yield ~ ., data = train_df, method = "knn",
  center = TRUE,
  scale = TRUE,
  trControl = trainControl("cv", number = 12),
  tuneLength = 20
)
knn_model
## k-Nearest Neighbors 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (12 fold) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    5  0.6989784  0.5283036  0.5492972
##    7  0.6975062  0.5370561  0.5573545
##    9  0.7141302  0.5166686  0.5776607
##   11  0.7116015  0.5287752  0.5841152
##   13  0.7169219  0.5176278  0.5880122
##   15  0.7279874  0.5088178  0.5913430
##   17  0.7357253  0.5036560  0.6025929
##   19  0.7414767  0.4966259  0.6042209
##   21  0.7441800  0.5038005  0.6070311
##   23  0.7489150  0.5000087  0.6124285
##   25  0.7524140  0.4955007  0.6168216
##   27  0.7591020  0.4879519  0.6178066
##   29  0.7645591  0.4850406  0.6233705
##   31  0.7667741  0.4837638  0.6238018
##   33  0.7748198  0.4695899  0.6313253
##   35  0.7850041  0.4586164  0.6399696
##   37  0.7870628  0.4689378  0.6408727
##   39  0.7909958  0.4638100  0.6437614
##   41  0.7970462  0.4588364  0.6490915
##   43  0.8023315  0.4543656  0.6558684
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.
# now we have to turn to our testing dataframe to see how it works

knn_predictions <- predict(knn_model, chemical_manufacturing_process_testing_df)

results <- data.frame(t(postResample(pred = knn_predictions, obs = chemical_manufacturing_process_testing_df$Yield))) %>%
  mutate("Model"= "KNN") 
results
##        RMSE  Rsquared       MAE Model
## 1 0.7176988 0.4460735 0.5796155   KNN
# for the support vector machine, because I presented on SVM, I want to try out these generalizations of SVM that allow for a 'curvy' line so to speak 



SVM_model <- train(
  Yield ~ ., data = train_df, method = "svmLinear",
  center = TRUE,
  scale = TRUE,
  trControl = trainControl(method = "cv"),
  tuneLength = 25
)

SVM_model
## Support Vector Machines with Linear Kernel 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 131, 129, 129, 129, 130, 130, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE      
##   1.80969  0.4703981  0.8635736
## 
## Tuning parameter 'C' was held constant at a value of 1
SVM_predictions <- predict(SVM_model, chemical_manufacturing_process_testing_df)

results <- data.frame(t(postResample(pred = SVM_predictions, obs = chemical_manufacturing_process_testing_df$Yield))) %>%
  mutate("Model"= "SVM")

results
##        RMSE  Rsquared       MAE Model
## 1 0.8784625 0.3265185 0.6529406   SVM
varImp(SVM_model, 10)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess13  100.00
## ManufacturingProcess32   90.67
## ManufacturingProcess17   86.79
## ManufacturingProcess09   76.75
## BiologicalMaterial06     74.19
## ManufacturingProcess36   67.41
## BiologicalMaterial03     65.91
## BiologicalMaterial02     63.73
## ManufacturingProcess06   59.73
## ManufacturingProcess31   57.62
## BiologicalMaterial12     56.62
## ManufacturingProcess11   47.48
## ManufacturingProcess29   46.17
## BiologicalMaterial11     43.51
## ManufacturingProcess33   43.22
## ManufacturingProcess02   42.44
## BiologicalMaterial04     41.14
## BiologicalMaterial01     40.37
## BiologicalMaterial08     39.74
## ManufacturingProcess30   36.30
SVM_model <- train(
  Yield ~ ., data = train_df, method = "svmRadial",
  center = TRUE,
  scale = TRUE,
  trControl = trainControl(method = "cv"),
  tuneLength = 25
)

SVM_model
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 132, 128, 131, 130, 128, ... 
## Resampling results across tuning parameters:
## 
##   C           RMSE       Rsquared   MAE      
##         0.25  0.7534920  0.5201546  0.6211455
##         0.50  0.6907749  0.5705705  0.5637796
##         1.00  0.6449411  0.6221761  0.5204910
##         2.00  0.6251292  0.6433012  0.4999772
##         4.00  0.6175964  0.6504017  0.4898894
##         8.00  0.6086086  0.6636794  0.4820380
##        16.00  0.6086086  0.6636794  0.4820380
##        32.00  0.6086086  0.6636794  0.4820380
##        64.00  0.6086086  0.6636794  0.4820380
##       128.00  0.6086086  0.6636794  0.4820380
##       256.00  0.6086086  0.6636794  0.4820380
##       512.00  0.6086086  0.6636794  0.4820380
##      1024.00  0.6086086  0.6636794  0.4820380
##      2048.00  0.6086086  0.6636794  0.4820380
##      4096.00  0.6086086  0.6636794  0.4820380
##      8192.00  0.6086086  0.6636794  0.4820380
##     16384.00  0.6086086  0.6636794  0.4820380
##     32768.00  0.6086086  0.6636794  0.4820380
##     65536.00  0.6086086  0.6636794  0.4820380
##    131072.00  0.6086086  0.6636794  0.4820380
##    262144.00  0.6086086  0.6636794  0.4820380
##    524288.00  0.6086086  0.6636794  0.4820380
##   1048576.00  0.6086086  0.6636794  0.4820380
##   2097152.00  0.6086086  0.6636794  0.4820380
##   4194304.00  0.6086086  0.6636794  0.4820380
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01625812
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01625812 and C = 8.
SVM_predictions <- predict(SVM_model, chemical_manufacturing_process_testing_df)

results <- data.frame(t(postResample(pred = SVM_predictions, obs = chemical_manufacturing_process_testing_df$Yield))) %>%
  mutate("Model"= "SVM")

results
##        RMSE  Rsquared       MAE Model
## 1 0.6300514 0.5704346 0.4876792   SVM
# I got a higher r-squared of .57 with the generalization of the support vector machine using svmRadial, which was better than the KNN r-squared of .46, which is better than the svmLinear with .33.

# Here are the most important predictors, starting with ManufacturingProcess13 and ManufacturingProcess32 and ManufacturingProcess17
varImp(SVM_model, 10)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess13  100.00
## ManufacturingProcess32   90.67
## ManufacturingProcess17   86.79
## ManufacturingProcess09   76.75
## BiologicalMaterial06     74.19
## ManufacturingProcess36   67.41
## BiologicalMaterial03     65.91
## BiologicalMaterial02     63.73
## ManufacturingProcess06   59.73
## ManufacturingProcess31   57.62
## BiologicalMaterial12     56.62
## ManufacturingProcess11   47.48
## ManufacturingProcess29   46.17
## BiologicalMaterial11     43.51
## ManufacturingProcess33   43.22
## ManufacturingProcess02   42.44
## BiologicalMaterial04     41.14
## BiologicalMaterial01     40.37
## BiologicalMaterial08     39.74
## ManufacturingProcess30   36.30
# The top predictor, ManufacturingProcess13, has a perfect relationship with the response, which suggests to me that it might be a tautological sort of variable, like predicting weight with kilograms, and maybe if we learn more about the dataset we can remove it.