7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
\(y=10sin(πx1x2)+20(x3−0.5)2+10x4+5x5+N(0,σ2)\)
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
library(mlbench)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)
testData <- mlbench.friedman1(5000, sd= 1)
testData$x <- data.frame(testData$x)
# this is book exmaple of knn
knn_fit<- train(x=trainingData$x,y=trainingData$y, method = "knn", preProcess = c("center", "scale"), tuneLength = 10)
knnPred <- predict(knn_fit, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
############################################################
# this is my example of mars
mars_fit<- train(x = trainingData$x, y = trainingData$y, method = "earth",
preProcess = c("center", "scale"), tuneLength = 10)
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
marsPred <- predict(mars_fit, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.776575 0.872700 1.358367
svmRTuned <- train(trainingData$x, trainingData$y, method = "svmLinear", preProc = c("center", "scale"), tuneLength = 14, trControl = trainControl(method = "cv"))
svmRTuned
## Support Vector Machines with Linear Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2.503119 0.7588397 2.021759
##
## Tuning parameter 'C' was held constant at a value of 1
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 172
##
## Objective Function Value : -54.9759
## Training error : 0.216921
# The MARS model is the best model. The r-squared is high at .8677, which is kind of the perfect level, more or less, because more than that might be overfit, which is what talked about in class.
Exercise 6.3 describes data for a chemcial manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
# Let's load up our local libraries
library(AppliedPredictiveModeling) # has ChemicalManufacturingProcess data in it
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.1 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
library(RANN)
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
library(earth)
library(e1071)
library(caret)
# load the chemical manufacturing process data and then make it local data frame
data(ChemicalManufacturingProcess)
ChemicalManufacturingProcess <- ChemicalManufacturingProcess
# this is a cool bit of code that i'm going to save and use with other datasets. I used this in Project 1 also. It let's you really visualize missing values.
visdat::vis_miss(ChemicalManufacturingProcess) + theme(axis.text.x = element_text(size = 5))
# always set this seed
set.seed(1024)
# use pre-process function
ChemicalManufacturingProcess_KNN <- preProcess(ChemicalManufacturingProcess, "knnImpute")
ChemicalManufacturingProcess_predictions <- predict(ChemicalManufacturingProcess_KNN, ChemicalManufacturingProcess)
# now we split up the data into a test and train version. We can save this training set for all the different nonlinear models we will try out
in_train <- createDataPartition(ChemicalManufacturingProcess_predictions$Yield, times = 1, p = 0.8, list = FALSE)
train_df <- ChemicalManufacturingProcess_predictions[in_train, ]
chemical_manufacturing_process_testing_df <- ChemicalManufacturingProcess_predictions[-in_train, ]
# First up is KNN
knn_model <- train(
Yield ~ ., data = train_df, method = "knn",
center = TRUE,
scale = TRUE,
trControl = trainControl("cv", number = 12),
tuneLength = 20
)
knn_model
## k-Nearest Neighbors
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (12 fold)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.6989784 0.5283036 0.5492972
## 7 0.6975062 0.5370561 0.5573545
## 9 0.7141302 0.5166686 0.5776607
## 11 0.7116015 0.5287752 0.5841152
## 13 0.7169219 0.5176278 0.5880122
## 15 0.7279874 0.5088178 0.5913430
## 17 0.7357253 0.5036560 0.6025929
## 19 0.7414767 0.4966259 0.6042209
## 21 0.7441800 0.5038005 0.6070311
## 23 0.7489150 0.5000087 0.6124285
## 25 0.7524140 0.4955007 0.6168216
## 27 0.7591020 0.4879519 0.6178066
## 29 0.7645591 0.4850406 0.6233705
## 31 0.7667741 0.4837638 0.6238018
## 33 0.7748198 0.4695899 0.6313253
## 35 0.7850041 0.4586164 0.6399696
## 37 0.7870628 0.4689378 0.6408727
## 39 0.7909958 0.4638100 0.6437614
## 41 0.7970462 0.4588364 0.6490915
## 43 0.8023315 0.4543656 0.6558684
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.
# now we have to turn to our testing dataframe to see how it works
knn_predictions <- predict(knn_model, chemical_manufacturing_process_testing_df)
results <- data.frame(t(postResample(pred = knn_predictions, obs = chemical_manufacturing_process_testing_df$Yield))) %>%
mutate("Model"= "KNN")
results
## RMSE Rsquared MAE Model
## 1 0.7176988 0.4460735 0.5796155 KNN
# for the support vector machine, because I presented on SVM, I want to try out these generalizations of SVM that allow for a 'curvy' line so to speak
SVM_model <- train(
Yield ~ ., data = train_df, method = "svmLinear",
center = TRUE,
scale = TRUE,
trControl = trainControl(method = "cv"),
tuneLength = 25
)
SVM_model
## Support Vector Machines with Linear Kernel
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 131, 129, 129, 129, 130, 130, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.80969 0.4703981 0.8635736
##
## Tuning parameter 'C' was held constant at a value of 1
SVM_predictions <- predict(SVM_model, chemical_manufacturing_process_testing_df)
results <- data.frame(t(postResample(pred = SVM_predictions, obs = chemical_manufacturing_process_testing_df$Yield))) %>%
mutate("Model"= "SVM")
results
## RMSE Rsquared MAE Model
## 1 0.8784625 0.3265185 0.6529406 SVM
varImp(SVM_model, 10)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess13 100.00
## ManufacturingProcess32 90.67
## ManufacturingProcess17 86.79
## ManufacturingProcess09 76.75
## BiologicalMaterial06 74.19
## ManufacturingProcess36 67.41
## BiologicalMaterial03 65.91
## BiologicalMaterial02 63.73
## ManufacturingProcess06 59.73
## ManufacturingProcess31 57.62
## BiologicalMaterial12 56.62
## ManufacturingProcess11 47.48
## ManufacturingProcess29 46.17
## BiologicalMaterial11 43.51
## ManufacturingProcess33 43.22
## ManufacturingProcess02 42.44
## BiologicalMaterial04 41.14
## BiologicalMaterial01 40.37
## BiologicalMaterial08 39.74
## ManufacturingProcess30 36.30
SVM_model <- train(
Yield ~ ., data = train_df, method = "svmRadial",
center = TRUE,
scale = TRUE,
trControl = trainControl(method = "cv"),
tuneLength = 25
)
SVM_model
## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 132, 128, 131, 130, 128, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.7534920 0.5201546 0.6211455
## 0.50 0.6907749 0.5705705 0.5637796
## 1.00 0.6449411 0.6221761 0.5204910
## 2.00 0.6251292 0.6433012 0.4999772
## 4.00 0.6175964 0.6504017 0.4898894
## 8.00 0.6086086 0.6636794 0.4820380
## 16.00 0.6086086 0.6636794 0.4820380
## 32.00 0.6086086 0.6636794 0.4820380
## 64.00 0.6086086 0.6636794 0.4820380
## 128.00 0.6086086 0.6636794 0.4820380
## 256.00 0.6086086 0.6636794 0.4820380
## 512.00 0.6086086 0.6636794 0.4820380
## 1024.00 0.6086086 0.6636794 0.4820380
## 2048.00 0.6086086 0.6636794 0.4820380
## 4096.00 0.6086086 0.6636794 0.4820380
## 8192.00 0.6086086 0.6636794 0.4820380
## 16384.00 0.6086086 0.6636794 0.4820380
## 32768.00 0.6086086 0.6636794 0.4820380
## 65536.00 0.6086086 0.6636794 0.4820380
## 131072.00 0.6086086 0.6636794 0.4820380
## 262144.00 0.6086086 0.6636794 0.4820380
## 524288.00 0.6086086 0.6636794 0.4820380
## 1048576.00 0.6086086 0.6636794 0.4820380
## 2097152.00 0.6086086 0.6636794 0.4820380
## 4194304.00 0.6086086 0.6636794 0.4820380
##
## Tuning parameter 'sigma' was held constant at a value of 0.01625812
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01625812 and C = 8.
SVM_predictions <- predict(SVM_model, chemical_manufacturing_process_testing_df)
results <- data.frame(t(postResample(pred = SVM_predictions, obs = chemical_manufacturing_process_testing_df$Yield))) %>%
mutate("Model"= "SVM")
results
## RMSE Rsquared MAE Model
## 1 0.6300514 0.5704346 0.4876792 SVM
# I got a higher r-squared of .57 with the generalization of the support vector machine using svmRadial, which was better than the KNN r-squared of .46, which is better than the svmLinear with .33.
# Here are the most important predictors, starting with ManufacturingProcess13 and ManufacturingProcess32 and ManufacturingProcess17
varImp(SVM_model, 10)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess13 100.00
## ManufacturingProcess32 90.67
## ManufacturingProcess17 86.79
## ManufacturingProcess09 76.75
## BiologicalMaterial06 74.19
## ManufacturingProcess36 67.41
## BiologicalMaterial03 65.91
## BiologicalMaterial02 63.73
## ManufacturingProcess06 59.73
## ManufacturingProcess31 57.62
## BiologicalMaterial12 56.62
## ManufacturingProcess11 47.48
## ManufacturingProcess29 46.17
## BiologicalMaterial11 43.51
## ManufacturingProcess33 43.22
## ManufacturingProcess02 42.44
## BiologicalMaterial04 41.14
## BiologicalMaterial01 40.37
## BiologicalMaterial08 39.74
## ManufacturingProcess30 36.30
# The top predictor, ManufacturingProcess13, has a perfect relationship with the response, which suggests to me that it might be a tautological sort of variable, like predicting weight with kilograms, and maybe if we learn more about the dataset we can remove it.