Source Code: https://github.com/djlofland/DATA624_PredictiveAnalytics/tree/master/Homework_8
Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
\[y = 10 sin(\pi x_1 x_2) + 20(x_3 − 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\] where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)Tune several models on these data. For example:
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
knnModel <- train(x = trainingData$x, y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
stopCluster(cl)
# Model results
plot(knnModel)## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 10 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
## loess r-squared variable importance
##
## Overall
## X4 100.0000
## X1 95.5047
## X2 89.6186
## X5 45.2170
## X3 29.9330
## X9 6.3299
## X10 5.5182
## X8 3.2527
## X6 0.8884
## X7 0.0000
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
(knnPerf <- postResample(pred = knnPred, obs = testData$y))## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
## Create a specific candidate set of models to evaluate:
nnetGrid <- expand.grid(
.decay = c(0, 0.01, .1),
.size = c(1:10),
## The next option is to use bagging (see the
## next chapter) instead of different random
## seeds.
.bag = FALSE)
ctrl <- trainControl(method='cv', number=10)
nnetModel <- train(trainingData$x, trainingData$y,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
maxit = 500)
stopCluster(cl)
# Model results
plot(nnetModel)## Length Class Mode
## model 5 -none- list
## repeats 1 -none- numeric
## bag 1 -none- logical
## seeds 5 -none- numeric
## names 10 -none- character
## terms 3 terms call
## coefnames 10 -none- character
## xlevels 0 -none- list
## xNames 10 -none- character
## problemType 1 -none- character
## tuneValue 3 data.frame list
## obsLevels 1 -none- logical
## param 4 -none- list
## loess r-squared variable importance
##
## Overall
## X4 100.0000
## X1 95.5047
## X2 89.6186
## X5 45.2170
## X3 29.9330
## X9 6.3299
## X10 5.5182
## X8 3.2527
## X6 0.8884
## X7 0.0000
nnetPred <- predict(nnetModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
(nnetPerf <- postResample(pred = nnetPred, obs = testData$y))## RMSE Rsquared MAE
## 1.6644075 0.8905623 1.2423150
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
svmModel <- train(trainingData$x, trainingData$y,
method = "svmRadial", # svmRadial, svmLinear, svmPoly
tuneLength = 14,
trControl = trainControl(method = "cv"))
stopCluster(cl)
# Model results
plot(svmModel)## Length Class Mode
## 1 ksvm S4
## loess r-squared variable importance
##
## Overall
## X4 100.0000
## X1 95.5047
## X2 89.6186
## X5 45.2170
## X3 29.9330
## X9 6.3299
## X10 5.5182
## X8 3.2527
## X6 0.8884
## X7 0.0000
svmPred <- predict(svmModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
(svmPerf <- postResample(pred = svmPred, obs = testData$y))## RMSE Rsquared MAE
## 2.0893300 0.8232353 1.5877434
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsModel <- train(trainingData$x, trainingData$y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
stopCluster(cl)
# Model results
plot(marsModel)## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=2,
## nprune=14)
##
## coefficients
## (Intercept) 21.905319
## h(0.621722-X1) -15.726181
## h(X1-0.621722) 9.234027
## h(0.601063-X2) -18.253527
## h(X2-0.601063) 10.448545
## h(0.447442-X3) 9.700589
## h(X3-0.606015) 12.674694
## h(0.734892-X4) -9.863526
## h(X4-0.734892) 10.297964
## h(0.850094-X5) -5.324175
## h(0.218266-X1) * h(X2-0.601063) -55.358726
## h(X1-0.218266) * h(X2-0.601063) -29.100250
## h(X1-0.621722) * h(X2-0.295997) -26.833129
## h(0.649253-X1) * h(0.601063-X2) 27.120721
##
## Selected 14 of 18 terms, and 5 of 10 predictors (nprune=14)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 9 4
## GCV 1.62945 RSS 225.8601 GRSq 0.9338437 RSq 0.953688
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.24
## X2 48.74
## X5 15.53
## X3 0.00
marsPred <- predict(marsModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
(marsPerf <- postResample(pred = marsPred, obs = testData$y))## RMSE Rsquared MAE
## 1.1722635 0.9448890 0.9324923
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
## RMSE Rsquared MAE
## nnetPerf 1.664408 0.8905623 1.2423150
## marsPerf 1.172263 0.9448890 0.9324923
## svmPerf 2.089330 0.8232353 1.5877434
## knnPerf 3.204059 0.6819919 2.5683461
All three models (KNN, MARS and SVM) identified
X4,X1,X2,X5, andX3as the top 5 features. The only difference is that MARS hadX1as 1st andX4as 2nd. The others hadX41st andx12nd.
7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
# NOTE: Code copied from my Homework #7
data("ChemicalManufacturingProcess")
cmp <- as_tibble(ChemicalManufacturingProcess)
x_raw <- cmp[,2:58]
y_raw <- as.matrix(cmp$Yield)
print(paste(nrow(x_raw), ncol(x_raw)))## [1] "176 57"
## [1] "176 1"
# Various NA plots to inspect data
knitr::kable(miss_var_summary(cmp) %>% filter(n_miss > 0),
caption = 'Missing Values',
format="html",
table.attr="style='width:50%;'") %>%
kableExtra::kable_styling()| variable | n_miss | pct_miss |
|---|---|---|
| ManufacturingProcess03 | 15 | 8.5227273 |
| ManufacturingProcess11 | 10 | 5.6818182 |
| ManufacturingProcess10 | 9 | 5.1136364 |
| ManufacturingProcess25 | 5 | 2.8409091 |
| ManufacturingProcess26 | 5 | 2.8409091 |
| ManufacturingProcess27 | 5 | 2.8409091 |
| ManufacturingProcess28 | 5 | 2.8409091 |
| ManufacturingProcess29 | 5 | 2.8409091 |
| ManufacturingProcess30 | 5 | 2.8409091 |
| ManufacturingProcess31 | 5 | 2.8409091 |
| ManufacturingProcess33 | 5 | 2.8409091 |
| ManufacturingProcess34 | 5 | 2.8409091 |
| ManufacturingProcess35 | 5 | 2.8409091 |
| ManufacturingProcess36 | 5 | 2.8409091 |
| ManufacturingProcess02 | 3 | 1.7045455 |
| ManufacturingProcess06 | 2 | 1.1363636 |
| ManufacturingProcess01 | 1 | 0.5681818 |
| ManufacturingProcess04 | 1 | 0.5681818 |
| ManufacturingProcess05 | 1 | 0.5681818 |
| ManufacturingProcess07 | 1 | 0.5681818 |
| ManufacturingProcess08 | 1 | 0.5681818 |
| ManufacturingProcess12 | 1 | 0.5681818 |
| ManufacturingProcess14 | 1 | 0.5681818 |
| ManufacturingProcess22 | 1 | 0.5681818 |
| ManufacturingProcess23 | 1 | 0.5681818 |
| ManufacturingProcess24 | 1 | 0.5681818 |
| ManufacturingProcess40 | 1 | 0.5681818 |
| ManufacturingProcess41 | 1 | 0.5681818 |
# Impute missing using KNN
x_imputed <- knn.impute(as.matrix(x_raw), k=10)
# Check for columns with little variance - candidate to drop
lowVariance <- nearZeroVar(x_imputed, names = TRUE)
head(lowVariance)## [1] "BiologicalMaterial07"
lowVariance <- nearZeroVar(x_imputed)
x_lowvar <- x_imputed[,-lowVariance]
# Deal with outliers ... impute to median
x_outliers <- outlieR::impute(x_lowvar, fill='median')
# Find and drop high correlation features
correlations <- cor(x_outliers)
highCorr <- findCorrelation(correlations, names=TRUE, cutoff=0.9)
(highCorr)## [1] "BiologicalMaterial02" "ManufacturingProcess26" "BiologicalMaterial04"
## [4] "BiologicalMaterial12" "ManufacturingProcess11" "ManufacturingProcess27"
## [7] "ManufacturingProcess44" "ManufacturingProcess40"
# get training/test split
trainingRows <- createDataPartition(y_raw, p=0.8, list=FALSE)
# Build training datasets
trainX <- x_transf[trainingRows,]
trainY <- y_raw[trainingRows]
# put remaining rows into the test sets
testX <- x_transf[-trainingRows,]
testY <- y_raw[-trainingRows]
# Build a DF
trainingData <- as.data.frame(trainX)
trainingData$Yield <- trainYset.seed(42424)
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
## Create a specific candidate set of models to evaluate:
nnetGrid <- expand.grid(
.decay = c(0, 0.01, .1),
.size = c(1:10),
## The next option is to use bagging (see the
## next chapter) instead of different random
## seeds.
.bag = FALSE)
ctrl <- trainControl(method='cv', number=10)
nnetTune <- train(trainX, trainY,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainX) + 1) + 10 + 1,
maxit = 500)
stopCluster(cl)
# Model results
plot(nnetTune)## Length Class Mode
## model 5 -none- list
## repeats 1 -none- numeric
## bag 1 -none- logical
## seeds 5 -none- numeric
## names 48 -none- character
## terms 3 terms call
## coefnames 48 -none- character
## xlevels 0 -none- list
## xNames 48 -none- character
## problemType 1 -none- character
## tuneValue 3 data.frame list
## obsLevels 1 -none- logical
## param 4 -none- list
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 48)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 89.09
## ManufacturingProcess13 79.04
## BiologicalMaterial03 77.63
## ManufacturingProcess09 75.31
## ManufacturingProcess31 74.87
## ManufacturingProcess36 72.14
## ManufacturingProcess17 69.93
## ManufacturingProcess06 62.10
## ManufacturingProcess30 55.24
## ManufacturingProcess33 53.74
## BiologicalMaterial11 46.95
## BiologicalMaterial01 40.10
## ManufacturingProcess18 38.18
## BiologicalMaterial09 36.04
## ManufacturingProcess25 35.93
## BiologicalMaterial08 35.81
## ManufacturingProcess20 34.31
## ManufacturingProcess29 34.04
## ManufacturingProcess15 33.15
modelPred <- predict(nnetTune, newdata=testX)
modelValues <- data.frame(obs = testY, pred=modelPred)
colnames(modelValues) = c('obs', 'pred')
(nnet_values <- defaultSummary(modelValues))## RMSE Rsquared MAE
## 1.2146853 0.6426541 0.8962163
set.seed(42424)
# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
marsTune <- train(trainX, trainY,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
stopCluster(cl)
# Model Results
plot(marsTune)## Call: earth(x=matrix[144,48], y=c(42.44,42.03,4...), keepxy=TRUE, degree=1,
## nprune=3)
##
## coefficients
## (Intercept) 39.647909
## h(0.553668-ManufacturingProcess09) -1.037195
## h(ManufacturingProcess32- -0.813021) 1.323398
##
## Selected 3 of 21 terms, and 2 of 48 predictors (nprune=3)
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 2 (additive model)
## GCV 1.369288 RSS 183.7224 GRSq 0.5808792 RSq 0.6039985
## earth variable importance
##
## Overall
## ManufacturingProcess32 100
## ManufacturingProcess09 0
modelPred <- predict(marsTune, newdata=testX)
modelValues <- data.frame(obs = testY, pred=modelPred)
colnames(modelValues) = c('obs', 'pred')
(mars_values <- defaultSummary(modelValues))## RMSE Rsquared MAE
## 1.3016929 0.5974956 1.0242889
set.seed(42424)
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
svmRTune <- train(trainX, trainY,
method = "svmRadial", # svmRadial, svmLinear, svmPoly
tuneLength = 14,
trControl = trainControl(method = "cv"))
stopCluster(cl)
# Model Results
plot(svmRTune)## Length Class Mode
## 1 ksvm S4
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 48)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 89.09
## ManufacturingProcess13 79.04
## BiologicalMaterial03 77.63
## ManufacturingProcess09 75.31
## ManufacturingProcess31 74.87
## ManufacturingProcess36 72.14
## ManufacturingProcess17 69.93
## ManufacturingProcess06 62.10
## ManufacturingProcess30 55.24
## ManufacturingProcess33 53.74
## BiologicalMaterial11 46.95
## BiologicalMaterial01 40.10
## ManufacturingProcess18 38.18
## BiologicalMaterial09 36.04
## ManufacturingProcess25 35.93
## BiologicalMaterial08 35.81
## ManufacturingProcess20 34.31
## ManufacturingProcess29 34.04
## ManufacturingProcess15 33.15
modelPred <- predict(svmRTune, newdata=testX)
modelValues <- data.frame(obs = testY, pred=modelPred)
colnames(modelValues) = c('obs', 'pred')
(svm_values <- defaultSummary(modelValues))## RMSE Rsquared MAE
## 1.146482 0.713360 0.859768
set.seed(42424)
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
knnTune <- train(trainX, trainY,
method = "knn",
# Center and scaling will occur for new predictions too
preProc = c("center", "scale"),
tuneGrid = data.frame(.k = 1:20),
trControl = trainControl(method = "cv"))
stopCluster(cl)
# Model Results
plot(knnTune)## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 48 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 48)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 89.09
## ManufacturingProcess13 79.04
## BiologicalMaterial03 77.63
## ManufacturingProcess09 75.31
## ManufacturingProcess31 74.87
## ManufacturingProcess36 72.14
## ManufacturingProcess17 69.93
## ManufacturingProcess06 62.10
## ManufacturingProcess30 55.24
## ManufacturingProcess33 53.74
## BiologicalMaterial11 46.95
## BiologicalMaterial01 40.10
## ManufacturingProcess18 38.18
## BiologicalMaterial09 36.04
## ManufacturingProcess25 35.93
## BiologicalMaterial08 35.81
## ManufacturingProcess20 34.31
## ManufacturingProcess29 34.04
## ManufacturingProcess15 33.15
modelPred <- predict(knnTune, newdata=testX)
modelValues <- data.frame(obs = testY, pred=modelPred)
colnames(modelValues) = c('obs', 'pred')
(knn_values <- defaultSummary(modelValues))## RMSE Rsquared MAE
## 1.4621971 0.4875616 1.0596875
## RMSE Rsquared MAE
## nnet_values 1.214685 0.6426541 0.8962163
## mars_values 1.301693 0.5974956 1.0242889
## svm_values 1.146482 0.7133600 0.8597680
## knn_values 1.462197 0.4875616 1.0596875
MARS appears to provide the best RMSE and \(R^2\) of the tuned models for NNET, MARS, SVM and KNN.
ManufacturingProcess*features dominate the top most important featrues for all models. The top 10 predictors are the same for NNET, SVM and KNN. Only MARS had a differnt top feature list. All four non-linear models consistently pickedManufacturingProcess32,ManufacturingProcess13,ManufacturingProcess09, andBiologicalMaterial06in the top 10.
Referring back to Problem 6.3 in Homwwork #7, these same 4 features also show as top using
varImp(). However, we improved our RMSE and Rsquared values using non-linear models over the linear counterparts.
df <- cmp %>%
dplyr::select(c('ManufacturingProcess32', 'ManufacturingProcess13', 'ManufacturingProcess09', 'BiologicalMaterial06', 'Yield'))
caret::featurePlot(x=df['ManufacturingProcess32'], y=df['Yield'], plot="pairs", pch=20)SInce we don’t actually know what each Process physically maps to in the real world, it’s a little hard to apply intuition. I don’t know how ManufacturingProcess32 differs from ManufacturingProcess09. That said, we do see clear patterns between the features and Yield which we would expect since these were shown to be the top predictors. Had we seen no visual relationships, it’s possible there was still one detectable in the math, but that would have been concerning. ManufacturingProcess32, ManufacturingProcess09, and BiologicalMaterial06 show increase in Yield with an increase in the feature. ManufacturingProcess13 shows a decrease in Yield as the feature increases.