library(ISLR)
data("Auto")
# Auto
?Auto
Quantitative: mpg, cylinders, displacement, horsepower, weight, acceleration, year
Qualitative: origin, name
range(Auto$mpg)
## [1] 9.0 46.6
range(Auto$cylinders)
## [1] 3 8
range(Auto$displacement)
## [1] 68 455
range(Auto$horsepower)
## [1] 46 230
range(Auto$weight)
## [1] 1613 5140
range(Auto$acceleration)
## [1] 8.0 24.8
range(Auto$year)
## [1] 70 82
mean(Auto$mpg)
## [1] 23.44592
mean(Auto$cylinders)
## [1] 5.471939
mean(Auto$displacement)
## [1] 194.412
mean(Auto$horsepower)
## [1] 104.4694
mean(Auto$weight)
## [1] 2977.584
mean(Auto$acceleration)
## [1] 15.54133
mean(Auto$year)
## [1] 75.97959
sd(Auto$mpg)
## [1] 7.805007
sd(Auto$cylinders)
## [1] 1.705783
sd(Auto$displacement)
## [1] 104.644
sd(Auto$horsepower)
## [1] 38.49116
sd(Auto$weight)
## [1] 849.4026
sd(Auto$acceleration)
## [1] 2.758864
sd(Auto$year)
## [1] 3.683737
Auto_removed <- Auto[-(20:80),]
range(Auto_removed$mpg)
## [1] 11.0 46.6
range(Auto_removed$cylinders)
## [1] 3 8
range(Auto_removed$displacement)
## [1] 68 455
range(Auto_removed$horsepower)
## [1] 46 230
range(Auto_removed$weight)
## [1] 1649 4997
range(Auto_removed$acceleration)
## [1] 8.0 24.8
range(Auto_removed$year)
## [1] 70 82
mean(Auto_removed$mpg)
## [1] 24.21692
mean(Auto_removed$cylinders)
## [1] 5.401813
mean(Auto_removed$displacement)
## [1] 189.4411
mean(Auto_removed$horsepower)
## [1] 101.858
mean(Auto_removed$weight)
## [1] 2935.015
mean(Auto_removed$acceleration)
## [1] 15.6139
mean(Auto_removed$year)
## [1] 76.85498
sd(Auto_removed$mpg)
## [1] 7.823191
sd(Auto_removed$cylinders)
## [1] 1.665659
sd(Auto_removed$displacement)
## [1] 101.7607
sd(Auto_removed$horsepower)
## [1] 36.60106
sd(Auto_removed$weight)
## [1] 805.4751
sd(Auto_removed$acceleration)
## [1] 2.758818
sd(Auto_removed$year)
## [1] 3.323488
par(mfrow = c(2,3))
plot(Auto$weight, Auto$horsepower)
plot(Auto$displacement, Auto$weight)
plot(Auto$displacement, Auto$horsepower)
plot(Auto$horsepower, Auto$mpg)
plot(Auto$weight, Auto$mpg)
plot(Auto$displacement, Auto$mpg)
plot(Auto$year, Auto$mpg)
There are clear positive correlations between weight, displacement, and horsepower, indicating that larger engines tend to be more powerful and are found in heavier cars. Additionally, mpg is negatively associated with weight, horsepower, and displacement, suggesting that cars with larger engines and higher weight tend to have lower fuel efficiency. There also seems to be a positive relationship between year and mpg, suggesting that newer cars tend to have higher fuel efficiency.
Based on the plots in part (e), weight, horsepower, and displacement show clear negative relationships with mpg, while year exhibits a positive relationship with mpg. These may be useful in predicting mpg in a regression model.
library(ISLR)
data(College)
# data basic information
head(College)
## Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University Yes 1660 1232 721 23 52
## Adelphi University Yes 2186 1924 512 16 29
## Adrian College Yes 1428 1097 336 22 50
## Agnes Scott College Yes 417 349 137 60 89
## Alaska Pacific University Yes 193 146 55 16 44
## Albertson College Yes 587 479 158 38 62
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University 2885 537 7440 3300 450
## Adelphi University 2683 1227 12280 6450 750
## Adrian College 1036 99 11250 3750 400
## Agnes Scott College 510 63 12960 5450 450
## Alaska Pacific University 249 869 7560 4120 800
## Albertson College 678 41 13500 3335 500
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University 2200 70 78 18.1 12 7041
## Adelphi University 1500 29 30 12.2 16 10527
## Adrian College 1165 53 66 12.9 30 8735
## Agnes Scott College 875 92 97 7.7 37 19016
## Alaska Pacific University 1500 76 72 11.9 2 10922
## Albertson College 675 67 73 9.4 11 9727
## Grad.Rate
## Abilene Christian University 60
## Adelphi University 56
## Adrian College 54
## Agnes Scott College 59
## Alaska Pacific University 15
## Albertson College 55
dim(College)
## [1] 777 18
# The column Apps is the response variable, and others may be treated as predictors.
# For instance, for linear regression model in R, you may use lm(Apps~.,data=College)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
hist(College$Apps) #right skew, use createDataPartition
set.seed(123)
trainIndex <- createDataPartition(College$Apps, p = 0.8, list = FALSE)
train <- College[trainIndex, ]
test <- College[-trainIndex, ]
set.seed(123)
svmTune <- train(Apps ~ ., data = train,
method = "svmRadial",
preProcess = c("center", "scale"),
tuneLength = 14)
svmTune
## Support Vector Machines with Radial Basis Function Kernel
##
## 624 samples
## 17 predictor
##
## Pre-processing: centered (17), scaled (17)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 624, 624, 624, 624, 624, 624, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2696.222 0.6647106 996.7393
## 0.50 2484.534 0.7015293 909.7269
## 1.00 2340.618 0.7266354 862.2084
## 2.00 2261.375 0.7399688 847.6386
## 4.00 2239.103 0.7434481 853.8454
## 8.00 2239.729 0.7422595 858.9893
## 16.00 2243.750 0.7413120 861.2244
## 32.00 2243.750 0.7413120 861.2244
## 64.00 2243.750 0.7413120 861.2244
## 128.00 2243.750 0.7413120 861.2244
## 256.00 2243.750 0.7413120 861.2244
## 512.00 2243.750 0.7413120 861.2244
## 1024.00 2243.750 0.7413120 861.2244
## 2048.00 2243.750 0.7413120 861.2244
##
## Tuning parameter 'sigma' was held constant at a value of 0.06196904
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06196904 and C = 4.
library(Metrics)
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
test$SVM <- predict(svmTune, newdata = test)
rmse(test$SVM, test$Apps)
## [1] 851.5477
The test RMSE is 851.5477.
indx <- createFolds(train$Apps, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)
nnetGrid <- expand.grid(decay = c(0, 0.01, 0.1),
size = c(1, 3, 5, 7),
bag = FALSE)
set.seed(123)
nnetTune <- train(Apps ~ ., data = train,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProcess = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10000,
maxit = 1000,
allowParallel = FALSE)
nnetTune
## Model Averaged Neural Network
##
## 624 samples
## 17 predictor
##
## Pre-processing: centered (17), scaled (17)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 561, 561, 562, 562, 563, 562, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2857.832 0.5721190 1647.7129
## 0.00 3 2096.280 0.7723060 1149.0270
## 0.00 5 1980.808 0.7808697 1093.2574
## 0.00 7 1780.631 0.8195406 974.2936
## 0.01 1 1843.555 0.8037330 1039.7315
## 0.01 3 1691.100 0.8410310 911.1555
## 0.01 5 1664.568 0.8432421 880.8077
## 0.01 7 1661.861 0.8425060 934.4921
## 0.10 1 1404.843 0.8970443 780.3200
## 0.10 3 1615.845 0.8541443 875.5145
## 0.10 5 1695.660 0.8327359 920.9192
## 0.10 7 1601.724 0.8550899 894.2645
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1, decay = 0.1 and bag = FALSE.
test$NNet <- predict(nnetTune, newdata = test)
rmse(test$NNet, test$Apps)
## [1] 1095.994
The test RMSE is 1095.994.
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
marsTune <- train(Apps ~ ., data = train,
method = "earth",
tuneGrid = expand.grid(degree = 1, nprune = 2:38),
trControl = ctrl)
marsTune
## Multivariate Adaptive Regression Spline
##
## 624 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 561, 561, 562, 562, 563, 562, ...
## Resampling results across tuning parameters:
##
## nprune RMSE Rsquared MAE
## 2 1578.070 0.8557717 731.3260
## 3 1562.095 0.8728635 643.0014
## 4 1210.672 0.9289767 536.7915
## 5 1201.513 0.9285990 536.7608
## 6 1183.260 0.9291783 543.9305
## 7 1171.882 0.9301260 547.0502
## 8 1177.996 0.9284477 554.3384
## 9 1183.033 0.9271378 567.8108
## 10 1188.996 0.9258380 575.2853
## 11 1207.093 0.9231389 584.4312
## 12 1208.557 0.9229228 584.7849
## 13 1217.859 0.9220090 590.2080
## 14 1217.859 0.9220090 590.2080
## 15 1217.859 0.9220090 590.2080
## 16 1217.859 0.9220090 590.2080
## 17 1217.859 0.9220090 590.2080
## 18 1217.859 0.9220090 590.2080
## 19 1217.859 0.9220090 590.2080
## 20 1217.859 0.9220090 590.2080
## 21 1217.859 0.9220090 590.2080
## 22 1217.859 0.9220090 590.2080
## 23 1217.859 0.9220090 590.2080
## 24 1217.859 0.9220090 590.2080
## 25 1217.859 0.9220090 590.2080
## 26 1217.859 0.9220090 590.2080
## 27 1217.859 0.9220090 590.2080
## 28 1217.859 0.9220090 590.2080
## 29 1217.859 0.9220090 590.2080
## 30 1217.859 0.9220090 590.2080
## 31 1217.859 0.9220090 590.2080
## 32 1217.859 0.9220090 590.2080
## 33 1217.859 0.9220090 590.2080
## 34 1217.859 0.9220090 590.2080
## 35 1217.859 0.9220090 590.2080
## 36 1217.859 0.9220090 590.2080
## 37 1217.859 0.9220090 590.2080
## 38 1217.859 0.9220090 590.2080
##
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 7 and degree = 1.
test$MARS <- predict(marsTune, newdata = test)
rmse(test$MARS, test$Apps)
## [1] 895.8359
The test RMSE is 895.8359.
methods = c(rep('SVM', 153), rep('NNet', 153), rep('MARS', 153))
values = c(test$SVM, test$NNet, test$MARS)
aov = aov(values~ methods)
summary(aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## methods 2 4.552e+06 2276169 0.338 0.713
## Residuals 456 3.072e+09 6736363
The SVM, neural network, and MARS models produced test RMSE values of 851.5477, 1095.994, and 895.8359, respectively. The SVM model achieved the lowest test error, while the neural network had the highest error. The differences in performance, however, are not statistically significant (p = 0.713 > 0.05).
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
df <- ChemicalManufacturingProcess
yield <- df$Yield
processPredictors <- df[, -which(names(df) == "Yield")]
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
imputedData <- mice(processPredictors, m = 1, method = 'pmm', seed = 123)
##
## iter imp variable
## 1 1 ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess14 ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess40 ManufacturingProcess41
## 2 1 ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess14 ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess40 ManufacturingProcess41
## 3 1 ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess14 ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess40 ManufacturingProcess41
## 4 1 ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess14 ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess40 ManufacturingProcess41
## 5 1 ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess14 ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess40 ManufacturingProcess41
## Warning: Number of logged events: 135
predictorsImputed <- complete(imputedData, 1)
# Split data
set.seed(123)
trainIndex2 <- createDataPartition(yield, p = 0.8, list = FALSE)
x_train <- predictorsImputed[trainIndex2, ]
x_test <- predictorsImputed[-trainIndex2, ]
y_train <- yield[trainIndex2]
y_test <- yield[-trainIndex2]
# Pre-process
nzv <- nearZeroVar(x_train)
x_train <- x_train[, -nzv]
x_test <- x_test[, -nzv]
preProc <- preProcess(x_train, method = c("center", "scale"))
x_train <- predict(preProc, x_train)
x_test <- predict(preProc, x_test)
# PLS
indx2 <- createFolds(x_train, returnTrain = TRUE)
ctrl2 <- trainControl(method = "cv", index = indx2)
set.seed(100)
plsTune <- train(x = x_train, y = y_train,
method = "pls",
tuneGrid = expand.grid(ncomp = 1:50),
trControl = ctrl2)
## Warning: model fit failed for Fold04: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "oscorespls") :
## Invalid number of components, ncomp
## Warning: model fit failed for Fold07: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "oscorespls") :
## Invalid number of components, ncomp
## Warning: model fit failed for Fold08: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "oscorespls") :
## Invalid number of components, ncomp
## Warning: model fit failed for Fold09: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "oscorespls") :
## Invalid number of components, ncomp
## Warning: model fit failed for Fold10: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "oscorespls") :
## Invalid number of components, ncomp
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
testResults <- data.frame(obs = y_test,
PLS = predict(plsTune, newdata = x_test))
# PCR
set.seed(100)
pcrTune <- train(x = x_train, y = y_train,
method = "pcr",
tuneGrid = expand.grid(ncomp = 1:50),
trControl = ctrl2)
## Warning: model fit failed for Fold04: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "svdpc") :
## Invalid number of components, ncomp
## Warning: model fit failed for Fold07: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "svdpc") :
## Invalid number of components, ncomp
## Warning: model fit failed for Fold08: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "svdpc") :
## Invalid number of components, ncomp
## Warning: model fit failed for Fold09: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "svdpc") :
## Invalid number of components, ncomp
## Warning: model fit failed for Fold10: ncomp=50 Error in pls::mvr(.outcome ~ ., data = dat, ncomp = ncomp, method = "svdpc") :
## Invalid number of components, ncomp
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
testResults$PCR <- predict(pcrTune, newdata = x_test)
# Ridge Regression
ridgeGrid <- expand.grid(lambda = seq(0, .1, length = 10))
set.seed(100)
ridgeTune <- train(x = x_train, y = y_train,
method = "ridge",
tuneGrid = ridgeGrid,
trControl = ctrl2,
preProc = c("BoxCox"))
testResults$Ridge <- predict(ridgeTune, newdata = x_test)
# ENET
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.3
enetGrid <- expand.grid(lambda = c(0.0001, .001, 0.01, 0.1),
fraction = seq(.05, 1, length = 20))
set.seed(100)
enetTune <- train(x = x_train, y = y_train,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl2,
preProc = c("BoxCox"))
testResults$ENET <- predict(enetTune, newdata = x_test)
# Best params
plsTune$bestTune
## ncomp
## 1 1
pcrTune$bestTune
## ncomp
## 8 8
ridgeTune$bestTune
## lambda
## 10 0.1
enetTune$bestTune
## fraction lambda
## 62 0.1 0.1
The optimal tuning parameters selected via cross-validation were:
PLS: ncomp = 1
PCR: ncomp = 8
Ridge: lambda = 0.1
Elastic Net: fraction = 0.1, lambda = 0.1
data.frame(rbind(PLS = postResample(pred = testResults$PLS, obs = y_test),
PCR = postResample(pred = testResults$PCR, obs = y_test),
Ridge = postResample(pred = testResults$Ridge, obs = y_test),
ENET = postResample(pred = testResults$ENET, obs = y_test)))
## RMSE Rsquared MAE
## PLS 1.443262 0.3718903 1.188921
## PCR 1.378808 0.4247325 1.135053
## Ridge 1.367931 0.4974944 1.176098
## ENET 1.473129 0.6171580 1.213936
resamps <- resamples(list(
PLS = plsTune,
PCR = pcrTune,
Ridge = ridgeTune,
ENET = enetTune
))
diffs <- diff(resamps)
summary(diffs)
##
## Call:
## summary.diff.resamples(object = diffs)
##
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## MAE
## PLS PCR Ridge ENET
## PLS 0.01867 -0.97713 -0.52495
## PCR 0.9780062 -0.99580 -0.54362
## Ridge 0.0009615 0.0006877 0.38553
## ENET 0.0004226 0.0009247 0.0245615
##
## RMSE
## PLS PCR Ridge ENET
## PLS 0.03063 -3.02581 -0.53132
## PCR 0.4076167 -3.05644 -0.56195
## Ridge 0.0003462 0.0003714 2.24829
## ENET 0.0008961 0.0014629 0.0003873
##
## Rsquared
## PLS PCR Ridge ENET
## PLS -0.049987 0.191048 0.003871
## PCR 0.007344 0.241035 0.053858
## Ridge 0.000330 2.449e-05 -0.179140
## ENET 1.000000 0.398353 0.011166
Among the models evaluated, Ridge Regression achieved the lowest RMSE (1.368), Elastic Net achieved the highest R2 (0.617), and PCR achieved the lowest MAE (1.135). Based on the resampling comparisons, Ridge Regression shows a statistically significant superior performance in all three metrics compared to the other models. Although ENET achieved the highest R2, its differences are less consistent, with some cases not being statistically significant. Therefore, Ridge Regression has the best predictive ability in this problem.
ridgeImp <- varImp(ridgeTune, scale = FALSE)
plot(ridgeImp, top = 25)
The top 25 most important predictors are shown above. There are more manufacturing predictors than biological represented (15 vs 10), suggesting that manufacturing processes may contribute more to the final yield than the quality of the raw materials.
par(mfrow = c(2,3))
plot(x_train[, "ManufacturingProcess32"], y_train)
plot(x_train[, "BiologicalMaterial05"], y_train)
plot(x_train[, "ManufacturingProcess36"], y_train)
plot(x_train[, "BiologicalMaterial03"], y_train)
plot(x_train[, "ManufacturingProcess13"], y_train)
Based on the plots of the top 5 most important predictors, ManufacturingProcess32 and BiologicalMaterial03 show a clear positive linear relationship with yield, so optimizing these factors further during future runs could improve yield. On the other hand, ManufacturingProcess13 and ManufacturingProcess36 have a negative relationship with yield, so limiting these processes in future runs can help mitigate reductions in yield.