In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.
library(lattice)
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
The matrix fingerprints
contains the 1,107 binary
molecular predictors for the 165 compounds, while
permeability
contains permeability response.
library(AppliedPredictiveModeling)
data(permeability)
nearZeroVar
function from the caret package. How many predictors are left for
modeling?library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'caret'
## The following object is masked from 'package:pls':
##
## R2
fingerprints_filtered <- fingerprints[, -nearZeroVar(fingerprints)]
glimpse(fingerprints_filtered)
## num [1:165, 1:388] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:165] "1" "2" "3" "4" ...
## ..$ : chr [1:388] "X1" "X2" "X3" "X4" ...
388 predictors remain for modeling.
For the training set, I will use the first 110 observations, and use the rest for my test set:
train_set_x_df <- as.data.frame(fingerprints_filtered[1:110,])
train_set_x <- fingerprints_filtered[1:110,]
test_set_x <- fingerprints_filtered[111:165,]
train_set_y_df <- as.data.frame(permeability[1:110,])
train_set_y <- permeability[1:110,]
test_set_y <- permeability[111:165,]
Let’s take a look at the original data:
### Some initial plots of the data
xyplot(train_set_y_df$permeability ~ train_set_x_df$X4,
ylab = "permeability",
xlab = "X4")
xyplot(train_set_y_df$permeability ~ train_set_x_df$X20,
ylab = "permeability",
xlab = "X20")
xyplot(train_set_y_df$permeability ~ train_set_x_df$X72,
ylab = "permeability",
xlab = "X72")
Since each predictor variable can just be 0 or 1, I don’t think any transformation is needed for the original data before modeling (outside what the train function does).
Tune a PLS model:
set.seed(100)
indx <- createFolds(train_set_y, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)
set.seed(100)
plsTune <- train(x = train_set_x, y = train_set_y,
method = "pls",
tuneGrid = expand.grid(ncomp = 1:20),
trControl = ctrl)
plsTune
## Partial Least Squares
##
## 110 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 100, 98, 98, 99, 99, 98, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 11.71395 0.2143849 8.899378
## 2 10.90826 0.3182098 8.069260
## 3 10.55456 0.3742090 7.866986
## 4 10.88378 0.4246992 8.081647
## 5 11.05445 0.4314871 8.002925
## 6 11.42425 0.4006607 8.315823
## 7 11.69288 0.3700492 8.662212
## 8 12.00423 0.3696711 8.824582
## 9 12.39296 0.3608033 9.275735
## 10 12.76420 0.3300089 9.392145
## 11 13.20669 0.3160044 9.568009
## 12 13.60859 0.3020311 9.879736
## 13 14.10546 0.2954876 10.182043
## 14 14.30560 0.2828452 10.346977
## 15 14.52335 0.2852085 10.341159
## 16 14.87430 0.2768254 10.578734
## 17 15.32083 0.2652363 10.664805
## 18 15.48298 0.2642017 10.762865
## 19 15.64720 0.2554556 10.901658
## 20 15.86152 0.2441421 11.050867
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
plsResamples <- plsTune$results
plsResamples$Model <- "PLS"
xyplot(RMSE ~ ncomp,
data = plsResamples,
#aspect = 1,
xlab = "# Components",
ylab = "RMSE (Cross-Validation)",
auto.key = list(columns = 2),
groups = Model,
type = c("o", "g"))
## Warning in draw.key(simpleKey(...), draw = FALSE): not enough rows for columns
plsImp <- varImp(plsTune, scale = FALSE)
plot(plsImp, top = 25, scales = list(y = list(cex = .95)))
Chosen ncomp is 3 and the corresponding Rsquared value is 0.3742090 and RMSE is 10.55456.
# Response:
PLS <- predict(plsTune, test_set_x)
glimpse(PLS)
## num [1:55] 29.2 14.7 29.2 12.3 33.6 ...
# Residuals:
PLS - test_set_y
## 111 112 113 114 115 116
## -12.8431093 9.1025829 -12.8431093 -23.0894055 1.9238237 -20.7571261
## 117 118 119 120 121 122
## -20.6841704 -17.4246662 -1.9557921 18.4131846 6.0388850 -18.8703852
## 123 124 125 126 127 128
## -18.7344415 -18.8703852 -18.3711135 10.4383768 9.6481709 0.4657709
## 129 130 131 132 133 134
## 20.7650795 9.4515256 -1.1937257 3.2565552 26.9477790 4.8903566
## 135 136 137 138 139 140
## 9.1354341 5.2938229 -1.3677178 -0.3678805 21.6851413 2.3322837
## 141 142 143 144 145 146
## -16.7250434 -25.7931272 5.2381332 -9.5740983 1.0494965 8.7029740
## 147 148 149 150 151 152
## 0.9474552 4.6237907 14.3518502 -2.8698006 -0.2320333 8.4093749
## 153 154 155 156 157 158
## 8.5223391 3.5816587 4.1149679 24.5950545 -1.8878850 8.4607057
## 159 160 161 162 163 164
## 10.6129820 7.5067071 14.3747168 8.6988405 9.0620765 -12.6505146
## 165
## 0.7935953
# Sum of Squares Total and Error
sst <- sum((test_set_y - mean(test_set_y))^2)
sse <- sum((PLS - test_set_y)^2)
# R squared
rsq <- 1 - sse / sst
rsq
## [1] 0.5711109
The Rsquared value is 0.5711109.
Let’s try PCR:
set.seed(100)
pcrTune <- train(x = train_set_x, y = train_set_y,
method = "pcr",
tuneGrid = expand.grid(ncomp = 1:35),
trControl = ctrl)
pcrTune
## Principal Component Analysis
##
## 110 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 100, 98, 98, 99, 99, 98, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.27574 0.09107850 9.262943
## 2 12.35470 0.08509332 9.338148
## 3 12.44100 0.05045470 9.390560
## 4 12.43995 0.06207247 9.419897
## 5 12.66908 0.08470903 9.414710
## 6 11.68133 0.19594032 8.589011
## 7 11.11128 0.25301878 8.319281
## 8 11.07580 0.25792210 8.262428
## 9 10.91109 0.29986383 8.029309
## 10 10.85697 0.31411262 8.058591
## 11 10.92693 0.31881653 8.278232
## 12 11.01372 0.32158958 8.310311
## 13 10.79324 0.35087582 8.019884
## 14 10.74359 0.35385391 8.030234
## 15 10.72466 0.35960927 7.977698
## 16 10.69193 0.35642982 8.061796
## 17 10.77312 0.34514408 8.166548
## 18 10.92369 0.34100319 8.257080
## 19 11.03201 0.32934915 8.203110
## 20 10.90737 0.35724430 8.085212
## 21 10.91733 0.39358365 8.223455
## 22 10.83154 0.38352691 8.123252
## 23 10.88953 0.39775459 8.079413
## 24 10.82934 0.44090555 7.975673
## 25 10.94696 0.43007179 7.989442
## 26 10.89051 0.43106957 7.984514
## 27 11.02230 0.38882073 8.051817
## 28 11.08063 0.39012608 8.096570
## 29 11.17963 0.39735058 8.159222
## 30 11.19468 0.40090978 8.281390
## 31 11.17683 0.40034588 8.292896
## 32 11.39660 0.39273512 8.529296
## 33 11.47548 0.38231032 8.540921
## 34 11.59752 0.37891499 8.658306
## 35 11.61107 0.37625579 8.589752
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 16.
pcrResamples <- pcrTune$results
pcrResamples$Model <- "PCR"
xyplot(RMSE ~ ncomp,
data = pcrResamples,
#aspect = 1,
xlab = "# Components",
ylab = "RMSE (Cross-Validation)",
auto.key = list(columns = 2),
groups = Model,
type = c("o", "g"))
## Warning in draw.key(simpleKey(...), draw = FALSE): not enough rows for columns
Chosen ncomp is 3 and the corresponding Rsquared value is 0.35642982 and RMSE is 10.69193.
Since I don’t know much about the variables, I will also try elastic net:
enetGrid <- expand.grid(lambda = c(0, 0.01, .1),
fraction = seq(.05, 1, length = 20))
set.seed(100)
enetTune <- train(x = train_set_x, y = train_set_y,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl,
preProc = c("center", "scale"))
## Warning: model fit failed for Fold09: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
enetTune
## Elasticnet
##
## 110 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 100, 98, 98, 99, 99, 98, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 2.167554e+18 0.2760875 9.400723e+17
## 0.00 0.10 4.329761e+18 0.2924141 1.878416e+18
## 0.00 0.15 6.491967e+18 0.2806316 2.816760e+18
## 0.00 0.20 8.654174e+18 0.2677772 3.755104e+18
## 0.00 0.25 1.081638e+19 0.2641035 4.693448e+18
## 0.00 0.30 1.297859e+19 0.2556734 5.631792e+18
## 0.00 0.35 1.514080e+19 0.2527739 6.570135e+18
## 0.00 0.40 1.730300e+19 0.2474795 7.508479e+18
## 0.00 0.45 1.946521e+19 0.2343342 8.446823e+18
## 0.00 0.50 2.162742e+19 0.2188636 9.385167e+18
## 0.00 0.55 2.378962e+19 0.2150393 1.032351e+19
## 0.00 0.60 2.595183e+19 0.2082588 1.126185e+19
## 0.00 0.65 2.811404e+19 0.2048796 1.220020e+19
## 0.00 0.70 3.027625e+19 0.2055291 1.313854e+19
## 0.00 0.75 3.243845e+19 0.2123844 1.407689e+19
## 0.00 0.80 3.460066e+19 0.2125866 1.501523e+19
## 0.00 0.85 3.676287e+19 0.2048431 1.595357e+19
## 0.00 0.90 3.892507e+19 0.2044992 1.689192e+19
## 0.00 0.95 4.108728e+19 0.2033647 1.783026e+19
## 0.00 1.00 4.324949e+19 0.2132726 1.876861e+19
## 0.01 0.05 4.158780e+01 0.4166435 2.427016e+01
## 0.01 0.10 7.223851e+01 0.3853051 4.090022e+01
## 0.01 0.15 1.022350e+02 0.3684834 5.699183e+01
## 0.01 0.20 1.323981e+02 0.3515061 7.298976e+01
## 0.01 0.25 1.618568e+02 0.3205087 8.867886e+01
## 0.01 0.30 1.911846e+02 0.3015848 1.042170e+02
## 0.01 0.35 2.204765e+02 0.2854697 1.196965e+02
## 0.01 0.40 2.497468e+02 0.2650762 1.351681e+02
## 0.01 0.45 2.788467e+02 0.2512959 1.507699e+02
## 0.01 0.50 3.078696e+02 0.2458830 1.664644e+02
## 0.01 0.55 3.368453e+02 0.2443749 1.821260e+02
## 0.01 0.60 3.658975e+02 0.2408552 1.978034e+02
## 0.01 0.65 3.949548e+02 0.2382542 2.134825e+02
## 0.01 0.70 4.239470e+02 0.2373392 2.291241e+02
## 0.01 0.75 4.521839e+02 0.2391018 2.439440e+02
## 0.01 0.80 4.804139e+02 0.2407181 2.587239e+02
## 0.01 0.85 5.082353e+02 0.2447778 2.733086e+02
## 0.01 0.90 5.359682e+02 0.2485837 2.878741e+02
## 0.01 0.95 5.638551e+02 0.2544388 3.029382e+02
## 0.01 1.00 5.933428e+02 0.2600142 3.186979e+02
## 0.10 0.05 1.067231e+01 0.3808896 7.790510e+00
## 0.10 0.10 1.000326e+01 0.4372548 6.987894e+00
## 0.10 0.15 1.018177e+01 0.4280656 7.295591e+00
## 0.10 0.20 1.047609e+01 0.4211830 7.594456e+00
## 0.10 0.25 1.076066e+01 0.4134969 7.784750e+00
## 0.10 0.30 1.109332e+01 0.3991958 8.005429e+00
## 0.10 0.35 1.142410e+01 0.3849210 8.245894e+00
## 0.10 0.40 1.172818e+01 0.3692020 8.458649e+00
## 0.10 0.45 1.194468e+01 0.3579474 8.616301e+00
## 0.10 0.50 1.212975e+01 0.3518741 8.725441e+00
## 0.10 0.55 1.229628e+01 0.3466773 8.822736e+00
## 0.10 0.60 1.245523e+01 0.3412078 8.927142e+00
## 0.10 0.65 1.262356e+01 0.3361084 9.013968e+00
## 0.10 0.70 1.280228e+01 0.3315977 9.115648e+00
## 0.10 0.75 1.298436e+01 0.3269108 9.244204e+00
## 0.10 0.80 1.316117e+01 0.3224074 9.384807e+00
## 0.10 0.85 1.333322e+01 0.3172381 9.509273e+00
## 0.10 0.90 1.349812e+01 0.3118688 9.619737e+00
## 0.10 0.95 1.365637e+01 0.3068700 9.730565e+00
## 0.10 1.00 1.382811e+01 0.3017281 9.850480e+00
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.1 and lambda = 0.1.
plot(enetTune)
Enet <- predict(enetTune, test_set_x)
Enet
## 111 112 113 114 115 116 117 118
## 19.317715 16.716305 19.317715 12.554632 31.613597 16.766909 22.636263 31.205816
## 119 120 121 122 123 124 125 126
## 22.228481 28.039977 43.963613 26.663922 18.909933 26.663922 18.909933 22.177877
## 127 128 129 130 131 132 133 134
## 1.533303 7.330949 17.429763 17.429763 7.093060 10.614248 15.796527 4.958225
## 135 136 137 138 139 140 141 142
## 5.197908 6.740159 4.683686 8.981011 14.120167 4.193123 31.613597 19.062643
## 143 144 145 146 147 148 149 150
## 30.645324 19.623995 7.330949 13.215758 7.330949 16.789378 7.754186 1.533303
## 151 152 153 154 155 156 157 158
## 6.994874 9.341670 3.519439 2.733053 3.519439 21.610132 3.519439 3.519439
## 159 160 161 162 163 164 165
## 10.614248 7.330949 7.754186 7.330949 1.533303 21.214976 5.152676
For the elastic chosen model, the Rsquared value is 0.4372548 and RMSE is 10.00326. Elastic net produced the highest Rsquared and lowest RMSE so I would use that as the model.
These models have performed ok, but I would not recommend any of the models I looked at to repace the experiment. I think Rsquared value should be higher.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
The matrix processPredictors
contains the 57 predictors
(12 describing the input biological material and 45 describing the
process predictors) for the 176 manufacturing runs. yield
contains the percent yield for each run.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
I first tried using K-nearest neighbors, but that removed observations which I did not want:
# Apply imputation methods based on K-nearest neighbors
ChemicalManufacturingProcessPredictors <- ChemicalManufacturingProcess[, -1]
trans <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))
So intead I used the mean for the missing values:
ChemicalManufacturingProcessCopy <- ChemicalManufacturingProcess
for(i in 1:ncol(ChemicalManufacturingProcessCopy)){
ChemicalManufacturingProcessCopy[is.na(ChemicalManufacturingProcessCopy[,i]), i] <- mean(ChemicalManufacturingProcessCopy[,i], na.rm = TRUE)
}
chem_train_set_x_df <- as.data.frame(ChemicalManufacturingProcessCopy[1:110,])
chem_train_set_x <- ChemicalManufacturingProcessCopy[1:110,]
chem_test_set_x <- ChemicalManufacturingProcessCopy[111:nrow(ChemicalManufacturingProcessCopy),]
chem_train_set_y_df <- as.data.frame(ChemicalManufacturingProcess[1:110,])
chem_train_set_y <- ChemicalManufacturingProcess[1:110,]
Let’s plot some of the original data:
### Some initial plots of the data
xyplot(chem_train_set_y_df$Yield ~ chem_train_set_x_df$BiologicalMaterial05,
ylab = "permeability",
xlab = "BiologicalMaterial05")
xyplot(chem_train_set_y_df$Yield ~ chem_train_set_x_df$ManufacturingProcess17,
ylab = "permeability",
xlab = "ManufacturingProcess17")
xyplot(chem_train_set_y_df$Yield ~ chem_train_set_x_df$ManufacturingProcess39,
ylab = "permeability",
xlab = "ManufacturingProcess39")
I don’t think we need to do any other pre-process (other than what
the train
function uses).
Let’s try ridge for this dataset since it is one we have not tried yet:
# Use glmnet:
# choose response variable
y <- chem_train_set_x$Yield
# all the predictors need to be put into a matrix
x <- as.matrix(chem_train_set_x[, -which(names(chem_train_set_x) == "Yield")])
lambdas <- 10^seq(3, -2, by = -.1)
fit <- glmnet(x, y, alpha = 0)
summary(fit)
## Length Class Mode
## a0 100 -none- numeric
## beta 5700 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 4 -none- call
## nobs 1 -none- numeric
plot(fit, xvar = "lambda", label = TRUE)
# automatically find a value for lambda that is optimal
cv_fit <- cv.glmnet(x, y, alpha = 0)
cv_fit
##
## Call: cv.glmnet(x = x, y = y, alpha = 0)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 3.982 63 1.807 0.4753 57
## 1se 16.074 48 2.261 0.3159 57
glmnet_fit <- cv_fit$glmnet.fit
glmnet_fit
##
## Call: glmnet(x = x, y = y, alpha = 0)
##
## Df %Dev Lambda
## 1 57 0.00 1274.00
## 2 57 1.64 1161.00
## 3 57 1.79 1058.00
## 4 57 1.96 963.60
## 5 57 2.15 878.00
## 6 57 2.36 800.00
## 7 57 2.58 728.90
## 8 57 2.82 664.20
## 9 57 3.09 605.20
## 10 57 3.38 551.40
## 11 57 3.69 502.40
## 12 57 4.04 457.80
## 13 57 4.41 417.10
## 14 57 4.82 380.10
## 15 57 5.26 346.30
## 16 57 5.74 315.50
## 17 57 6.27 287.50
## 18 57 6.83 262.00
## 19 57 7.44 238.70
## 20 57 8.10 217.50
## 21 57 8.82 198.20
## 22 57 9.59 180.60
## 23 57 10.42 164.50
## 24 57 11.31 149.90
## 25 57 12.26 136.60
## 26 57 13.28 124.50
## 27 57 14.37 113.40
## 28 57 15.53 103.30
## 29 57 16.77 94.15
## 30 57 18.08 85.78
## 31 57 19.46 78.16
## 32 57 20.92 71.22
## 33 57 22.46 64.89
## 34 57 24.06 59.13
## 35 57 25.74 53.87
## 36 57 27.49 49.09
## 37 57 29.29 44.73
## 38 57 31.16 40.75
## 39 57 33.08 37.13
## 40 57 35.04 33.83
## 41 57 37.04 30.83
## 42 57 39.07 28.09
## 43 57 41.12 25.59
## 44 57 43.18 23.32
## 45 57 45.24 21.25
## 46 57 47.30 19.36
## 47 57 49.33 17.64
## 48 57 51.34 16.07
## 49 57 53.31 14.65
## 50 57 55.23 13.35
## 51 57 57.11 12.16
## 52 57 58.92 11.08
## 53 57 60.67 10.10
## 54 57 62.35 9.20
## 55 57 63.95 8.38
## 56 57 65.48 7.64
## 57 57 66.93 6.96
## 58 57 68.30 6.34
## 59 57 69.60 5.78
## 60 57 70.82 5.26
## 61 57 71.96 4.80
## 62 57 73.03 4.37
## 63 57 74.04 3.98
## 64 57 74.97 3.63
## 65 57 75.85 3.31
## 66 57 76.67 3.01
## 67 57 77.43 2.74
## 68 57 78.13 2.50
## 69 57 78.80 2.28
## 70 57 79.42 2.08
## 71 57 79.99 1.89
## 72 57 80.54 1.72
## 73 57 81.04 1.57
## 74 57 81.52 1.43
## 75 57 81.97 1.30
## 76 57 82.39 1.19
## 77 57 82.80 1.08
## 78 57 83.17 0.99
## 79 57 83.53 0.90
## 80 57 83.88 0.82
## 81 57 84.20 0.75
## 82 57 84.51 0.68
## 83 57 84.81 0.62
## 84 57 85.10 0.56
## 85 57 85.37 0.51
## 86 57 85.64 0.47
## 87 57 85.89 0.43
## 88 57 86.14 0.39
## 89 57 86.37 0.35
## 90 57 86.60 0.32
## 91 57 86.83 0.29
## 92 57 87.04 0.27
## 93 57 87.25 0.24
## 94 57 87.46 0.22
## 95 57 87.65 0.20
## 96 57 87.85 0.18
## 97 57 88.03 0.17
## 98 57 88.22 0.15
## 99 57 88.39 0.14
## 100 57 88.57 0.13
# plot the cross validation error vs. log(lambda)
plot(cv_fit)
# find opimal lambda value that minimizes test MSE
best_lambda <- cv_fit$lambda.min
best_lambda
## [1] 3.981707
# Other way (from the textbook):
set.seed(101)
indx <- createFolds(y, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)
set.seed(101)
ridgeGrid <- expand.grid(lambda = seq(0, .1, length = 15))
ridgeTune <- train(x = x, y = y,
method = "ridge",
tuneGrid = ridgeGrid,
trControl = ctrl,
preProc = c("center", "scale"))
ridgeTune
## Ridge Regression
##
## 110 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 98, 99, 99, 101, 99, 99, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 11.840868 0.4927704 4.098290
## 0.007142857 4.361446 0.5264280 1.877559
## 0.014285714 3.758401 0.5393399 1.685806
## 0.021428571 3.438947 0.5486669 1.580761
## 0.028571429 3.219338 0.5565268 1.507241
## 0.035714286 3.052741 0.5637427 1.454526
## 0.042857143 2.919864 0.5706336 1.414526
## 0.050000000 2.810604 0.5772968 1.381886
## 0.057142857 2.718875 0.5837263 1.353724
## 0.064285714 2.640676 0.5898716 1.329083
## 0.071428571 2.573212 0.5956728 1.307293
## 0.078571429 2.514444 0.6010796 1.288169
## 0.085714286 2.462833 0.6060609 1.272049
## 0.092857143 2.417186 0.6106070 1.258030
## 0.100000000 2.376559 0.6147271 1.245576
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
Using the method from the textbook, the optimal lambda is 0.1 with a RMSE value of 2.376559 and a Rsquared of 0.6147271.
Using glmnet
, the best lambda is 4.79598.
chem_test_y <- chem_test_set_x[, which(names(chem_test_set_x) == "Yield")]
chem_test_x <- as.matrix(chem_test_set_x[, -which(names(chem_test_set_x) == "Yield")])
y_predicted <- predict(ridgeTune, chem_test_x)
# Sum of Squares Total and Error
sst <- sum((chem_test_y - mean(chem_test_y))^2)
sse <- sum((y_predicted - chem_test_y)^2)
# R squared
rsq <- 1 - sse / sst
rsq
## [1] -1.888747
The Rsquared value is -1.888747 using the best lambda found in the previous section. This is a poor Rsquared value.
Let’s test the glmnet lambda:
y_predicted <- predict(glmnet_fit, s = best_lambda, newx = chem_test_x)
# Sum of Squares Total and Error
sst <- sum((chem_test_y - mean(chem_test_y))^2)
sse <- sum((y_predicted - chem_test_y)^2)
# R squared
rsq <- 1 - sse / sst
rsq
## [1] -0.9570074
This Rsquared is also pretty bad, but better than the previous.
The training Rsquared value is much better than the test.
Let’s quick try elastic net:
enetGrid <- expand.grid(lambda = c(0, 0.01, .1),
fraction = seq(.05, 1, length = 20))
set.seed(101)
enetTune <- train(x = x, y = y,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl,
preProc = c("center", "scale"))
enetTune
## Elasticnet
##
## 110 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 98, 99, 99, 101, 99, 99, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 0.9539125 0.7630150 0.7582394
## 0.00 0.10 1.5927956 0.6764034 0.9495497
## 0.00 0.15 2.2283107 0.6043608 1.1881296
## 0.00 0.20 2.5371662 0.5888523 1.2979816
## 0.00 0.25 2.8028695 0.5596565 1.4155927
## 0.00 0.30 3.4995606 0.5189109 1.6450227
## 0.00 0.35 4.2844601 0.5130060 1.8818095
## 0.00 0.40 4.9729266 0.5151845 2.0847374
## 0.00 0.45 5.5816641 0.5217100 2.2579832
## 0.00 0.50 5.9665651 0.5321175 2.3663422
## 0.00 0.55 6.2327631 0.5439702 2.4397878
## 0.00 0.60 6.6280498 0.5396910 2.5460374
## 0.00 0.65 7.9127532 0.5202888 2.9378729
## 0.00 0.70 8.9264267 0.4924244 3.2466695
## 0.00 0.75 9.5328209 0.4738516 3.4329179
## 0.00 0.80 10.0269989 0.4708959 3.5787094
## 0.00 0.85 10.4798008 0.4719496 3.7107146
## 0.00 0.90 10.9364212 0.4753977 3.8426291
## 0.00 0.95 11.3869270 0.4832676 3.9712485
## 0.00 1.00 11.8408680 0.4927704 4.0982896
## 0.01 0.05 1.5003480 0.7030680 1.2170637
## 0.01 0.10 1.1594132 0.7418393 0.9377423
## 0.01 0.15 0.9706755 0.7481935 0.7829963
## 0.01 0.20 1.3381025 0.6941970 0.8773705
## 0.01 0.25 1.6882111 0.6883140 0.9753398
## 0.01 0.30 1.7911683 0.6895415 1.0015210
## 0.01 0.35 1.9282823 0.6898041 1.0358408
## 0.01 0.40 2.0939279 0.6522954 1.1114982
## 0.01 0.45 2.2177646 0.6295969 1.1648012
## 0.01 0.50 2.3469798 0.6178605 1.2152260
## 0.01 0.55 2.5526138 0.6102831 1.2823467
## 0.01 0.60 2.7495110 0.6073446 1.3440703
## 0.01 0.65 2.9696063 0.5978812 1.4220332
## 0.01 0.70 3.1934986 0.5816913 1.4989875
## 0.01 0.75 3.2645671 0.5621210 1.5295967
## 0.01 0.80 3.4045837 0.5466330 1.5785543
## 0.01 0.85 3.5839461 0.5395682 1.6359128
## 0.01 0.90 3.7629463 0.5362625 1.6911615
## 0.01 0.95 3.9320086 0.5338424 1.7426281
## 0.01 1.00 4.0559146 0.5322587 1.7813981
## 0.10 0.05 1.6844520 0.6476472 1.3772921
## 0.10 0.10 1.4733019 0.7124996 1.1955158
## 0.10 0.15 1.2879008 0.7350048 1.0459709
## 0.10 0.20 1.1237580 0.7503011 0.9074389
## 0.10 0.25 1.0004605 0.7560037 0.8076171
## 0.10 0.30 0.9702768 0.7490365 0.7778417
## 0.10 0.35 1.0503939 0.7279671 0.7992161
## 0.10 0.40 1.3092331 0.7012701 0.8719739
## 0.10 0.45 1.4775177 0.6943541 0.9213725
## 0.10 0.50 1.5901581 0.6913220 0.9566569
## 0.10 0.55 1.6294809 0.6902932 0.9700293
## 0.10 0.60 1.6632224 0.6911883 0.9780840
## 0.10 0.65 1.7214353 0.6872656 1.0054842
## 0.10 0.70 1.8127012 0.6718898 1.0372066
## 0.10 0.75 1.9367827 0.6606214 1.0737555
## 0.10 0.80 2.0566635 0.6511461 1.1146220
## 0.10 0.85 2.1692143 0.6430294 1.1542290
## 0.10 0.90 2.2397623 0.6352884 1.1872496
## 0.10 0.95 2.3007060 0.6252622 1.2160389
## 0.10 1.00 2.3765589 0.6147271 1.2455765
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.05 and lambda = 0.
plot(enetTune)
Enet <- predict(enetTune, chem_test_x)
# Sum of Squares Total and Error
sst <- sum((chem_test_y - mean(chem_test_y))^2)
sse <- sum((Enet - chem_test_y)^2)
# R squared
rsq <- 1 - sse / sst
rsq
## [1] 0.09719997
Elastic net produced a better Rsquared value and RMSE than ridge, so that would be a better model to pick.
I tried the ridge model for this, so every variable is used. Here are the most important variables:
ridgeImp <- varImp(ridgeTune, scale = FALSE)
plot(ridgeImp, top = 25, scales = list(y = list(cex = .95)))
The process predictors dominate the list.
Looking at the top 3 predictors,
ggplot(data=ChemicalManufacturingProcessCopy, aes(x=ManufacturingProcess13, y=Yield)) + geom_point()
ggplot(data=ChemicalManufacturingProcessCopy, aes(x=ManufacturingProcess17, y=Yield)) + geom_point()
ggplot(data=ChemicalManufacturingProcessCopy, aes(x=ManufacturingProcess09, y=Yield)) + geom_point()
We now know for ManufacturingProcess13
, as
ManufacturingProcess13
increases, the Yield
decreases. Also, most of the values for
ManufacturingProcess13
hover around 33 - 36, all other
points look like outliers.
ManufacturingProcess17
has a very similar graph to
ManufacturingProcess13
, except the main range for
ManufacturingProcess17
is around 32.5 - 36.
ManufacturingProcess09
has an increasing trend. The main
data range is 42.5 - 47.5. For this, the variance seems to be greater
around 47.5.
Having this information informs us what values we should have for certain predictors to produce a good yeild.