HW 7

In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.

library(lattice)
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

library(AppliedPredictiveModeling)
data(permeability)
  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'caret'
## The following object is masked from 'package:pls':
## 
##     R2
fingerprints_filtered <- fingerprints[, -nearZeroVar(fingerprints)]
glimpse(fingerprints_filtered)
##  num [1:165, 1:388] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:388] "X1" "X2" "X3" "X4" ...

388 predictors remain for modeling.

  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R^2?

For the training set, I will use the first 110 observations, and use the rest for my test set:

train_set_x_df <- as.data.frame(fingerprints_filtered[1:110,])
train_set_x <- fingerprints_filtered[1:110,]
test_set_x <- fingerprints_filtered[111:165,]

train_set_y_df <- as.data.frame(permeability[1:110,])
train_set_y <- permeability[1:110,]
test_set_y <- permeability[111:165,]

Let’s take a look at the original data:

### Some initial plots of the data
xyplot(train_set_y_df$permeability ~ train_set_x_df$X4,
       ylab = "permeability",
       xlab = "X4")

xyplot(train_set_y_df$permeability ~ train_set_x_df$X20,
       ylab = "permeability",
       xlab = "X20")

xyplot(train_set_y_df$permeability ~ train_set_x_df$X72,
       ylab = "permeability",
       xlab = "X72")

Since each predictor variable can just be 0 or 1, I don’t think any transformation is needed for the original data before modeling (outside what the train function does).

Tune a PLS model:

set.seed(100)
indx <- createFolds(train_set_y, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)
set.seed(100)
plsTune <- train(x = train_set_x, y = train_set_y,
                 method = "pls",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = ctrl)
plsTune
## Partial Least Squares 
## 
## 110 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 100, 98, 98, 99, 99, 98, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     11.71395  0.2143849   8.899378
##    2     10.90826  0.3182098   8.069260
##    3     10.55456  0.3742090   7.866986
##    4     10.88378  0.4246992   8.081647
##    5     11.05445  0.4314871   8.002925
##    6     11.42425  0.4006607   8.315823
##    7     11.69288  0.3700492   8.662212
##    8     12.00423  0.3696711   8.824582
##    9     12.39296  0.3608033   9.275735
##   10     12.76420  0.3300089   9.392145
##   11     13.20669  0.3160044   9.568009
##   12     13.60859  0.3020311   9.879736
##   13     14.10546  0.2954876  10.182043
##   14     14.30560  0.2828452  10.346977
##   15     14.52335  0.2852085  10.341159
##   16     14.87430  0.2768254  10.578734
##   17     15.32083  0.2652363  10.664805
##   18     15.48298  0.2642017  10.762865
##   19     15.64720  0.2554556  10.901658
##   20     15.86152  0.2441421  11.050867
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
plsResamples <- plsTune$results
plsResamples$Model <- "PLS"

xyplot(RMSE ~ ncomp,
       data = plsResamples,
       #aspect = 1,
       xlab = "# Components",
       ylab = "RMSE (Cross-Validation)",
       auto.key = list(columns = 2),
       groups = Model,
       type = c("o", "g"))
## Warning in draw.key(simpleKey(...), draw = FALSE): not enough rows for columns

plsImp <- varImp(plsTune, scale = FALSE)
plot(plsImp, top = 25, scales = list(y = list(cex = .95)))

Chosen ncomp is 3 and the corresponding Rsquared value is 0.3742090 and RMSE is 10.55456.

  1. Predict the response for the test set. What is the test set estimate of R^2?
# Response:
PLS <- predict(plsTune, test_set_x)
glimpse(PLS)
##  num [1:55] 29.2 14.7 29.2 12.3 33.6 ...
# Residuals:
PLS - test_set_y
##         111         112         113         114         115         116 
## -12.8431093   9.1025829 -12.8431093 -23.0894055   1.9238237 -20.7571261 
##         117         118         119         120         121         122 
## -20.6841704 -17.4246662  -1.9557921  18.4131846   6.0388850 -18.8703852 
##         123         124         125         126         127         128 
## -18.7344415 -18.8703852 -18.3711135  10.4383768   9.6481709   0.4657709 
##         129         130         131         132         133         134 
##  20.7650795   9.4515256  -1.1937257   3.2565552  26.9477790   4.8903566 
##         135         136         137         138         139         140 
##   9.1354341   5.2938229  -1.3677178  -0.3678805  21.6851413   2.3322837 
##         141         142         143         144         145         146 
## -16.7250434 -25.7931272   5.2381332  -9.5740983   1.0494965   8.7029740 
##         147         148         149         150         151         152 
##   0.9474552   4.6237907  14.3518502  -2.8698006  -0.2320333   8.4093749 
##         153         154         155         156         157         158 
##   8.5223391   3.5816587   4.1149679  24.5950545  -1.8878850   8.4607057 
##         159         160         161         162         163         164 
##  10.6129820   7.5067071  14.3747168   8.6988405   9.0620765 -12.6505146 
##         165 
##   0.7935953
# Sum of Squares Total and Error
sst <- sum((test_set_y - mean(test_set_y))^2)
sse <- sum((PLS - test_set_y)^2)

# R squared
rsq <- 1 - sse / sst
rsq
## [1] 0.5711109

The Rsquared value is 0.5711109.

  1. Try building other models discussed in this chapter. Do any have better predictive performance?

Let’s try PCR:

set.seed(100)
pcrTune <- train(x = train_set_x, y = train_set_y,
                 method = "pcr",
                 tuneGrid = expand.grid(ncomp = 1:35),
                 trControl = ctrl)
pcrTune                  
## Principal Component Analysis 
## 
## 110 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 100, 98, 98, 99, 99, 98, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared    MAE     
##    1     12.27574  0.09107850  9.262943
##    2     12.35470  0.08509332  9.338148
##    3     12.44100  0.05045470  9.390560
##    4     12.43995  0.06207247  9.419897
##    5     12.66908  0.08470903  9.414710
##    6     11.68133  0.19594032  8.589011
##    7     11.11128  0.25301878  8.319281
##    8     11.07580  0.25792210  8.262428
##    9     10.91109  0.29986383  8.029309
##   10     10.85697  0.31411262  8.058591
##   11     10.92693  0.31881653  8.278232
##   12     11.01372  0.32158958  8.310311
##   13     10.79324  0.35087582  8.019884
##   14     10.74359  0.35385391  8.030234
##   15     10.72466  0.35960927  7.977698
##   16     10.69193  0.35642982  8.061796
##   17     10.77312  0.34514408  8.166548
##   18     10.92369  0.34100319  8.257080
##   19     11.03201  0.32934915  8.203110
##   20     10.90737  0.35724430  8.085212
##   21     10.91733  0.39358365  8.223455
##   22     10.83154  0.38352691  8.123252
##   23     10.88953  0.39775459  8.079413
##   24     10.82934  0.44090555  7.975673
##   25     10.94696  0.43007179  7.989442
##   26     10.89051  0.43106957  7.984514
##   27     11.02230  0.38882073  8.051817
##   28     11.08063  0.39012608  8.096570
##   29     11.17963  0.39735058  8.159222
##   30     11.19468  0.40090978  8.281390
##   31     11.17683  0.40034588  8.292896
##   32     11.39660  0.39273512  8.529296
##   33     11.47548  0.38231032  8.540921
##   34     11.59752  0.37891499  8.658306
##   35     11.61107  0.37625579  8.589752
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 16.
pcrResamples <- pcrTune$results
pcrResamples$Model <- "PCR"

xyplot(RMSE ~ ncomp,
       data = pcrResamples,
       #aspect = 1,
       xlab = "# Components",
       ylab = "RMSE (Cross-Validation)",
       auto.key = list(columns = 2),
       groups = Model,
       type = c("o", "g"))
## Warning in draw.key(simpleKey(...), draw = FALSE): not enough rows for columns

Chosen ncomp is 3 and the corresponding Rsquared value is 0.35642982 and RMSE is 10.69193.

Since I don’t know much about the variables, I will also try elastic net:

enetGrid <- expand.grid(lambda = c(0, 0.01, .1), 
                        fraction = seq(.05, 1, length = 20))
set.seed(100)
enetTune <- train(x = train_set_x, y = train_set_y,
                  method = "enet",
                  tuneGrid = enetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"))
## Warning: model fit failed for Fold09: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
enetTune
## Elasticnet 
## 
## 110 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 100, 98, 98, 99, 99, 98, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE          Rsquared   MAE         
##   0.00    0.05      2.167554e+18  0.2760875  9.400723e+17
##   0.00    0.10      4.329761e+18  0.2924141  1.878416e+18
##   0.00    0.15      6.491967e+18  0.2806316  2.816760e+18
##   0.00    0.20      8.654174e+18  0.2677772  3.755104e+18
##   0.00    0.25      1.081638e+19  0.2641035  4.693448e+18
##   0.00    0.30      1.297859e+19  0.2556734  5.631792e+18
##   0.00    0.35      1.514080e+19  0.2527739  6.570135e+18
##   0.00    0.40      1.730300e+19  0.2474795  7.508479e+18
##   0.00    0.45      1.946521e+19  0.2343342  8.446823e+18
##   0.00    0.50      2.162742e+19  0.2188636  9.385167e+18
##   0.00    0.55      2.378962e+19  0.2150393  1.032351e+19
##   0.00    0.60      2.595183e+19  0.2082588  1.126185e+19
##   0.00    0.65      2.811404e+19  0.2048796  1.220020e+19
##   0.00    0.70      3.027625e+19  0.2055291  1.313854e+19
##   0.00    0.75      3.243845e+19  0.2123844  1.407689e+19
##   0.00    0.80      3.460066e+19  0.2125866  1.501523e+19
##   0.00    0.85      3.676287e+19  0.2048431  1.595357e+19
##   0.00    0.90      3.892507e+19  0.2044992  1.689192e+19
##   0.00    0.95      4.108728e+19  0.2033647  1.783026e+19
##   0.00    1.00      4.324949e+19  0.2132726  1.876861e+19
##   0.01    0.05      4.158780e+01  0.4166435  2.427016e+01
##   0.01    0.10      7.223851e+01  0.3853051  4.090022e+01
##   0.01    0.15      1.022350e+02  0.3684834  5.699183e+01
##   0.01    0.20      1.323981e+02  0.3515061  7.298976e+01
##   0.01    0.25      1.618568e+02  0.3205087  8.867886e+01
##   0.01    0.30      1.911846e+02  0.3015848  1.042170e+02
##   0.01    0.35      2.204765e+02  0.2854697  1.196965e+02
##   0.01    0.40      2.497468e+02  0.2650762  1.351681e+02
##   0.01    0.45      2.788467e+02  0.2512959  1.507699e+02
##   0.01    0.50      3.078696e+02  0.2458830  1.664644e+02
##   0.01    0.55      3.368453e+02  0.2443749  1.821260e+02
##   0.01    0.60      3.658975e+02  0.2408552  1.978034e+02
##   0.01    0.65      3.949548e+02  0.2382542  2.134825e+02
##   0.01    0.70      4.239470e+02  0.2373392  2.291241e+02
##   0.01    0.75      4.521839e+02  0.2391018  2.439440e+02
##   0.01    0.80      4.804139e+02  0.2407181  2.587239e+02
##   0.01    0.85      5.082353e+02  0.2447778  2.733086e+02
##   0.01    0.90      5.359682e+02  0.2485837  2.878741e+02
##   0.01    0.95      5.638551e+02  0.2544388  3.029382e+02
##   0.01    1.00      5.933428e+02  0.2600142  3.186979e+02
##   0.10    0.05      1.067231e+01  0.3808896  7.790510e+00
##   0.10    0.10      1.000326e+01  0.4372548  6.987894e+00
##   0.10    0.15      1.018177e+01  0.4280656  7.295591e+00
##   0.10    0.20      1.047609e+01  0.4211830  7.594456e+00
##   0.10    0.25      1.076066e+01  0.4134969  7.784750e+00
##   0.10    0.30      1.109332e+01  0.3991958  8.005429e+00
##   0.10    0.35      1.142410e+01  0.3849210  8.245894e+00
##   0.10    0.40      1.172818e+01  0.3692020  8.458649e+00
##   0.10    0.45      1.194468e+01  0.3579474  8.616301e+00
##   0.10    0.50      1.212975e+01  0.3518741  8.725441e+00
##   0.10    0.55      1.229628e+01  0.3466773  8.822736e+00
##   0.10    0.60      1.245523e+01  0.3412078  8.927142e+00
##   0.10    0.65      1.262356e+01  0.3361084  9.013968e+00
##   0.10    0.70      1.280228e+01  0.3315977  9.115648e+00
##   0.10    0.75      1.298436e+01  0.3269108  9.244204e+00
##   0.10    0.80      1.316117e+01  0.3224074  9.384807e+00
##   0.10    0.85      1.333322e+01  0.3172381  9.509273e+00
##   0.10    0.90      1.349812e+01  0.3118688  9.619737e+00
##   0.10    0.95      1.365637e+01  0.3068700  9.730565e+00
##   0.10    1.00      1.382811e+01  0.3017281  9.850480e+00
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.1 and lambda = 0.1.
plot(enetTune)

Enet <- predict(enetTune, test_set_x)
Enet
##       111       112       113       114       115       116       117       118 
## 19.317715 16.716305 19.317715 12.554632 31.613597 16.766909 22.636263 31.205816 
##       119       120       121       122       123       124       125       126 
## 22.228481 28.039977 43.963613 26.663922 18.909933 26.663922 18.909933 22.177877 
##       127       128       129       130       131       132       133       134 
##  1.533303  7.330949 17.429763 17.429763  7.093060 10.614248 15.796527  4.958225 
##       135       136       137       138       139       140       141       142 
##  5.197908  6.740159  4.683686  8.981011 14.120167  4.193123 31.613597 19.062643 
##       143       144       145       146       147       148       149       150 
## 30.645324 19.623995  7.330949 13.215758  7.330949 16.789378  7.754186  1.533303 
##       151       152       153       154       155       156       157       158 
##  6.994874  9.341670  3.519439  2.733053  3.519439 21.610132  3.519439  3.519439 
##       159       160       161       162       163       164       165 
## 10.614248  7.330949  7.754186  7.330949  1.533303 21.214976  5.152676

For the elastic chosen model, the Rsquared value is 0.4372548 and RMSE is 10.00326. Elastic net produced the highest Rsquared and lowest RMSE so I would use that as the model.

  1. Would you recommend any of your models to replace the permeability laboratory experiment?

These models have performed ok, but I would not recommend any of the models I looked at to repace the experiment. I think Rsquared value should be higher.

6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(chemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

I first tried using K-nearest neighbors, but that removed observations which I did not want:

# Apply imputation methods based on K-nearest neighbors
ChemicalManufacturingProcessPredictors <- ChemicalManufacturingProcess[, -1]
trans <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))

So intead I used the mean for the missing values:

ChemicalManufacturingProcessCopy <- ChemicalManufacturingProcess
for(i in 1:ncol(ChemicalManufacturingProcessCopy)){
  ChemicalManufacturingProcessCopy[is.na(ChemicalManufacturingProcessCopy[,i]), i] <- mean(ChemicalManufacturingProcessCopy[,i], na.rm = TRUE)
}
  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
chem_train_set_x_df <- as.data.frame(ChemicalManufacturingProcessCopy[1:110,])
chem_train_set_x <- ChemicalManufacturingProcessCopy[1:110,]

chem_test_set_x <- ChemicalManufacturingProcessCopy[111:nrow(ChemicalManufacturingProcessCopy),]

chem_train_set_y_df <- as.data.frame(ChemicalManufacturingProcess[1:110,])
chem_train_set_y <- ChemicalManufacturingProcess[1:110,]

Let’s plot some of the original data:

### Some initial plots of the data
xyplot(chem_train_set_y_df$Yield ~ chem_train_set_x_df$BiologicalMaterial05,
       ylab = "permeability",
       xlab = "BiologicalMaterial05")

xyplot(chem_train_set_y_df$Yield ~ chem_train_set_x_df$ManufacturingProcess17,
       ylab = "permeability",
       xlab = "ManufacturingProcess17")

xyplot(chem_train_set_y_df$Yield ~ chem_train_set_x_df$ManufacturingProcess39,
       ylab = "permeability",
       xlab = "ManufacturingProcess39")

I don’t think we need to do any other pre-process (other than what the train function uses).

Let’s try ridge for this dataset since it is one we have not tried yet:

# Use glmnet:
# choose response variable
y <- chem_train_set_x$Yield
# all the predictors need to be put into a matrix
x <- as.matrix(chem_train_set_x[, -which(names(chem_train_set_x) == "Yield")])

lambdas <- 10^seq(3, -2, by = -.1)
fit <- glmnet(x, y, alpha = 0)
summary(fit)
##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      5700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         4   -none-    call   
## nobs         1   -none-    numeric
plot(fit, xvar = "lambda", label = TRUE)

# automatically find a value for lambda that is optimal
cv_fit <- cv.glmnet(x, y, alpha = 0)
cv_fit
## 
## Call:  cv.glmnet(x = x, y = y, alpha = 0) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure     SE Nonzero
## min  3.982    63   1.807 0.4753      57
## 1se 16.074    48   2.261 0.3159      57
glmnet_fit <- cv_fit$glmnet.fit
glmnet_fit
## 
## Call:  glmnet(x = x, y = y, alpha = 0) 
## 
##     Df  %Dev  Lambda
## 1   57  0.00 1274.00
## 2   57  1.64 1161.00
## 3   57  1.79 1058.00
## 4   57  1.96  963.60
## 5   57  2.15  878.00
## 6   57  2.36  800.00
## 7   57  2.58  728.90
## 8   57  2.82  664.20
## 9   57  3.09  605.20
## 10  57  3.38  551.40
## 11  57  3.69  502.40
## 12  57  4.04  457.80
## 13  57  4.41  417.10
## 14  57  4.82  380.10
## 15  57  5.26  346.30
## 16  57  5.74  315.50
## 17  57  6.27  287.50
## 18  57  6.83  262.00
## 19  57  7.44  238.70
## 20  57  8.10  217.50
## 21  57  8.82  198.20
## 22  57  9.59  180.60
## 23  57 10.42  164.50
## 24  57 11.31  149.90
## 25  57 12.26  136.60
## 26  57 13.28  124.50
## 27  57 14.37  113.40
## 28  57 15.53  103.30
## 29  57 16.77   94.15
## 30  57 18.08   85.78
## 31  57 19.46   78.16
## 32  57 20.92   71.22
## 33  57 22.46   64.89
## 34  57 24.06   59.13
## 35  57 25.74   53.87
## 36  57 27.49   49.09
## 37  57 29.29   44.73
## 38  57 31.16   40.75
## 39  57 33.08   37.13
## 40  57 35.04   33.83
## 41  57 37.04   30.83
## 42  57 39.07   28.09
## 43  57 41.12   25.59
## 44  57 43.18   23.32
## 45  57 45.24   21.25
## 46  57 47.30   19.36
## 47  57 49.33   17.64
## 48  57 51.34   16.07
## 49  57 53.31   14.65
## 50  57 55.23   13.35
## 51  57 57.11   12.16
## 52  57 58.92   11.08
## 53  57 60.67   10.10
## 54  57 62.35    9.20
## 55  57 63.95    8.38
## 56  57 65.48    7.64
## 57  57 66.93    6.96
## 58  57 68.30    6.34
## 59  57 69.60    5.78
## 60  57 70.82    5.26
## 61  57 71.96    4.80
## 62  57 73.03    4.37
## 63  57 74.04    3.98
## 64  57 74.97    3.63
## 65  57 75.85    3.31
## 66  57 76.67    3.01
## 67  57 77.43    2.74
## 68  57 78.13    2.50
## 69  57 78.80    2.28
## 70  57 79.42    2.08
## 71  57 79.99    1.89
## 72  57 80.54    1.72
## 73  57 81.04    1.57
## 74  57 81.52    1.43
## 75  57 81.97    1.30
## 76  57 82.39    1.19
## 77  57 82.80    1.08
## 78  57 83.17    0.99
## 79  57 83.53    0.90
## 80  57 83.88    0.82
## 81  57 84.20    0.75
## 82  57 84.51    0.68
## 83  57 84.81    0.62
## 84  57 85.10    0.56
## 85  57 85.37    0.51
## 86  57 85.64    0.47
## 87  57 85.89    0.43
## 88  57 86.14    0.39
## 89  57 86.37    0.35
## 90  57 86.60    0.32
## 91  57 86.83    0.29
## 92  57 87.04    0.27
## 93  57 87.25    0.24
## 94  57 87.46    0.22
## 95  57 87.65    0.20
## 96  57 87.85    0.18
## 97  57 88.03    0.17
## 98  57 88.22    0.15
## 99  57 88.39    0.14
## 100 57 88.57    0.13
# plot the cross validation error vs. log(lambda)
plot(cv_fit)

# find opimal lambda value that minimizes test MSE
best_lambda <- cv_fit$lambda.min
best_lambda
## [1] 3.981707
# Other way (from the textbook):
set.seed(101)
indx <- createFolds(y, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)

set.seed(101)
ridgeGrid <- expand.grid(lambda = seq(0, .1, length = 15))
ridgeTune <- train(x = x, y = y,
                   method = "ridge",
                   tuneGrid = ridgeGrid,
                   trControl = ctrl,
                   preProc = c("center", "scale"))
ridgeTune
## Ridge Regression 
## 
## 110 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 98, 99, 99, 101, 99, 99, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE     
##   0.000000000  11.840868  0.4927704  4.098290
##   0.007142857   4.361446  0.5264280  1.877559
##   0.014285714   3.758401  0.5393399  1.685806
##   0.021428571   3.438947  0.5486669  1.580761
##   0.028571429   3.219338  0.5565268  1.507241
##   0.035714286   3.052741  0.5637427  1.454526
##   0.042857143   2.919864  0.5706336  1.414526
##   0.050000000   2.810604  0.5772968  1.381886
##   0.057142857   2.718875  0.5837263  1.353724
##   0.064285714   2.640676  0.5898716  1.329083
##   0.071428571   2.573212  0.5956728  1.307293
##   0.078571429   2.514444  0.6010796  1.288169
##   0.085714286   2.462833  0.6060609  1.272049
##   0.092857143   2.417186  0.6106070  1.258030
##   0.100000000   2.376559  0.6147271  1.245576
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

Using the method from the textbook, the optimal lambda is 0.1 with a RMSE value of 2.376559 and a Rsquared of 0.6147271.

Using glmnet, the best lambda is 4.79598.

  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
chem_test_y <- chem_test_set_x[, which(names(chem_test_set_x) == "Yield")]
chem_test_x <- as.matrix(chem_test_set_x[, -which(names(chem_test_set_x) == "Yield")])

y_predicted <- predict(ridgeTune, chem_test_x)

# Sum of Squares Total and Error
sst <- sum((chem_test_y - mean(chem_test_y))^2)
sse <- sum((y_predicted - chem_test_y)^2)

# R squared
rsq <- 1 - sse / sst
rsq
## [1] -1.888747

The Rsquared value is -1.888747 using the best lambda found in the previous section. This is a poor Rsquared value.

Let’s test the glmnet lambda:

y_predicted <- predict(glmnet_fit, s = best_lambda, newx = chem_test_x)

# Sum of Squares Total and Error
sst <- sum((chem_test_y - mean(chem_test_y))^2)
sse <- sum((y_predicted - chem_test_y)^2)

# R squared
rsq <- 1 - sse / sst
rsq
## [1] -0.9570074

This Rsquared is also pretty bad, but better than the previous.

The training Rsquared value is much better than the test.

Let’s quick try elastic net:

enetGrid <- expand.grid(lambda = c(0, 0.01, .1), 
                        fraction = seq(.05, 1, length = 20))
set.seed(101)
enetTune <- train(x = x, y = y,
                  method = "enet",
                  tuneGrid = enetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"))
enetTune
## Elasticnet 
## 
## 110 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 98, 99, 99, 101, 99, 99, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE        Rsquared   MAE      
##   0.00    0.05       0.9539125  0.7630150  0.7582394
##   0.00    0.10       1.5927956  0.6764034  0.9495497
##   0.00    0.15       2.2283107  0.6043608  1.1881296
##   0.00    0.20       2.5371662  0.5888523  1.2979816
##   0.00    0.25       2.8028695  0.5596565  1.4155927
##   0.00    0.30       3.4995606  0.5189109  1.6450227
##   0.00    0.35       4.2844601  0.5130060  1.8818095
##   0.00    0.40       4.9729266  0.5151845  2.0847374
##   0.00    0.45       5.5816641  0.5217100  2.2579832
##   0.00    0.50       5.9665651  0.5321175  2.3663422
##   0.00    0.55       6.2327631  0.5439702  2.4397878
##   0.00    0.60       6.6280498  0.5396910  2.5460374
##   0.00    0.65       7.9127532  0.5202888  2.9378729
##   0.00    0.70       8.9264267  0.4924244  3.2466695
##   0.00    0.75       9.5328209  0.4738516  3.4329179
##   0.00    0.80      10.0269989  0.4708959  3.5787094
##   0.00    0.85      10.4798008  0.4719496  3.7107146
##   0.00    0.90      10.9364212  0.4753977  3.8426291
##   0.00    0.95      11.3869270  0.4832676  3.9712485
##   0.00    1.00      11.8408680  0.4927704  4.0982896
##   0.01    0.05       1.5003480  0.7030680  1.2170637
##   0.01    0.10       1.1594132  0.7418393  0.9377423
##   0.01    0.15       0.9706755  0.7481935  0.7829963
##   0.01    0.20       1.3381025  0.6941970  0.8773705
##   0.01    0.25       1.6882111  0.6883140  0.9753398
##   0.01    0.30       1.7911683  0.6895415  1.0015210
##   0.01    0.35       1.9282823  0.6898041  1.0358408
##   0.01    0.40       2.0939279  0.6522954  1.1114982
##   0.01    0.45       2.2177646  0.6295969  1.1648012
##   0.01    0.50       2.3469798  0.6178605  1.2152260
##   0.01    0.55       2.5526138  0.6102831  1.2823467
##   0.01    0.60       2.7495110  0.6073446  1.3440703
##   0.01    0.65       2.9696063  0.5978812  1.4220332
##   0.01    0.70       3.1934986  0.5816913  1.4989875
##   0.01    0.75       3.2645671  0.5621210  1.5295967
##   0.01    0.80       3.4045837  0.5466330  1.5785543
##   0.01    0.85       3.5839461  0.5395682  1.6359128
##   0.01    0.90       3.7629463  0.5362625  1.6911615
##   0.01    0.95       3.9320086  0.5338424  1.7426281
##   0.01    1.00       4.0559146  0.5322587  1.7813981
##   0.10    0.05       1.6844520  0.6476472  1.3772921
##   0.10    0.10       1.4733019  0.7124996  1.1955158
##   0.10    0.15       1.2879008  0.7350048  1.0459709
##   0.10    0.20       1.1237580  0.7503011  0.9074389
##   0.10    0.25       1.0004605  0.7560037  0.8076171
##   0.10    0.30       0.9702768  0.7490365  0.7778417
##   0.10    0.35       1.0503939  0.7279671  0.7992161
##   0.10    0.40       1.3092331  0.7012701  0.8719739
##   0.10    0.45       1.4775177  0.6943541  0.9213725
##   0.10    0.50       1.5901581  0.6913220  0.9566569
##   0.10    0.55       1.6294809  0.6902932  0.9700293
##   0.10    0.60       1.6632224  0.6911883  0.9780840
##   0.10    0.65       1.7214353  0.6872656  1.0054842
##   0.10    0.70       1.8127012  0.6718898  1.0372066
##   0.10    0.75       1.9367827  0.6606214  1.0737555
##   0.10    0.80       2.0566635  0.6511461  1.1146220
##   0.10    0.85       2.1692143  0.6430294  1.1542290
##   0.10    0.90       2.2397623  0.6352884  1.1872496
##   0.10    0.95       2.3007060  0.6252622  1.2160389
##   0.10    1.00       2.3765589  0.6147271  1.2455765
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.05 and lambda = 0.
plot(enetTune)

Enet <- predict(enetTune, chem_test_x)

# Sum of Squares Total and Error
sst <- sum((chem_test_y - mean(chem_test_y))^2)
sse <- sum((Enet - chem_test_y)^2)

# R squared
rsq <- 1 - sse / sst
rsq
## [1] 0.09719997

Elastic net produced a better Rsquared value and RMSE than ridge, so that would be a better model to pick.

  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

I tried the ridge model for this, so every variable is used. Here are the most important variables:

ridgeImp <- varImp(ridgeTune, scale = FALSE)
plot(ridgeImp, top = 25, scales = list(y = list(cex = .95)))

The process predictors dominate the list.

  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Looking at the top 3 predictors,

ggplot(data=ChemicalManufacturingProcessCopy, aes(x=ManufacturingProcess13, y=Yield)) + geom_point()

ggplot(data=ChemicalManufacturingProcessCopy, aes(x=ManufacturingProcess17, y=Yield)) + geom_point()

ggplot(data=ChemicalManufacturingProcessCopy, aes(x=ManufacturingProcess09, y=Yield)) + geom_point()

We now know for ManufacturingProcess13, as ManufacturingProcess13 increases, the Yield decreases. Also, most of the values for ManufacturingProcess13 hover around 33 - 36, all other points look like outliers.

ManufacturingProcess17 has a very similar graph to ManufacturingProcess13, except the main range for ManufacturingProcess17 is around 32.5 - 36.

ManufacturingProcess09 has an increasing trend. The main data range is 42.5 - 47.5. For this, the variance seems to be greater around 47.5.

Having this information informs us what values we should have for certain predictors to produce a good yeild.