library(tidyverse)
library(fpp2)
library(urca)
library(rio)
library(gridExtra)
#library(AppliedPredictiveModeling)
library(caret)
library(glmnet)
library(elasticnet)
library(RANN)
seed <- 123

1 HW7: Linear Regression and Its Cousins

In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.

1.1 Ex. 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a.) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?
Predict the response for the test set. What is the test set estimate of \(R^2\)?
Try building other models discussed in this chapter. Do any have better predictive performance?
Would you recommend any of your models to replace the permeability laboratory experiment?

1.1.1 Part a

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

1.1.2 Part b

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

Answer:

719 predictors were removed from the 1,107 binary moelecular predictors.
388 predictors left.

fp_remove <- nearZeroVar(fingerprints)
str(fp_remove)

##  int [1:719] 7 8 9 10 13 14 17 18 19 22 ...

fp <- fingerprints[,-fp_remove]
ncol(fingerprints)

## [1] 1107

ncol(fp)

## [1] 388

1.1.3 Part c

Split the data into a training and a test set, pre-process the data, and tune a PLS (Partial Least Square) model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

Answer:

Train-test split at 70%

set.seed(seed)
trainingRows <- createDataPartition(permeability, p=0.75, list=FALSE) #caret, textbook sec4.9
train_X <- fp[trainingRows, ]
train_Y <- permeability[trainingRows,]
test_X <- fp[-trainingRows, ]
test_Y <- permeability[-trainingRows,]

Create a PLS model

The best performed PLS model generated is selected with the lowest RMSE value.

set.seed(seed)
pls_1 <- train(x=train_X, y=train_Y, method="pls", tuneLength=20, 
               preProcess=c("center", "scale"), 
               trControl=trainControl(method="cv"))
pls_1

## Partial Least Squares 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 112, 113, 113, 113, 113, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.41334  0.3623773  10.259274
##    2     11.86219  0.4618152   8.397173
##    3     11.71304  0.4659634   8.861406
##    4     11.66161  0.4877544   8.696145
##    5     11.26338  0.5311563   8.021885
##    6     11.43810  0.5298935   8.179149
##    7     11.66505  0.5257160   8.629869
##    8     11.77213  0.5317200   8.877641
##    9     11.95947  0.5208151   9.076055
##   10     12.44031  0.4879913   9.394500
##   11     12.91286  0.4605701   9.468711
##   12     12.88221  0.4686068   9.510687
##   13     13.03911  0.4552614   9.558798
##   14     12.98039  0.4501933   9.396228
##   15     13.06457  0.4507424   9.361827
##   16     13.00332  0.4502875   9.585219
##   17     13.18292  0.4480355   9.661004
##   18     13.23331  0.4462273   9.675576
##   19     13.29171  0.4415160   9.760474
##   20     13.44435  0.4363275   9.801495
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 5.

plot(pls_1, main="PLS Model: RMSE vs Components")

pls_1$results[pls_1$bestTune$ncomp,]

1.1.4 Part d

Predict the response for the test set. What is the test set estimate of \(R^2\)?

Answer:

pls_predict <- predict(pls_1, test_X)
plot(pls_predict, test_Y, main="Observed vs Predicted Permeability of PLS Model",
     xlab="Predicted Permeability", ylab="Observed Permeability")
abline(0,1,col="royalblue")

postResample(pred=pls_predict, obs=test_Y)

##       RMSE   Rsquared        MAE 
## 12.1654338  0.3593983  8.2578548

1.1.5 Part e

Try building other models discussed in this chapter. Do any have better predictive performance?

Answer:

We have learned 3 types of penalized models in chapter 6, ridge regression model, lasso regression model, and elastic net regression model.

1.1.5.1 Ridge Regression

Ridge regression model

set.seed(seed)

ridge_lambda <- data.frame(.lambda = seq(0, 0.3, length=30))
ridge_1 <- train(x=train_X, y=train_Y, method="ridge", 
               tuneGrid=expand.grid(lambda=ridge_lambda), 
               preProcess=c("center", "scale"),
               trControl=trainControl(method="cv", number=10))
ridge_1

## Ridge Regression 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 112, 113, 113, 113, 113, ... 
## Resampling results across tuning parameters:
## 
##   lambda      RMSE      Rsquared   MAE      
##   0.00000000  13.31355  0.4257943   9.489948
##   0.01034483  62.15255  0.3291853  34.807919
##   0.02068966  13.42988  0.4398304   9.526945
##   0.03103448  13.06314  0.4599117   9.273919
##   0.04137931  12.84222  0.4720473   9.179058
##   0.05172414  12.70529  0.4801400   9.113387
##   0.06206897  12.60931  0.4861717   9.055218
##   0.07241379  12.53670  0.4910593   9.004515
##   0.08275862  12.48777  0.4949549   8.963595
##   0.09310345  12.44965  0.4983657   8.931068
##   0.10344828  12.42250  0.5013294   8.914315
##   0.11379310  12.40210  0.5038764   8.901359
##   0.12413793  12.38961  0.5064060   8.896276
##   0.13448276  12.38060  0.5085803   8.890395
##   0.14482759  12.38435  0.5102651   8.893560
##   0.15517241  12.37985  0.5123803   8.890398
##   0.16551724  12.38608  0.5138800   8.896312
##   0.17586207  12.39146  0.5155138   8.902158
##   0.18620690  12.40188  0.5169707   8.918062
##   0.19655172  12.41694  0.5182106   8.934497
##   0.20689655  12.43309  0.5194247   8.950699
##   0.21724138  12.45263  0.5205351   8.968571
##   0.22758621  12.47204  0.5216033   8.984214
##   0.23793103  12.49450  0.5225826   9.000665
##   0.24827586  12.51887  0.5235025   9.018637
##   0.25862069  12.54486  0.5243477   9.041171
##   0.26896552  12.57344  0.5251166   9.063755
##   0.27931034  12.60197  0.5258823   9.085035
##   0.28965517  12.63676  0.5264487   9.109773
##   0.30000000  12.66622  0.5272410   9.127675
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1551724.

The best ridge regression model generated is selected with the lowest RMSE.

plot(ridge_1)

plot(ridge_1, ylim=c(0,50), xlim=c(-0.01,0.31),main="Closer look at the RMSE")

#2nd lambda has the minimum RMSE and maximum R^2
ridge_1$results[which(ridge_1$results$lambda==ridge_1$bestTune$lambda),]

By predicting the response for the test set, the test set estimate of \(R^2\) is shown below.

set.seed(seed)
ridge_predict <- predict(ridge_1, test_X)
plot(ridge_predict, test_Y, main="Observed vs Predicted Permeability of Ridge Regression Model",
     xlab="Predicted Permeability", ylab="Observed Permeability")
abline(0,1,col="royalblue")

postResample(pred=ridge_predict, obs=test_Y)

##       RMSE   Rsquared        MAE 
## 12.8768125  0.3982622  8.9082363

1.1.5.2 Lasso Regression

Lasso regression model

set.seed(seed)

lasso_1 <- train(x=train_X, y=train_Y, method="lasso", 
               tuneGrid=data.frame(.fraction = seq(0, 0.5, length=50)), 
               preProcess=c("center", "scale"), metric="RMSE", 
               trControl=trainControl(method="cv", number=10))
lasso_1

## The lasso 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 112, 113, 113, 113, 113, ... 
## Resampling results across tuning parameters:
## 
##   fraction    RMSE      Rsquared   MAE      
##   0.00000000  15.57692        NaN  12.728995
##   0.01020408  14.47419  0.4634966  11.835283
##   0.02040816  13.52155  0.4689504  10.972438
##   0.03061224  12.75502  0.4880627  10.160320
##   0.04081633  12.22596  0.4853324   9.493436
##   0.05102041  11.83632  0.4804238   9.042418
##   0.06122449  11.57451  0.4835550   8.717351
##   0.07142857  11.34355  0.4829280   8.438200
##   0.08163265  11.21724  0.4844352   8.237427
##   0.09183673  11.16631  0.4863175   8.132458
##   0.10204082  11.17368  0.4858203   8.098263
##   0.11224490  11.21966  0.4849602   8.119512
##   0.12244898  11.29267  0.4829318   8.142800
##   0.13265306  11.35585  0.4815348   8.146106
##   0.14285714  11.39282  0.4808331   8.123918
##   0.15306122  11.43691  0.4803752   8.125886
##   0.16326531  11.48366  0.4800437   8.125808
##   0.17346939  11.51432  0.4800548   8.112645
##   0.18367347  11.55157  0.4789385   8.115222
##   0.19387755  11.59262  0.4768212   8.118618
##   0.20408163  11.63138  0.4746560   8.119309
##   0.21428571  11.67602  0.4723134   8.130778
##   0.22448980  11.71293  0.4710988   8.134070
##   0.23469388  11.74852  0.4701882   8.133650
##   0.24489796  11.77903  0.4693071   8.146913
##   0.25510204  11.80821  0.4680399   8.172026
##   0.26530612  11.84255  0.4665413   8.197618
##   0.27551020  11.87817  0.4652673   8.230074
##   0.28571429  11.92102  0.4635390   8.266816
##   0.29591837  11.96073  0.4621253   8.299207
##   0.30612245  11.98789  0.4613869   8.321313
##   0.31632653  11.99858  0.4611815   8.334260
##   0.32653061  12.00539  0.4607196   8.340874
##   0.33673469  12.02526  0.4594931   8.365919
##   0.34693878  12.03500  0.4596560   8.386498
##   0.35714286  12.04123  0.4606912   8.397463
##   0.36734694  12.05451  0.4608652   8.413506
##   0.37755102  12.06834  0.4609730   8.425471
##   0.38775510  12.08131  0.4613339   8.437526
##   0.39795918  12.09310  0.4614313   8.446080
##   0.40816327  12.09651  0.4615999   8.448989
##   0.41836735  12.09470  0.4620095   8.447201
##   0.42857143  12.09465  0.4622211   8.450695
##   0.43877551  12.09751  0.4624315   8.459603
##   0.44897959  12.10322  0.4625048   8.471047
##   0.45918367  12.11214  0.4624240   8.487719
##   0.46938776  12.12142  0.4623135   8.502984
##   0.47959184  12.13137  0.4622029   8.518438
##   0.48979592  12.14500  0.4619836   8.537519
##   0.50000000  12.16177  0.4616643   8.554631
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.09183673.

The best lasso regression model generated is selected with the lowest RMSE value.

plot(lasso_1)

lasso_1$results[which(lasso_1$results$fraction==lasso_1$bestTune$fraction),]

By predicting the response for the test set, the test set estimate of \(R^2\) is shown below.

set.seed(seed)
lasso_predict <- predict(lasso_1, test_X)
plot(lasso_predict, test_Y, main="Observed vs Predicted Permeability of Lasso Regression Model",
     xlab="Predicted Permeability", ylab="Observed Permeability")
abline(0,1,col="royalblue")

postResample(pred=lasso_predict, obs=test_Y)

##       RMSE   Rsquared        MAE 
## 11.7485995  0.3021474  8.2396727

1.1.5.3 Elastic Net Regression

Elastic Net Regression model

set.seed(seed)

elastic_1 <- train(x=train_X, y=train_Y, method="enet", 
               tuneGrid=data.frame(.lambda = seq(0,0.3,length=20), .fraction=seq(0.05,0.5,length=20)), 
               preProcess=c("center", "scale"), 
               trControl=trainControl(method="cv", number=10))
elastic_1

## Elasticnet 
## 
## 125 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 112, 113, 113, 113, 113, ... 
## Resampling results across tuning parameters:
## 
##   lambda      fraction    RMSE      Rsquared   MAE     
##   0.00000000  0.05000000  11.87107  0.4805047  9.081472
##   0.01578947  0.07368421  11.44758  0.4855473  8.094838
##   0.03157895  0.09736842  11.45778  0.4860160  8.034581
##   0.04736842  0.12105263  11.50092  0.4859597  8.060801
##   0.06315789  0.14473684  11.54106  0.4868502  8.085338
##   0.07894737  0.16842105  11.58295  0.4881920  8.131431
##   0.09473684  0.19210526  11.62286  0.4905041  8.182796
##   0.11052632  0.21578947  11.67162  0.4919044  8.242982
##   0.12631579  0.23947368  11.72141  0.4928231  8.292377
##   0.14210526  0.26315789  11.79440  0.4924376  8.346969
##   0.15789474  0.28684211  11.85231  0.4925342  8.387146
##   0.17368421  0.31052632  11.92964  0.4921074  8.449229
##   0.18947368  0.33421053  11.99055  0.4927981  8.499881
##   0.20526316  0.35789474  12.04181  0.4934928  8.537313
##   0.22105263  0.38157895  12.09259  0.4946214  8.568601
##   0.23684211  0.40526316  12.12059  0.4967339  8.583769
##   0.25263158  0.42894737  12.14993  0.4991258  8.600702
##   0.26842105  0.45263158  12.18961  0.5010938  8.628437
##   0.28421053  0.47631579  12.23650  0.5028885  8.664567
##   0.30000000  0.50000000  12.28733  0.5047685  8.701255
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.07368421 and
##  lambda = 0.01578947.

The best elastic net regression model generated is selected with the lowest RMSE value.

elastic_1$results[which(elastic_1$results$fraction==elastic_1$bestTune$fraction),]

By predicting the response for the test set, the test set estimate of \(R^2\) is shown below.

set.seed(seed)
elastic_predict <- predict(elastic_1, test_X)
plot(elastic_predict, test_Y, main="Observed vs Predicted Permeability of Elastic Net Regression Model",
     xlab="Predicted Permeability", ylab="Observed Permeability")
abline(0,1,col="royalblue")

postResample(pred=elastic_predict, obs=test_Y)

##       RMSE   Rsquared        MAE 
## 11.4592402  0.3454041  7.6291889

1.1.6 Part f

Would you recommend any of your models to replace the permeability laboratory experiment?

Answer:

I would replace my model to the elastic net regression model as it has the lowest RMSE value.

#PLS
print("PLS:")

## [1] "PLS:"

postResample(pred=pls_predict, obs=test_Y)

##       RMSE   Rsquared        MAE 
## 12.1654338  0.3593983  8.2578548

#Ridge Regression
print("Ridge Regression:")

## [1] "Ridge Regression:"

postResample(pred=ridge_predict, obs=test_Y)

##       RMSE   Rsquared        MAE 
## 12.8768125  0.3982622  8.9082363

#Lasso Regression
print("Lasso Regression:")

## [1] "Lasso Regression:"

postResample(pred=lasso_predict, obs=test_Y)

##       RMSE   Rsquared        MAE 
## 11.7485995  0.3021474  8.2396727

#Elastic Net Regression
print("Elastic Net Regression:")

## [1] "Elastic Net Regression:"

postResample(pred=elastic_predict, obs=test_Y)

##       RMSE   Rsquared        MAE 
## 11.4592402  0.3454041  7.6291889

1.2 Ex. 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield.

Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a.) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(chemicalManufacturing)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

1.2.1 Part a

Start R and use these commands to load the data:

data(ChemicalManufacturingProcess)
summary(ChemicalManufacturingProcess)

##      Yield       BiologicalMaterial01 BiologicalMaterial02
##  Min.   :35.25   Min.   :4.580        Min.   :46.87       
##  1st Qu.:38.75   1st Qu.:5.978        1st Qu.:52.68       
##  Median :39.97   Median :6.305        Median :55.09       
##  Mean   :40.18   Mean   :6.411        Mean   :55.69       
##  3rd Qu.:41.48   3rd Qu.:6.870        3rd Qu.:58.74       
##  Max.   :46.34   Max.   :8.810        Max.   :64.75       
##                                                           
##  BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
##  Min.   :56.97        Min.   : 9.38        Min.   :13.24       
##  1st Qu.:64.98        1st Qu.:11.24        1st Qu.:17.23       
##  Median :67.22        Median :12.10        Median :18.49       
##  Mean   :67.70        Mean   :12.35        Mean   :18.60       
##  3rd Qu.:70.43        3rd Qu.:13.22        3rd Qu.:19.90       
##  Max.   :78.25        Max.   :23.09        Max.   :24.85       
##                                                                
##  BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
##  Min.   :40.60        Min.   :100.0        Min.   :15.88       
##  1st Qu.:46.05        1st Qu.:100.0        1st Qu.:17.06       
##  Median :48.46        Median :100.0        Median :17.51       
##  Mean   :48.91        Mean   :100.0        Mean   :17.49       
##  3rd Qu.:51.34        3rd Qu.:100.0        3rd Qu.:17.88       
##  Max.   :59.38        Max.   :100.8        Max.   :19.14       
##                                                                
##  BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
##  Min.   :11.44        Min.   :1.770        Min.   :135.8       
##  1st Qu.:12.60        1st Qu.:2.460        1st Qu.:143.8       
##  Median :12.84        Median :2.710        Median :146.1       
##  Mean   :12.85        Mean   :2.801        Mean   :147.0       
##  3rd Qu.:13.13        3rd Qu.:2.990        3rd Qu.:149.6       
##  Max.   :14.08        Max.   :6.870        Max.   :158.7       
##                                                                
##  BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
##  Min.   :18.35        Min.   : 0.00          Min.   : 0.00         
##  1st Qu.:19.73        1st Qu.:10.80          1st Qu.:19.30         
##  Median :20.12        Median :11.40          Median :21.00         
##  Mean   :20.20        Mean   :11.21          Mean   :16.68         
##  3rd Qu.:20.75        3rd Qu.:12.15          3rd Qu.:21.50         
##  Max.   :22.21        Max.   :14.10          Max.   :22.50         
##                       NA's   :1              NA's   :3             
##  ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
##  Min.   :1.47           Min.   :911.0          Min.   : 923.0        
##  1st Qu.:1.53           1st Qu.:928.0          1st Qu.: 986.8        
##  Median :1.54           Median :934.0          Median : 999.2        
##  Mean   :1.54           Mean   :931.9          Mean   :1001.7        
##  3rd Qu.:1.55           3rd Qu.:936.0          3rd Qu.:1008.9        
##  Max.   :1.60           Max.   :946.0          Max.   :1175.3        
##  NA's   :15             NA's   :1              NA's   :1             
##  ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
##  Min.   :203.0          Min.   :177.0          Min.   :177.0         
##  1st Qu.:205.7          1st Qu.:177.0          1st Qu.:177.0         
##  Median :206.8          Median :177.0          Median :178.0         
##  Mean   :207.4          Mean   :177.5          Mean   :177.6         
##  3rd Qu.:208.7          3rd Qu.:178.0          3rd Qu.:178.0         
##  Max.   :227.4          Max.   :178.0          Max.   :178.0         
##  NA's   :2              NA's   :1              NA's   :1             
##  ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
##  Min.   :38.89          Min.   : 7.500         Min.   : 7.500        
##  1st Qu.:44.89          1st Qu.: 8.700         1st Qu.: 9.000        
##  Median :45.73          Median : 9.100         Median : 9.400        
##  Mean   :45.66          Mean   : 9.179         Mean   : 9.386        
##  3rd Qu.:46.52          3rd Qu.: 9.550         3rd Qu.: 9.900        
##  Max.   :49.36          Max.   :11.600         Max.   :11.500        
##                         NA's   :9              NA's   :10            
##  ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
##  Min.   :   0.0         Min.   :32.10          Min.   :4701          
##  1st Qu.:   0.0         1st Qu.:33.90          1st Qu.:4828          
##  Median :   0.0         Median :34.60          Median :4856          
##  Mean   : 857.8         Mean   :34.51          Mean   :4854          
##  3rd Qu.:   0.0         3rd Qu.:35.20          3rd Qu.:4882          
##  Max.   :4549.0         Max.   :38.60          Max.   :5055          
##  NA's   :1                                     NA's   :1             
##  ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
##  Min.   :5904           Min.   :   0           Min.   :31.30         
##  1st Qu.:6010           1st Qu.:4561           1st Qu.:33.50         
##  Median :6032           Median :4588           Median :34.40         
##  Mean   :6039           Mean   :4566           Mean   :34.34         
##  3rd Qu.:6061           3rd Qu.:4619           3rd Qu.:35.10         
##  Max.   :6233           Max.   :4852           Max.   :40.00         
##                                                                      
##  ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
##  Min.   :   0           Min.   :5890           Min.   :   0          
##  1st Qu.:4813           1st Qu.:6001           1st Qu.:4553          
##  Median :4835           Median :6022           Median :4582          
##  Mean   :4810           Mean   :6028           Mean   :4556          
##  3rd Qu.:4862           3rd Qu.:6050           3rd Qu.:4610          
##  Max.   :4971           Max.   :6146           Max.   :4759          
##                                                                      
##  ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
##  Min.   :-1.8000        Min.   : 0.000         Min.   :0.000         
##  1st Qu.:-0.6000        1st Qu.: 3.000         1st Qu.:2.000         
##  Median :-0.3000        Median : 5.000         Median :3.000         
##  Mean   :-0.1642        Mean   : 5.406         Mean   :3.017         
##  3rd Qu.: 0.0000        3rd Qu.: 8.000         3rd Qu.:4.000         
##  Max.   : 3.6000        Max.   :12.000         Max.   :6.000         
##                         NA's   :1              NA's   :1             
##  ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
##  Min.   : 0.000         Min.   :   0           Min.   :   0          
##  1st Qu.: 4.000         1st Qu.:4832           1st Qu.:6020          
##  Median : 8.000         Median :4855           Median :6047          
##  Mean   : 8.834         Mean   :4828           Mean   :6016          
##  3rd Qu.:14.000         3rd Qu.:4877           3rd Qu.:6070          
##  Max.   :23.000         Max.   :4990           Max.   :6161          
##  NA's   :1              NA's   :5              NA's   :5             
##  ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
##  Min.   :   0           Min.   : 0.000         Min.   : 0.00         
##  1st Qu.:4560           1st Qu.: 0.000         1st Qu.:19.70         
##  Median :4587           Median :10.400         Median :19.90         
##  Mean   :4563           Mean   : 6.592         Mean   :20.01         
##  3rd Qu.:4609           3rd Qu.:10.750         3rd Qu.:20.40         
##  Max.   :4710           Max.   :11.500         Max.   :22.00         
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
##  Min.   : 0.000         Min.   : 0.00          Min.   :143.0         
##  1st Qu.: 8.800         1st Qu.:70.10          1st Qu.:155.0         
##  Median : 9.100         Median :70.80          Median :158.0         
##  Mean   : 9.161         Mean   :70.18          Mean   :158.5         
##  3rd Qu.: 9.700         3rd Qu.:71.40          3rd Qu.:162.0         
##  Max.   :11.200         Max.   :72.50          Max.   :173.0         
##  NA's   :5              NA's   :5                                    
##  ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
##  Min.   :56.00          Min.   :2.300          Min.   :463.0         
##  1st Qu.:62.00          1st Qu.:2.500          1st Qu.:490.0         
##  Median :64.00          Median :2.500          Median :495.0         
##  Mean   :63.54          Mean   :2.494          Mean   :495.6         
##  3rd Qu.:65.00          3rd Qu.:2.500          3rd Qu.:501.5         
##  Max.   :70.00          Max.   :2.600          Max.   :522.0         
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
##  Min.   :0.01700        Min.   :0.000          Min.   :0.000         
##  1st Qu.:0.01900        1st Qu.:0.700          1st Qu.:2.000         
##  Median :0.02000        Median :1.000          Median :3.000         
##  Mean   :0.01957        Mean   :1.014          Mean   :2.534         
##  3rd Qu.:0.02000        3rd Qu.:1.300          3rd Qu.:3.000         
##  Max.   :0.02200        Max.   :2.300          Max.   :3.000         
##  NA's   :5                                                           
##  ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
##  Min.   :0.000          Min.   :0.00000        Min.   :0.00000       
##  1st Qu.:7.100          1st Qu.:0.00000        1st Qu.:0.00000       
##  Median :7.200          Median :0.00000        Median :0.00000       
##  Mean   :6.851          Mean   :0.01771        Mean   :0.02371       
##  3rd Qu.:7.300          3rd Qu.:0.00000        3rd Qu.:0.00000       
##  Max.   :7.500          Max.   :0.10000        Max.   :0.20000       
##                         NA's   :1              NA's   :1             
##  ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
##  Min.   : 0.00          Min.   : 0.0000        Min.   :0.000         
##  1st Qu.:11.40          1st Qu.: 0.6000        1st Qu.:1.800         
##  Median :11.60          Median : 0.8000        Median :1.900         
##  Mean   :11.21          Mean   : 0.9119        Mean   :1.805         
##  3rd Qu.:11.70          3rd Qu.: 1.0250        3rd Qu.:1.900         
##  Max.   :12.10          Max.   :11.0000        Max.   :2.100         
##                                                                      
##  ManufacturingProcess45
##  Min.   :0.000         
##  1st Qu.:2.100         
##  Median :2.200         
##  Mean   :2.138         
##  3rd Qu.:2.300         
##  Max.   :2.600         
##

The matrix ChemicalManufacturingProcess contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs, plus the variable yield which contains the percent yield for each run.

1.2.2 Part b

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

Answer:

The caret class preProcess has the ability to transform, center, scale, or impute values, as well as apply the spatial sign transformation and feature extraction.

cmp_predictors <- ChemicalManufacturingProcess[,-c(1)]
cmp_pre <- preProcess(cmp_predictors, method="knnImpute") #textbook sec3.8
#apply the transformations
cmp_predictors <- predict(cmp_pre, cmp_predictors)

1.2.3 Part c

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Answer:

Pre-process the data with centering and scaling.

cmp_pre <- preProcess(cmp_predictors, method=c("center", "scale"))
cmp_predictors <- predict(cmp_pre, cmp_predictors)

Train-test split at 70%

set.seed(0)
trainingRows <- createDataPartition(ChemicalManufacturingProcess$Yield, 
                                    p=0.70, list=FALSE) #caret, textbook sec4.9
train_X2 <- cmp_predictors[trainingRows, ]
train_Y2 <- ChemicalManufacturingProcess$Yield[trainingRows]
test_X2 <- cmp_predictors[-trainingRows, ]
test_Y2 <- ChemicalManufacturingProcess$Yield[-trainingRows]

Create an elastic net regression model

set.seed(seed)

elastic_2 <- train(x=train_X2, y=train_Y2, method="enet", 
               tuneGrid=data.frame(.lambda = seq(0,0.5,length=50), .fraction=seq(0,0.5,length=50)), 
               preProcess=c("center", "scale"),
               trControl=trainControl(method="cv", number=10))
elastic_2

## Elasticnet 
## 
## 124 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 110, 112, 112, 112, 112, 112, ... 
## Resampling results across tuning parameters:
## 
##   lambda      fraction    RMSE      Rsquared   MAE     
##   0.00000000  0.00000000  1.901440        NaN  1.550630
##   0.01020408  0.01020408  1.827446  0.4715302  1.489765
##   0.02040816  0.02040816  1.781927  0.5154421  1.452926
##   0.03061224  0.03061224  1.740373  0.5501750  1.419992
##   0.04081633  0.04081633  1.701189  0.5760593  1.389037
##   0.05102041  0.05102041  1.665255  0.5897328  1.360337
##   0.06122449  0.06122449  1.631330  0.5991104  1.333962
##   0.07142857  0.07142857  1.598567  0.6055289  1.308824
##   0.08163265  0.08163265  1.567849  0.6094772  1.285673
##   0.09183673  0.09183673  1.538174  0.6130028  1.263410
##   0.10204082  0.10204082  1.510038  0.6152897  1.242095
##   0.11224490  0.11224490  1.484245  0.6164227  1.221695
##   0.12244898  0.12244898  1.459033  0.6176095  1.201066
##   0.13265306  0.13265306  1.435393  0.6179064  1.181772
##   0.14285714  0.14285714  1.413077  0.6175098  1.164097
##   0.15306122  0.15306122  1.391639  0.6169066  1.146605
##   0.16326531  0.16326531  1.370669  0.6167188  1.130151
##   0.17346939  0.17346939  1.349648  0.6178490  1.114648
##   0.18367347  0.18367347  1.328960  0.6196777  1.099119
##   0.19387755  0.19387755  1.309101  0.6221056  1.085584
##   0.20408163  0.20408163  1.296127  0.6197820  1.081108
##   0.21428571  0.21428571  1.291100  0.6112501  1.076820
##   0.22448980  0.22448980  1.292687  0.6015908  1.073770
##   0.23469388  0.23469388  1.298798  0.5933851  1.071336
##   0.24489796  0.24489796  1.296381  0.5918706  1.065447
##   0.25510204  0.25510204  1.295319  0.5898961  1.061638
##   0.26530612  0.26530612  1.302230  0.5847361  1.061831
##   0.27551020  0.27551020  1.309779  0.5808628  1.060294
##   0.28571429  0.28571429  1.319787  0.5773648  1.060404
##   0.29591837  0.29591837  1.330923  0.5745433  1.061366
##   0.30612245  0.30612245  1.343585  0.5716829  1.062866
##   0.31632653  0.31632653  1.357646  0.5690163  1.065103
##   0.32653061  0.32653061  1.366258  0.5686029  1.065763
##   0.33673469  0.33673469  1.374310  0.5687929  1.066584
##   0.34693878  0.34693878  1.382495  0.5695444  1.067658
##   0.35714286  0.35714286  1.390983  0.5704198  1.069274
##   0.36734694  0.36734694  1.401315  0.5714089  1.071955
##   0.37755102  0.37755102  1.415014  0.5722355  1.075890
##   0.38775510  0.38775510  1.427006  0.5734824  1.079853
##   0.39795918  0.39795918  1.436859  0.5752575  1.083323
##   0.40816327  0.40816327  1.447524  0.5768117  1.087171
##   0.41836735  0.41836735  1.458528  0.5784628  1.091176
##   0.42857143  0.42857143  1.486616  0.5746711  1.100797
##   0.43877551  0.43877551  1.508477  0.5729361  1.108207
##   0.44897959  0.44897959  1.514037  0.5757535  1.110933
##   0.45918367  0.45918367  1.520137  0.5785080  1.114653
##   0.46938776  0.46938776  1.525742  0.5810269  1.117965
##   0.47959184  0.47959184  1.532399  0.5831547  1.121583
##   0.48979592  0.48979592  1.539876  0.5851052  1.125881
##   0.50000000  0.50000000  1.557432  0.5861309  1.133355
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.2142857 and lambda
##  = 0.2142857.

The best elastic net regression model generated is selected with the lowest RMSE value.

elastic_2$results[which(elastic_2$results$fraction==elastic_2$bestTune$fraction),]

1.2.4 Part d

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

Answer:

The resampled performance metric on the training set obtained from above has \(R^2 = 0.6112501\) and \(RMSE = 1.2911\).
By predicting the response for the test set, the test set estimate of \(R^2 = 5955402\) and \(RMSE = 1.1216353\).
The prediction performance has lower RMSE compared to the resampled performance. Thus the test set results appears to perform better than the training set results.

set.seed(0)
elastic_predict <- predict(elastic_2, test_X2)
plot(elastic_predict, test_Y2, main="Observed vs Predicted Permeability of Elastic Net Regression Model",
     xlab="Predicted Permeability", ylab="Observed Permeability")
abline(0,1,col="royalblue")

postResample(pred=elastic_predict, obs=test_Y2)

##      RMSE  Rsquared       MAE 
## 1.1216353 0.5955402 0.9313962

1.2.5 Part e

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

Answer:

Using the varImp function from library caret to find the predictors’ importance. The top 20 important predictors are shown below.
The most important predictor is ManufacturingProcess32, following with ManufacturingProcess13, BiologicalMaterial03, BiologicalMaterial06 and `ManufacturingProcess17, etc.
Among the 20 most important variables, there are 12 process predictors and 8 biological predictors. Also, there are 6 process predictors among top 10. Thus, process predictors appear to dominate the list.

varImp(elastic_2)$importance %>% arrange(desc(Overall))

1.2.6 Part f

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Answer:

According to the correlation plot of the top 20 important predictors, I will try to modify the manufacturing process #13, #17 and #36 to decrease their importance to the yield because they are highly negatively correlated to Yield. Their correlation coefficients with Yield are -0.50, -0.43, and -0.52 respectively.

rn <- varImp(elastic_2)$importance %>% arrange(desc(Overall)) %>% rownames() %>% .[1:20]

m <- cmp_predictors %>% select(rn) %>% cbind(ChemicalManufacturingProcess$Yield)
library(corrplot)
corrplot(cor(m), type="lower")

cor(cmp_predictors$ManufacturingProcess13, ChemicalManufacturingProcess$Yield)

## [1] -0.5036797

cor(cmp_predictors$ManufacturingProcess17, ChemicalManufacturingProcess$Yield)

## [1] -0.4258069

cor(cmp_predictors$ManufacturingProcess36, ChemicalManufacturingProcess$Yield)

## [1] -0.5237389

Data 624 HW7: Linear Regression and Its Cousins

Data 624 HW7: Linear Regression and Its Cousins

1 HW7: Linear Regression and Its Cousins

1.1 Ex. 6.2

1.1.1 Part a

1.1.2 Part b

1.1.3 Part c

1.1.4 Part d

1.1.5 Part e

1.1.5.1 Ridge Regression

1.1.5.2 Lasso Regression

1.1.5.3 Elastic Net Regression

1.1.6 Part f

1.2 Ex. 6.3

1.2.1 Part a

1.2.2 Part b

1.2.3 Part c

1.2.4 Part d

1.2.5 Part e

1.2.6 Part f