library(AppliedPredictiveModeling)
library(caret)
library(pls)
library(dplyr)
library(elasticnet)
library(ggplot2)
library(lattice)

Linear Regression

6.2

Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug

a

Start R and use these commands to load the data

data(permeability)
summary(permeability)

##   permeability  
##  Min.   : 0.06  
##  1st Qu.: 1.55  
##  Median : 4.91  
##  Mean   :12.24  
##  3rd Qu.:15.47  
##  Max.   :55.60

str(fingerprints)

##  num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...

str(permeability)

##  num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:165] "1" "2" "3" "4" ...
##   ..$ : chr "permeability"

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

b

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

#Filtering the predictors that hae low frequencies nearZeroVar 
fingerprints %>% nearZeroVar() %>% length()

## [1] 719

#Predictors that are left for modeling
X <- fingerprints[,-nearZeroVar(fingerprints)]
dim(X)[2]

## [1] 388

c

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

The optimal ncomp is 11 with the maximum R2 is 0.5271005.

#Split data into train/test set using the createDataPartition
set.seed(1)
trainRow <- createDataPartition(permeability, p=0.8, list=FALSE)
X.train <- X[trainRow, ]
y.train <- permeability[trainRow, ]
X.test <- X[-trainRow, ]
y.test <- permeability[-trainRow, ]

#Pre-process training set then cross validation of ncomp parameter for PLS model from 1 to 20.
set.seed(1)
plsFit <- train(x=X.train, y=y.train, method='pls', metric='Rsquared', tuneLength=20, trControl=trainControl(method='cv'), preProcess=c('center', 'scale'))
plsResult <- plsFit$results
plsFit

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 120, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     12.66375  0.3461521  9.603300
##    2     11.66557  0.4884962  8.364131
##    3     11.86090  0.4462082  9.081715
##    4     12.15438  0.4254096  9.389547
##    5     12.00549  0.4492658  9.076973
##    6     11.70875  0.4618761  8.826006
##    7     11.42799  0.4775314  8.747634
##    8     11.25481  0.4925856  8.533191
##    9     11.03884  0.5066896  8.344461
##   10     10.88506  0.5187689  8.028416
##   11     10.80216  0.5271005  8.012505
##   12     10.81509  0.5247495  8.034195
##   13     10.74326  0.5245241  8.003891
##   14     10.70988  0.5238880  7.892343
##   15     10.87078  0.5198265  8.032440
##   16     11.25383  0.4999003  8.375209
##   17     11.52949  0.4886010  8.619614
##   18     11.46969  0.4907743  8.541789
##   19     11.46414  0.4906053  8.591222
##   20     11.38025  0.5003197  8.484932
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 11.

plot(plsFit)

d

Predict the response for the test set. What is the test set estimate of R2?

The test set estimate of R2 is .2899935.

plsPred <- predict(plsFit, newdata=X.test)
postResample(pred=plsPred, obs=y.test)

##       RMSE   Rsquared        MAE 
## 15.4341377  0.2899935 11.6639869

e

Try building other models discussed in this chapter. Do any have better predictive performance?

The the test sets seems to suggest that the Lasso model is best, with an R2 of 0.3954258. The scores for the Ridge, Lasso, and Enet are better performing than the PLS.

#Ridge Regression
set.seed(1)
ridgeFit <- train(x=X.train, y=y.train, method='ridge', metric='Rsquared', tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)), trControl=trainControl(method='cv'), preProcess=c('center','scale'))

## Warning: model fit failed for Fold08: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

ridgeFit

## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 120, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.0     12.31893  0.4528045   9.294510
##   0.1     10.79390  0.5292391   7.935830
##   0.2     11.15821  0.5289694   8.241192
##   0.3     11.56911  0.5286210   8.612171
##   0.4     12.09487  0.5259461   9.039822
##   0.5     12.67474  0.5230682   9.494363
##   0.6     13.29786  0.5205254  10.029131
##   0.7     13.95945  0.5179546  10.622090
##   0.8     14.64851  0.5156990  11.217376
##   0.9     15.36147  0.5136360  11.828589
##   1.0     16.09410  0.5117506  12.434283
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.1.

plot(ridgeFit)

#Lasso
set.seed(1)
lassoFit <- train(x=X.train, y=y.train, method='lasso', metric='Rsquared', tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.05)),trControl=trainControl(method='cv'), preProcess=c('center','scale'))

## Warning: model fit failed for Fold08: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

## Warning in train.default(x = X.train, y = y.train, method = "lasso", metric =
## "Rsquared", : missing values found in aggregated results

lassoFit

## The lasso 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 120, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.00      14.62719        NaN  11.901845
##   0.05      13.29247  0.4552939  10.359126
##   0.10      12.80621  0.4554363   9.552588
##   0.15      12.90315  0.4235837   9.618665
##   0.20      12.87604  0.4068843   9.600865
##   0.25      12.75649  0.4114738   9.556420
##   0.30      12.66593  0.4169995   9.558130
##   0.35      12.58089  0.4227670   9.518502
##   0.40      12.50366  0.4283152   9.458058
##   0.45      12.44435  0.4366066   9.404528
##   0.50      12.35249  0.4422982   9.333097
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.1.

plot(lassoFit)

# Elasticnet 
set.seed(1)
enetFit <- train(x=X.train, y=y.train, method='enet', metric='Rsquared', tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), .lambda = seq(0, 1, by=0.1)), trControl=trainControl(method='cv'), preProcess=c('center','scale'))

## Warning: model fit failed for Fold08: lambda=0.0, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

## Warning in train.default(x = X.train, y = y.train, method = "enet", metric =
## "Rsquared", : missing values found in aggregated results

enetFit

## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 120, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.0     0.0       14.62719        NaN  11.901845
##   0.0     0.1       12.80621  0.4554363   9.552588
##   0.0     0.2       12.87604  0.4068843   9.600865
##   0.0     0.3       12.66593  0.4169995   9.558130
##   0.0     0.4       12.50366  0.4283152   9.458058
##   0.0     0.5       12.35249  0.4422982   9.333097
##   0.0     0.6       12.25156  0.4459605   9.303113
##   0.0     0.7       12.19984  0.4489988   9.311995
##   0.0     0.8       12.21014  0.4503081   9.280251
##   0.0     0.9       12.24057  0.4526927   9.271295
##   0.0     1.0       12.31893  0.4528045   9.294510
##   0.1     0.0       15.17471        NaN  12.242520
##   0.1     0.1       11.55434  0.4874903   8.229407
##   0.1     0.2       11.63397  0.4714981   8.459077
##   0.1     0.3       11.17731  0.5023828   8.228974
##   0.1     0.4       10.93531  0.5152918   8.045882
##   0.1     0.5       10.80908  0.5216687   7.920525
##   0.1     0.6       10.72893  0.5273809   7.828944
##   0.1     0.7       10.69309  0.5308887   7.777732
##   0.1     0.8       10.72480  0.5308325   7.832275
##   0.1     0.9       10.75986  0.5298811   7.908604
##   0.1     1.0       10.79390  0.5292391   7.935830
##   0.2     0.0       15.17471        NaN  12.242520
##   0.2     0.1       11.48301  0.4945241   8.170117
##   0.2     0.2       11.84209  0.4696152   8.505668
##   0.2     0.3       11.43389  0.4958256   8.375297
##   0.2     0.4       11.19698  0.5114505   8.243848
##   0.2     0.5       11.09985  0.5182971   8.147914
##   0.2     0.6       11.00732  0.5263811   8.075977
##   0.2     0.7       10.99994  0.5305898   8.069433
##   0.2     0.8       11.05283  0.5299608   8.106566
##   0.2     0.9       11.10436  0.5299953   8.173625
##   0.2     1.0       11.15821  0.5289694   8.241192
##   0.3     0.0       15.17471        NaN  12.242520
##   0.3     0.1       11.43015  0.4972166   8.101725
##   0.3     0.2       11.96864  0.4730337   8.473644
##   0.3     0.3       11.75988  0.4899883   8.585111
##   0.3     0.4       11.56666  0.5064758   8.533954
##   0.3     0.5       11.46977  0.5140181   8.420096
##   0.3     0.6       11.41659  0.5217383   8.390925
##   0.3     0.7       11.38872  0.5284379   8.402842
##   0.3     0.8       11.43282  0.5293766   8.443219
##   0.3     0.9       11.50977  0.5285789   8.537666
##   0.3     1.0       11.56911  0.5286210   8.612171
##   0.4     0.0       15.17471        NaN  12.242520
##   0.4     0.1       11.41234  0.4970136   8.041166
##   0.4     0.2       12.07118  0.4773956   8.417483
##   0.4     0.3       12.09220  0.4855603   8.772254
##   0.4     0.4       11.96135  0.5008874   8.794705
##   0.4     0.5       11.91102  0.5097316   8.694116
##   0.4     0.6       11.89320  0.5169703   8.720530
##   0.4     0.7       11.87520  0.5239754   8.757094
##   0.4     0.8       11.92508  0.5256967   8.832045
##   0.4     0.9       12.00938  0.5256961   8.936506
##   0.4     1.0       12.09487  0.5259461   9.039822
##   0.5     0.0       15.17471        NaN  12.242520
##   0.5     0.1       11.40954  0.4963705   7.986894
##   0.5     0.2       12.20206  0.4801158   8.392077
##   0.5     0.3       12.46719  0.4806466   8.951887
##   0.5     0.4       12.40354  0.4958890   9.045601
##   0.5     0.5       12.39951  0.5050895   8.992832
##   0.5     0.6       12.42175  0.5118355   9.100795
##   0.5     0.7       12.42012  0.5187276   9.170141
##   0.5     0.8       12.47215  0.5220491   9.262259
##   0.5     0.9       12.57012  0.5224055   9.371348
##   0.5     1.0       12.67474  0.5230682   9.494363
##   0.6     0.0       15.17471        NaN  12.242520
##   0.6     0.1       11.43256  0.4939374   7.946523
##   0.6     0.2       12.36966  0.4817767   8.395529
##   0.6     0.3       12.85821  0.4782145   9.124133
##   0.6     0.4       12.87990  0.4922499   9.308087
##   0.6     0.5       12.92052  0.5013594   9.353646
##   0.6     0.6       12.97712  0.5077939   9.498849
##   0.6     0.7       13.00404  0.5142719   9.613319
##   0.6     0.8       13.06977  0.5182672   9.741350
##   0.6     0.9       13.17859  0.5193988   9.891632
##   0.6     1.0       13.29786  0.5205254  10.029131
##   0.7     0.0       15.17471        NaN  12.242520
##   0.7     0.1       11.46619  0.4912343   7.912957
##   0.7     0.2       12.55303  0.4833110   8.390999
##   0.7     0.3       13.25892  0.4767863   9.301066
##   0.7     0.4       13.37747  0.4891867   9.594389
##   0.7     0.5       13.47327  0.4977598   9.738364
##   0.7     0.6       13.56088  0.5039024   9.927475
##   0.7     0.7       13.62075  0.5100817  10.119179
##   0.7     0.8       13.70839  0.5144473  10.288818
##   0.7     0.9       13.82941  0.5164615  10.468116
##   0.7     1.0       13.95945  0.5179546  10.622090
##   0.8     0.0       15.17471        NaN  12.242520
##   0.8     0.1       11.50692  0.4886514   7.879098
##   0.8     0.2       12.76479  0.4839868   8.409995
##   0.8     0.3       13.66419  0.4762875   9.519340
##   0.8     0.4       13.89244  0.4866892   9.940965
##   0.8     0.5       14.04197  0.4946135  10.165559
##   0.8     0.6       14.16619  0.5006537  10.436159
##   0.8     0.7       14.25971  0.5068011  10.645189
##   0.8     0.8       14.36988  0.5114073  10.843979
##   0.8     0.9       14.50376  0.5140374  11.043532
##   0.8     1.0       14.64851  0.5156990  11.217376
##   0.9     0.0       15.17471        NaN  12.242520
##   0.9     0.1       11.55121  0.4862199   7.848229
##   0.9     0.2       12.99728  0.4843262   8.472094
##   0.9     0.3       14.07288  0.4758987   9.760572
##   0.9     0.4       14.42043  0.4845527  10.309539
##   0.9     0.5       14.62738  0.4918566  10.622684
##   0.9     0.6       14.78920  0.4978107  10.969734
##   0.9     0.7       14.91416  0.5040442  11.192740
##   0.9     0.8       15.05381  0.5086436  11.412504
##   0.9     0.9       15.19917  0.5118594  11.638340
##   0.9     1.0       15.36147  0.5136360  11.828589
##   1.0     0.0       15.17471        NaN  12.242520
##   1.0     0.1       11.60369  0.4838849   7.823994
##   1.0     0.2       13.24717  0.4844231   8.565415
##   1.0     0.3       14.49421  0.4757489  10.001137
##   1.0     0.4       14.95867  0.4828266  10.733226
##   1.0     0.5       15.22359  0.4895516  11.135659
##   1.0     0.6       15.42999  0.4952250  11.523292
##   1.0     0.7       15.58606  0.5014637  11.753596
##   1.0     0.8       15.75800  0.5060329  11.990365
##   1.0     0.9       15.91616  0.5098193  12.235461
##   1.0     1.0       16.09410  0.5117506  12.434283
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.7 and lambda = 0.1.

plot(enetFit)

resamp <- resamples(list(PLS=plsFit, Ridge=ridgeFit, Lasso=lassoFit, enet=enetFit))
(resamp.s <- summary(resamp))

## 
## Call:
## summary.resamples(object = resamp)
## 
## Models: PLS, Ridge, Lasso, enet 
## Number of resamples: 10 
## 
## MAE 
##           Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PLS   3.431086 6.279454 8.683473 8.012505 9.946990 10.48475    0
## Ridge 2.932725 6.407812 8.665880 7.935830 9.924056 10.45745    0
## Lasso 5.791257 7.693374 7.903535 9.552588 9.632303 21.21770    1
## enet  2.993593 6.181506 8.157837 7.777732 9.796487 10.79813    0
## 
## RMSE 
##           Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PLS   4.127860 7.883648 12.72946 10.80216 13.06351 14.98578    0
## Ridge 3.517458 8.293763 12.22699 10.79390 13.11415 15.70450    0
## Lasso 7.657579 8.839116 12.01826 12.80621 14.38416 25.37189    1
## enet  3.616262 8.054933 11.81838 10.69309 13.31569 14.94121    0
## 
## Rsquared 
##               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS   0.0007971331 0.3795375 0.5019221 0.5271005 0.7601007 0.9084493    0
## Ridge 0.0047218766 0.3487174 0.5283737 0.5292391 0.7709032 0.9297472    0
## Lasso 0.0235890619 0.1246479 0.5805556 0.4554363 0.7421118 0.8656899    1
## enet  0.0042021862 0.3407759 0.5243361 0.5308887 0.7766468 0.9259750    0

plsPred <- predict(plsFit, newdata=X.test)
postResample(pred=plsPred, obs=y.test)

##       RMSE   Rsquared        MAE 
## 15.4341377  0.2899935 11.6639869

#Evaluation of models using the test set
multiResample <- function(models, newdata, obs){
  res = list()
  methods = c()
  i = 1
  for (model in models){
    pred <- predict(model, newdata=newdata)
    metrics <- postResample(pred=pred, obs=obs)
    res[[i]] <- metrics
    methods[[i]] <- model$method
    i <- 1 + i
  }
  names(res) <- methods
  return(res)
}

models <- list(plsFit, ridgeFit, lassoFit, enetFit)
(resampleResult <- multiResample(models, X.test, y.test))

## $pls
##       RMSE   Rsquared        MAE 
## 15.4341377  0.2899935 11.6639869 
## 
## $ridge
##       RMSE   Rsquared        MAE 
## 15.0822363  0.3047575 11.4460367 
## 
## $lasso
##       RMSE   Rsquared        MAE 
## 12.3663838  0.3954258  9.1764050 
## 
## $enet
##       RMSE   Rsquared        MAE 
## 14.1795387  0.3415212 10.7570758

f

Would you recommend any of your models to replace the permeability laboratory experiment?

No,I would not recommend any of the models to replace the permeability laboratory experiment. The MAE of all of the models are roughly between 9 and 12, meaning that the model predictions are on average +/- 9 to 12 off. We can see that most of permeability are under 12.

hist(permeability)

6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch

a

Start R and use these commands to load the data:

data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

b

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values.

(cmpImpute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute')))

## Created from 152 samples and 57 variables
## 
## Pre-processing:
##   - bagged tree imputation (57)
##   - ignored (0)

cmp <- predict(cmpImpute, ChemicalManufacturingProcess[,-c(1)])

c

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

#Train and Test
set.seed(1)
trainRow <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X.train <- cmp[trainRow, ]
y.train <- ChemicalManufacturingProcess$Yield[trainRow]
X.test <- cmp[-trainRow, ]
y.test <- ChemicalManufacturingProcess$Yield[-trainRow]

The elastic net model is tuned using cross validation with parameters lambda ranging from 0 to 1, and fraction ranging from 0 to 1.

set.seed(1)
enetFit <- train(x=X.train, y=y.train, method='enet', metric='RMSE', tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), .lambda = seq(0, 1, by=0.1)), trControl=trainControl(method='cv'), preProcess=c('center','scale'))

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

plot(enetFit)

d

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

The test set RMSE is lower than the resampled performance metric on the training set. The test set result appears to be better than the training set result.

enetPred <- predict(enetFit, newdata=X.test)
(predResult <- postResample(pred=enetPred, obs=y.test))

##      RMSE  Rsquared       MAE 
## 1.0611572 0.6019381 0.8536509

e

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

The predictors that are most important in the model are the coefficients. The elastic net zero out some of the predictors, due to the lasso penalty. 26 Manufacturing Process predictors are zeroed out, while 7 Biological Material predictors are zeroed out. In the remaining 24 predictors, 19 are Manufacturing Process predictors and 5 are Biological Material predictors.

(coeffs <- predict.enet(enetFit$finalModel, s=enetFit$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients)

##   BiologicalMaterial01   BiologicalMaterial02   BiologicalMaterial03 
##             0.00000000             0.00000000             0.00000000 
##   BiologicalMaterial04   BiologicalMaterial05   BiologicalMaterial06 
##             0.00000000             0.00000000             0.08872051 
##   BiologicalMaterial07   BiologicalMaterial08   BiologicalMaterial09 
##             0.00000000             0.00000000             0.00000000 
##   BiologicalMaterial10   BiologicalMaterial11   BiologicalMaterial12 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 
##             0.04275827             0.00000000             0.04748137 
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09 
##            -0.04828912             0.00000000             0.45351080 
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15 
##            -0.21934545             0.00000000             0.06315242 
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18 
##             0.00000000            -0.22082836             0.00000000 
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess22 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34 
##             0.80779340             0.00000000             0.12023091 
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37 
##             0.00000000            -0.19025738            -0.12077693 
## ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40 
##             0.00000000             0.10800365             0.00000000 
## ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43 
##             0.00000000             0.02043653             0.00000000 
## ManufacturingProcess44 ManufacturingProcess45 
##             0.00000000             0.05192492

#Comparing non-zero coefficients and sorting
coeffs.sorted <- abs(coeffs)
coeffs.sorted <- coeffs.sorted[coeffs.sorted>0]
(coeffs.sorted <- sort(coeffs.sorted, decreasing = T))

## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess17 
##             0.80779340             0.45351080             0.22082836 
## ManufacturingProcess13 ManufacturingProcess36 ManufacturingProcess37 
##             0.21934545             0.19025738             0.12077693 
## ManufacturingProcess34 ManufacturingProcess39   BiologicalMaterial06 
##             0.12023091             0.10800365             0.08872051 
## ManufacturingProcess15 ManufacturingProcess45 ManufacturingProcess07 
##             0.06315242             0.05192492             0.04828912 
## ManufacturingProcess06 ManufacturingProcess04 ManufacturingProcess42 
##             0.04748137             0.04275827             0.02043653

It appears that ManufacturingProcess are more important. The varImp function can be used to rank the importance.

varImp(enetFit)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   82.21
## ManufacturingProcess36   79.47
## BiologicalMaterial06     75.61
## BiologicalMaterial03     71.87
## ManufacturingProcess17   70.62
## BiologicalMaterial12     66.86
## ManufacturingProcess09   62.20
## ManufacturingProcess06   55.45
## BiologicalMaterial02     53.61
## ManufacturingProcess31   46.70
## ManufacturingProcess33   45.66
## BiologicalMaterial11     42.39
## BiologicalMaterial04     39.70
## ManufacturingProcess29   36.88
## ManufacturingProcess11   36.39
## ManufacturingProcess12   35.50
## BiologicalMaterial08     31.86
## BiologicalMaterial09     30.98
## BiologicalMaterial01     29.67

11 out of the 20 in the list are ManufacturingProcess predictors making it more important than BiologicalMaterial.

f

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Elastic net is a linear regression model. The coefficients explains the predictors affecting the target. Positive coefficients represent the improvement of the yield and the negative coefficients decreases the yield.

I would alter the process such that the predictor value increases.

#Positive coefficients
coeffs.mp <- coeffs.sorted[grep('ManufacturingProcess', names(coeffs.sorted))] %>% names() %>% coeffs[.]
coeffs.mp[coeffs.mp>0]

## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess34 
##             0.80779340             0.45351080             0.12023091 
## ManufacturingProcess39 ManufacturingProcess15 ManufacturingProcess45 
##             0.10800365             0.06315242             0.05192492 
## ManufacturingProcess06 ManufacturingProcess04 ManufacturingProcess42 
##             0.04748137             0.04275827             0.02043653

I would alter the process such that the predictor value decreases, since the Manufacturing Process has negative coefficients.

#Negative coefficients
coeffs.mp[coeffs.mp<0]

## ManufacturingProcess17 ManufacturingProcess13 ManufacturingProcess36 
##            -0.22082836            -0.21934545            -0.19025738 
## ManufacturingProcess37 ManufacturingProcess07 
##            -0.12077693            -0.04828912

Homework 7

Isabel R

11/1/2020

6.2

a

b

c

d

e

f

6.3

a

b

c

d

e

f