DATA 624 - Homework # 7

6.2 Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

remove.cols <- nearZeroVar(fingerprints)
X <- fingerprints[,-remove.cols]

length(remove.cols) %>% paste(' columns are removed. ', dim(X)[2], ' columns are left for modeling.' ,sep='') %>% print()

## [1] "719 columns are removed. 388 columns are left for modeling."

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R^2?

Below, the data is splitted into train/test set, using the createDataPartition function.

set.seed(1)
trainRow <- createDataPartition(permeability, p=0.8, list=FALSE)
X.train <- X[trainRow, ]
y.train <- permeability[trainRow, ]
X.test <- X[-trainRow, ]
y.test <- permeability[-trainRow, ]

I use the train function to perform the pre-process and tuning together. The function first preprocess the training set by centering it and scaling it. Then the function uses 10-fold cross validation to try the ncomp parameter (number of components, i.e. latent variables) of the PLS model from 1 to 20.

set.seed(1)
plsFit <- train(x=X.train,
                y=y.train, 
                method='pls',
                metric='Rsquared',
                tuneLength=20,
                trControl=trainControl(method='cv'),
                preProcess=c('center', 'scale')
                )

plsResult <- plsFit$results
plsFit

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 121, 117, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.43441  0.3057483  10.026064
##    2     12.25160  0.4403062   8.723873
##    3     11.73659  0.4724559   8.888187
##    4     11.62749  0.4754973   8.803811
##    5     11.48531  0.4719510   8.551965
##    6     11.32229  0.4847234   8.355657
##    7     11.24192  0.4945511   8.449710
##    8     11.11175  0.5094459   8.540873
##    9     10.99359  0.5256837   8.441846
##   10     11.08075  0.5276792   8.420112
##   11     11.28732  0.5235339   8.570830
##   12     11.22365  0.5307880   8.617399
##   13     11.45069  0.5207469   8.850330
##   14     11.70270  0.5161730   9.115517
##   15     11.96240  0.4954513   9.269173
##   16     12.22588  0.4803732   9.467217
##   17     12.64388  0.4616823   9.853870
##   18     12.97678  0.4481705  10.176509
##   19     13.16823  0.4433375  10.285782
##   20     13.51347  0.4262324  10.524854
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 12.

plot(plsFit)

Using R^2 as the deciding metric, the CV found the optimal ncomp to be 12, with the maximum R^2 being 0.530788.

(d) Predict the response for the test set. What is the test set estimate of R^2?

The postResample function from the caret package can be use to find the R^2 in the test set, using the selected model.

plsPred <- predict(plsFit, newdata=X.test)
postResample(pred=plsPred, obs=y.test)

##       RMSE   Rsquared        MAE 
## 12.4780591  0.3536607  8.7204595

Here, the R^2 is 0.3536607.

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

I try to tune 3 additional models:

Ridge regression, parameter tuned: lambda (from 0 to 1 by 0.1)
Lasso, parameter tuned: fraction (from 0 to 0.5 by 0.05)
Elastic net, parameters tuned: fraction and lambda (2-D grid with each D from 0 to 1 by 0.1)

I ensure that all of the models have the same seed, so their CV sets are identical. This way, I can then use the resamples functions to compare all 4 models at once. The R^2 metrics are used in all cases.

set.seed(1)
ridgeFit <- train(x=X.train,
                  y=y.train,
                  method='ridge',
                  metric='Rsquared',
                  tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )
ridgeFit

## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 121, 117, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.0     24.02881  0.2627497  16.031438
##   0.1     11.83292  0.5081518   9.092937
##   0.2     11.58981  0.5330773   8.857077
##   0.3     11.79414  0.5390176   9.009380
##   0.4     12.15851  0.5405512   9.284284
##   0.5     12.62564  0.5397839   9.662863
##   0.6     13.17255  0.5384263  10.131791
##   0.7     13.76370  0.5366428  10.603431
##   0.8     14.40875  0.5347296  11.117149
##   0.9     15.08011  0.5328512  11.635077
##   1.0     15.78570  0.5309111  12.226047
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.4.

plot(ridgeFit)

set.seed(1)
lassoFit <- train(x=X.train,
                  y=y.train,
                  method='lasso',
                  metric='Rsquared',
                  tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.05)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )
lassoFit

## The lasso 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 121, 117, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.00      15.69625        NaN  12.538724
##   0.05      12.24969  0.4843311   9.116047
##   0.10      11.56814  0.4836731   8.205552
##   0.15      11.56574  0.4898894   8.221542
##   0.20      11.79666  0.4782172   8.428298
##   0.25      12.13109  0.4736244   8.810463
##   0.30      12.59829  0.4660510   9.277051
##   0.35      13.32724  0.4463834   9.791144
##   0.40      14.14116  0.4245198  10.327665
##   0.45      14.99053  0.4048185  10.809800
##   0.50      15.82414  0.3884961  11.289059
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.15.

plot(lassoFit)

set.seed(1)
enetFit <- train(x=X.train,
                 y=y.train,
                 method='enet',
                 metric='Rsquared',
                 tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                      .lambda = seq(0, 1, by=0.1)),
                 trControl=trainControl(method='cv'),
                 preProcess=c('center','scale')
                  )
enetFit

## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 121, 117, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.0     0.0       15.69625        NaN  12.538724
##   0.0     0.1       11.56814  0.4836731   8.205552
##   0.0     0.2       11.79666  0.4782172   8.428298
##   0.0     0.3       12.59829  0.4660510   9.277051
##   0.0     0.4       14.14116  0.4245198  10.327665
##   0.0     0.5       15.82414  0.3884961  11.289059
##   0.0     0.6       17.38793  0.3599041  12.116201
##   0.0     0.7       19.04314  0.3283464  13.066529
##   0.0     0.8       20.75340  0.2979376  14.082592
##   0.0     0.9       22.44389  0.2752090  15.144311
##   0.0     1.0       24.02881  0.2627497  16.031438
##   0.1     0.0       15.69625        NaN  12.538724
##   0.1     0.1       12.19254  0.4492064   8.605796
##   0.1     0.2       11.62409  0.4718928   8.408015
##   0.1     0.3       11.05631  0.5136729   8.157275
##   0.1     0.4       10.75970  0.5415845   7.982482
##   0.1     0.5       10.72237  0.5524289   8.027004
##   0.1     0.6       10.90467  0.5474572   8.242991
##   0.1     0.7       11.16185  0.5364623   8.478961
##   0.1     0.8       11.40062  0.5253274   8.705799
##   0.1     0.9       11.62258  0.5160225   8.907756
##   0.1     1.0       11.83292  0.5081518   9.092937
##   0.2     0.0       15.69625        NaN  12.538724
##   0.2     0.1       12.36775  0.4391669   8.736642
##   0.2     0.2       11.88404  0.4578189   8.496250
##   0.2     0.3       11.33947  0.4956607   8.294602
##   0.2     0.4       11.10163  0.5171062   8.176860
##   0.2     0.5       11.03577  0.5300755   8.196063
##   0.2     0.6       10.97786  0.5436227   8.211415
##   0.2     0.7       11.09535  0.5448354   8.376295
##   0.2     0.8       11.25502  0.5420105   8.553258
##   0.2     0.9       11.41336  0.5382461   8.711490
##   0.2     1.0       11.58981  0.5330773   8.857077
##   0.3     0.0       15.69625        NaN  12.538724
##   0.3     0.1       12.44461  0.4352870   8.808041
##   0.3     0.2       12.03609  0.4568303   8.550594
##   0.3     0.3       11.61866  0.4869751   8.449227
##   0.3     0.4       11.37541  0.5088590   8.332388
##   0.3     0.5       11.28847  0.5243304   8.347866
##   0.3     0.6       11.24502  0.5385152   8.422089
##   0.3     0.7       11.31849  0.5447888   8.550314
##   0.3     0.8       11.47051  0.5443912   8.708744
##   0.3     0.9       11.63416  0.5423013   8.871366
##   0.3     1.0       11.79414  0.5390176   9.009380
##   0.4     0.0       15.69625        NaN  12.538724
##   0.4     0.1       12.46317  0.4359377   8.815614
##   0.4     0.2       12.17932  0.4582006   8.558682
##   0.4     0.3       11.89215  0.4819287   8.612171
##   0.4     0.4       11.67436  0.5037862   8.520643
##   0.4     0.5       11.59841  0.5198893   8.557936
##   0.4     0.6       11.60297  0.5328453   8.669162
##   0.4     0.7       11.66832  0.5417862   8.806390
##   0.4     0.8       11.81300  0.5440424   8.944141
##   0.4     0.9       11.98605  0.5432023   9.106210
##   0.4     1.0       12.15851  0.5405512   9.284284
##   0.5     0.0       15.69625        NaN  12.538724
##   0.5     0.1       12.47802  0.4353104   8.808893
##   0.5     0.2       12.29626  0.4608017   8.532296
##   0.5     0.3       12.18053  0.4789307   8.782363
##   0.5     0.4       12.00240  0.5003569   8.761399
##   0.5     0.5       11.97035  0.5155913   8.817833
##   0.5     0.6       12.01916  0.5276820   8.946892
##   0.5     0.7       12.10691  0.5370474   9.115591
##   0.5     0.8       12.25483  0.5412436   9.281079
##   0.5     0.9       12.43242  0.5420301   9.467562
##   0.5     1.0       12.62564  0.5397839   9.662863
##   0.6     0.0       15.69625        NaN  12.538724
##   0.6     0.1       12.49724  0.4340396   8.795973
##   0.6     0.2       12.44437  0.4614854   8.533354
##   0.6     0.3       12.48723  0.4770042   8.961043
##   0.6     0.4       12.36283  0.4980203   9.034933
##   0.6     0.5       12.39486  0.5117181   9.110964
##   0.6     0.6       12.48451  0.5233819   9.265807
##   0.6     0.7       12.60272  0.5328759   9.470418
##   0.6     0.8       12.76612  0.5382022   9.680965
##   0.6     0.9       12.96331  0.5396780   9.914089
##   0.6     1.0       13.17255  0.5384263  10.131791
##   0.7     0.0       15.69625        NaN  12.538724
##   0.7     0.1       12.51610  0.4331791   8.774374
##   0.7     0.2       12.60467  0.4618883   8.545572
##   0.7     0.3       12.81294  0.4757621   9.138647
##   0.7     0.4       12.75434  0.4961645   9.305793
##   0.7     0.5       12.84907  0.5083508   9.419925
##   0.7     0.6       12.97825  0.5197860   9.615271
##   0.7     0.7       13.13597  0.5290128   9.863176
##   0.7     0.8       13.32407  0.5347350  10.118876
##   0.7     0.9       13.53749  0.5368695  10.365385
##   0.7     1.0       13.76370  0.5366428  10.603431
##   0.8     0.0       15.69625        NaN  12.538724
##   0.8     0.1       12.53631  0.4325517   8.743716
##   0.8     0.2       12.79329  0.4614391   8.592084
##   0.8     0.3       13.14827  0.4756570   9.313472
##   0.8     0.4       13.17859  0.4945652   9.581749
##   0.8     0.5       13.33467  0.5057595   9.748750
##   0.8     0.6       13.50693  0.5168581  10.012941
##   0.8     0.7       13.70688  0.5257715  10.303201
##   0.8     0.8       13.92630  0.5314981  10.591464
##   0.8     0.9       14.15853  0.5343323  10.854302
##   0.8     1.0       14.40875  0.5347296  11.117149
##   0.9     0.0       15.69625        NaN  12.538724
##   0.9     0.1       12.54923  0.4324305   8.704979
##   0.9     0.2       12.99183  0.4608736   8.653303
##   0.9     0.3       13.49642  0.4760287   9.524598
##   0.9     0.4       13.62767  0.4930172   9.890936
##   0.9     0.5       13.84355  0.5033749  10.134375
##   0.9     0.6       14.06161  0.5140701  10.465136
##   0.9     0.7       14.30666  0.5226414  10.792907
##   0.9     0.8       14.55437  0.5283736  11.079871
##   0.9     0.9       14.80822  0.5318552  11.350032
##   0.9     1.0       15.08011  0.5328512  11.635077
##   1.0     0.0       15.69625        NaN  12.538724
##   1.0     0.1       12.56799  0.4323515   8.662856
##   1.0     0.2       13.21252  0.4601775   8.741190
##   1.0     0.3       13.86660  0.4764490   9.822327
##   1.0     0.4       14.10716  0.4915158  10.277905
##   1.0     0.5       14.37915  0.5012562  10.580294
##   1.0     0.6       14.64847  0.5114856  10.959617
##   1.0     0.7       14.93773  0.5197105  11.318959
##   1.0     0.8       15.21322  0.5255184  11.609361
##   1.0     0.9       15.49077  0.5294580  11.915591
##   1.0     1.0       15.78570  0.5309111  12.226047
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.5 and lambda = 0.1.

plot(enetFit)

resamp <- resamples(list(PLS=plsFit, Ridge=ridgeFit, Lasso=lassoFit, enet=enetFit))
(resamp.s <- summary(resamp))

## 
## Call:
## summary.resamples(object = resamp)
## 
## Models: PLS, Ridge, Lasso, enet 
## Number of resamples: 10 
## 
## MAE 
##           Min.  1st Qu.   Median     Mean   3rd Qu.     Max. NA's
## PLS   5.178501 7.188415 8.333728 8.617399 10.302540 11.69573    0
## Ridge 6.802083 7.680192 9.047882 9.284284 11.031392 11.95240    0
## Lasso 6.660690 7.295599 7.918300 8.221542  8.329301 11.98442    0
## enet  6.320890 7.401605 7.656499 8.027004  8.889819 10.34718    0
## 
## RMSE 
##           Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PLS   6.535488  9.638799 11.53551 11.22365 13.31952 14.06551    0
## Ridge 9.902231 10.446352 11.48805 12.15851 13.75999 15.48795    0
## Lasso 8.975670 10.637842 11.16021 11.56574 12.78378 14.84629    0
## enet  8.566392  9.698955 10.65870 10.72237 11.60401 12.94550    0
## 
## Rsquared 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS   0.1578591 0.4540182 0.5665093 0.5307880 0.6136715 0.9158369    0
## Ridge 0.1831412 0.4019344 0.5990213 0.5405512 0.6504959 0.8222652    0
## Lasso 0.1801029 0.3056181 0.4640828 0.4898894 0.6740984 0.8337504    0
## enet  0.2255447 0.4188976 0.5370606 0.5524289 0.7058119 0.8508363    0

The model with the maximum R^2 appears to be the elastic net model, with R^2 = 0.5524289.

Below, I also evaluate the models using the test set:

plsPred <- predict(plsFit, newdata=X.test)
postResample(pred=plsPred, obs=y.test)

##       RMSE   Rsquared        MAE 
## 12.4780591  0.3536607  8.7204595

multiResample <- function(models, newdata, obs){
  res = list()
  methods = c()
  i = 1
  for (model in models){
    pred <- predict(model, newdata=newdata)
    metrics <- postResample(pred=pred, obs=obs)
    res[[i]] <- metrics
    methods[[i]] <- model$method
    i <- 1 + i
  }
  names(res) <- methods
  return(res)
}

models <- list(plsFit, ridgeFit, lassoFit, enetFit)
(resampleResult <- multiResample(models, X.test, y.test))

## $pls
##       RMSE   Rsquared        MAE 
## 12.4780591  0.3536607  8.7204595 
## 
## $ridge
##       RMSE   Rsquared        MAE 
## 12.3781577  0.4254869  8.9045062 
## 
## $lasso
##       RMSE   Rsquared        MAE 
## 10.7713633  0.4595602  8.5548981 
## 
## $enet
##       RMSE   Rsquared        MAE 
## 11.6712925  0.3935143  8.2445102

The evaluation on the test sets seems to suggest that the Lasso model is best, with R^2 = 0.4595602. Here we seem to have a dilemma: the 10-fold cross validations suggest that the elastic net model is the best, while the test set evaluation suggest that the Lasso model is the best. Here, I would choose to trust the cross validation result, because the cross validation result is closer approximation to the true distribution than the test set, which is equivalent to just one fold of the whole set.

Nonetheless, the scores for the Ridge, Lasso, and Enet are all higher (better performance) than the PLS.

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

I would not recommend any of the models to replace the permeability laboratory experiment. The MAE of all of the models are roughly between 8 and 9, meaning that the model predictions are on average +/- 8 to 9 off. Looking at the histogram of the target variable permeability:

hist(permeability)

We can see that most of permeability are under 10. The model’s accuracy is not good enough to replace lab test.

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

The preProcess function can be used to impute the imssing value. I choose to use the ‘bagImpute’ method, which impute the missing values through bagged tree model.

(cmpImpute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute')))

## Created from 152 samples and 57 variables
## 
## Pre-processing:
##   - bagged tree imputation (57)
##   - ignored (0)

cmp <- predict(cmpImpute, ChemicalManufacturingProcess[,-c(1)])

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Train/test splitting:

set.seed(1)
trainRow <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X.train <- cmp[trainRow, ]
y.train <- ChemicalManufacturingProcess$Yield[trainRow]
X.test <- cmp[-trainRow, ]
y.test <- ChemicalManufacturingProcess$Yield[-trainRow]

The elastic net model is tuned using 10-fold cross validation with parameters lambda ranging from 0 to 1, and fraction ranging from 0 to 1. The metric used to decide is the RMSE.

set.seed(1)
enetFit <- train(x=X.train,
                 y=y.train,
                 method='enet',
                 metric='RMSE',
                 tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                      .lambda = seq(0, 1, by=0.1)),
                 trControl=trainControl(method='cv'),
                 preProcess=c('center','scale')
                  )
plot(enetFit)

The best parameter combo is fraction = 0.5, lambda = 0.2, with the RMSE = 1.1920333.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

enetPred <- predict(enetFit, newdata=X.test)
(predResult <- postResample(pred=enetPred, obs=y.test))

##      RMSE  Rsquared       MAE 
## 1.0292796 0.6960366 0.8209627

The test set RMSE is 1.0292796. This is lower than the resampled performance metric (cross validated RMSE) on the training set. So the test set result appears to be better than the training set result.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

The coefficients of the best-tuned elastic net model is below. We can see that the elastic net zero out some of the predictors, due to the lasso penalty.

(coeffs <- predict.enet(enetFit$finalModel, s=enetFit$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients)

##   BiologicalMaterial01   BiologicalMaterial02   BiologicalMaterial03 
##            0.000000000            0.015230205            0.110767648 
##   BiologicalMaterial04   BiologicalMaterial05   BiologicalMaterial06 
##            0.000000000            0.108886650            0.133816040 
##   BiologicalMaterial07   BiologicalMaterial08   BiologicalMaterial09 
##           -0.036324238            0.000000000            0.000000000 
##   BiologicalMaterial10   BiologicalMaterial11   BiologicalMaterial12 
##            0.000000000            0.000000000            0.000000000 
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 
##            0.000000000            0.000000000            0.000000000 
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 
##            0.054125690            0.000000000            0.163095669 
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09 
##           -0.044244659            0.000000000            0.372412766 
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 
##            0.000000000            0.077331599            0.000000000 
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15 
##           -0.215257776            0.000000000            0.115667061 
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18 
##            0.000000000           -0.257187426            0.000000000 
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21 
##            0.033287979            0.000000000            0.000000000 
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 
##            0.000000000            0.000000000           -0.001236635 
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 
##            0.000000000            0.000000000            0.000000000 
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 
##            0.000000000            0.000000000            0.000000000 
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33 
##            0.000000000            0.651786278            0.000000000 
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36 
##            0.126158248            0.000000000           -0.260394128 
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39 
##           -0.163439748            0.000000000            0.046643930 
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42 
##            0.000000000            0.000000000            0.004305952 
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45 
##            0.056021733            0.086624481            0.040897862

We can compare the non-zero coefficients by taking their absolute value, and then sorting them:

coeffs.sorted <- abs(coeffs)
coeffs.sorted <- coeffs.sorted[coeffs.sorted>0]
(coeffs.sorted <- sort(coeffs.sorted, decreasing = T))

## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess36 
##            0.651786278            0.372412766            0.260394128 
## ManufacturingProcess17 ManufacturingProcess13 ManufacturingProcess37 
##            0.257187426            0.215257776            0.163439748 
## ManufacturingProcess06   BiologicalMaterial06 ManufacturingProcess34 
##            0.163095669            0.133816040            0.126158248 
## ManufacturingProcess15   BiologicalMaterial03   BiologicalMaterial05 
##            0.115667061            0.110767648            0.108886650 
## ManufacturingProcess44 ManufacturingProcess11 ManufacturingProcess43 
##            0.086624481            0.077331599            0.056021733 
## ManufacturingProcess04 ManufacturingProcess39 ManufacturingProcess07 
##            0.054125690            0.046643930            0.044244659 
## ManufacturingProcess45   BiologicalMaterial07 ManufacturingProcess19 
##            0.040897862            0.036324238            0.033287979 
##   BiologicalMaterial02 ManufacturingProcess42 ManufacturingProcess24 
##            0.015230205            0.004305952            0.001236635

We can conclude the following:

26 out of the 45 ManufacturingProcess predictors are zero’d out, while 7 out of the 12 BiologicalMaterial predictors are zero’d out.
In the remaining 24 predictors, 19 are ManufacturingProcess predictors and just 5 are BiologicalMaterial predictors
The top 7 highest absolute coefficients are all from the ManufacturingProcess predictors.

It appears that ManufacturingProcess are more important. Alternatively, varImp function can be used to rank the importance of predictors:

varImp(enetFit)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     84.76
## ManufacturingProcess36   73.43
## BiologicalMaterial03     70.69
## ManufacturingProcess13   68.69
## BiologicalMaterial02     64.16
## ManufacturingProcess31   60.12
## BiologicalMaterial12     60.11
## ManufacturingProcess17   57.19
## ManufacturingProcess09   55.42
## ManufacturingProcess33   55.09
## BiologicalMaterial04     51.35
## ManufacturingProcess06   46.88
## BiologicalMaterial11     46.75
## ManufacturingProcess29   46.26
## BiologicalMaterial01     41.20
## BiologicalMaterial08     38.75
## ManufacturingProcess26   28.05
## BiologicalMaterial09     26.02
## ManufacturingProcess11   25.57

Again, 11 out of the 20 in the list are ManufacturingProcess predictors, which makes it more important than BiologicalMaterial.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Elastic net is a linear regression model. The coefficients directly explain how the predictors affect the target. Positive coefficients improve the yield, while negative coefficients decrease the yield.

For the ManufacturingProcess having the positive coefficients, I would alter the process such that the predictor value increases. Below are the ManufacturingProcess having positive coefficients:

coeffs.mp <- coeffs.sorted[grep('ManufacturingProcess', names(coeffs.sorted))] %>% names() %>% coeffs[.]
coeffs.mp[coeffs.mp>0]

## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess06 
##            0.651786278            0.372412766            0.163095669 
## ManufacturingProcess34 ManufacturingProcess15 ManufacturingProcess44 
##            0.126158248            0.115667061            0.086624481 
## ManufacturingProcess11 ManufacturingProcess43 ManufacturingProcess04 
##            0.077331599            0.056021733            0.054125690 
## ManufacturingProcess39 ManufacturingProcess45 ManufacturingProcess19 
##            0.046643930            0.040897862            0.033287979 
## ManufacturingProcess42 
##            0.004305952

For the ManufacturingProcess having the negative coefficients, I would alter the process such that the predictor value decreases. Below are the ManufacturingProcess having negative coefficients:

coeffs.mp[coeffs.mp<0]

## ManufacturingProcess36 ManufacturingProcess17 ManufacturingProcess13 
##           -0.260394128           -0.257187426           -0.215257776 
## ManufacturingProcess37 ManufacturingProcess07 ManufacturingProcess24 
##           -0.163439748           -0.044244659           -0.001236635

DATA 624 - Homework # 7

Jun Yan

March 31, 2019