Questions

Question 1

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Start R and use these commands to load the data:
data(permeability)
  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

There were originally 1107 predictors but after filtering out the predictors with ner zero variance there are only 388 predictors remaining.

dim(fingerprints)
## [1]  165 1107
fingerprints.filtered <- fingerprints %>%
  .[, -caret::nearZeroVar(fingerprints)]

dim(fingerprints.filtered)
## [1] 165 388
  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

I begin by splitting the data into a training and test set. At this point I also discovered that log transforming the response variable led to more consistent results so I made that transformation here as well.

set.seed(123)
part <- caret::createDataPartition(permeability, p=0.8, list=FALSE)
training <- fingerprints.filtered %>%
  .[part, ] %>%
  as_tibble()
training['permeability'] <- log(permeability[part])
testing <- fingerprints.filtered %>%
  .[-part, ] %>%
  as_tibble()
testing['permeability'] <- log(permeability[-part])

Plotting the RMSE compared to the number of components reveals that 10 components results in the strongest predictive value on the validation data. This was done by performing 10 fold cross validation.

plsFit <- plsr(permeability ~ ., data=training, validation="CV", segments=10, ncomp=90)
plot(RMSEP(plsFit, legendpos='topright'))

A plot of the measured vs. predicted values shows fairly strong predictions. Additional plots (not shown) confirm that the residuals are randomly distributed and centered around 0. This supports the conclusion that this is a good model selection.

plot(plsFit, ncomp=10, asp=1, line=TRUE)

Rerunning the model with only 10 components results in the following model. This model captures approximatley 82% of the variance in the response variable despite the fact that only a small portion of the predictors are being used.

model <- plsr(permeability ~ ., data=training, validation="CV", segments=10, ncomp=10)
summary(model)
## Data:    X dimension: 133 388 
##  Y dimension: 133 1
## Fit method: kernelpls
## Number of components considered: 10
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           1.563    1.414    1.285    1.204    1.166    1.152    1.144
## adjCV        1.563    1.413    1.279    1.196    1.156    1.142    1.125
##        7 comps  8 comps  9 comps  10 comps
## CV       1.111    1.081    1.066     1.054
## adjCV    1.091    1.061    1.045     1.033
## 
## TRAINING: % variance explained
##               1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X               25.30    41.82    48.74    54.87    62.95    66.56
## permeability    24.89    44.72    55.54    62.72    66.84    72.94
##               7 comps  8 comps  9 comps  10 comps
## X               69.46    72.44    74.40     76.38
## permeability    76.98    78.85    80.44     81.72

The \(R^2\) of the training data is 0.81.

R2(model, newdata=training)
## (Intercept)      1 comps      2 comps      3 comps      4 comps  
##  -2.220e-16    2.489e-01    4.472e-01    5.554e-01    6.272e-01  
##     5 comps      6 comps      7 comps      8 comps      9 comps  
##   6.684e-01    7.294e-01    7.698e-01    7.885e-01    8.044e-01  
##    10 comps  
##   8.172e-01
  1. Predict the response for the test set. What is the test set estimate of \(R^2\)?

The \(R^2\) of the testing data is only 0.48. This is concerning and indicates that I am likely overfitting on the training data despite performing 10 fold cross validation. Comparing the \(R^2\) results, it appears that 2 components may be enough to make a strong prediction.

R2(model, newdata=testing)
## (Intercept)      1 comps      2 comps      3 comps      4 comps  
##  -0.0004353    0.3218426    0.4753267    0.3512251    0.4178438  
##     5 comps      6 comps      7 comps      8 comps      9 comps  
##   0.3999964    0.4806289    0.5134123    0.4883341    0.5032752  
##    10 comps  
##   0.4801270

However, comparing the RMSE of the models indicates a strong fit for the testing data that is even lower than the training data. This indicates the model may not have been overfit.

ModelMetrics::rmse(training$permeability, predict(model, type='response'))
## [1] 1.344487
ModelMetrics::rmse(testing$permeability, predict(model, type='response', newdata=testing))
## [1] 1.178631
  1. Try building other models discussed in this chapter. Do any have better predictive performance?

I decided to explore using penalized regression models. My strongest model used BIC to prevent overfitting and to remove predictors that are not suitably useful.

lm.1 <- lm(permeability ~ ., data=training)
MASS::stepAIC(lm.1, trace=0, k=log(nrow(training)))
## 
## Call:
## lm(formula = permeability ~ X2 + X6 + X11 + X12 + X15 + X16 + 
##     X35 + X41 + X87 + X93 + X94 + X96 + X102 + X111 + X121 + 
##     X125 + X126 + X141 + X150 + X153 + X156 + X159 + X221 + X225 + 
##     X226 + X230 + X231 + X238 + X241 + X242 + X247 + X248 + X257 + 
##     X258 + X260 + X270 + X272 + X280 + X297 + X306 + X314 + X316 + 
##     X319 + X320 + X329 + X338 + X345 + X357 + X374 + X376 + X496 + 
##     X503 + X507 + X509 + X571 + X592, data = training)
## 
## Coefficients:
## (Intercept)           X2           X6          X11          X12  
##      1.2770       1.4310       0.9804      -5.9718       2.5714  
##         X15          X16          X35          X41          X87  
##     -2.4649       1.5297      -2.0619      -2.6922       6.7834  
##         X93          X94          X96         X102         X111  
##     -0.8837       3.8669      -5.6135      -3.3766       7.4344  
##        X121         X125         X126         X141         X150  
##     -3.3559       4.4876      -1.7797       0.5557      -4.2130  
##        X153         X156         X159         X221         X225  
##      2.9001      -2.1097      -2.2070       1.5527       3.7926  
##        X226         X230         X231         X238         X241  
##     -1.1395      -5.3491       2.4446       2.2171       4.9788  
##        X242         X247         X248         X257         X258  
##     -3.7532       6.5757      -8.4856       6.5018      -1.8631  
##        X260         X270         X272         X280         X297  
##     -7.9762       3.0475       2.8886       0.7403       1.5096  
##        X306         X314         X316         X319         X320  
##     -2.1863      -6.1460       2.4329       4.5613      -2.4846  
##        X329         X338         X345         X357         X374  
##     -1.5445       2.5435      -0.8583      -2.3279      -0.7184  
##        X376         X496         X503         X507         X509  
##     -0.8583       2.0995      -1.7605       3.0448       1.9632  
##        X571         X592  
##     -3.1022       1.8453
lm.2 <- lm(formula = permeability ~ X2 + X6 + X11 + X12 + X15 + X16 + 
    X35 + X41 + X87 + X93 + X94 + X96 + X102 + X111 + X121 + 
    X125 + X126 + X141 + X150 + X153 + X156 + X159 + X221 + X225 + 
    X226 + X230 + X231 + X238 + X241 + X242 + X247 + X248 + X257 + 
    X258 + X260 + X270 + X272 + X280 + X297 + X306 + X314 + X316 + 
    X319 + X320 + X329 + X338 + X345 + X357 + X374 + X376 + X496 + 
    X503 + X507 + X509 + X571 + X592, data = training)
summary(lm.2)
## 
## Call:
## lm(formula = permeability ~ X2 + X6 + X11 + X12 + X15 + X16 + 
##     X35 + X41 + X87 + X93 + X94 + X96 + X102 + X111 + X121 + 
##     X125 + X126 + X141 + X150 + X153 + X156 + X159 + X221 + X225 + 
##     X226 + X230 + X231 + X238 + X241 + X242 + X247 + X248 + X257 + 
##     X258 + X260 + X270 + X272 + X280 + X297 + X306 + X314 + X316 + 
##     X319 + X320 + X329 + X338 + X345 + X357 + X374 + X376 + X496 + 
##     X503 + X507 + X509 + X571 + X592, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.12513 -0.23344 -0.02274  0.29784  1.22245 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.2770     0.7335   1.741 0.085744 .  
## X2            1.4310     0.5416   2.642 0.009996 ** 
## X6            0.9804     0.1620   6.053 5.00e-08 ***
## X11          -5.9718     1.4960  -3.992 0.000150 ***
## X12           2.5714     0.8231   3.124 0.002526 ** 
## X15          -2.4649     0.6465  -3.813 0.000278 ***
## X16           1.5297     0.4665   3.279 0.001573 ** 
## X35          -2.0619     0.5336  -3.864 0.000233 ***
## X41          -2.6922     0.6106  -4.409 3.37e-05 ***
## X87           6.7834     1.9833   3.420 0.001009 ** 
## X93          -0.8837     0.1791  -4.934 4.63e-06 ***
## X94           3.8669     0.9158   4.222 6.63e-05 ***
## X96          -5.6135     1.6562  -3.389 0.001114 ** 
## X102         -3.3766     1.6418  -2.057 0.043151 *  
## X111          7.4344     1.8690   3.978 0.000157 ***
## X121         -3.3559     0.9862  -3.403 0.001067 ** 
## X125          4.4876     1.1620   3.862 0.000235 ***
## X126         -1.7797     0.6789  -2.621 0.010578 *  
## X141          0.5557     0.1648   3.372 0.001175 ** 
## X150         -4.2130     1.1199  -3.762 0.000330 ***
## X153          2.9001     1.0325   2.809 0.006320 ** 
## X156         -2.1097     0.6501  -3.245 0.001746 ** 
## X159         -2.2070     0.6494  -3.398 0.001082 ** 
## X221          1.5527     0.6054   2.565 0.012295 *  
## X225          3.7926     1.4693   2.581 0.011773 *  
## X226         -1.1395     0.5817  -1.959 0.053795 .  
## X230         -5.3491     1.6468  -3.248 0.001730 ** 
## X231          2.4446     0.6284   3.890 0.000213 ***
## X238          2.2171     0.6354   3.490 0.000809 ***
## X241          4.9788     1.5088   3.300 0.001474 ** 
## X242         -3.7532     0.6922  -5.423 6.70e-07 ***
## X247          6.5757     1.3199   4.982 3.84e-06 ***
## X248         -8.4856     2.1700  -3.910 0.000199 ***
## X257          6.5018     2.1964   2.960 0.004099 ** 
## X258         -1.8631     0.4365  -4.268 5.63e-05 ***
## X260         -7.9762     1.5185  -5.253 1.32e-06 ***
## X270          3.0475     0.8552   3.564 0.000636 ***
## X272          2.8886     0.6223   4.642 1.42e-05 ***
## X280          0.7403     0.3368   2.198 0.030996 *  
## X297          1.5096     0.5383   2.804 0.006396 ** 
## X306         -2.1863     0.4571  -4.783 8.28e-06 ***
## X314         -6.1460     1.5140  -4.059 0.000118 ***
## X316          2.4329     0.7081   3.436 0.000960 ***
## X319          4.5613     1.6914   2.697 0.008618 ** 
## X320         -2.4846     1.0655  -2.332 0.022363 *  
## X329         -1.5445     0.4399  -3.511 0.000755 ***
## X338          2.5435     0.5194   4.897 5.35e-06 ***
## X345         -0.8583     0.4238  -2.025 0.046367 *  
## X357         -2.3279     0.5312  -4.382 3.72e-05 ***
## X374         -0.7184     0.3081  -2.332 0.022356 *  
## X376         -0.8583     0.4520  -1.899 0.061359 .  
## X496          2.0995     0.5680   3.696 0.000411 ***
## X503         -1.7605     0.5733  -3.071 0.002961 ** 
## X507          3.0448     0.5831   5.222 1.50e-06 ***
## X509          1.9632     0.5152   3.811 0.000279 ***
## X571         -3.1022     0.6293  -4.930 4.71e-06 ***
## X592          1.8453     0.6435   2.868 0.005347 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6499 on 76 degrees of freedom
## Multiple R-squared:  0.8997, Adjusted R-squared:  0.8258 
## F-statistic: 12.18 on 56 and 76 DF,  p-value: < 2.2e-16

The final model contains many more predictors than the PLS model devloped previously. It also has a better RMSE.

ModelMetrics::rmse(testing$permeability, predict(lm.2, type='response', newdata=testing))
## [1] 1.142615

Would you recommend any of your models to replace the permeability laboratory experiment?

No. There is often a tough decision that needs to be made on which predictors should be included in the final model. In this case, by adding numerous extra features, I was only able to slightly improve the models predictive performance. As we discovered when creating the PLS model, the usefulness of the predictors dropped quickly. I am concerned that the BIC model is overfitting the data by using so many additional predictors. Even if it is not, the significant increase in predictors compared to the small improvement in predictive ability is not a reasonable trade off.

Question 2

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data:
data(ChemicalManufacturingProcess)
  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

The below plot shows the missing values, where they are, and their counts. While there appears to be some correlation between missing values, the total amount missing is small enough that they can be imputed without much concern.

Chemical.Y <- ChemicalManufacturingProcess$Yield
Chemical.X <- ChemicalManufacturingProcess %>%
  select(-Yield)

VIM::aggr(Chemical.X, col=c('navyblue', 'yellow'),
numbers=TRUE, sortVars=TRUE,
labels=names(Chemical.X), cex.axis=.7,
gap=3, ylab=c('Missing Data', 'Pattern'), combined=TRUE)
## Warning in plot.aggr(res, ...): not enough horizontal space to display
## frequencies

## 
##  Variables sorted by number of missings: 
##                Variable Count
##  ManufacturingProcess03    15
##  ManufacturingProcess11    10
##  ManufacturingProcess10     9
##  ManufacturingProcess25     5
##  ManufacturingProcess26     5
##  ManufacturingProcess27     5
##  ManufacturingProcess28     5
##  ManufacturingProcess29     5
##  ManufacturingProcess30     5
##  ManufacturingProcess31     5
##  ManufacturingProcess33     5
##  ManufacturingProcess34     5
##  ManufacturingProcess35     5
##  ManufacturingProcess36     5
##  ManufacturingProcess02     3
##  ManufacturingProcess06     2
##  ManufacturingProcess01     1
##  ManufacturingProcess04     1
##  ManufacturingProcess05     1
##  ManufacturingProcess07     1
##  ManufacturingProcess08     1
##  ManufacturingProcess12     1
##  ManufacturingProcess14     1
##  ManufacturingProcess22     1
##  ManufacturingProcess23     1
##  ManufacturingProcess24     1
##  ManufacturingProcess40     1
##  ManufacturingProcess41     1
##    BiologicalMaterial01     0
##    BiologicalMaterial02     0
##    BiologicalMaterial03     0
##    BiologicalMaterial04     0
##    BiologicalMaterial05     0
##    BiologicalMaterial06     0
##    BiologicalMaterial07     0
##    BiologicalMaterial08     0
##    BiologicalMaterial09     0
##    BiologicalMaterial10     0
##    BiologicalMaterial11     0
##    BiologicalMaterial12     0
##  ManufacturingProcess09     0
##  ManufacturingProcess13     0
##  ManufacturingProcess15     0
##  ManufacturingProcess16     0
##  ManufacturingProcess17     0
##  ManufacturingProcess18     0
##  ManufacturingProcess19     0
##  ManufacturingProcess20     0
##  ManufacturingProcess21     0
##  ManufacturingProcess32     0
##  ManufacturingProcess37     0
##  ManufacturingProcess38     0
##  ManufacturingProcess39     0
##  ManufacturingProcess42     0
##  ManufacturingProcess43     0
##  ManufacturingProcess44     0
##  ManufacturingProcess45     0

I am using the mice library to impute the missing data.

set.seed(123)
imputed.data <- mice::mice(Chemical.X, m=5, maxit=10, seed=123, printFlag=FALSE)
## Warning: Number of logged events: 1350
completed.data <- cbind(Chemical.Y, complete(imputed.data, 1)) 

completed.data %>%
  map_dbl(~sum(is.na(.))/nrow(completed.data)) 
##             Chemical.Y   BiologicalMaterial01   BiologicalMaterial02 
##                      0                      0                      0 
##   BiologicalMaterial03   BiologicalMaterial04   BiologicalMaterial05 
##                      0                      0                      0 
##   BiologicalMaterial06   BiologicalMaterial07   BiologicalMaterial08 
##                      0                      0                      0 
##   BiologicalMaterial09   BiologicalMaterial10   BiologicalMaterial11 
##                      0                      0                      0 
##   BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02 
##                      0                      0                      0 
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 
##                      0                      0                      0 
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 
##                      0                      0                      0 
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11 
##                      0                      0                      0 
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14 
##                      0                      0                      0 
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17 
##                      0                      0                      0 
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20 
##                      0                      0                      0 
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23 
##                      0                      0                      0 
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 
##                      0                      0                      0 
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 
##                      0                      0                      0 
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32 
##                      0                      0                      0 
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 
##                      0                      0                      0 
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38 
##                      0                      0                      0 
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41 
##                      0                      0                      0 
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44 
##                      0                      0                      0 
## ManufacturingProcess45 
##                      0
  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

I begin by splitting the data in a training and test set.

set.seed(123)

part <- caret::createDataPartition(completed.data$Chemical.Y, p=0.8, list=FALSE)
training <- completed.data %>%
  filter(row_number() %in% part)
testing <- completed.data %>%
  filter(!row_number() %in% part)

There are a large number of predictors (57) compared to the number of observations in the training set (144) so I will use lasso to filter the number of predictors down to the most essential ones.

model <- lars::lars(x=as.matrix(training %>% select(-Chemical.Y)), y=training$Chemical.Y)
cv <- lars::cv.lars(x=as.matrix(training %>% select(-Chemical.Y)), y=training$Chemical.Y)

cv
## $index
##   [1] 0.00000000 0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
##   [7] 0.06060606 0.07070707 0.08080808 0.09090909 0.10101010 0.11111111
##  [13] 0.12121212 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717
##  [19] 0.18181818 0.19191919 0.20202020 0.21212121 0.22222222 0.23232323
##  [25] 0.24242424 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929
##  [31] 0.30303030 0.31313131 0.32323232 0.33333333 0.34343434 0.35353535
##  [37] 0.36363636 0.37373737 0.38383838 0.39393939 0.40404040 0.41414141
##  [43] 0.42424242 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747
##  [49] 0.48484848 0.49494949 0.50505051 0.51515152 0.52525253 0.53535354
##  [55] 0.54545455 0.55555556 0.56565657 0.57575758 0.58585859 0.59595960
##  [61] 0.60606061 0.61616162 0.62626263 0.63636364 0.64646465 0.65656566
##  [67] 0.66666667 0.67676768 0.68686869 0.69696970 0.70707071 0.71717172
##  [73] 0.72727273 0.73737374 0.74747475 0.75757576 0.76767677 0.77777778
##  [79] 0.78787879 0.79797980 0.80808081 0.81818182 0.82828283 0.83838384
##  [85] 0.84848485 0.85858586 0.86868687 0.87878788 0.88888889 0.89898990
##  [91] 0.90909091 0.91919192 0.92929293 0.93939394 0.94949495 0.95959596
##  [97] 0.96969697 0.97979798 0.98989899 1.00000000
## 
## $cv
##   [1]  3.603045  3.119527  2.668574  2.285438  1.975633  1.753369  1.607615
##   [8]  1.498252  1.408962  1.355895  1.311203  1.274076  1.251260  1.237331
##  [15]  1.237349  1.247559  1.336748  1.443217  1.560738  1.651065  1.754257
##  [22]  1.871375  1.989457  2.104459  2.221986  2.372258  2.553525  2.782083
##  [29]  3.207805  3.671242  4.298886  5.095387  5.782114  6.402738  7.050858
##  [36]  7.625435  8.209090  8.856248  9.397578  9.924282 10.385342 10.843682
##  [43] 11.364685 11.824988 12.137381 12.535546 13.042626 13.566467 14.181873
##  [50] 14.771680 15.358330 15.881280 16.419840 16.930302 17.435201 17.974143
##  [57] 18.513420 19.066209 19.631284 20.205527 20.782147 21.299158 21.826010
##  [64] 22.416785 23.491031 24.593887 25.726342 26.902444 28.124980 29.378887
##  [71] 30.664016 31.983306 33.342370 34.558739 35.713193 36.888217 38.083795
##  [78] 39.299904 40.472599 41.533485 42.610324 43.620389 44.395138 45.127369
##  [85] 45.831436 46.547045 47.269272 48.000857 48.743346 49.496733 50.272849
##  [92] 51.126434 51.993956 52.871492 53.759050 54.636169 55.514132 56.401491
##  [99] 57.298263 58.204153
## 
## $cv.error
##   [1]  0.3258995  0.3242379  0.3042758  0.2862965  0.2717001  0.2601974
##   [7]  0.2538552  0.2501950  0.2395085  0.2311294  0.2234566  0.2171469
##  [13]  0.2095785  0.2012632  0.1944954  0.1900733  0.2069854  0.2594999
##  [19]  0.3391864  0.4073174  0.4890033  0.5869647  0.6909402  0.7931668
##  [25]  0.8788900  0.9667561  1.0602803  1.1767355  1.4591303  1.7509486
##  [31]  2.1784085  2.7864936  3.3183674  3.7793901  4.2575625  4.6688753
##  [37]  5.0999551  5.5505576  5.8988274  6.2362344  6.5254096  6.8184576
##  [43]  7.1697929  7.4759851  7.6440263  7.9347156  8.3272779  8.7319068
##  [49]  9.1704261  9.5815214  9.9869386 10.3842312 10.7940311 11.1776753
##  [55] 11.5540414 11.9597981 12.3707666 12.7902522 13.2174095 13.6516992
##  [61] 14.0869707 14.4685690 14.8559186 15.3003698 16.1933528 17.1148969
##  [67] 18.0647759 19.0562510 20.0918372 21.1572542 22.2524586 23.3779721
##  [73] 24.5347475 25.5543437 26.5132733 27.4902884 28.4853858 29.4985642
##  [79] 30.4695838 31.3320436 32.2068693 33.0809775 33.9302686 34.7425526
##  [85] 35.5297799 36.3295506 37.1412631 37.9651437 38.8012381 39.6494094
##  [91] 40.5104697 41.3880774 42.2767026 43.1759806 44.0858492 44.9862207
##  [97] 45.8880794 46.7998007 47.7213535 48.6527116
## 
## $mode
## [1] "fraction"
cv$index[which.min(cv$cv)]
## [1] 0.1313131

Rerunning the model with the minimum value produces the coefficients that I will use in my final model. Of the original 57 predictors, only 17 are included.

lasso.predict <- predict(model, s=.1313131, type='coef', mode='fraction')
lasso.predict$coef
##   BiologicalMaterial01   BiologicalMaterial02   BiologicalMaterial03 
##             0.00000000             0.00000000             0.03935311 
##   BiologicalMaterial04   BiologicalMaterial05   BiologicalMaterial06 
##             0.00000000             0.07932232             0.00000000 
##   BiologicalMaterial07   BiologicalMaterial08   BiologicalMaterial09 
##            -0.04170001             0.00000000             0.00000000 
##   BiologicalMaterial10   BiologicalMaterial11   BiologicalMaterial12 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 
##             0.01826043             0.00000000             0.00000000 
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 
##             0.02710271             0.00000000             0.06316752 
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09 
##            -0.07561489             0.00000000             0.27818286 
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15 
##            -0.13177469             0.00000000             0.00000000 
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18 
##             0.00000000            -0.27462573             0.00000000 
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 
##             0.00000000            -0.02307966             0.00000000 
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 
##             0.00000000             0.31947682             0.00000000 
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33 
##             0.00000000             0.16196981             0.00000000 
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36 
##             4.09843065             0.00000000             0.00000000 
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39 
##            -0.36370676             0.00000000             0.04432715 
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42 
##             0.00000000             0.00000000             0.00000000 
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45 
##             0.00000000             0.00000000             0.17194955

This model has an RMSE of 26.

ModelMetrics::rmse(testing$Chemical.Y, as.matrix(testing %>% select(-Chemical.Y)) %*% lasso.predict$coef)
## [1] 26.12255
  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

The test data has an RMSE that is nearly identical to the training set. While on an absolute scale the RMSE may not be great, comparing the training RMSE to the testing RMSE shows that they are nearly identical. This speaks well of the model and indicates that it is not overfitting on the data.

ModelMetrics::rmse(training$Chemical.Y, as.matrix(training %>% select(-Chemical.Y)) %*% lasso.predict$coef)
## [1] 26.18219
  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

There are 57 predictors in the data set of which 17 have been selected by the lasso regression. 3 out of 12 Biological Markers and 14 out of 45 Manufacturing Processes have been selected. Proportionally, a larger amount of Manufacturing Processes have been selected than Biological Markers, but not overwhelmingly so. If 4 Biological Markers had been selected instead of 3, then they would be nearly identical. As such, it is safe to say that neither completely dominate regression.

  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Looking over the correlation plot interestingly shows that there isn’t incredibly strong correlation between the response variable and some of the predictors. It is likely that a number of the predictors work together in tandem to have strong predictive value or it is possible that a strong form of regularization is necessary to remove additional predictors.

For simplicity, we can focus on Manufacturing Processes 9, 29 and 32 as being strongly positively correlated with the response and Manufacturing Processes 13 and 17 as being strongly negatively correlated.

subset <- training %>%
  select(c(1, 4, 6, 8, 14, 17, 19, 20, 22, 26, 30, 36, 42, 45, 47, 50, 52, 58))
corrplot::corrplot(cor(subset), method='circle')