Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
data(permeability)
There were originally 1107 predictors but after filtering out the predictors with ner zero variance there are only 388 predictors remaining.
dim(fingerprints)
## [1] 165 1107
fingerprints.filtered <- fingerprints %>%
.[, -caret::nearZeroVar(fingerprints)]
dim(fingerprints.filtered)
## [1] 165 388
I begin by splitting the data into a training and test set. At this point I also discovered that log transforming the response variable led to more consistent results so I made that transformation here as well.
set.seed(123)
part <- caret::createDataPartition(permeability, p=0.8, list=FALSE)
training <- fingerprints.filtered %>%
.[part, ] %>%
as_tibble()
training['permeability'] <- log(permeability[part])
testing <- fingerprints.filtered %>%
.[-part, ] %>%
as_tibble()
testing['permeability'] <- log(permeability[-part])
Plotting the RMSE compared to the number of components reveals that 10 components results in the strongest predictive value on the validation data. This was done by performing 10 fold cross validation.
plsFit <- plsr(permeability ~ ., data=training, validation="CV", segments=10, ncomp=90)
plot(RMSEP(plsFit, legendpos='topright'))
A plot of the measured vs. predicted values shows fairly strong predictions. Additional plots (not shown) confirm that the residuals are randomly distributed and centered around 0. This supports the conclusion that this is a good model selection.
plot(plsFit, ncomp=10, asp=1, line=TRUE)
Rerunning the model with only 10 components results in the following model. This model captures approximatley 82% of the variance in the response variable despite the fact that only a small portion of the predictors are being used.
model <- plsr(permeability ~ ., data=training, validation="CV", segments=10, ncomp=10)
summary(model)
## Data: X dimension: 133 388
## Y dimension: 133 1
## Fit method: kernelpls
## Number of components considered: 10
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 1.563 1.414 1.285 1.204 1.166 1.152 1.144
## adjCV 1.563 1.413 1.279 1.196 1.156 1.142 1.125
## 7 comps 8 comps 9 comps 10 comps
## CV 1.111 1.081 1.066 1.054
## adjCV 1.091 1.061 1.045 1.033
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 25.30 41.82 48.74 54.87 62.95 66.56
## permeability 24.89 44.72 55.54 62.72 66.84 72.94
## 7 comps 8 comps 9 comps 10 comps
## X 69.46 72.44 74.40 76.38
## permeability 76.98 78.85 80.44 81.72
The \(R^2\) of the training data is 0.81.
R2(model, newdata=training)
## (Intercept) 1 comps 2 comps 3 comps 4 comps
## -2.220e-16 2.489e-01 4.472e-01 5.554e-01 6.272e-01
## 5 comps 6 comps 7 comps 8 comps 9 comps
## 6.684e-01 7.294e-01 7.698e-01 7.885e-01 8.044e-01
## 10 comps
## 8.172e-01
The \(R^2\) of the testing data is only 0.48. This is concerning and indicates that I am likely overfitting on the training data despite performing 10 fold cross validation. Comparing the \(R^2\) results, it appears that 2 components may be enough to make a strong prediction.
R2(model, newdata=testing)
## (Intercept) 1 comps 2 comps 3 comps 4 comps
## -0.0004353 0.3218426 0.4753267 0.3512251 0.4178438
## 5 comps 6 comps 7 comps 8 comps 9 comps
## 0.3999964 0.4806289 0.5134123 0.4883341 0.5032752
## 10 comps
## 0.4801270
However, comparing the RMSE of the models indicates a strong fit for the testing data that is even lower than the training data. This indicates the model may not have been overfit.
ModelMetrics::rmse(training$permeability, predict(model, type='response'))
## [1] 1.344487
ModelMetrics::rmse(testing$permeability, predict(model, type='response', newdata=testing))
## [1] 1.178631
I decided to explore using penalized regression models. My strongest model used BIC to prevent overfitting and to remove predictors that are not suitably useful.
lm.1 <- lm(permeability ~ ., data=training)
MASS::stepAIC(lm.1, trace=0, k=log(nrow(training)))
##
## Call:
## lm(formula = permeability ~ X2 + X6 + X11 + X12 + X15 + X16 +
## X35 + X41 + X87 + X93 + X94 + X96 + X102 + X111 + X121 +
## X125 + X126 + X141 + X150 + X153 + X156 + X159 + X221 + X225 +
## X226 + X230 + X231 + X238 + X241 + X242 + X247 + X248 + X257 +
## X258 + X260 + X270 + X272 + X280 + X297 + X306 + X314 + X316 +
## X319 + X320 + X329 + X338 + X345 + X357 + X374 + X376 + X496 +
## X503 + X507 + X509 + X571 + X592, data = training)
##
## Coefficients:
## (Intercept) X2 X6 X11 X12
## 1.2770 1.4310 0.9804 -5.9718 2.5714
## X15 X16 X35 X41 X87
## -2.4649 1.5297 -2.0619 -2.6922 6.7834
## X93 X94 X96 X102 X111
## -0.8837 3.8669 -5.6135 -3.3766 7.4344
## X121 X125 X126 X141 X150
## -3.3559 4.4876 -1.7797 0.5557 -4.2130
## X153 X156 X159 X221 X225
## 2.9001 -2.1097 -2.2070 1.5527 3.7926
## X226 X230 X231 X238 X241
## -1.1395 -5.3491 2.4446 2.2171 4.9788
## X242 X247 X248 X257 X258
## -3.7532 6.5757 -8.4856 6.5018 -1.8631
## X260 X270 X272 X280 X297
## -7.9762 3.0475 2.8886 0.7403 1.5096
## X306 X314 X316 X319 X320
## -2.1863 -6.1460 2.4329 4.5613 -2.4846
## X329 X338 X345 X357 X374
## -1.5445 2.5435 -0.8583 -2.3279 -0.7184
## X376 X496 X503 X507 X509
## -0.8583 2.0995 -1.7605 3.0448 1.9632
## X571 X592
## -3.1022 1.8453
lm.2 <- lm(formula = permeability ~ X2 + X6 + X11 + X12 + X15 + X16 +
X35 + X41 + X87 + X93 + X94 + X96 + X102 + X111 + X121 +
X125 + X126 + X141 + X150 + X153 + X156 + X159 + X221 + X225 +
X226 + X230 + X231 + X238 + X241 + X242 + X247 + X248 + X257 +
X258 + X260 + X270 + X272 + X280 + X297 + X306 + X314 + X316 +
X319 + X320 + X329 + X338 + X345 + X357 + X374 + X376 + X496 +
X503 + X507 + X509 + X571 + X592, data = training)
summary(lm.2)
##
## Call:
## lm(formula = permeability ~ X2 + X6 + X11 + X12 + X15 + X16 +
## X35 + X41 + X87 + X93 + X94 + X96 + X102 + X111 + X121 +
## X125 + X126 + X141 + X150 + X153 + X156 + X159 + X221 + X225 +
## X226 + X230 + X231 + X238 + X241 + X242 + X247 + X248 + X257 +
## X258 + X260 + X270 + X272 + X280 + X297 + X306 + X314 + X316 +
## X319 + X320 + X329 + X338 + X345 + X357 + X374 + X376 + X496 +
## X503 + X507 + X509 + X571 + X592, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.12513 -0.23344 -0.02274 0.29784 1.22245
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2770 0.7335 1.741 0.085744 .
## X2 1.4310 0.5416 2.642 0.009996 **
## X6 0.9804 0.1620 6.053 5.00e-08 ***
## X11 -5.9718 1.4960 -3.992 0.000150 ***
## X12 2.5714 0.8231 3.124 0.002526 **
## X15 -2.4649 0.6465 -3.813 0.000278 ***
## X16 1.5297 0.4665 3.279 0.001573 **
## X35 -2.0619 0.5336 -3.864 0.000233 ***
## X41 -2.6922 0.6106 -4.409 3.37e-05 ***
## X87 6.7834 1.9833 3.420 0.001009 **
## X93 -0.8837 0.1791 -4.934 4.63e-06 ***
## X94 3.8669 0.9158 4.222 6.63e-05 ***
## X96 -5.6135 1.6562 -3.389 0.001114 **
## X102 -3.3766 1.6418 -2.057 0.043151 *
## X111 7.4344 1.8690 3.978 0.000157 ***
## X121 -3.3559 0.9862 -3.403 0.001067 **
## X125 4.4876 1.1620 3.862 0.000235 ***
## X126 -1.7797 0.6789 -2.621 0.010578 *
## X141 0.5557 0.1648 3.372 0.001175 **
## X150 -4.2130 1.1199 -3.762 0.000330 ***
## X153 2.9001 1.0325 2.809 0.006320 **
## X156 -2.1097 0.6501 -3.245 0.001746 **
## X159 -2.2070 0.6494 -3.398 0.001082 **
## X221 1.5527 0.6054 2.565 0.012295 *
## X225 3.7926 1.4693 2.581 0.011773 *
## X226 -1.1395 0.5817 -1.959 0.053795 .
## X230 -5.3491 1.6468 -3.248 0.001730 **
## X231 2.4446 0.6284 3.890 0.000213 ***
## X238 2.2171 0.6354 3.490 0.000809 ***
## X241 4.9788 1.5088 3.300 0.001474 **
## X242 -3.7532 0.6922 -5.423 6.70e-07 ***
## X247 6.5757 1.3199 4.982 3.84e-06 ***
## X248 -8.4856 2.1700 -3.910 0.000199 ***
## X257 6.5018 2.1964 2.960 0.004099 **
## X258 -1.8631 0.4365 -4.268 5.63e-05 ***
## X260 -7.9762 1.5185 -5.253 1.32e-06 ***
## X270 3.0475 0.8552 3.564 0.000636 ***
## X272 2.8886 0.6223 4.642 1.42e-05 ***
## X280 0.7403 0.3368 2.198 0.030996 *
## X297 1.5096 0.5383 2.804 0.006396 **
## X306 -2.1863 0.4571 -4.783 8.28e-06 ***
## X314 -6.1460 1.5140 -4.059 0.000118 ***
## X316 2.4329 0.7081 3.436 0.000960 ***
## X319 4.5613 1.6914 2.697 0.008618 **
## X320 -2.4846 1.0655 -2.332 0.022363 *
## X329 -1.5445 0.4399 -3.511 0.000755 ***
## X338 2.5435 0.5194 4.897 5.35e-06 ***
## X345 -0.8583 0.4238 -2.025 0.046367 *
## X357 -2.3279 0.5312 -4.382 3.72e-05 ***
## X374 -0.7184 0.3081 -2.332 0.022356 *
## X376 -0.8583 0.4520 -1.899 0.061359 .
## X496 2.0995 0.5680 3.696 0.000411 ***
## X503 -1.7605 0.5733 -3.071 0.002961 **
## X507 3.0448 0.5831 5.222 1.50e-06 ***
## X509 1.9632 0.5152 3.811 0.000279 ***
## X571 -3.1022 0.6293 -4.930 4.71e-06 ***
## X592 1.8453 0.6435 2.868 0.005347 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6499 on 76 degrees of freedom
## Multiple R-squared: 0.8997, Adjusted R-squared: 0.8258
## F-statistic: 12.18 on 56 and 76 DF, p-value: < 2.2e-16
The final model contains many more predictors than the PLS model devloped previously. It also has a better RMSE.
ModelMetrics::rmse(testing$permeability, predict(lm.2, type='response', newdata=testing))
## [1] 1.142615
Would you recommend any of your models to replace the permeability laboratory experiment?
No. There is often a tough decision that needs to be made on which predictors should be included in the final model. In this case, by adding numerous extra features, I was only able to slightly improve the models predictive performance. As we discovered when creating the PLS model, the usefulness of the predictors dropped quickly. I am concerned that the BIC model is overfitting the data by using so many additional predictors. Even if it is not, the significant increase in predictors compared to the small improvement in predictive ability is not a reasonable trade off.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
data(ChemicalManufacturingProcess)
The below plot shows the missing values, where they are, and their counts. While there appears to be some correlation between missing values, the total amount missing is small enough that they can be imputed without much concern.
Chemical.Y <- ChemicalManufacturingProcess$Yield
Chemical.X <- ChemicalManufacturingProcess %>%
select(-Yield)
VIM::aggr(Chemical.X, col=c('navyblue', 'yellow'),
numbers=TRUE, sortVars=TRUE,
labels=names(Chemical.X), cex.axis=.7,
gap=3, ylab=c('Missing Data', 'Pattern'), combined=TRUE)
## Warning in plot.aggr(res, ...): not enough horizontal space to display
## frequencies
##
## Variables sorted by number of missings:
## Variable Count
## ManufacturingProcess03 15
## ManufacturingProcess11 10
## ManufacturingProcess10 9
## ManufacturingProcess25 5
## ManufacturingProcess26 5
## ManufacturingProcess27 5
## ManufacturingProcess28 5
## ManufacturingProcess29 5
## ManufacturingProcess30 5
## ManufacturingProcess31 5
## ManufacturingProcess33 5
## ManufacturingProcess34 5
## ManufacturingProcess35 5
## ManufacturingProcess36 5
## ManufacturingProcess02 3
## ManufacturingProcess06 2
## ManufacturingProcess01 1
## ManufacturingProcess04 1
## ManufacturingProcess05 1
## ManufacturingProcess07 1
## ManufacturingProcess08 1
## ManufacturingProcess12 1
## ManufacturingProcess14 1
## ManufacturingProcess22 1
## ManufacturingProcess23 1
## ManufacturingProcess24 1
## ManufacturingProcess40 1
## ManufacturingProcess41 1
## BiologicalMaterial01 0
## BiologicalMaterial02 0
## BiologicalMaterial03 0
## BiologicalMaterial04 0
## BiologicalMaterial05 0
## BiologicalMaterial06 0
## BiologicalMaterial07 0
## BiologicalMaterial08 0
## BiologicalMaterial09 0
## BiologicalMaterial10 0
## BiologicalMaterial11 0
## BiologicalMaterial12 0
## ManufacturingProcess09 0
## ManufacturingProcess13 0
## ManufacturingProcess15 0
## ManufacturingProcess16 0
## ManufacturingProcess17 0
## ManufacturingProcess18 0
## ManufacturingProcess19 0
## ManufacturingProcess20 0
## ManufacturingProcess21 0
## ManufacturingProcess32 0
## ManufacturingProcess37 0
## ManufacturingProcess38 0
## ManufacturingProcess39 0
## ManufacturingProcess42 0
## ManufacturingProcess43 0
## ManufacturingProcess44 0
## ManufacturingProcess45 0
I am using the mice library to impute the missing data.
set.seed(123)
imputed.data <- mice::mice(Chemical.X, m=5, maxit=10, seed=123, printFlag=FALSE)
## Warning: Number of logged events: 1350
completed.data <- cbind(Chemical.Y, complete(imputed.data, 1))
completed.data %>%
map_dbl(~sum(is.na(.))/nrow(completed.data))
## Chemical.Y BiologicalMaterial01 BiologicalMaterial02
## 0 0 0
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## 0 0 0
## BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
## 0 0 0
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## 0 0 0
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## 0 0 0
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
## 0 0 0
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
## 0 0 0
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
## 0 0 0
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## 0 0 0
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## 0 0 0
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## 0 0 0
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## 0 0 0
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
## 0 0 0
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
## 0 0 0
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
## 0 0 0
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## 0 0 0
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## 0 0 0
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
## 0 0 0
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
## 0 0 0
## ManufacturingProcess45
## 0
I begin by splitting the data in a training and test set.
set.seed(123)
part <- caret::createDataPartition(completed.data$Chemical.Y, p=0.8, list=FALSE)
training <- completed.data %>%
filter(row_number() %in% part)
testing <- completed.data %>%
filter(!row_number() %in% part)
There are a large number of predictors (57) compared to the number of observations in the training set (144) so I will use lasso to filter the number of predictors down to the most essential ones.
model <- lars::lars(x=as.matrix(training %>% select(-Chemical.Y)), y=training$Chemical.Y)
cv <- lars::cv.lars(x=as.matrix(training %>% select(-Chemical.Y)), y=training$Chemical.Y)
cv
## $index
## [1] 0.00000000 0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
## [7] 0.06060606 0.07070707 0.08080808 0.09090909 0.10101010 0.11111111
## [13] 0.12121212 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717
## [19] 0.18181818 0.19191919 0.20202020 0.21212121 0.22222222 0.23232323
## [25] 0.24242424 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929
## [31] 0.30303030 0.31313131 0.32323232 0.33333333 0.34343434 0.35353535
## [37] 0.36363636 0.37373737 0.38383838 0.39393939 0.40404040 0.41414141
## [43] 0.42424242 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747
## [49] 0.48484848 0.49494949 0.50505051 0.51515152 0.52525253 0.53535354
## [55] 0.54545455 0.55555556 0.56565657 0.57575758 0.58585859 0.59595960
## [61] 0.60606061 0.61616162 0.62626263 0.63636364 0.64646465 0.65656566
## [67] 0.66666667 0.67676768 0.68686869 0.69696970 0.70707071 0.71717172
## [73] 0.72727273 0.73737374 0.74747475 0.75757576 0.76767677 0.77777778
## [79] 0.78787879 0.79797980 0.80808081 0.81818182 0.82828283 0.83838384
## [85] 0.84848485 0.85858586 0.86868687 0.87878788 0.88888889 0.89898990
## [91] 0.90909091 0.91919192 0.92929293 0.93939394 0.94949495 0.95959596
## [97] 0.96969697 0.97979798 0.98989899 1.00000000
##
## $cv
## [1] 3.603045 3.119527 2.668574 2.285438 1.975633 1.753369 1.607615
## [8] 1.498252 1.408962 1.355895 1.311203 1.274076 1.251260 1.237331
## [15] 1.237349 1.247559 1.336748 1.443217 1.560738 1.651065 1.754257
## [22] 1.871375 1.989457 2.104459 2.221986 2.372258 2.553525 2.782083
## [29] 3.207805 3.671242 4.298886 5.095387 5.782114 6.402738 7.050858
## [36] 7.625435 8.209090 8.856248 9.397578 9.924282 10.385342 10.843682
## [43] 11.364685 11.824988 12.137381 12.535546 13.042626 13.566467 14.181873
## [50] 14.771680 15.358330 15.881280 16.419840 16.930302 17.435201 17.974143
## [57] 18.513420 19.066209 19.631284 20.205527 20.782147 21.299158 21.826010
## [64] 22.416785 23.491031 24.593887 25.726342 26.902444 28.124980 29.378887
## [71] 30.664016 31.983306 33.342370 34.558739 35.713193 36.888217 38.083795
## [78] 39.299904 40.472599 41.533485 42.610324 43.620389 44.395138 45.127369
## [85] 45.831436 46.547045 47.269272 48.000857 48.743346 49.496733 50.272849
## [92] 51.126434 51.993956 52.871492 53.759050 54.636169 55.514132 56.401491
## [99] 57.298263 58.204153
##
## $cv.error
## [1] 0.3258995 0.3242379 0.3042758 0.2862965 0.2717001 0.2601974
## [7] 0.2538552 0.2501950 0.2395085 0.2311294 0.2234566 0.2171469
## [13] 0.2095785 0.2012632 0.1944954 0.1900733 0.2069854 0.2594999
## [19] 0.3391864 0.4073174 0.4890033 0.5869647 0.6909402 0.7931668
## [25] 0.8788900 0.9667561 1.0602803 1.1767355 1.4591303 1.7509486
## [31] 2.1784085 2.7864936 3.3183674 3.7793901 4.2575625 4.6688753
## [37] 5.0999551 5.5505576 5.8988274 6.2362344 6.5254096 6.8184576
## [43] 7.1697929 7.4759851 7.6440263 7.9347156 8.3272779 8.7319068
## [49] 9.1704261 9.5815214 9.9869386 10.3842312 10.7940311 11.1776753
## [55] 11.5540414 11.9597981 12.3707666 12.7902522 13.2174095 13.6516992
## [61] 14.0869707 14.4685690 14.8559186 15.3003698 16.1933528 17.1148969
## [67] 18.0647759 19.0562510 20.0918372 21.1572542 22.2524586 23.3779721
## [73] 24.5347475 25.5543437 26.5132733 27.4902884 28.4853858 29.4985642
## [79] 30.4695838 31.3320436 32.2068693 33.0809775 33.9302686 34.7425526
## [85] 35.5297799 36.3295506 37.1412631 37.9651437 38.8012381 39.6494094
## [91] 40.5104697 41.3880774 42.2767026 43.1759806 44.0858492 44.9862207
## [97] 45.8880794 46.7998007 47.7213535 48.6527116
##
## $mode
## [1] "fraction"
cv$index[which.min(cv$cv)]
## [1] 0.1313131
Rerunning the model with the minimum value produces the coefficients that I will use in my final model. Of the original 57 predictors, only 17 are included.
lasso.predict <- predict(model, s=.1313131, type='coef', mode='fraction')
lasso.predict$coef
## BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 0.00000000 0.00000000 0.03935311
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 0.00000000 0.07932232 0.00000000
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## -0.04170001 0.00000000 0.00000000
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 0.01826043 0.00000000 0.00000000
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 0.02710271 0.00000000 0.06316752
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## -0.07561489 0.00000000 0.27818286
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## -0.13177469 0.00000000 0.00000000
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 0.00000000 -0.27462573 0.00000000
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 0.00000000 -0.02307966 0.00000000
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 0.00000000 0.31947682 0.00000000
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 0.00000000 0.16196981 0.00000000
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 4.09843065 0.00000000 0.00000000
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## -0.36370676 0.00000000 0.04432715
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 0.00000000 0.00000000 0.00000000
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 0.00000000 0.00000000 0.17194955
This model has an RMSE of 26.
ModelMetrics::rmse(testing$Chemical.Y, as.matrix(testing %>% select(-Chemical.Y)) %*% lasso.predict$coef)
## [1] 26.12255
The test data has an RMSE that is nearly identical to the training set. While on an absolute scale the RMSE may not be great, comparing the training RMSE to the testing RMSE shows that they are nearly identical. This speaks well of the model and indicates that it is not overfitting on the data.
ModelMetrics::rmse(training$Chemical.Y, as.matrix(training %>% select(-Chemical.Y)) %*% lasso.predict$coef)
## [1] 26.18219
There are 57 predictors in the data set of which 17 have been selected by the lasso regression. 3 out of 12 Biological Markers and 14 out of 45 Manufacturing Processes have been selected. Proportionally, a larger amount of Manufacturing Processes have been selected than Biological Markers, but not overwhelmingly so. If 4 Biological Markers had been selected instead of 3, then they would be nearly identical. As such, it is safe to say that neither completely dominate regression.
Looking over the correlation plot interestingly shows that there isn’t incredibly strong correlation between the response variable and some of the predictors. It is likely that a number of the predictors work together in tandem to have strong predictive value or it is possible that a strong form of regularization is necessary to remove additional predictors.
For simplicity, we can focus on Manufacturing Processes 9, 29 and 32 as being strongly positively correlated with the response and Manufacturing Processes 13 and 17 as being strongly negatively correlated.
subset <- training %>%
select(c(1, 4, 6, 8, 14, 17, 19, 20, 22, 26, 30, 36, 42, 45, 47, 50, 52, 58))
corrplot::corrplot(cor(subset), method='circle')