Chapter 6: Linear Regression and Its Cousins

library(tidyr)
library(dplyr)
library(knitr)
library(utils)
library(ggplot2)
library(corrplot)
library(purrr)
library(mice)
library(psych)
library(gridExtra)
library(caret)
library(MASS)
library(pls)
library(lars)
library(elasticnet)
library(AppliedPredictiveModeling)

6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

Exercise 6.2

  1. Start R and use these commands to load the data: The matrix \(fingerprints\) contains the 1,107 binary molecular predic- tors for the 165 compounds, while \(permeability\) contains permeability response.
data(permeability)

# Convert to data frames.
fp.full = data.frame(fingerprints)
pb.full = data.frame(permeability)

cat("Number of Samples = ", nrow(fp.full), "\nNumber of Predictors =", ncol(fp.full))
## Number of Samples =  165 
## Number of Predictors = 1107

The \(fingerprints\) data frame has 165 rows and 1107 columns. There are far more predictors (P = 1107) than samples (n = 165) in this data set. This implies that multiple linear regression cannot directly be used and techniques such as dimensionality reduction or PCA (or KNN) must be used.

  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse, meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
zv = nearZeroVar(fp.full)

# Filter out predictors that have near-zero variance.
fp = fp.full[, -zv]
pb = pb.full[,-zv]

cat(length(zv),
    "variables have near-zero variance. ", "After filtering, no. of variables (including response) is ",
    ncol(fp))
## 719 variables have near-zero variance.  After filtering, no. of variables (including response) is  388
  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
set.seed(100)
index = sample(1:floor(0.7 * nrow(fp)))
trainingData = fp[index,]

trainingY = pb[index]

testData = fp[-index,]
testY = pb[-index]

# To check if preprocessing is needed we summarize the skewness of each predictor.
sk = psych::describe(trainingData) %>% dplyr::select(vars, skew)

ggplot(sk, aes(vars)) +
    geom_line(aes(y=skew)) +
    ggtitle("Skewness of Permeability predictor variables") +
    xlab("Variable Index") + ylab("Skew")

Since skew > about 1.0 is considered right-skewed, we find that a number of variables are highly skewed to the right (as well as a few on the left). To deal with this, we will center and scale the data when invoking the PLS algorithm.

ctrl = trainControl(method = "cv", number = 10)

plsTune = train(trainingData, trainingY, method = "pls",
                tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))
plot(plsTune)

plsTune
## Partial Least Squares 
## 
## 115 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 103, 104, 104, 103, 103, 104, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     11.52104  0.3799249   8.743790
##    2     10.78168  0.4502599   7.665984
##    3     11.06183  0.4127380   8.378497
##    4     11.55061  0.3969529   8.741814
##    5     11.45602  0.4370471   8.464968
##    6     11.63693  0.4382304   8.735028
##    7     11.91421  0.4431521   8.987942
##    8     12.00891  0.4305037   9.229677
##    9     12.24457  0.4224444   9.515594
##   10     12.52004  0.4258139   9.598684
##   11     13.06931  0.4091522   9.971984
##   12     13.32178  0.3927331  10.154233
##   13     13.54099  0.3888934  10.286393
##   14     13.85842  0.3639380  10.434880
##   15     13.96797  0.3495088  10.412449
##   16     13.95448  0.3641351  10.339549
##   17     14.10628  0.3616655  10.517164
##   18     14.11821  0.3528987  10.638628
##   19     14.25320  0.3489858  10.683669
##   20     14.50345  0.3378590  10.946146
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.

We find that only two components were used and the resampled estimate for \(R^2\) is 0.4502599 and for \(RMSE\) 10.78168.

  1. Predict the response for the test set. What is the test set estimate of R2?
plsPred = predict(plsTune, testData)

plsResult = data.frame(obs = testY, pred = plsPred)
defaultSummary(plsResult)
##       RMSE   Rsquared        MAE 
## 13.2729127  0.5485723 10.2893589

The test set estimate of \(R^2\) is 0.5485723. This is higher than the corresponding values in the train data set obtained earlier, implying that the model performs somewhat better on the test set.

  1. Try building other models discussed in this chapter. Do any have better predictive performance?
# Build a predictive model using Principal Component Analysis and the Robust Linear Model.
set.seed(100)

# Using the RLM algorithm, the output shows that the algorithm fails to converge even after
# increasing the number of iterations to 40 and reports that two variables -- X613 and X621 --
# have zero variance. Accordingly, these variables were removed from the training set.
#

trainingData2 = trainingData %>% dplyr::select(-X613, -X621)
rlmPCA <- train(trainingData2, trainingY,
                method = "rlm", preProcess = "pca", trControl = ctrl, maxit = 40)

rlmPCA
## Robust Linear Model 
## 
## 115 samples
## 386 predictors
## 
## Pre-processing: principal component signal extraction (386),
##  centered (386), scaled (386) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 103, 103, 103, 105, 103, 106, ... 
## Resampling results across tuning parameters:
## 
##   intercept  psi           RMSE      Rsquared   MAE      
##   FALSE      psi.huber     16.19835  0.3058460  12.356960
##   FALSE      psi.hampel    16.20411  0.3317913  12.733633
##   FALSE      psi.bisquare  16.63757  0.2962591  12.451415
##    TRUE      psi.huber     12.45713  0.2939160   9.091191
##    TRUE      psi.hampel    12.19295  0.3091146   9.037580
##    TRUE      psi.bisquare  13.31820  0.2491322   9.216890
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi
##  = psi.hampel.

Since the RLM with PCA failed to converge its summary results were ignored.

  1. Would you recommend any of your models to replace the permeability laboratory experiment?
xyplot(testY ~ plsPred, type = c("p", "g"), xlab = "Predicted", ylab = "Observed")

On the basis of the predicted results and the \(R^2\) value I would not recommend replacing the permeability lab experiment.

Exercise 6.3

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data:
data(ChemicalManufacturingProcess)

# The matrix processPredictors contains the 57 predictors (12 describing
# the input biological material and 45 describing the process predictors)
# for the 176 manufacturing runs. yield contains the percent yield for each
# run.

chem.full = ChemicalManufacturingProcess
names(chem.full)
##  [1] "Yield"                  "BiologicalMaterial01"  
##  [3] "BiologicalMaterial02"   "BiologicalMaterial03"  
##  [5] "BiologicalMaterial04"   "BiologicalMaterial05"  
##  [7] "BiologicalMaterial06"   "BiologicalMaterial07"  
##  [9] "BiologicalMaterial08"   "BiologicalMaterial09"  
## [11] "BiologicalMaterial10"   "BiologicalMaterial11"  
## [13] "BiologicalMaterial12"   "ManufacturingProcess01"
## [15] "ManufacturingProcess02" "ManufacturingProcess03"
## [17] "ManufacturingProcess04" "ManufacturingProcess05"
## [19] "ManufacturingProcess06" "ManufacturingProcess07"
## [21] "ManufacturingProcess08" "ManufacturingProcess09"
## [23] "ManufacturingProcess10" "ManufacturingProcess11"
## [25] "ManufacturingProcess12" "ManufacturingProcess13"
## [27] "ManufacturingProcess14" "ManufacturingProcess15"
## [29] "ManufacturingProcess16" "ManufacturingProcess17"
## [31] "ManufacturingProcess18" "ManufacturingProcess19"
## [33] "ManufacturingProcess20" "ManufacturingProcess21"
## [35] "ManufacturingProcess22" "ManufacturingProcess23"
## [37] "ManufacturingProcess24" "ManufacturingProcess25"
## [39] "ManufacturingProcess26" "ManufacturingProcess27"
## [41] "ManufacturingProcess28" "ManufacturingProcess29"
## [43] "ManufacturingProcess30" "ManufacturingProcess31"
## [45] "ManufacturingProcess32" "ManufacturingProcess33"
## [47] "ManufacturingProcess34" "ManufacturingProcess35"
## [49] "ManufacturingProcess36" "ManufacturingProcess37"
## [51] "ManufacturingProcess38" "ManufacturingProcess39"
## [53] "ManufacturingProcess40" "ManufacturingProcess41"
## [55] "ManufacturingProcess42" "ManufacturingProcess43"
## [57] "ManufacturingProcess44" "ManufacturingProcess45"
cat("The data set has", nrow(chem.full), "rows and", sum(is.na(chem.full)), "missing values.")
## The data set has 176 rows and 106 missing values.
  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
# The MICE library can be used for imputation of missing data.

imputed.1 <- mice(chem.full, m=5, maxit = 5, seed = 500, printFlag=F)
chem = complete(imputed.1, 2)
  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
set.seed(100)
index = sample(1:floor(0.7 * nrow(chem)))

trainingData = chem[index,] %>% dplyr::select(-Yield)
trainingY = chem[index,]$Yield

testData = chem[-index,] %>% dplyr::select(-Yield)
testY = chem[-index,]$Yield

# For this problem we choose an OLS model with PCA processing.
ctrl <- trainControl(method = "cv", number = 10)
set.seed(100)
lmFit1 <- train(x = trainingData, y = trainingY,
                method = "lm", preProcess = "pca", trControl = ctrl)
summary(lmFit1)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.41548 -0.50439 -0.08832  0.57300  2.43094 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.52211    0.08584 472.082  < 2e-16 ***
## PC1         -0.30817    0.02801 -11.003  < 2e-16 ***
## PC2          0.27135    0.03623   7.490 3.52e-11 ***
## PC3          0.22704    0.03855   5.889 5.85e-08 ***
## PC4         -0.13608    0.04737  -2.873 0.005017 ** 
## PC5          0.28509    0.04912   5.804 8.50e-08 ***
## PC6         -0.08342    0.05143  -1.622 0.108126    
## PC7          0.11547    0.05273   2.190 0.030991 *  
## PC8         -0.01759    0.05635  -0.312 0.755584    
## PC9         -0.31350    0.05856  -5.353 5.99e-07 ***
## PC10        -0.15654    0.06224  -2.515 0.013588 *  
## PC11         0.13317    0.06459   2.062 0.041952 *  
## PC12        -0.15437    0.06912  -2.233 0.027871 *  
## PC13        -0.14907    0.07292  -2.044 0.043674 *  
## PC14        -0.21261    0.07305  -2.911 0.004494 ** 
## PC15        -0.22180    0.07699  -2.881 0.004900 ** 
## PC16        -0.02903    0.08325  -0.349 0.728071    
## PC17         0.14229    0.08937   1.592 0.114652    
## PC18        -0.09485    0.08957  -1.059 0.292305    
## PC19        -0.33653    0.09349  -3.600 0.000508 ***
## PC20        -0.19197    0.09762  -1.967 0.052148 .  
## PC21        -0.21414    0.09998  -2.142 0.034772 *  
## PC22        -0.26422    0.10736  -2.461 0.015663 *  
## PC23        -0.09398    0.11137  -0.844 0.400868    
## PC24        -0.01211    0.11540  -0.105 0.916672    
## PC25        -0.13917    0.12193  -1.141 0.256560    
## PC26         0.13988    0.12432   1.125 0.263356    
## PC27         0.27330    0.13063   2.092 0.039091 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.952 on 95 degrees of freedom
## Multiple R-squared:  0.7937, Adjusted R-squared:  0.735 
## F-statistic: 13.53 on 27 and 95 DF,  p-value: < 2.2e-16
  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
lmPred = predict(lmFit1, testData)

lmResult = data.frame(obs = testY, pred = lmPred)
defaultSummary(lmResult)
##       RMSE   Rsquared        MAE 
## 5.43779339 0.04074351 3.25249500
p1 = xyplot(testY ~ lmPred, type = c("p", "g"), xlab = "Predicted", ylab = "Observed")
p2 = xyplot(resid(lmFit1) ~ lmPred, type = c("p", "g"), xlab = "Predicted", ylab = "Residuals")
grid.arrange(p1, p2, ncol=2, nrow=1)

  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
# To identify the top PCA predictors, we can examine the results of the PCA
# in the LM model
lmFit1$preProcess
## Created from 123 samples and 57 variables
## 
## Pre-processing:
##   - centered (57)
##   - ignored (0)
##   - principal component signal extraction (57)
##   - scaled (57)
## 
## PCA needed 27 components to capture 95 percent of the variance
rot = lmFit1$preProcess$rotation

# List the top 4 PCA eigenvectors:
rot[1:10,1:4]
##                              PC1          PC2          PC3          PC4
## BiologicalMaterial01 -0.24414112  0.002841420 -0.093145051  0.041660633
## BiologicalMaterial02 -0.28533000  0.080126938 -0.072748512 -0.036361864
## BiologicalMaterial03 -0.19063359  0.130255258 -0.007997775 -0.180652612
## BiologicalMaterial04 -0.24749288 -0.004208954 -0.097306538 -0.008616674
## BiologicalMaterial05 -0.12355626 -0.067379849 -0.107109954  0.047886870
## BiologicalMaterial06 -0.26592377  0.085802466 -0.061300955 -0.043836489
## BiologicalMaterial07  0.02821910  0.035888228  0.088372949  0.072540870
## BiologicalMaterial08 -0.25615824  0.076885976 -0.028259858  0.041870187
## BiologicalMaterial09 -0.04167002  0.127086920  0.035417195 -0.184801313
## BiologicalMaterial10 -0.19329028 -0.044790465 -0.107697018  0.017421899
  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?