DS 624: Homwork 7 (Linear Regression and Cousins)

Source Code: https://github.com/djlofland/DATA624_PredictiveAnalytics/tree/master/Homework_7

Problem 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

data("permeability")

fp <- fingerprints
pm <- permeability

print(paste(nrow(fp), ncol(fp)))

## [1] "165 1107"

print(paste(nrow(pm), ncol(pm)))

## [1] "165 1"

Note: no missing values found

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

nz <- nearZeroVar(fp)
fp <- fingerprints[,-nz]

print(paste(nrow(fp), ncol(fp)))

## [1] "165 388"

We have 388 predictors left for predicting.

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

set.seed(424242)

# get training/test split
trainingRows <- createDataPartition(pm, p=0.8, list=FALSE)

# Build training datasets
trainX <- fp[trainingRows,]
trainY <- pm[trainingRows]

# put remaining rows into the test sets
testX <- fp[-trainingRows,]
testY <- pm[-trainingRows]

# Build a DF
trainingData <- as.data.frame(trainX)
trainingData$Perm <- trainY

ctrl <- trainControl(method='cv', number=10)

# run Cv to identify # of latent components
plsTune <- train(trainX, trainY,
                 method="pls",
                 tuneLength=20,
                 trControl = ctrl)
plot(plsTune)

# Get the number of components which minimizes RMSE
(optimal_num_comp <- which.min(plsTune$results[,2]))

## [1] 7

(optimal_rmse <- plsTune$results[optimal_num_comp,3])

## [1] 0.5797039

I found the optimal number of PLS latent components to be 7 giving and RMSE of 0.5797039.

Predict the response for the test set. What is the test set estimate of \(R^2\)?

plsFit <- plsr(Perm ~., data=trainingData, ncomp=optimal_num_comp)

plsPred <- predict(plsFit, testX, ncomp=optimal_num_comp)
head(plsPred)

## , , 7 comps
## 
##         Perm
## 2   2.201710
## 3  18.503569
## 4   5.680499
## 8   5.720243
## 19  7.133854
## 22 28.728169

plsValues <- data.frame(obs = testY, pred=plsPred)
colnames(plsValues) = c('obs', 'pred')
defaultSummary(plsValues)

##       RMSE   Rsquared        MAE 
## 13.3211161  0.4067846  9.1956516

\(R^2\) for the test set is 0.4068

Try building other models discussed in this chapter. Do any have better predictive performance?

# Ridge Model
set.seed(100)

ridgeGrid <- data.frame(.lambda=seq(0, 0.2, length=21))

ctrl <- trainControl(method='cv', number=10)

ridgeRegFit <- train(trainX, trainY,
                     method="ridge", 
                     tuneGrid = ridgeGrid,
                     trControl = ctrl)

## Warning: model fit failed for Fold02: lambda=0.00 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning: model fit failed for Fold06: lambda=0.00 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning: model fit failed for Fold07: lambda=0.00 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

(ridgeRegFit)

## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.00    14.04059  0.3155877   9.731223
##   0.01    15.05256  0.3259500  10.671451
##   0.02    13.20238  0.3886225   9.379421
##   0.03    12.51383  0.4167591   8.895310
##   0.04    12.18571  0.4374073   8.720544
##   0.05    11.83622  0.4527312   8.434725
##   0.06    11.66755  0.4629389   8.320563
##   0.07    11.56659  0.4721971   8.255973
##   0.08    11.46803  0.4796161   8.179014
##   0.09    11.35867  0.4860730   8.089975
##   0.10    11.30413  0.4910185   8.041902
##   0.11    11.26373  0.4952875   8.001179
##   0.12    11.23325  0.4990213   7.978158
##   0.13    11.21365  0.5023382   7.964948
##   0.14    11.22380  0.5053486   7.984859
##   0.15    11.21283  0.5082730   7.986088
##   0.16    11.16735  0.5106052   7.958787
##   0.17    11.17855  0.5123371   7.973737
##   0.18    11.16861  0.5145551   7.979657
##   0.19    11.17309  0.5162831   7.993145
##   0.20    11.19009  0.5177423   8.022084
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.16.

# Do prediction and get R^2
ridgePred <- predict(ridgeRegFit, newdata=testX)

ridgeValues <- data.frame(obs = testY, pred=ridgePred)
colnames(ridgeValues) = c('obs', 'pred')
defaultSummary(ridgeValues)

##       RMSE   Rsquared        MAE 
## 13.7684863  0.3937314 10.1500314

set.seed(100)

# LASSO
enetGrid <- expand.grid(.lambda=c(0, 0.01, 0.1),
                        .fraction=seq(0.05, 1, length = 20))

ctrl <- trainControl(method='cv', number=10)

enetTune <- train(trainX, trainY,
                     method="enet", 
                     tuneGrid = enetGrid,
                     trControl = ctrl)

## Warning: model fit failed for Fold02: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning: model fit failed for Fold06: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning: model fit failed for Fold07: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

plot(enetTune)

summary(enetTune)

##             Length Class      Mode     
## call            4  -none-     call     
## actions       192  -none-     list     
## allset        388  -none-     numeric  
## beta.pure   74496  -none-     numeric  
## vn            388  -none-     character
## mu              1  -none-     numeric  
## normx         388  -none-     numeric  
## meanx         388  -none-     numeric  
## lambda          1  -none-     numeric  
## L1norm        192  -none-     numeric  
## penalty       192  -none-     numeric  
## df            192  -none-     numeric  
## Cp            192  -none-     numeric  
## sigma2          1  -none-     numeric  
## xNames        388  -none-     character
## problemType     1  -none-     character
## tuneValue       2  data.frame list     
## obsLevels       1  -none-     logical  
## param           0  -none-     list

lassoPred <- predict(enetTune, newdata=testX)

lassoValues <- data.frame(obs = testY, pred=lassoPred)
colnames(lassoValues) = c('obs', 'pred')
defaultSummary(lassoValues)

##       RMSE   Rsquared        MAE 
## 14.0007919  0.3538517  9.7501498

Would you recommend any of your models to replace the permeability laboratory experiment?

I would NOT recommend replacement of laboratory experiments, but possibly a model could be used to help prioritize the order of research. Then based on business needs, if it made sense to stop early, we are more likely to have found those that are most meaningful. Between the couple of models, the Ridge model explains ~40% of the variability in permability based on features which may or may not be sufficient to justify altering processes.

Problem 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(chemicalManufacturing)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

data("ChemicalManufacturingProcess")

cmp <- as_tibble(ChemicalManufacturingProcess)

x_raw <- cmp[,2:58]
y_raw <- as.matrix(cmp$Yield)

print(paste(nrow(x_raw), ncol(x_raw)))

## [1] "176 57"

print(paste(nrow(y_raw), ncol(y_raw)))

## [1] "176 1"

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

# Various NA plots to inspect data
knitr::kable(miss_var_summary(cmp) %>% filter(n_miss > 0), 
             caption = 'Missing Values',
             format="html", 
             table.attr="style='width:50%;'") %>% 
  kableExtra::kable_styling()

Missing Values
variable	n_miss	pct_miss
ManufacturingProcess03	15	8.5227273
ManufacturingProcess11	10	5.6818182
ManufacturingProcess10	9	5.1136364
ManufacturingProcess25	5	2.8409091
ManufacturingProcess26	5	2.8409091
ManufacturingProcess27	5	2.8409091
ManufacturingProcess28	5	2.8409091
ManufacturingProcess29	5	2.8409091
ManufacturingProcess30	5	2.8409091
ManufacturingProcess31	5	2.8409091
ManufacturingProcess33	5	2.8409091
ManufacturingProcess34	5	2.8409091
ManufacturingProcess35	5	2.8409091
ManufacturingProcess36	5	2.8409091
ManufacturingProcess02	3	1.7045455
ManufacturingProcess06	2	1.1363636
ManufacturingProcess01	1	0.5681818
ManufacturingProcess04	1	0.5681818
ManufacturingProcess05	1	0.5681818
ManufacturingProcess07	1	0.5681818
ManufacturingProcess08	1	0.5681818
ManufacturingProcess12	1	0.5681818
ManufacturingProcess14	1	0.5681818
ManufacturingProcess22	1	0.5681818
ManufacturingProcess23	1	0.5681818
ManufacturingProcess24	1	0.5681818
ManufacturingProcess40	1	0.5681818
ManufacturingProcess41	1	0.5681818

gg_miss_var(cmp)

gg_miss_upset(cmp)

Because we see some patterns in how features are missing, I’ll use a KNN Impute approach. Note that thecaret knnImpute also normalizes the data by default. This is annoying - I’m going use knn.impute from bnstruct instead.

x_imputed <- knn.impute(as.matrix(x_raw), k=10)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Let’s check for and drop any features with near zero variance

# Check for columns with little variance - candidate to drop
lowVariance <- nearZeroVar(x_imputed, names = TRUE)
head(lowVariance)

## [1] "BiologicalMaterial07"

lowVariance <- nearZeroVar(x_imputed)

# Drop columns with low variance
x_lowvar <- x_imputed[,-lowVariance]

Let’s deal with outliers - rrepalce them with the MEDIAN from that feature:

x_outliers <- outlieR::impute(x_lowvar, fill='median')

Now, lets drop any columns with high correlations

# Find and drop high correlation features
correlations <- cor(x_outliers)
highCorr <- findCorrelation(correlations, names=TRUE, cutoff=0.9)
(highCorr)

## [1] "BiologicalMaterial02"   "ManufacturingProcess26" "BiologicalMaterial04"  
## [4] "BiologicalMaterial12"   "ManufacturingProcess11" "ManufacturingProcess27"
## [7] "ManufacturingProcess44" "ManufacturingProcess40"

highCorr <- findCorrelation(correlations, cutoff=0.9)
x_corr <- x_outliers[,-highCorr]

head(x_corr)

##      BiologicalMaterial01 BiologicalMaterial03 BiologicalMaterial05
## [1,]                 6.25                56.97                19.51
## [2,]                 8.01                67.48                19.36
## [3,]                 8.01                67.48                19.36
## [4,]                 8.01                67.48                19.36
## [5,]                 7.47                72.25                17.91
## [6,]                 6.12                65.31                21.79
##      BiologicalMaterial06 BiologicalMaterial08 BiologicalMaterial09
## [1,]                43.73                16.66                11.44
## [2,]                53.14                19.04                12.55
## [3,]                53.14                19.04                12.55
## [4,]                53.14                19.04                12.55
## [5,]                54.66                18.22                12.80
## [6,]                51.23                18.30                12.13
##      BiologicalMaterial10 BiologicalMaterial11 ManufacturingProcess01
## [1,]                 3.46               138.09                   11.1
## [2,]                 3.46               153.67                    0.0
## [3,]                 3.46               153.67                    0.0
## [4,]                 3.46               153.67                    0.0
## [5,]                 3.05               147.61                   10.7
## [6,]                 3.78               151.88                   12.0
##      ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## [1,]                      0                   1.55                    934
## [2,]                      0                   1.55                    917
## [3,]                      0                   1.55                    912
## [4,]                      0                   1.55                    911
## [5,]                      0                   1.55                    918
## [6,]                      0                   1.55                    924
##      ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## [1,]                 1092.2                  207.5                    177
## [2,]                 1032.2                  210.0                    177
## [3,]                 1003.6                  207.1                    178
## [4,]                 1014.6                  213.3                    177
## [5,]                 1027.5                  205.7                    178
## [6,]                 1016.8                  208.9                    178
##      ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## [1,]                    177                  43.00                    8.9
## [2,]                    178                  46.57                    8.9
## [3,]                    178                  45.07                    8.9
## [4,]                    177                  44.92                    9.0
## [5,]                    178                  44.96                   10.7
## [6,]                    178                  45.32                    8.9
##      ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## [1,]                      0                   35.5                   4898
## [2,]                      0                   34.0                   4869
## [3,]                      0                   34.8                   4878
## [4,]                      0                   34.8                   4897
## [5,]                      0                   34.6                   4992
## [6,]                      0                   34.0                   4985
##      ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## [1,]                   6108                   4682                   35.5
## [2,]                   6095                   4617                   34.0
## [3,]                   6087                   4617                   34.8
## [4,]                   6102                   4635                   34.8
## [5,]                   6233                   4733                   33.9
## [6,]                   6222                   4786                   33.4
##      ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## [1,]                   4865                   6049                   4665
## [2,]                   4867                   6097                   4621
## [3,]                   4877                   6078                   4621
## [4,]                   4872                   6073                   4611
## [5,]                   4886                   6102                   4659
## [6,]                   4862                   6115                   4696
##      ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## [1,]                    0.0                      7                      2
## [2,]                    0.0                      3                      0
## [3,]                    0.0                      4                      1
## [4,]                    0.0                      5                      2
## [5,]                   -0.7                      8                      4
## [6,]                   -0.6                      9                      1
##      ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess28
## [1,]                      3                   4873                   10.7
## [2,]                      3                   4869                   11.2
## [3,]                      4                   4897                   11.1
## [4,]                      5                   4892                   11.1
## [5,]                     18                   4930                   11.3
## [6,]                      1                   4871                   11.4
##      ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## [1,]                   21.0                    9.9                   69.1
## [2,]                   21.4                    9.9                   68.7
## [3,]                   21.3                    9.4                   69.3
## [4,]                   21.3                    9.4                   69.3
## [5,]                   21.6                    9.0                   69.4
## [6,]                   21.7                   10.1                   68.2
##      ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## [1,]                    156                     66                    2.4
## [2,]                    169                     66                    2.6
## [3,]                    173                     66                    2.6
## [4,]                    171                     68                    2.5
## [5,]                    171                     70                    2.5
## [6,]                    173                     70                    2.5
##      ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## [1,]                    486                  0.019                    0.5
## [2,]                    508                  0.019                    2.0
## [3,]                    509                  0.018                    0.7
## [4,]                    496                  0.018                    1.2
## [5,]                    468                  0.017                    0.2
## [6,]                    490                  0.018                    0.4
##      ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess41
## [1,]                      3                    7.2                   0.00
## [2,]                      2                    7.2                   0.15
## [3,]                      2                    7.2                   0.00
## [4,]                      2                    7.2                   0.00
## [5,]                      2                    7.3                   0.00
## [6,]                      2                    7.2                   0.00
##      ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess45
## [1,]                   11.6                    3.0                    2.4
## [2,]                   11.1                    0.9                    2.2
## [3,]                   12.0                    1.0                    2.3
## [4,]                   10.6                    1.1                    2.1
## [5,]                   11.0                    1.1                    2.1
## [6,]                   11.5                    2.2                    2.0

Now, let’s transform our features (scale, center and boxcox) to improve modeling resolution

x_transf <-  preProcess(x_corr, method=c('center', 'scale', 'BoxCox'))
(x_transf)

## Created from 176 samples and 48 variables
## 
## Pre-processing:
##   - Box-Cox transformation (33)
##   - centered (48)
##   - ignored (0)
##   - scaled (48)
## 
## Lambda estimates for Box-Cox transformation:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.00000 -2.00000  0.00000 -0.08182  2.00000  2.00000

x_transf <- predict(x_transf, x_corr)

Time to split in to Training and Test

# get training/test split
trainingRows <- createDataPartition(y_raw, p=0.8, list=FALSE)

# Build training datasets
trainX <- x_transf[trainingRows,]
trainY <- y_raw[trainingRows]

# put remaining rows into the test sets
testX <- x_transf[-trainingRows,]
testY <- y_raw[-trainingRows]

# Build a DF
trainingData <- as.data.frame(trainX)
trainingData$Yield <- trainY

Build a model:

set.seed(100)

# LASSO
enetGrid <- expand.grid(.lambda=c(0, 0.01, 0.1),
                        .fraction=seq(0.05, 1, length = 20))

ctrl <- trainControl(method='cv', number=10)

enetTune <- train(trainX, trainY,
                     method="enet", 
                     tuneGrid = enetGrid,
                     trControl = ctrl)
plot(enetTune)

summary(enetTune)

##             Length Class      Mode     
## call           4   -none-     call     
## actions       77   -none-     list     
## allset        48   -none-     numeric  
## beta.pure   3696   -none-     numeric  
## vn            48   -none-     character
## mu             1   -none-     numeric  
## normx         48   -none-     numeric  
## meanx         48   -none-     numeric  
## lambda         1   -none-     numeric  
## L1norm        77   -none-     numeric  
## penalty       77   -none-     numeric  
## df            77   -none-     numeric  
## Cp            77   -none-     numeric  
## sigma2         1   -none-     numeric  
## xNames        48   -none-     character
## problemType    1   -none-     character
## tuneValue      2   data.frame list     
## obsLevels      1   -none-     logical  
## param          0   -none-     list

lassoPred <- predict(enetTune, newdata=trainX)

lassoValues <- data.frame(obs = trainY, pred=lassoPred)
colnames(lassoValues) = c('obs', 'pred')
(train_values <- defaultSummary(lassoValues))

##      RMSE  Rsquared       MAE 
## 1.0072681 0.6971485 0.8157123

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

lassoPred <- predict(enetTune, newdata=testX)

lassoValues <- data.frame(obs = testY, pred=lassoPred)
colnames(lassoValues) = c('obs', 'pred')
(test_values <- defaultSummary(lassoValues))

##      RMSE  Rsquared       MAE 
## 1.1413167 0.6794059 0.9102490

With the training data, we had an RMSE of 0.6971485. With the test data, this was 0.6794059. These seem reasonable and quite good. We expect the training RMSE to be higher.

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

enetImp <- varImp(enetTune, scale = FALSE)
plot(enetImp, top = 20)

We have more Manufacturing Features in the top 20 (15) vs Biological (5). That said, we had more Manufacturing features to start with so, this could just be a class imbalance issue. … or maybe Manufacturing Processes have more impact than base Biological?

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

# Get top feature names
features <- as.data.frame(enetImp[["importance"]]) %>%
  arrange(-Overall, ) %>%
  rownames()

features <- features[1:20]

cmp_imp <- as.data.frame(x_outliers)[features]
cmp_imp$Yield <- cmp$Yield

featurePlot(x = cmp_imp[features], 
            y = cmp_imp$Yield, 
            plot = "scatter",
            type = c("p", 'smooth'),
            span = .5,
            pch = 20)

# Show feature correlations/target by decreasing correlation
#stack(sort(cor(clean_df[,1], clean_df[,2:ncol(clean_df)])[,], decreasing=TRUE))

Since we cannot affect Biological Features, we can ignore those for now. However, for the Manufacuring features, we can look for those that show a relationship and try altering our Manufacturing towards the more favorable values. At this point, we haven’t established causation, merely correlation. So, changing those values might not lead to improvements, however these are good candidates for exploration. I would setup experimets where we independently test changing each candidate Manufacturing feature with the strongest correlation and starting from the top of list (based on feature importance) to see if it’s a lever for improving yield. Note: Just because a variable appeared at the top of the feature importance doesn’t mean it is the lever that could have the most impact if changed … but it’s a good place to start and better than just randomly adjusting manufactuing levers.

DS 624: Homwork 7 (Linear Regression and Cousins)

Kuhn: 6.2, 6.3

Donny Lofland

Problem 6.2

Problem 6.3