Regression Homework

Jack Wright

Exercises 6.1, 6.2

6.1

Spectroscopy is used to analyze chemical makeup of a sample material, Tecator Infratec Food and Feed Analyzer was used on 215 samples of meat across 100 frequencies. THis was used to determine the percent content of water, fat and protien of each sample. Establish a predictive relationship between IR spectrum and fat content.

a. Load the data

library(tidyverse)
library(caret)
data(tecator)

Lets look at the table absorp

absorp[1:10,1:10]
##          [,1]    [,2]    [,3]    [,4]    [,5]    [,6]    [,7]    [,8]    [,9]
##  [1,] 2.61776 2.61814 2.61859 2.61912 2.61981 2.62071 2.62186 2.62334 2.62511
##  [2,] 2.83454 2.83871 2.84283 2.84705 2.85138 2.85587 2.86060 2.86566 2.87093
##  [3,] 2.58284 2.58458 2.58629 2.58808 2.58996 2.59192 2.59401 2.59627 2.59873
##  [4,] 2.82286 2.82460 2.82630 2.82814 2.83001 2.83192 2.83392 2.83606 2.83842
##  [5,] 2.78813 2.78989 2.79167 2.79350 2.79538 2.79746 2.79984 2.80254 2.80553
##  [6,] 3.00993 3.01540 3.02086 3.02634 3.03190 3.03756 3.04341 3.04955 3.05599
##  [7,] 2.98893 2.99435 2.99980 3.00540 3.01117 3.01716 3.02347 3.03014 3.03708
##  [8,] 2.52582 2.52665 2.52758 2.52844 2.52944 2.53057 2.53187 2.53339 2.53516
##  [9,] 3.27336 3.27996 3.28646 3.29300 3.29956 3.30627 3.31310 3.32006 3.32727
## [10,] 3.39805 3.40539 3.41271 3.42001 3.42735 3.43479 3.44245 3.45035 3.45838
##         [,10]
##  [1,] 2.62722
##  [2,] 2.87661
##  [3,] 2.60131
##  [4,] 2.84097
##  [5,] 2.80890
##  [6,] 3.06274
##  [7,] 3.04445
##  [8,] 2.53710
##  [9,] 3.33472
## [10,] 3.46665

These are the absorbance values for 215 samples

The table endpoints is the moisture, fat and protien content of these samples.

head(endpoints)
##      [,1] [,2] [,3]
## [1,] 60.5 22.5 16.7
## [2,] 46.0 40.1 13.5
## [3,] 71.0  8.4 20.5
## [4,] 72.8  5.9 20.7
## [5,] 58.3 25.5 15.5
## [6,] 44.0 42.7 13.7

b Since the frequencies lie in a systematic order, predictors have a high degree of correlation. This makes intuitive sense because a frequency peak will have tails on either side (sort of like a normal distribution.) Say there is a certain frequency of radiation that is predictive, it will have peaks at frequencies close to it as well. It would be easier to just describe the peak instead of the distribution for modeling.

Lets fit PCA to flatten the dimensionality

first lets make a dataframe with the relevant columns (we are interested in fat content)

fat<-endpoints[, 2]
df_raw<-as.data.frame(cbind(absorp, fat))
library(caTools)
## Warning: package 'caTools' was built under R version 4.1.3
sample = sample.split(df_raw$fat, SplitRatio = .80)
train = subset(df_raw, sample ==TRUE)
test = subset(df_raw, sample == FALSE)
fat.pca<-prcomp(df_raw%>%select(-fat), center = TRUE, scale. = TRUE)
#summary(df_raw.pca)
library(ggfortify)
## Warning: package 'ggfortify' was built under R version 4.1.3
plot(fat.pca, type = 'l')

We can see from the plot that almost 100% of the variance is explained by just one principal component. This confirms our suspicion of the high colinearity between predictors.

c

split data into train and test and build the models described in the chapter. What are the optimal valeus for the tuning params?

endpoints= as.data.frame(endpoints)
absorp = as.data.frame(absorp)
# partition the data, this will flag our train and test data

partition <- createDataPartition(endpoints[, 2], p = .8, list= FALSE)

feature_train <- absorp[ partition,]
feature_test  <- absorp[-partition,]
target_train <- endpoints[ partition, 2]
target_test <- endpoints[-partition,2]

#set the training method and number of folds
ctrl <- trainControl(method = "repeatedcv", repeats = 5)

We will build the following models: * Simple Linear Model * PCA * PLS

Simple Linear Model

First lets build a simple linear model

set.seed(123)
lm.fat <- train(y = target_train, x=feature_train, method = "lm", trControl = ctrl)
lm.fat
## Linear Regression 
## 
## 174 samples
## 100 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 156, 158, 156, 158, 157, 157, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   3.954846  0.8922923  2.311221
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

PCR

Now lets train a PCR model

pcr.fat<-train(y = target_train, x=feature_train, method = "pcr", trControl = ctrl)
pcr.fat
## Principal Component Analysis 
## 
## 174 samples
## 100 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 155, 158, 157, 154, 158, 157, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE     
##   1      10.936636  0.2566603  8.732860
##   2      10.814444  0.2801540  8.475095
##   3       8.069628  0.5989369  6.271323
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

PLS

Now lets train a PLS model.

pls.fat<-train(y = target_train, x=feature_train, method = "pls", preProcess= c('center','scale'), trControl = ctrl)
pls.fat
## Partial Least Squares 
## 
## 174 samples
## 100 predictors
## 
## Pre-processing: centered (100), scaled (100) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 156, 155, 156, 157, 157, 156, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE     
##   1      10.931262  0.2490276  8.733955
##   2       7.827490  0.6271765  6.019896
##   3       5.116558  0.8228591  3.911564
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

## ENS

Now lets create an elastic (enet) model

library(elasticnet)
## Loading required package: lars
## Loaded lars 1.2
fractionTune = c(seq(.005, .03,.005),seq(.04,.1,0.02),seq(0.2,1,0.2))
lambdaTune = c(0, .001, .01, .1, 1)

enet.fat<- train(y = target_train, x=feature_train, method = "enet",
                 preProcess= c('center','scale'),
                 trControl = ctrl)
enet.fat
## Elasticnet 
## 
## 174 samples
## 100 predictors
## 
## Pre-processing: centered (100), scaled (100) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 157, 156, 155, 157, 157, 157, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE       Rsquared   MAE     
##   0e+00   0.050      2.311275  0.9685303  1.654082
##   0e+00   0.525      2.880209  0.9419125  1.774340
##   0e+00   1.000      3.381009  0.9220231  2.105895
##   1e-04   0.050      8.766670  0.5833081  7.089899
##   1e-04   0.525      2.855200  0.9584628  2.216961
##   1e-04   1.000      2.783060  0.9602594  2.164595
##   1e-01   0.050     10.470428  0.3238220  8.416098
##   1e-01   0.525      6.838385  0.7565883  5.386303
##   1e-01   1.000      5.119037  0.8512841  3.978153
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.05 and lambda = 0.
plot(enet.fat, plotType = 'level')

## Lasso

Now lets make a lasso model

lasso.fat<- train(y = target_train, x=feature_train, method = "lasso",
                 preProcess= c('center','scale'),
                 trControl = ctrl, tuneLength = 20)
lasso.fat
## The lasso 
## 
## 174 samples
## 100 predictors
## 
## Pre-processing: centered (100), scaled (100) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 156, 156, 156, 157, 156, 157, ... 
## Resampling results across tuning parameters:
## 
##   fraction   RMSE      Rsquared   MAE     
##   0.1000000  2.530118  0.9567047  1.636784
##   0.1421053  2.546803  0.9524041  1.603619
##   0.1842105  2.559719  0.9493975  1.581926
##   0.2263158  2.613919  0.9462052  1.598780
##   0.2684211  2.688485  0.9434534  1.630950
##   0.3105263  2.754890  0.9402909  1.656477
##   0.3526316  2.825336  0.9375589  1.686945
##   0.3947368  2.885821  0.9355530  1.716846
##   0.4368421  2.963666  0.9327612  1.755945
##   0.4789474  3.024010  0.9303128  1.785856
##   0.5210526  3.062328  0.9288288  1.810026
##   0.5631579  3.110188  0.9270790  1.840851
##   0.6052632  3.162895  0.9250866  1.872901
##   0.6473684  3.212294  0.9232577  1.903904
##   0.6894737  3.255902  0.9218893  1.935048
##   0.7315789  3.311360  0.9197482  1.971730
##   0.7736842  3.382415  0.9164548  2.009767
##   0.8157895  3.461411  0.9126266  2.046980
##   0.8578947  3.546272  0.9085289  2.085319
##   0.9000000  3.636808  0.9042464  2.130438
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.1.

Ridge

Now lets make a ridge Regression

ridge.fat<- train(y = target_train, x=feature_train, method = "ridge",
                 preProcess= c('center','scale'),
                 trControl = ctrl,
                 tuneLength = 20)
ridge.fat
## Ridge Regression 
## 
## 174 samples
## 100 predictors
## 
## Pre-processing: centered (100), scaled (100) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 158, 157, 157, 157, 156, 157, ... 
## Resampling results across tuning parameters:
## 
##   lambda        RMSE      Rsquared   MAE     
##   0.0000000000  3.565383  0.9067496  2.190629
##   0.0001000000  2.749704  0.9581747  2.146717
##   0.0001467799  2.770849  0.9575632  2.166931
##   0.0002154435  2.789954  0.9570080  2.185274
##   0.0003162278  2.807272  0.9565106  2.201707
##   0.0004641589  2.824334  0.9560317  2.218057
##   0.0006812921  2.844081  0.9554808  2.237957
##   0.0010000000  2.871023  0.9547045  2.268207
##   0.0014677993  2.911301  0.9534746  2.326249
##   0.0021544347  2.972109  0.9514945  2.406859
##   0.0031622777  3.059966  0.9484456  2.516202
##   0.0046415888  3.178035  0.9440882  2.644497
##   0.0068129207  3.324083  0.9383823  2.785476
##   0.0100000000  3.491251  0.9315358  2.928000
##   0.0146779927  3.672513  0.9238819  3.066239
##   0.0215443469  3.866980  0.9155918  3.196014
##   0.0316227766  4.084518  0.9063442  3.326368
##   0.0464158883  4.345442  0.8950827  3.484902
##   0.0681292069  4.673726  0.8799349  3.684336
##   0.1000000000  5.085839  0.8583363  3.958740
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 1e-04.
mat <- matrix(NA, nrow = 2, ncol = 3)
mat <- data.frame(mat, row.names = c("lambda", "fraction"))
colnames(mat) <- c("Ridge", "Lasso", "Enet")
mat[1,1] <- ridge.fat$bestTune[1]
mat[2,2] <- lasso.fat$bestTune[1]
mat[1,3] <- enet.fat$bestTune[2]
mat[2,3] <- enet.fat$bestTune[1]
mat
##          Ridge Lasso Enet
## lambda   1e-04    NA 0.00
## fraction    NA   0.1 0.05
library(ggpubr)
r<-ggplot(ridge.fat)+ggtitle('ridge')
l<-ggplot(lasso.fat)+ggtitle('lasso')
e<-ggplot(enet.fat)+ggtitle('enet')
ggarrange(r,l,e)

the lasso model is showing the lowest RMSE.

Now lets test the models on or test set.

pred.lm<-predict(lm.fat, feature_test)
pred.pls<-predict(pls.fat, feature_test)
pred.pcr<-predict(pcr.fat, feature_test)
pred.ridge <- predict(ridge.fat, feature_test)
pred.lasso <- predict(lasso.fat, feature_test)
pred.enet <- predict(enet.fat, feature_test)
lm.RMSE <- RMSE(pred.ridge, target_test)
pls.RMSE <- RMSE(pred.lasso, target_test) 
pcr.RMSE <- RMSE(pred.enet, target_test)
ridge.RMSE <- RMSE(pred.ridge, target_test)
lasso.RMSE <- RMSE(pred.lasso, target_test) 
enet.RMSE <- RMSE(pred.enet, target_test) 
c <- cbind(lm.RMSE,pls.RMSE,pcr.RMSE, ridge.RMSE,lasso.RMSE,enet.RMSE); rownames(c) <- "RMSE"
c
##       lm.RMSE pls.RMSE pcr.RMSE ridge.RMSE lasso.RMSE enet.RMSE
## RMSE 3.400684 3.803044 3.367648   3.400684   3.803044  3.367648

6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

a

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.1.2
data(permeability)

b

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

zero_fingerprints<-nearZeroVar(fingerprints)
f_clean<-fingerprints[,-zero_fingerprints]

c

Split data into train/test set, process data and tune a PLS model.

How many latent variables are optimal?

What is the corresponding resampled R^2?

set.seed(123)
training_rows<-createDataPartition(permeability,
  p=.75,
  list = FALSE
)
train_feature<-f_clean[training_rows,]
train_target<-permeability[training_rows,]
test_feature<-f_clean[training_rows,]
test_target<-permeability[training_rows,]

Perform PLS

set.seed(123)
ctrl<-trainControl(method = 'LGOCV')
plsTune<-train(y=log(train_target), x = train_feature,
               method = 'pls',
               metric = 'Rsquared',
               tuneGrid = expand.grid(ncomp = 1:15),
               trControl = ctrl)
plot(plsTune)

Since the lowest RMSE is at 9, it indicates the optimal number of components is nine

d

predict the response for the test set. what is the test set estimate of R^2

pred<-predict(plsTune, newdata=test_feature)
postResample(pred = pred, obs = test_target)
##       RMSE   Rsquared        MAE 
## 18.4875399  0.6523581 11.0062569

e

Predict with some other models

set.seed(123)
ctrl<-trainControl(method = 'LGOCV')
enetTune<-train(y=log(train_target), x = train_feature,
               method = 'enet',
               preprocess = c('center','scale'),
               metric = 'Rsquared',
               tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                      .lambda = seq(0, 1, by=0.1)),
               trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(y = log(train_target), x = train_feature, method =
## "enet", : missing values found in aggregated results
plot(enetTune)

Now lets try a lasso fit

set.seed(123)
ctrl<-trainControl(method = 'LGOCV')
lassoTune<-train(y=log(train_target), x = train_feature,
               method = 'lasso',
               preProcess = c('center','scale'),
               metric = 'Rsquared',
               tuneGrid=expand.grid(.fraction = seq(0, .5, by=0.05)),
               trControl = ctrl)
## Warning: model fit failed for Resample05: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Resample10: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Resample16: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Resample23: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(y = log(train_target), x = train_feature, method =
## "lasso", : missing values found in aggregated results
plot(lassoTune)

Lets compare the models

resamp<-resamples(list(PLS =plsTune, enet= enetTune, Lasso = lassoTune))
summary(resamp)
## 
## Call:
## summary.resamples(object = resamp)
## 
## Models: PLS, enet, Lasso 
## Number of resamples: 25 
## 
## MAE 
##            Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## PLS   0.6937625 0.7600335 0.8122873 0.8351074 0.8922488 1.081546    0
## enet  0.6755287 0.7798982 0.8484279 0.8380389 0.8753350 1.002401    0
## Lasso 0.7411175 0.8509494 0.9268667 0.9454309 0.9835806 1.429989    4
## 
## RMSE 
##            Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PLS   0.9008501 1.0168419 1.093672 1.089024 1.157039 1.325048    0
## enet  0.8615021 0.9949612 1.070893 1.076277 1.165211 1.313763    0
## Lasso 0.9561000 1.0871150 1.143622 1.180296 1.230222 1.727219    4
## 
## Rsquared 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS   0.3572540 0.5281833 0.5527050 0.5509718 0.5810646 0.6802082    0
## enet  0.4104164 0.4976425 0.5520461 0.5462438 0.5836227 0.6935267    0
## Lasso 0.4012249 0.4531927 0.5052610 0.5231025 0.5602233 0.6830847    4

the ENET seems to have the best Rsquared value. Lets try on the test set.

pls_pred<-predict(plsTune, newdata = test_feature)
postResample(pred = pls_pred, obs = test_target)
##       RMSE   Rsquared        MAE 
## 18.4875399  0.6523581 11.0062569

f

I would not reccomend using this method to replace the permiability experiment, because as you can see above the MAE (mean average error, which is in the units of the variable) is over 10, and the histogram below shows that almost all of the data lies under 10 units. In essence our margin of error spans this very clear dividing line. If there was some permiability threshold we wanted to test, we could use a classification model instead.

histogram(permeability)