Exercises 6.1, 6.2
Spectroscopy is used to analyze chemical makeup of a sample material, Tecator Infratec Food and Feed Analyzer was used on 215 samples of meat across 100 frequencies. THis was used to determine the percent content of water, fat and protien of each sample. Establish a predictive relationship between IR spectrum and fat content.
a. Load the data
library(tidyverse)
library(caret)
data(tecator)
Lets look at the table absorp
absorp[1:10,1:10]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 2.61776 2.61814 2.61859 2.61912 2.61981 2.62071 2.62186 2.62334 2.62511
## [2,] 2.83454 2.83871 2.84283 2.84705 2.85138 2.85587 2.86060 2.86566 2.87093
## [3,] 2.58284 2.58458 2.58629 2.58808 2.58996 2.59192 2.59401 2.59627 2.59873
## [4,] 2.82286 2.82460 2.82630 2.82814 2.83001 2.83192 2.83392 2.83606 2.83842
## [5,] 2.78813 2.78989 2.79167 2.79350 2.79538 2.79746 2.79984 2.80254 2.80553
## [6,] 3.00993 3.01540 3.02086 3.02634 3.03190 3.03756 3.04341 3.04955 3.05599
## [7,] 2.98893 2.99435 2.99980 3.00540 3.01117 3.01716 3.02347 3.03014 3.03708
## [8,] 2.52582 2.52665 2.52758 2.52844 2.52944 2.53057 2.53187 2.53339 2.53516
## [9,] 3.27336 3.27996 3.28646 3.29300 3.29956 3.30627 3.31310 3.32006 3.32727
## [10,] 3.39805 3.40539 3.41271 3.42001 3.42735 3.43479 3.44245 3.45035 3.45838
## [,10]
## [1,] 2.62722
## [2,] 2.87661
## [3,] 2.60131
## [4,] 2.84097
## [5,] 2.80890
## [6,] 3.06274
## [7,] 3.04445
## [8,] 2.53710
## [9,] 3.33472
## [10,] 3.46665
These are the absorbance values for 215 samples
The table endpoints is the moisture, fat and protien content of these samples.
head(endpoints)
## [,1] [,2] [,3]
## [1,] 60.5 22.5 16.7
## [2,] 46.0 40.1 13.5
## [3,] 71.0 8.4 20.5
## [4,] 72.8 5.9 20.7
## [5,] 58.3 25.5 15.5
## [6,] 44.0 42.7 13.7
b Since the frequencies lie in a systematic order, predictors have a high degree of correlation. This makes intuitive sense because a frequency peak will have tails on either side (sort of like a normal distribution.) Say there is a certain frequency of radiation that is predictive, it will have peaks at frequencies close to it as well. It would be easier to just describe the peak instead of the distribution for modeling.
Lets fit PCA to flatten the dimensionality
first lets make a dataframe with the relevant columns (we are interested in fat content)
fat<-endpoints[, 2]
df_raw<-as.data.frame(cbind(absorp, fat))
library(caTools)
## Warning: package 'caTools' was built under R version 4.1.3
sample = sample.split(df_raw$fat, SplitRatio = .80)
train = subset(df_raw, sample ==TRUE)
test = subset(df_raw, sample == FALSE)
fat.pca<-prcomp(df_raw%>%select(-fat), center = TRUE, scale. = TRUE)
#summary(df_raw.pca)
library(ggfortify)
## Warning: package 'ggfortify' was built under R version 4.1.3
plot(fat.pca, type = 'l')
We can see from the plot that almost 100% of the variance is explained by just one principal component. This confirms our suspicion of the high colinearity between predictors.
c
split data into train and test and build the models described in the chapter. What are the optimal valeus for the tuning params?
endpoints= as.data.frame(endpoints)
absorp = as.data.frame(absorp)
# partition the data, this will flag our train and test data
partition <- createDataPartition(endpoints[, 2], p = .8, list= FALSE)
feature_train <- absorp[ partition,]
feature_test <- absorp[-partition,]
target_train <- endpoints[ partition, 2]
target_test <- endpoints[-partition,2]
#set the training method and number of folds
ctrl <- trainControl(method = "repeatedcv", repeats = 5)
We will build the following models: * Simple Linear Model * PCA * PLS
First lets build a simple linear model
set.seed(123)
lm.fat <- train(y = target_train, x=feature_train, method = "lm", trControl = ctrl)
lm.fat
## Linear Regression
##
## 174 samples
## 100 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 156, 158, 156, 158, 157, 157, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 3.954846 0.8922923 2.311221
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Now lets train a PCR model
pcr.fat<-train(y = target_train, x=feature_train, method = "pcr", trControl = ctrl)
pcr.fat
## Principal Component Analysis
##
## 174 samples
## 100 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 155, 158, 157, 154, 158, 157, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 10.936636 0.2566603 8.732860
## 2 10.814444 0.2801540 8.475095
## 3 8.069628 0.5989369 6.271323
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
Now lets train a PLS model.
pls.fat<-train(y = target_train, x=feature_train, method = "pls", preProcess= c('center','scale'), trControl = ctrl)
pls.fat
## Partial Least Squares
##
## 174 samples
## 100 predictors
##
## Pre-processing: centered (100), scaled (100)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 156, 155, 156, 157, 157, 156, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 10.931262 0.2490276 8.733955
## 2 7.827490 0.6271765 6.019896
## 3 5.116558 0.8228591 3.911564
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
## ENS
Now lets create an elastic (enet) model
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.2
fractionTune = c(seq(.005, .03,.005),seq(.04,.1,0.02),seq(0.2,1,0.2))
lambdaTune = c(0, .001, .01, .1, 1)
enet.fat<- train(y = target_train, x=feature_train, method = "enet",
preProcess= c('center','scale'),
trControl = ctrl)
enet.fat
## Elasticnet
##
## 174 samples
## 100 predictors
##
## Pre-processing: centered (100), scaled (100)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 157, 156, 155, 157, 157, 157, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0e+00 0.050 2.311275 0.9685303 1.654082
## 0e+00 0.525 2.880209 0.9419125 1.774340
## 0e+00 1.000 3.381009 0.9220231 2.105895
## 1e-04 0.050 8.766670 0.5833081 7.089899
## 1e-04 0.525 2.855200 0.9584628 2.216961
## 1e-04 1.000 2.783060 0.9602594 2.164595
## 1e-01 0.050 10.470428 0.3238220 8.416098
## 1e-01 0.525 6.838385 0.7565883 5.386303
## 1e-01 1.000 5.119037 0.8512841 3.978153
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.05 and lambda = 0.
plot(enet.fat, plotType = 'level')
## Lasso
Now lets make a lasso model
lasso.fat<- train(y = target_train, x=feature_train, method = "lasso",
preProcess= c('center','scale'),
trControl = ctrl, tuneLength = 20)
lasso.fat
## The lasso
##
## 174 samples
## 100 predictors
##
## Pre-processing: centered (100), scaled (100)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 156, 156, 156, 157, 156, 157, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.1000000 2.530118 0.9567047 1.636784
## 0.1421053 2.546803 0.9524041 1.603619
## 0.1842105 2.559719 0.9493975 1.581926
## 0.2263158 2.613919 0.9462052 1.598780
## 0.2684211 2.688485 0.9434534 1.630950
## 0.3105263 2.754890 0.9402909 1.656477
## 0.3526316 2.825336 0.9375589 1.686945
## 0.3947368 2.885821 0.9355530 1.716846
## 0.4368421 2.963666 0.9327612 1.755945
## 0.4789474 3.024010 0.9303128 1.785856
## 0.5210526 3.062328 0.9288288 1.810026
## 0.5631579 3.110188 0.9270790 1.840851
## 0.6052632 3.162895 0.9250866 1.872901
## 0.6473684 3.212294 0.9232577 1.903904
## 0.6894737 3.255902 0.9218893 1.935048
## 0.7315789 3.311360 0.9197482 1.971730
## 0.7736842 3.382415 0.9164548 2.009767
## 0.8157895 3.461411 0.9126266 2.046980
## 0.8578947 3.546272 0.9085289 2.085319
## 0.9000000 3.636808 0.9042464 2.130438
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.1.
Now lets make a ridge Regression
ridge.fat<- train(y = target_train, x=feature_train, method = "ridge",
preProcess= c('center','scale'),
trControl = ctrl,
tuneLength = 20)
ridge.fat
## Ridge Regression
##
## 174 samples
## 100 predictors
##
## Pre-processing: centered (100), scaled (100)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 158, 157, 157, 157, 156, 157, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0000000000 3.565383 0.9067496 2.190629
## 0.0001000000 2.749704 0.9581747 2.146717
## 0.0001467799 2.770849 0.9575632 2.166931
## 0.0002154435 2.789954 0.9570080 2.185274
## 0.0003162278 2.807272 0.9565106 2.201707
## 0.0004641589 2.824334 0.9560317 2.218057
## 0.0006812921 2.844081 0.9554808 2.237957
## 0.0010000000 2.871023 0.9547045 2.268207
## 0.0014677993 2.911301 0.9534746 2.326249
## 0.0021544347 2.972109 0.9514945 2.406859
## 0.0031622777 3.059966 0.9484456 2.516202
## 0.0046415888 3.178035 0.9440882 2.644497
## 0.0068129207 3.324083 0.9383823 2.785476
## 0.0100000000 3.491251 0.9315358 2.928000
## 0.0146779927 3.672513 0.9238819 3.066239
## 0.0215443469 3.866980 0.9155918 3.196014
## 0.0316227766 4.084518 0.9063442 3.326368
## 0.0464158883 4.345442 0.8950827 3.484902
## 0.0681292069 4.673726 0.8799349 3.684336
## 0.1000000000 5.085839 0.8583363 3.958740
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 1e-04.
mat <- matrix(NA, nrow = 2, ncol = 3)
mat <- data.frame(mat, row.names = c("lambda", "fraction"))
colnames(mat) <- c("Ridge", "Lasso", "Enet")
mat[1,1] <- ridge.fat$bestTune[1]
mat[2,2] <- lasso.fat$bestTune[1]
mat[1,3] <- enet.fat$bestTune[2]
mat[2,3] <- enet.fat$bestTune[1]
mat
## Ridge Lasso Enet
## lambda 1e-04 NA 0.00
## fraction NA 0.1 0.05
library(ggpubr)
r<-ggplot(ridge.fat)+ggtitle('ridge')
l<-ggplot(lasso.fat)+ggtitle('lasso')
e<-ggplot(enet.fat)+ggtitle('enet')
ggarrange(r,l,e)
the lasso model is showing the lowest RMSE.
Now lets test the models on or test set.
pred.lm<-predict(lm.fat, feature_test)
pred.pls<-predict(pls.fat, feature_test)
pred.pcr<-predict(pcr.fat, feature_test)
pred.ridge <- predict(ridge.fat, feature_test)
pred.lasso <- predict(lasso.fat, feature_test)
pred.enet <- predict(enet.fat, feature_test)
lm.RMSE <- RMSE(pred.ridge, target_test)
pls.RMSE <- RMSE(pred.lasso, target_test)
pcr.RMSE <- RMSE(pred.enet, target_test)
ridge.RMSE <- RMSE(pred.ridge, target_test)
lasso.RMSE <- RMSE(pred.lasso, target_test)
enet.RMSE <- RMSE(pred.enet, target_test)
c <- cbind(lm.RMSE,pls.RMSE,pcr.RMSE, ridge.RMSE,lasso.RMSE,enet.RMSE); rownames(c) <- "RMSE"
c
## lm.RMSE pls.RMSE pcr.RMSE ridge.RMSE lasso.RMSE enet.RMSE
## RMSE 3.400684 3.803044 3.367648 3.400684 3.803044 3.367648
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
a
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.1.2
data(permeability)
b
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
zero_fingerprints<-nearZeroVar(fingerprints)
f_clean<-fingerprints[,-zero_fingerprints]
c
Split data into train/test set, process data and tune a PLS model.
How many latent variables are optimal?
What is the corresponding resampled R^2?
set.seed(123)
training_rows<-createDataPartition(permeability,
p=.75,
list = FALSE
)
train_feature<-f_clean[training_rows,]
train_target<-permeability[training_rows,]
test_feature<-f_clean[training_rows,]
test_target<-permeability[training_rows,]
Perform PLS
set.seed(123)
ctrl<-trainControl(method = 'LGOCV')
plsTune<-train(y=log(train_target), x = train_feature,
method = 'pls',
metric = 'Rsquared',
tuneGrid = expand.grid(ncomp = 1:15),
trControl = ctrl)
plot(plsTune)
Since the lowest RMSE is at 9, it indicates the optimal number of components is nine
d
predict the response for the test set. what is the test set estimate of R^2
pred<-predict(plsTune, newdata=test_feature)
postResample(pred = pred, obs = test_target)
## RMSE Rsquared MAE
## 18.4875399 0.6523581 11.0062569
e
Predict with some other models
set.seed(123)
ctrl<-trainControl(method = 'LGOCV')
enetTune<-train(y=log(train_target), x = train_feature,
method = 'enet',
preprocess = c('center','scale'),
metric = 'Rsquared',
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1),
.lambda = seq(0, 1, by=0.1)),
trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(y = log(train_target), x = train_feature, method =
## "enet", : missing values found in aggregated results
plot(enetTune)
Now lets try a lasso fit
set.seed(123)
ctrl<-trainControl(method = 'LGOCV')
lassoTune<-train(y=log(train_target), x = train_feature,
method = 'lasso',
preProcess = c('center','scale'),
metric = 'Rsquared',
tuneGrid=expand.grid(.fraction = seq(0, .5, by=0.05)),
trControl = ctrl)
## Warning: model fit failed for Resample05: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Resample10: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Resample16: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Resample23: fraction=0.5 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(y = log(train_target), x = train_feature, method =
## "lasso", : missing values found in aggregated results
plot(lassoTune)
Lets compare the models
resamp<-resamples(list(PLS =plsTune, enet= enetTune, Lasso = lassoTune))
summary(resamp)
##
## Call:
## summary.resamples(object = resamp)
##
## Models: PLS, enet, Lasso
## Number of resamples: 25
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 0.6937625 0.7600335 0.8122873 0.8351074 0.8922488 1.081546 0
## enet 0.6755287 0.7798982 0.8484279 0.8380389 0.8753350 1.002401 0
## Lasso 0.7411175 0.8509494 0.9268667 0.9454309 0.9835806 1.429989 4
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 0.9008501 1.0168419 1.093672 1.089024 1.157039 1.325048 0
## enet 0.8615021 0.9949612 1.070893 1.076277 1.165211 1.313763 0
## Lasso 0.9561000 1.0871150 1.143622 1.180296 1.230222 1.727219 4
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 0.3572540 0.5281833 0.5527050 0.5509718 0.5810646 0.6802082 0
## enet 0.4104164 0.4976425 0.5520461 0.5462438 0.5836227 0.6935267 0
## Lasso 0.4012249 0.4531927 0.5052610 0.5231025 0.5602233 0.6830847 4
the ENET seems to have the best Rsquared value. Lets try on the test set.
pls_pred<-predict(plsTune, newdata = test_feature)
postResample(pred = pls_pred, obs = test_target)
## RMSE Rsquared MAE
## 18.4875399 0.6523581 11.0062569
f
I would not reccomend using this method to replace the permiability experiment, because as you can see above the MAE (mean average error, which is in the units of the variable) is over 10, and the histogram below shows that almost all of the data lies under 10 units. In essence our margin of error spans this very clear dividing line. If there was some permiability threshold we wanted to test, we could use a classification model instead.
histogram(permeability)