library(AppliedPredictiveModeling)
data(permeability)
There are 388 predictors left for modeling.
dim(fingerprints[,-nearZeroVar(fingerprints)])
## [1] 165 388
We will use the preProcess function to remove all highly correlated values and impute the median for missing values. ncomp=3 is optimal for this model with the lowest RMSE and an Rsquared of 0.36.
fp<-fingerprints[,-nearZeroVar(fingerprints)]
set.seed(6354)
partition <- createDataPartition(permeability, p=0.75, list = FALSE)
training_f <- fp[partition,]
testing_f <- fp[-partition,]
training_p <- permeability[partition,]
testing_p <- permeability[-partition,]
fit <-train(training_f, training_p, method="pls", preProcess=c( "corr", "medianImpute"))
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
fit
## Partial Least Squares
##
## 125 samples
## 388 predictors
##
## Pre-processing: median imputation (388)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 125, 125, 125, 125, 125, 125, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 14.36370 0.2029552 10.961263
## 2 13.10083 0.3438063 9.231339
## 3 12.96620 0.3604807 9.369437
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
plot(fit)
fit$results
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 14.36370 0.2029552 10.961263 1.918080 0.09980926 1.290841
## 2 2 13.10083 0.3438063 9.231339 1.926232 0.10733951 1.252370
## 3 3 12.96620 0.3604807 9.369437 1.522479 0.10059657 1.047166
Rsquared = 0.495
p <- predict(fit, newdata=testing_f)
postResample(pred = p, obs = testing_p)
## RMSE Rsquared MAE
## 11.7379154 0.4946969 8.8731725
###(e) Try building other models discussed in this chapter. Do any have better predictive performance?
Neither the linear regressionmodel or pcr model used below offer better performance.
fitlm <-train(training_f, training_p, method="lm", preProcess=c( "corr", "medianImpute"))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
fitlm$results
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 52.71606 0.08646621 34.08798 27.77341 0.06522266 16.68144
p <- predict(fitlm, newdata=testing_f)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
postResample(pred = p, obs = testing_p)
## RMSE Rsquared MAE
## 32.80868071 0.06333511 21.01243721
fitpcr <-train(training_f, training_p, method="pcr", preProcess=c( "corr", "medianImpute"))
fitpcr$results
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 14.67980 0.08779887 11.25151 1.600385 0.05716365 1.259675
## 2 2 14.53946 0.10749057 11.22314 1.671931 0.07180479 1.343948
## 3 3 14.06675 0.17290361 10.63392 1.912729 0.11051256 1.354806
p <- predict(fitpcr, newdata=testing_f)
postResample(pred = p, obs = testing_p)
## RMSE Rsquared MAE
## 14.5087914 0.2568094 10.3661885
No, the highest r squared of only 0.36 does not suggest it should replace the lab experiment.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
Will use the prepocess function to impute the median for missing values.
chem<-preProcess(ChemicalManufacturingProcess,
method = c("medianImpute"))
Optimal ncomp for the PLS metric = 1.
set.seed(6354)
partition <- createDataPartition(ChemicalManufacturingProcess[,1] , p=0.75, list=F)
training <- ChemicalManufacturingProcess[partition,-1]
y_training<- ChemicalManufacturingProcess[partition,1]
testing<- ChemicalManufacturingProcess[-partition,-1]
y_testing<- ChemicalManufacturingProcess[-partition,1]
fit <-train(training, y_training, method="pls", preProcess=c( "corr", "medianImpute"))
## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero
fit
## Partial Least Squares
##
## 132 samples
## 57 predictor
##
## Pre-processing: median imputation (57)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.678556 0.1684016 1.378780
## 2 1.803978 0.1634518 1.409225
## 3 2.521718 0.1264553 1.596152
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 1.
plot(fit)
fit$results
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 1.678556 0.1684016 1.378780 0.1257655 0.06816547 0.1368997
## 2 2 1.803978 0.1634518 1.409225 0.6154606 0.07680899 0.1871861
## 3 3 2.521718 0.1264553 1.596152 1.9041071 0.09249665 0.5249521
The rquared of .114 which is less then the training set .168
p <- predict(fit, newdata=testing)
postResample(pred = p, obs = y_testing)
## RMSE Rsquared MAE
## 1.8689643 0.1148996 1.4249476
Process predictors severely dominate the list, making up all of the top 10 with ManufacturingProcess12 being by far the most important.
varImp(fit)
##
## Attaching package: 'pls'
## The following object is masked from 'package:corrplot':
##
## corrplot
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
## pls variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess12 100.0000
## ManufacturingProcess20 4.1922
## ManufacturingProcess18 3.9803
## ManufacturingProcess26 2.7624
## ManufacturingProcess16 2.4947
## ManufacturingProcess15 1.8589
## ManufacturingProcess19 0.8588
## ManufacturingProcess25 0.7621
## ManufacturingProcess32 0.4995
## ManufacturingProcess05 0.3770
## ManufacturingProcess27 0.3343
## BiologicalMaterial02 0.2761
## BiologicalMaterial03 0.2708
## BiologicalMaterial06 0.2631
## ManufacturingProcess02 0.2373
## BiologicalMaterial11 0.2248
## ManufacturingProcess28 0.2203
## ManufacturingProcess35 0.2098
## ManufacturingProcess04 0.1899
## ManufacturingProcess14 0.1867
We see that Yield is megatively correlated with many of the top process fields and positively correlated with the Manufacturing process 12. Although manufacturing process 26 is only slightly correlated with yield, it is highly correlated with process 12. So it may be that in order to increase process 12 and increase in 26 is also necessary.
v<-ChemicalManufacturingProcess%>%drop_na()%>%dplyr::select(Yield,ManufacturingProcess12, ManufacturingProcess20,ManufacturingProcess18, ManufacturingProcess26,ManufacturingProcess16)
as.data.frame(cor(v))
## Yield ManufacturingProcess12
## Yield 1.00000000 0.39134855
## ManufacturingProcess12 0.39134855 1.00000000
## ManufacturingProcess20 -0.24455635 -0.40475972
## ManufacturingProcess18 -0.18662921 -0.17799305
## ManufacturingProcess26 0.02949117 0.03462301
## ManufacturingProcess16 -0.16354183 -0.42905315
## ManufacturingProcess20 ManufacturingProcess18
## Yield -0.2445564 -0.18662921
## ManufacturingProcess12 -0.4047597 -0.17799305
## ManufacturingProcess20 1.0000000 0.55734163
## ManufacturingProcess18 0.5573416 1.00000000
## ManufacturingProcess26 0.1095043 0.08148943
## ManufacturingProcess16 0.6325275 0.12601181
## ManufacturingProcess26 ManufacturingProcess16
## Yield 0.02949117 -0.16354183
## ManufacturingProcess12 0.03462301 -0.42905315
## ManufacturingProcess20 0.10950427 0.63252747
## ManufacturingProcess18 0.08148943 0.12601181
## ManufacturingProcess26 1.00000000 0.09012531
## ManufacturingProcess16 0.09012531 1.00000000