In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.
## install.packages("AppliedPredictiveModeling")
library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
##install.packages("caret")
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
dim(fingerprints)
## [1] 165 1107
fingerprints_filtered <- fingerprints[,-nearZeroVar(fingerprints)]
dim(fingerprints_filtered)
## [1] 165 388
There are 388 preditcors left after applying the nearZeroVar function.
set.seed(1234)
trainIndex <- createDataPartition(permeability, p = .75, list = FALSE) #split data into 75 training / 25 test
x_train <- fingerprints_filtered[trainIndex, ]
x_test <- fingerprints_filtered[-trainIndex, ]
y_train <- permeability[trainIndex]
y_test <- permeability[-trainIndex]
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
library(caret)
set.seed(100)
pls_model <- train(
x_train,
y_train,
method = "pls",
preProcess = c("center", "scale"),
tuneLength = 20,
trControl = trainControl(method = "cv")
)
pls_model
## Partial Least Squares
##
## 125 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 113, 112, 112, 113, 113, 112, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.42916 0.2711872 10.090545
## 2 12.49593 0.3811331 8.943130
## 3 12.28442 0.4178946 9.368835
## 4 12.31807 0.4031416 9.273239
## 5 11.75901 0.4431513 8.588036
## 6 11.70064 0.4623612 8.890948
## 7 11.80612 0.4562449 9.096516
## 8 11.84144 0.4592960 9.295636
## 9 11.90034 0.4574474 9.438925
## 10 12.03766 0.4503017 9.640827
## 11 12.28704 0.4417652 9.671329
## 12 12.21722 0.4467392 9.626235
## 13 12.38543 0.4300270 9.765697
## 14 12.30309 0.4477841 9.704548
## 15 12.53427 0.4403767 9.778292
## 16 12.90168 0.4297984 10.168555
## 17 13.05135 0.4192600 10.122761
## 18 13.07595 0.4248108 10.182406
## 19 13.22557 0.4181248 10.254585
## 20 13.39426 0.4179436 10.287528
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.
summary(pls_model)
## Data: X dimension: 125 388
## Y dimension: 125 1
## Fit method: oscorespls
## Number of components considered: 6
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 22.87 34.97 39.58 45.32 51.66 57.55
## .outcome 27.94 44.42 54.94 61.69 66.91 70.44
Per the output, the highest R square occurs at ncomp is 6 , and R square is 0.4623612.
plspredict <- predict(pls_model, x_test)
postResample(pred=plspredict, obs = y_test)
## RMSE Rsquared MAE
## 11.2419636 0.5517762 8.2036003
The predictions on the test set’s R square is 0.55which is higher than the training set R square.
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
set.seed(100)
ridgeRegFit <- train(x_train, y_train,
method = "ridge",
tuneGrid = ridgeGrid,
trControl = trainControl(method = "cv"),
preProc = c("center", "scale"))
ridgeRegFit
## Ridge Regression
##
## 125 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 113, 112, 112, 113, 113, 112, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 1.474287e+16 0.3189323 6.556967e+15
## 0.007142857 1.394168e+01 0.4206754 1.036431e+01
## 0.014285714 1.330548e+01 0.4356740 1.004718e+01
## 0.021428571 1.292971e+01 0.4450035 9.800008e+00
## 0.028571429 1.270412e+01 0.4504323 9.646346e+00
## 0.035714286 1.252683e+01 0.4554367 9.534768e+00
## 0.042857143 1.242359e+01 0.4585581 9.482024e+00
## 0.050000000 1.233803e+01 0.4614329 9.443610e+00
## 0.057142857 1.230881e+01 0.4623214 9.436483e+00
## 0.064285714 1.221000e+01 0.4652737 9.394075e+00
## 0.071428571 1.216866e+01 0.4669886 9.375667e+00
## 0.078571429 1.213848e+01 0.4682778 9.361588e+00
## 0.085714286 1.211676e+01 0.4694780 9.355642e+00
## 0.092857143 1.210577e+01 0.4701863 9.352581e+00
## 0.100000000 1.209146e+01 0.4711019 9.355625e+00
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
predict_ridgefit<- predict(ridgeRegFit, x_test)
postResample(pred=predict_ridgefit, obs = y_test)
## RMSE Rsquared MAE
## 11.4386118 0.5685297 8.1866511
R square of Ridgefit model is 0.56 which performed better than Pls model.
Based on the model outputs, I would favor the Ridge regression model due to its slightly better performance. However, the differences in R square values across models are minimal, suggesting that multiple approaches yield comparable predictive accuracy.
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
dim(ChemicalManufacturingProcess)
## [1] 176 58
#install.packages("RANN")
library(RANN)
set.seed(100)
preProcValues <- preProcess(ChemicalManufacturingProcess[, -ncol(ChemicalManufacturingProcess)], method = "knnImpute")
data_imputed <- predict(preProcValues, ChemicalManufacturingProcess)
set.seed(123)
trainIndex <- createDataPartition(data_imputed$Yield, p = 0.75, list = FALSE) # split the imputed data into training (75%) and testing (25%) sets
trainData <- data_imputed[trainIndex, ]
testData <- data_imputed[-trainIndex, ]
pls_mod <- train(Yield ~.,
data=trainData,
method = "pls",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv"))
pls_mod
## Partial Least Squares
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 118, 118, 119, 119, 120, 118, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.7690372 0.4477934 0.6181578
## 2 1.1331753 0.4855548 0.6912467
## 3 0.7664986 0.5719025 0.5705335
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
The optimal value is shown at ncomp 3 with a R-squared of 0.5961740.
postResample(pred = predict(pls_mod,testData), obs =testData$Yield)
## RMSE Rsquared MAE
## 0.6950735 0.4943072 0.5814548
The R squared is 0.49 which lower than training data set.
varImp_pcr <- varImp(pls_mod)
print(varImp_pcr)
## pls variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 85.64
## ManufacturingProcess17 85.54
## ManufacturingProcess13 84.50
## ManufacturingProcess36 83.84
## ManufacturingProcess06 66.89
## BiologicalMaterial02 59.43
## ManufacturingProcess33 59.17
## ManufacturingProcess11 59.00
## BiologicalMaterial06 58.04
## BiologicalMaterial08 57.77
## BiologicalMaterial03 57.63
## BiologicalMaterial12 51.54
## BiologicalMaterial11 51.14
## BiologicalMaterial01 50.48
## BiologicalMaterial04 49.32
## ManufacturingProcess37 47.46
## ManufacturingProcess12 45.50
## ManufacturingProcess28 44.51
## ManufacturingProcess02 40.89
plot(varImp_pcr, top = 10)
Above are the top 10 predictors in the model.
#install.packages("DataExplorer")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(DataExplorer)
data_imputed %>%
select(Yield,
ManufacturingProcess32,
ManufacturingProcess09,
ManufacturingProcess17,
ManufacturingProcess13,
ManufacturingProcess36,
ManufacturingProcess06,
BiologicalMaterial02,
ManufacturingProcess33,
ManufacturingProcess11,
BiologicalMaterial06) %>%
plot_correlation()
Overall, the correlation plot highlights several processes that may
negatively impact yield. It’s crucial to investigate these variables
further, as their combined effects might be more significant than
individual ones. Equally important are the processes that enhance yield,
especially in understanding how they interact with others