data(permeability)
dim(fingerprints)
## [1] 165 1107
dim(permeability)
## [1] 165 1
class(fingerprints)
## [1] "matrix"
class(permeability)
## [1] "matrix"
Using the above command, two matrices are loaded. The matrix fingerprints contains 1,107 binary molecular predictors for 165 compounds, while permeability contains permeability response.
We will use the function nearZeroVar from caret package to check the predictors with low frequencies or near zero frequencies. These predictors will be removed in order to get unbiased results.
388 predictors out of 1,107 were left for modeling after removing the low variance predictors.
nzv <- nearZeroVar(fingerprints)
fingerprints <- fingerprints[,-nzv]
dim(fingerprints)
## [1] 165 388
## 388 predictors are left for modeling after droping zerovarianc predictors
We will use createDataPartition function from caret package to split the data into training and testing sets. Using general rule, we will split in 75-25 ration for our testing and trainig data.
Preprocess: We will use preProcess function from the caret package to process our data for training the model. Within that function, nzv identifies the numeric predictors with near zero variance and remove them rom the model, corr identifies highly correlated predictors and filter them out, center subtracts the mean of the predictor’s data from the predictor values, scale method divides values by the standard deviation of the predictor, BoxCox will be simply transforming the data, and medianimpute will impute the missing values.
Train- We will use traincontrol from caret package, to perform differnt types of cross validatiosn. there are different methods for that, we will use repeatedCV. After this preprocess and train control, we will tune our model on the training data.
train_r <- createDataPartition(permeability, p=0.75, list=F)
fp_train <- fingerprints[train_r,]
perm_train <- permeability[train_r,]
fp_test <- fingerprints[-train_r,]
perm_test <- permeability[-train_r]
p_pro <- c("nzv", "corr", "center","scale", "medianImpute")
t_ctrl <- trainControl(method = "repeatedcv", repeats = 5)
pls_fit <-train(fp_train, perm_train, method="pls",tuneLength=10, preProcess=p_pro,trControl=t_ctrl)
pls_fit
## Partial Least Squares
##
## 125 samples
## 388 predictors
##
## Pre-processing: centered (103), scaled (103), median imputation
## (103), remove (285)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 113, 113, 112, 113, 113, 113, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 11.72563 0.4747334 8.743535
## 2 11.20316 0.5334524 8.367818
## 3 11.42837 0.5103012 8.831527
## 4 11.58258 0.4972196 8.765828
## 5 11.61416 0.4936065 8.754583
## 6 11.76662 0.4898154 8.776972
## 7 12.24152 0.4662024 8.999535
## 8 12.68327 0.4508898 9.502555
## 9 13.04750 0.4382922 9.823628
## 10 13.20644 0.4353327 9.881539
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.
plot(pls_fit)
print( sprintf( "%-10s: No. of optimal variables", which.min(pls_fit$results[,2])) )
## [1] "2 : No. of optimal variables"
print( sprintf( "%-10s: Corresponding r-squared", which.min(pls_fit$results[,3])) )
## [1] "10 : Corresponding r-squared"
After we fit our model, we will pedict the response for our tst set using predict function.
pls_pred <- predict(pls_fit, newdata=fp_test)
postResample(pred = pls_pred, obs = perm_test)
## RMSE Rsquared MAE
## 11.6628678 0.4115886 7.9859438
We tried to build two additional models besides PLS. Linear model and elatic net using similar preprocessing the data. We recieved warning messages loop in linear model. Research revealed that this might be due to using oomany predictors in our model. Although we removed near zero variance predictors and highly correlated predictors in the preprocessing, the lm was ## however not able to fit properly. This model might notbe a good fit. Elastic net model was built using the similar preporcessing on the data.
lm_fit = suppressWarnings(train(fp_train, perm_train, method="lm", preProcess= p_pro, trControl=t_ctrl))
# warning : - prediction from a rank-deficient fit may be misleading
## This might be due to using oomany predictors in our model. Although we removed near zero variance predictors and highly correlated predictors in the preprocessing, the lm was ## however not able to fit properly. This model might notbe a good fit.
# https://discuss.analyticsvidhya.com/t/what-does-the-warning-message-prediction-from-a-rank-deficient-fit-mean-in-logistic-regression/4282
lm_pred <- predict(lm_fit,newdata=fp_test)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
postResample(pred = lm_pred, obs = perm_test)
## RMSE Rsquared MAE
## 34.63448647 0.06670543 21.13009032
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
enet_fit <- train(fp_train, perm_train, method="enet", tuneGrid = enetGrid,preProcess= p_pro, trControl=t_ctrl)
enet_pred <- predict(enet_fit,newdata=fp_test)
postResample(pred = enet_pred, obs = perm_test)
## RMSE Rsquared MAE
## 12.1794890 0.3604331 8.8611718
resamp = resamples(list(pls=pls_fit,enet=enet_fit, lm_fit) )
print(summary(resamp))
##
## Call:
## summary.resamples(object = resamp)
##
## Models: pls, enet, Model3
## Number of resamples: 50
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 4.179468 6.906092 7.868679 8.367818 9.287634 15.05106 0
## enet 4.663404 7.200955 8.045945 8.302044 8.996797 12.85238 0
## Model3 7.289643 13.829058 19.275556 20.865127 24.170777 51.78726 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 5.343196 8.778338 10.34084 11.20316 13.67811 20.04514 0
## enet 6.523553 9.141823 11.07884 11.39359 13.32839 17.41499 0
## Model3 11.105107 19.729713 30.31001 34.29961 40.23166 109.10411 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## pls 0.0101119258 0.36748215 0.6014957 0.5334524 0.7237201 0.8587618
## enet 0.0420575402 0.34678604 0.5424824 0.5125444 0.7347710 0.8907623
## Model3 0.0002753923 0.05213484 0.2481770 0.2580875 0.3981954 0.7121657
## NA's
## pls 0
## enet 0
## Model3 0
#https://stackoverflow.com/questions/36482475/create-data-partition-into-training-testing-and-validation-split-in-r
#https://stackoverflow.com/questions/46572275/the-train-function-in-r-caret-package
#https://www.rdocumentation.org/packages/caret/versions/3.21/topics/postResample
#https://machinelearningmastery.com/pre-process-your-dataset-in-r/
Comparing between pls model and enet model, we can prefer enet model.
lationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing pro-cess. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# http://appliedpredictivemodeling.com/errata
processPredictors = as.matrix(ChemicalManufacturingProcess[,2:58])
yield = ChemicalManufacturingProcess[,1]
We will use bag impute to treat the missing values. The data will be processed in the process of training the model using the function preprocess and method knnimpute. We will treat the missing values and check here also.
processPredictors_m <- preProcess(processPredictors, method="medianImpute", verbose=T)
## Computing medians for 57 predictors... done
Using createDataPartition in the caret package, we will split the data and preprocess using the methods as above. nzv identifies the numeric predictors with near zero variance and remove them rom the model, corr identifies highly correlated predictors and filter them out, center subtracts the mean of the predictor’s data from the predictor values, scale method divides values by the standard deviation of the predictor, BoxCox will be simply transforming the data, and medianimpute will impute the missing values. We will fit partial least squares on training data.
train_r <- createDataPartition(yield, p=0.75, list=F)
pp_train <- ChemicalManufacturingProcess[train_r,-1]
y_train <- ChemicalManufacturingProcess[train_r,1]
pp_test <- ChemicalManufacturingProcess[-train_r,-1]
y_test <- ChemicalManufacturingProcess[-train_r,1]
p_pro <- c("nzv", "corr", "center","scale", "medianImpute")
tr_ctrl =trainControl(method = "boot", number=5)
pls_fit <-train(pp_train, y_train, method="pls", tuneLength = 10, preProcess=p_pro,trControl=t_ctrl)
pls_fit
## Partial Least Squares
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (49), scaled (49), median imputation (49),
## remove (8)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 119, 120, 120, 119, 120, 117, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.553330 0.4788648 1.197820
## 2 1.667977 0.5264871 1.182179
## 3 1.438762 0.5570214 1.104316
## 4 1.467156 0.5458387 1.117981
## 5 1.531955 0.5296412 1.139940
## 6 1.564005 0.5303612 1.172071
## 7 1.745984 0.5199421 1.247957
## 8 1.853568 0.5105140 1.275794
## 9 1.897786 0.5109370 1.286091
## 10 1.874342 0.5138742 1.274363
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
plot(pls_fit)
pls_pred <- predict(pls_fit, pp_test)
postResample(pred = pls_pred, obs = y_test)
## RMSE Rsquared MAE
## 1.7144886 0.2357345 1.0800501
The dominant predictors will be identified using varImp function in caret package.
impv <- varImp(pls_fit)
## Warning: package 'pls' was built under R version 3.5.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
plot(impv, top=20)
We will create a list of top five dominant variables and will check the correlation between the variables itself and response variables. Exploring this will provide us better picture to pick bewteen the dominat ratio of biolgical vs manufacturing predictors. Negative correlations and positive correlations between the predictors and response will further provide us an overview for tuning the model and reducing the predictors as wqe have seen in our linear model, more predictors will make the models complex and it can impact the model fit. We explored the dominant models in step above, we got an inference that manufacturing predictors were dominant than bilogical.
imp_train <- pp_train %>%select(ManufacturingProcess32,ManufacturingProcess13,ManufacturingProcess36, ManufacturingProcess09, ManufacturingProcess17)
cor(imp_train)
## ManufacturingProcess32 ManufacturingProcess13
## ManufacturingProcess32 1.00000000 -0.1459677
## ManufacturingProcess13 -0.14596770 1.0000000
## ManufacturingProcess36 NA NA
## ManufacturingProcess09 0.06097087 -0.8254165
## ManufacturingProcess17 -0.07185669 0.8045073
## ManufacturingProcess36 ManufacturingProcess09
## ManufacturingProcess32 NA 0.06097087
## ManufacturingProcess13 NA -0.82541646
## ManufacturingProcess36 1 NA
## ManufacturingProcess09 NA 1.00000000
## ManufacturingProcess17 NA -0.76063541
## ManufacturingProcess17
## ManufacturingProcess32 -0.07185669
## ManufacturingProcess13 0.80450727
## ManufacturingProcess36 NA
## ManufacturingProcess09 -0.76063541
## ManufacturingProcess17 1.00000000
cor(imp_train, y_train)
## [,1]
## ManufacturingProcess32 0.6106951
## ManufacturingProcess13 -0.5605128
## ManufacturingProcess36 NA
## ManufacturingProcess09 0.5243719
## ManufacturingProcess17 -0.5164468