In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.
Exercises
library(MASS)
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.3
library(stats)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:pls':
##
## R2
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
Developing a model to predict permeability (see Sect.1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
dim(fingerprints) # allows to check the data structure of this this data
## [1] 165 1107
# this get back the dimensions of the matrix
this matrix has 165 rows and 1107 columns
near_zero = nearZeroVar(fingerprints)
this functions filters out the predictors with near zero variance. With near zero variance some of the data are basically almost the same, like have a lot fo 1 and 0 s in the data. example [ 1,1,1,1,1] and [0,0,0,0,]
# drop the data with near zero predictors from the data with "-near-zero'
filtered_fingerprints = fingerprints[, -near_zero]
filtered fingers prints is now data set with with no varience
dim(filtered_fingerprints)
## [1] 165 388
After dropping the predictors with near zero variance we get a new matrix with 165 rows and and 388 columns
b.) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
set.seed(312)
Our filtered fingers print is the predictor variable also called x/indenpent variable the permeability is is the response variable also calle dthe y/dependent variable, this is the variable that will explain the the permability of the molecule that enters into the membrane of the person.
# splting our data with chapter 4 talks about this with the createDataPartition function ,
# we will split into 80 percent for training and 20 percent to test
train_index = createDataPartition(permeability, p = 0.80, list = FALSE)
# list = false, we bring back a vector of the index , makes it easier to subset our data, and aviod creating a list
X_train = filtered_fingerprints[train_index, ]
# subset the filter fingerprint matrix to find rows that match the training data
X_test = filtered_fingerprints[-train_index, ]
# makes a test data set that take rows that we cant find the train data
# when spliting the data, we need make train and test data set with data that doesn't match, so it does not affect our results
y_train = permeability[train_index]
# here we getting the dependent variable/response variable to match data train block
y_test = permeability[-train_index]
# here we getting the dependent variable/response variable NOT in train block data.
The 80/20 split is common slit used for training and testing data, with 80 percent we can gave the model enough patterns to learn about the data. 20 percent is used to evaluate how well our model will perform with the data, even if it is new data.
next is cross validation of the data with a PLS model to train the data notes from chapter 4 and 6
# 10 cross validation , training control parameter
pls_ctrl = trainControl(method = "cv", number = 10)
10 cross validation will fold the training data into 10 parts to and train 9 fold and then come to conclusion to chose to test on the last fold. This process of validating the data will repeat for ten times with the model test each fold to observe which will avoid any over fitting and
pls_model = train(
x = X_train,
y = y_train,
method = "pls", # using the partial least squares
preProcess = c("center", "scale"), # apply the preprocess before we train the data
tuneLength = 10, # we will try 10 different number of components
# this will capture latent variables, hidden data that influence the data
trControl = pls_ctrl, # 10 fold cross validation
metric = "Rsquared" # we will chose the model based on best r square
)
print(pls_model)
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 120, 121, 120, 118, 121, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.04093 0.3103755 9.939871
## 2 12.22827 0.4122874 8.705134
## 3 12.40015 0.4195067 9.143409
## 4 12.69684 0.4045341 9.540222
## 5 12.86993 0.4226621 9.579177
## 6 12.70483 0.4358627 9.580128
## 7 12.56317 0.4375749 9.495606
## 8 12.24705 0.4380541 9.520556
## 9 12.22510 0.4433568 9.367976
## 10 12.44215 0.4398578 9.481070
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 9.
plot(pls_model)
opti_value1 = pls_model$bestTune
print(opti_value1 )
## ncomp
## 9 9
Looking at our results, the number of latent components with the best RMSE, R squared and MAE has 5 components. With four latent components, we have low RMSE ( root mean squared error), which help measures how well the model predicts the response variable, and how r square of 0.53 or 53 percent of the model variance will explain our response variables, and has MAE(mean absolute error) of 8.741811 which indicates the average errors found in the predictions of the model compared to the actual value.
We have to check how will our model will do against data not inputted into the model before. This done because the model we made can just memorize the answers from the model and this cause over fitting. Over fitting will generalize the data and will give low accuracy on any new data.
pls_preds = predict(pls_model, newdata = X_test)
# postResample will compare the performance of the model to the response / dependent variable
# We observe how will the model will we predict our targeted values
postResample(pls_preds, y_test)
## RMSE Rsquared MAE
## 9.933804 0.642241 7.939443
There is variation of 37.5 percent in the R sqaure With a such low variation, the model can still be improved and this this low variation indicate that that data is too spread out. With a low variation the response variable cannot explain why there is changes in our data.
training_ctrl = trainControl(method = "cv", number = 10)
linear_model = train(
x = X_train,
y = y_train,
method = "lm", # linear regression
preProcess = c("center", "scale"),
trControl = training_ctrl
)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
ridge_model = train(
x = X_train,
y = y_train,
method = "glmnet",
preProcess = c("center", "scale"),
# lambda used to control how strong the penalty will be for having large coefficicents in the data
# tunegrid will try different lambda strengths 10 times, between lambda valie of 0.001 to 1
tuneGrid = expand.grid(alpha = 0, lambda = seq(0.0001, 1, length = 10)),
trControl = training_ctrl
)
A lambda close to zero will create model similar to of ordinary linear regression, meaning it minize the sum of least squares. This can cause over fitting and too large lambda will take away any over fitting and cause under fitting. put all the models into a list * remeber ridge regression pushed cofficeients close to 0, but not fully to weo
models = list(
"PLS" = pls_model,
"Linear Regression" = linear_model,
"Ridge Regression" = ridge_model
)
# resample fuctiion will compare model
results = resamples(models)
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: PLS, Linear Regression, Ridge Regression
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## PLS 5.062580 8.854054 9.616833 9.367976 10.606981 11.98775
## Linear Regression 10.166223 15.956758 21.732021 25.374879 33.940273 46.27533
## Ridge Regression 6.389101 8.204465 8.506049 8.668495 9.428291 11.30136
## NA's
## PLS 0
## Linear Regression 0
## Ridge Regression 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 7.440460 11.87682 12.56390 12.22510 13.84332 15.36944 0
## Linear Regression 12.979289 23.03389 32.33058 37.43822 44.73796 75.89702 0
## Ridge Regression 7.937061 10.32177 11.31130 11.81360 13.08203 16.98667 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu.
## PLS 0.0988254928 0.36917912 0.42577892 0.4433568 0.5036574
## Linear Regression 0.0002819909 0.06131537 0.07648197 0.1462571 0.1737759
## Ridge Regression 0.1361200693 0.36065813 0.43165539 0.4788606 0.6115515
## Max. NA's
## PLS 0.7377751 0
## Linear Regression 0.6465223 0
## Ridge Regression 0.8107929 0
print(models)
## $PLS
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 120, 121, 120, 118, 121, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.04093 0.3103755 9.939871
## 2 12.22827 0.4122874 8.705134
## 3 12.40015 0.4195067 9.143409
## 4 12.69684 0.4045341 9.540222
## 5 12.86993 0.4226621 9.579177
## 6 12.70483 0.4358627 9.580128
## 7 12.56317 0.4375749 9.495606
## 8 12.24705 0.4380541 9.520556
## 9 12.22510 0.4433568 9.367976
## 10 12.44215 0.4398578 9.481070
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 9.
##
## $`Linear Regression`
## Linear Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 120, 120, 119, 118, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 37.43822 0.1462571 25.37488
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
##
## $`Ridge Regression`
## glmnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 118, 120, 119, 121, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0001 11.8136 0.4788606 8.668495
## 0.1112 11.8136 0.4788606 8.668495
## 0.2223 11.8136 0.4788606 8.668495
## 0.3334 11.8136 0.4788606 8.668495
## 0.4445 11.8136 0.4788606 8.668495
## 0.5556 11.8136 0.4788606 8.668495
## 0.6667 11.8136 0.4788606 8.668495
## 0.7778 11.8136 0.4788606 8.668495
## 0.8889 11.8136 0.4788606 8.668495
## 1.0000 11.8136 0.4788606 8.668495
##
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 1.
bwplot(results)
Observing the plot the both the linear and ridge regression model show a
lower MAE and RMSE model with higher r square values than the linear
regression model. The r square for PLS is 53.2 percent , linear
regression is 30 percent and for regression is 54.6 percent. Both the
PLS and ridge regression have better change of explaining the changes
that will occur in in the model.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
data(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)
## [1] 176 58
help(ChemicalManufacturingProcess)
The matrix process Predictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
b.) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect.3.8).
Imputation will help in replacing any incomplete data with some subbed data.
# we are pre processing the data with the k to nearest neighbour to fill out missing values
preprocessed_data = preProcess(ChemicalManufacturingProcess, method = "knnImpute") # this uses the k to nearst neightbour approach to impuate the data
# predict fuction will appy the KNN to the new data
imputed_data = predict(preprocessed_data, newdata = ChemicalManufacturingProcess)
sum(is.na(imputed_data))
## [1] 0
dim(imputed_data)
## [1] 176 58
set.seed(909)
# Yield is the response variable that we want to predict
# we use the default 80/20 split
split_index = createDataPartition(imputed_data$Yield, p = 0.8, list = FALSE)
train_imputed = imputed_data[split_index, ]
test_imputed = imputed_data[-split_index, ]
# Usign 10 cross fold validation
training_ctrl2 = trainControl(method = "cv", number = 10)
pls_model3 = train(
Yield ~ .,
data = train_imputed,
method = "pls",
trControl = training_ctrl2,
metric = "RMSE",
preProcess = c("center", "scale"),
tuneLength = 20
)
print(pls_model3)
## Partial Least Squares
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 130, 130, 130, 131, 130, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.8578181 0.4494503 0.6314903
## 2 1.2022620 0.4393228 0.6792114
## 3 0.8275585 0.5354153 0.5721578
## 4 0.7394728 0.5796356 0.5516950
## 5 0.8036234 0.5337296 0.5735966
## 6 0.8534380 0.5166319 0.5875801
## 7 0.9681134 0.4934850 0.6236933
## 8 0.9726749 0.4972598 0.6289777
## 9 1.1419420 0.4838038 0.6732001
## 10 1.2250624 0.4803668 0.6936546
## 11 1.2629881 0.4579983 0.7090685
## 12 1.2283690 0.4685935 0.7024693
## 13 1.1821069 0.4878418 0.6932177
## 14 1.1652476 0.5076029 0.6822836
## 15 1.1146200 0.5160249 0.6568116
## 16 1.0736225 0.5199259 0.6528180
## 17 0.9884735 0.5325340 0.6269468
## 18 0.9297065 0.5452256 0.6103556
## 19 0.8963245 0.5472305 0.6053765
## 20 0.9313099 0.5239763 0.6175505
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 4.
# get the optimla vlaue
opti_value = pls_model3$bestTune
print(opti_value)
## ncomp
## 4 4
# we will test to pls model agaisnt the test data to see how will it explain the data
preds3 = predict(pls_model3, newdata = test_imputed)
# postResmaple to see our RMSE,r square and MAE values
postResample(preds3, test_imputed$Yield)
## RMSE Rsquared MAE
## 0.7479883 0.5205417 0.6282878
Our linear regression model does a good job at explaining about 71.3 percent of the varicne in the data, when approached with new unseen data from our testing data. In the RMSE, we observe acutal data will be off about 0.51 units and we have an average absolute error is 0.41 , where our values will likely be off by 0.41 in the yeild variable.
(e). Try building other models discussed in this chapter. Do any have better predictive performance?
we will use the Linear and PCR model to build this PCR will combine linear and PlS model together
# Train a linear regression model on the training data
lm_model2 = train(
Yield ~ ., # Use all predictors to predict Yield
data = imputed_data ,
method = "lm", # Linear regression
trControl = training_ctrl2
)
print(lm_model2)
## Linear Regression
##
## 176 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 158, 160, 158, 158, 159, 158, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.272693 0.429542 0.7440433
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
summary(lm_model2)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10264 -0.30357 -0.03165 0.25409 1.06336
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0018495 0.0438022 -0.042 0.96639
## BiologicalMaterial01 0.0455980 0.1323190 0.345 0.73100
## BiologicalMaterial02 -0.2437687 0.2784942 -0.875 0.38317
## BiologicalMaterial03 0.4958561 0.5220036 0.950 0.34408
## BiologicalMaterial04 -0.1169545 0.4908971 -0.238 0.81210
## BiologicalMaterial05 0.1647417 0.1047952 1.572 0.11860
## BiologicalMaterial06 -0.1866881 0.6194332 -0.301 0.76365
## BiologicalMaterial07 -0.0957142 0.0569371 -1.681 0.09538 .
## BiologicalMaterial08 0.2355963 0.2421458 0.973 0.33255
## BiologicalMaterial09 -0.2515950 0.3129362 -0.804 0.42301
## BiologicalMaterial10 0.0509659 0.4325772 0.118 0.90641
## BiologicalMaterial11 -0.2346396 0.2135479 -1.099 0.27409
## BiologicalMaterial12 0.1827764 0.2671196 0.684 0.49515
## ManufacturingProcess01 0.0369664 0.0919907 0.402 0.68852
## ManufacturingProcess02 -0.0004011 0.1859956 -0.002 0.99828
## ManufacturingProcess03 -0.0602545 0.0625694 -0.963 0.33750
## ManufacturingProcess04 0.2369374 0.0981557 2.414 0.01731 *
## ManufacturingProcess05 0.0082087 0.0638537 0.129 0.89793
## ManufacturingProcess06 0.0388170 0.0632274 0.614 0.54044
## ManufacturingProcess07 -0.0465431 0.0574361 -0.810 0.41936
## ManufacturingProcess08 -0.0265382 0.0688750 -0.385 0.70070
## ManufacturingProcess09 0.2622360 0.1524794 1.720 0.08807 .
## ManufacturingProcess10 -0.0185401 0.2355355 -0.079 0.93739
## ManufacturingProcess11 0.0545496 0.2834653 0.192 0.84773
## ManufacturingProcess12 0.1119268 0.1010813 1.107 0.27040
## ManufacturingProcess13 -0.1459152 0.2147261 -0.680 0.49811
## ManufacturingProcess14 0.0168105 0.3155612 0.053 0.95760
## ManufacturingProcess15 0.0814451 0.2935170 0.277 0.78189
## ManufacturingProcess16 0.0095529 0.0633091 0.151 0.88032
## ManufacturingProcess17 -0.0677198 0.2038923 -0.332 0.74037
## ManufacturingProcess18 0.7401067 0.8953599 0.827 0.41012
## ManufacturingProcess19 -0.0543006 0.1925060 -0.282 0.77838
## ManufacturingProcess20 -0.7227078 0.8911237 -0.811 0.41898
## ManufacturingProcess21 NA NA NA NA
## ManufacturingProcess22 -0.0150200 0.0758028 -0.198 0.84327
## ManufacturingProcess23 -0.0313061 0.0749360 -0.418 0.67687
## ManufacturingProcess24 -0.0671870 0.0734892 -0.914 0.36244
## ManufacturingProcess25 -0.9620548 2.8063833 -0.343 0.73235
## ManufacturingProcess26 1.3221724 2.6639842 0.496 0.62059
## ManufacturingProcess27 -1.5296797 1.4847028 -1.030 0.30496
## ManufacturingProcess28 -0.2453565 0.0866782 -2.831 0.00546 **
## ManufacturingProcess29 1.2251111 0.7222629 1.696 0.09246 .
## ManufacturingProcess30 -0.2056283 0.2980058 -0.690 0.49153
## ManufacturingProcess31 0.2241897 0.3568015 0.628 0.53099
## ManufacturingProcess32 0.8777672 0.1928887 4.551 1.3e-05 ***
## ManufacturingProcess33 -0.4766695 0.1701810 -2.801 0.00595 **
## ManufacturingProcess34 -0.0238571 0.0813978 -0.293 0.76996
## ManufacturingProcess35 -0.0936055 0.1002698 -0.934 0.35243
## ManufacturingProcess36 0.1241460 0.1428852 0.869 0.38668
## ManufacturingProcess37 -0.1694232 0.0703383 -2.409 0.01754 *
## ManufacturingProcess38 -0.0856827 0.0853147 -1.004 0.31727
## ManufacturingProcess39 0.0473406 0.1064271 0.445 0.65726
## ManufacturingProcess40 0.0324079 0.1366341 0.237 0.81292
## ManufacturingProcess41 -0.0268241 0.1390033 -0.193 0.84731
## ManufacturingProcess42 0.1170493 0.2201179 0.532 0.59589
## ManufacturingProcess43 0.0909620 0.0565802 1.608 0.11056
## ManufacturingProcess44 -0.0998864 0.2068479 -0.483 0.63006
## ManufacturingProcess45 0.2214974 0.1209522 1.831 0.06956 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5636 on 119 degrees of freedom
## Multiple R-squared: 0.784, Adjusted R-squared: 0.6823
## F-statistic: 7.712 on 56 and 119 DF, p-value: < 2.2e-16
# we will test to lm model agaisnt the test data to see how will it explain the data
lm_preds2 = predict(lm_model2, newdata = test_imputed)
# postResmaple to see our RMSE,r square and MAE values
postResample(lm_preds2, test_imputed$Yield)
## RMSE Rsquared MAE
## 0.5307185 0.7661373 0.4356418
# PCR Model
pcr_model = train(
Yield ~ .,
data = train_imputed,
method = "pcr",
trControl = training_ctrl2,
preProcess = c("center", "scale"),
tuneLength = 20
)
print(pcr_model)
## Principal Component Analysis
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 131, 130, 128, 130, 129, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.9599903 0.2783475 0.7346390
## 2 0.9796250 0.3028973 0.7280088
## 3 1.0893876 0.3297326 0.7483017
## 4 1.1702638 0.3901733 0.7346735
## 5 1.1612427 0.3930433 0.7262085
## 6 1.1716890 0.3950469 0.7369601
## 7 1.1626013 0.3893585 0.7401618
## 8 1.1500363 0.3922661 0.7328569
## 9 1.1337147 0.4046351 0.6963287
## 10 1.1068249 0.4535474 0.6590726
## 11 1.0752391 0.4748928 0.6491893
## 12 1.0986376 0.4756759 0.6588284
## 13 1.0114436 0.5039221 0.6164314
## 14 0.9930940 0.5078058 0.6184654
## 15 0.9682991 0.5132088 0.6188779
## 16 0.9677342 0.5095159 0.6192046
## 17 0.7726571 0.5263169 0.5734064
## 18 0.8175566 0.5056910 0.5981806
## 19 0.8545303 0.5012867 0.6075666
## 20 0.7606981 0.5085721 0.5821039
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 20.
preds4 = predict(pcr_model, newdata = test_imputed)
postResample(preds4, test_imputed$Yield)
## RMSE Rsquared MAE
## 0.7632398 0.5044555 0.6508389
models2 = list(
"Pricinpal Component Analysis" = pcr_model,
"Linear Regression" = lm_model2,
"Partial Least Squares" = pls_model3
)
results2 = resamples(models2)
summary(results2)
##
## Call:
## summary.resamples(object = results2)
##
## Models: Pricinpal Component Analysis, Linear Regression, Partial Least Squares
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu.
## Pricinpal Component Analysis 0.4347878 0.4854884 0.5332387 0.5821039 0.5618521
## Linear Regression 0.3799587 0.5806012 0.6699280 0.7440433 0.8399554
## Partial Least Squares 0.3893792 0.4539323 0.5618073 0.5516950 0.6162924
## Max. NA's
## Pricinpal Component Analysis 0.9198492 0
## Linear Regression 1.4836498 0
## Partial Least Squares 0.7955932 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu.
## Pricinpal Component Analysis 0.4857683 0.5917400 0.6462086 0.7606981 0.7042514
## Linear Regression 0.4622790 0.7001441 0.8360196 1.2726931 1.6102967
## Partial Least Squares 0.4699549 0.5362325 0.6770921 0.7394728 0.7761977
## Max. NA's
## Pricinpal Component Analysis 1.606723 0
## Linear Regression 3.829480 0
## Partial Least Squares 1.561645 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu.
## Pricinpal Component Analysis 0.13185694 0.4786513 0.5447869 0.5085721 0.5654926
## Linear Regression 0.03005533 0.2108320 0.5320927 0.4295420 0.5799920
## Partial Least Squares 0.29896486 0.4234135 0.6131606 0.5796356 0.7303353
## Max. NA's
## Pricinpal Component Analysis 0.8213090 0
## Linear Regression 0.8061203 0
## Partial Least Squares 0.7985753 0
bwplot(results2)
Comparing all three model the PCR, PLS and linear model, the PLS model has the best predictive performance. It has the lowest RMSE value (0.695), meaning we are likely to get closer to actual value, lower MAE and the highest r squared of 0.56 or 56 percent. The PLS is a good fit here, due to mmulticollinearity . We had lot of predictors in the imputed data and if there are many predictors related this makes it difficult to predict our data. The PCR does great job at reducing the dimension of the matrix, however the compnent that was optimal compared to PLS does not perform will. The component from the PCR was RMSE - 0.9031344 , Rsquared - 0.5220680 , and MAE - 0.6194639.
We will look for the variable importance in both model to understand how it affect our yield variable
# varImp will retunr the variable importacnce (chapter 7)
pls_importance = varImp(pls_model3)
print(pls_importance)
## pls variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 88.15
## ManufacturingProcess36 84.46
## ManufacturingProcess13 84.45
## ManufacturingProcess17 74.80
## BiologicalMaterial02 65.52
## BiologicalMaterial06 64.18
## BiologicalMaterial03 60.95
## BiologicalMaterial08 59.55
## ManufacturingProcess11 57.37
## ManufacturingProcess33 55.82
## BiologicalMaterial12 55.28
## ManufacturingProcess06 54.66
## ManufacturingProcess12 54.65
## BiologicalMaterial04 52.57
## BiologicalMaterial11 52.20
## BiologicalMaterial01 51.54
## ManufacturingProcess28 45.64
## ManufacturingProcess10 42.20
## ManufacturingProcess34 41.55
# make top = 10 makes a 10 most important variables
plot(pls_importance, top = 10, main = "PLS Model Top 10 Important Variables ")
In the PLS model the top three imporant variables are the
ManufacturingProcess32, ManufacturingProcess09, and
ManufacturingProcess36. This shows the strong correlation with these
variables and our response variables yield. The company will benefit if
these increase these manufacturing process variables to boost their
revenue. Understanding the impact of these predictors will help control
measurements of manufacturing process such the concentration or
tempeture of the chemical in the process.
pcr_importance = varImp(pcr_model)
print(pcr_importance)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 96.87
## BiologicalMaterial06 94.60
## BiologicalMaterial03 89.36
## ManufacturingProcess09 84.60
## ManufacturingProcess36 79.90
## ManufacturingProcess17 77.46
## BiologicalMaterial12 77.01
## BiologicalMaterial02 75.16
## ManufacturingProcess31 69.11
## ManufacturingProcess06 60.42
## BiologicalMaterial04 54.04
## BiologicalMaterial08 51.97
## BiologicalMaterial11 49.86
## ManufacturingProcess11 49.69
## BiologicalMaterial01 45.39
## ManufacturingProcess33 44.10
## ManufacturingProcess29 40.28
## BiologicalMaterial09 38.34
## ManufacturingProcess30 35.98
plot(pcr_importance, top = 10, main = "PCR Model Top 10 Important Variables -")
In the PCR model our top three model are, ManufacturingProcess32 100.00000 , ManufacturingProcess13 96.86898, and BiologicalMaterial06. ManufacturingProcess32 has strongest influece on both the PLS and PCR model, meaning it plays huge role in the production process of the yeild they are looking for. Since they cannot control the biological matter, the company can screen the quality of the raw material to futher observe its impact on the PCR model.
plot the most imporant variable, with geom smooth = lm to see the line best fit in the model
# Ploting ManufacturingProcess32 important variable from PLS model on the the top ten model
ggplot(imputed_data, aes(x = ManufacturingProcess32, y = Yield)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = " Impct of Manufacturing Process 32 on the Batch Yield of Company X")
## `geom_smooth()` using formula = 'y ~ x'