In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.

Exercises

library(MASS)
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.3
library(stats)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:pls':
## 
##     R2
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
Question 6.2.

Developing a model to predict permeability (see Sect.1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the near Zero Var function from the caret package. How many predictors are left for modeling?
dim(fingerprints) # allows to check the data structure of this this data
## [1]  165 1107
# this get back the dimensions of the matrix

this matrix has 165 rows and 1107 columns

near_zero = nearZeroVar(fingerprints)

this functions filters out the predictors with near zero variance. With near zero variance some of the data are basically almost the same, like have a lot fo 1 and 0 s in the data. example [ 1,1,1,1,1] and [0,0,0,0,]

# drop the data with near zero predictors from the data with "-near-zero'
filtered_fingerprints  =  fingerprints[, -near_zero]

filtered fingers prints is now data set with with no varience

dim(filtered_fingerprints)
## [1] 165 388

After dropping the predictors with near zero variance we get a new matrix with 165 rows and and 388 columns

b.) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

set.seed(312) 

Our filtered fingers print is the predictor variable also called x/indenpent variable the permeability is is the response variable also calle dthe y/dependent variable, this is the variable that will explain the the permability of the molecule that enters into the membrane of the person.

 # splting our data with chapter 4  talks about this with the createDataPartition function , 
# we will split into 80 percent for training and 20 percent to test
train_index  =  createDataPartition(permeability, p = 0.80, list = FALSE)
# list = false, we bring back a vector of the index , makes it easier to subset our data, and aviod creating a list
X_train  =  filtered_fingerprints[train_index, ]
# subset the filter fingerprint matrix to find rows that match the training data
X_test   =  filtered_fingerprints[-train_index, ]
# makes a test data set that take rows that we cant find the train data
# when spliting the data, we need make train and test data set with data that doesn't match, so it does not affect our results
y_train =  permeability[train_index]
# here we getting the dependent variable/response variable to match data  train block
y_test   = permeability[-train_index]
# here we getting the dependent variable/response variable NOT  in  train block data.

The 80/20 split is common slit used for training and testing data, with 80 percent we can gave the model enough patterns to learn about the data. 20 percent is used to evaluate how well our model will perform with the data, even if it is new data.

next is cross validation of the data with a PLS model to train the data notes from chapter 4 and 6

 # 10 cross validation  , training control parameter
pls_ctrl =  trainControl(method = "cv", number = 10)

10 cross validation will fold the training data into 10 parts to and train 9 fold and then come to conclusion to chose to test on the last fold. This process of validating the data will repeat for ten times with the model test each fold to observe which will avoid any over fitting and

pls_model  =  train(
  x = X_train,
  y = y_train,
  method = "pls",  # using the partial least squares 
  preProcess = c("center", "scale"),  # apply the preprocess before we train the data
  tuneLength = 10, # we will try 10 different number of components 
  # this will capture latent variables, hidden data that influence the data
  trControl = pls_ctrl, # 10 fold cross validation 
  metric = "Rsquared" # we will chose the model based on best r  square
)
print(pls_model)
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 120, 121, 120, 118, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     13.04093  0.3103755  9.939871
##    2     12.22827  0.4122874  8.705134
##    3     12.40015  0.4195067  9.143409
##    4     12.69684  0.4045341  9.540222
##    5     12.86993  0.4226621  9.579177
##    6     12.70483  0.4358627  9.580128
##    7     12.56317  0.4375749  9.495606
##    8     12.24705  0.4380541  9.520556
##    9     12.22510  0.4433568  9.367976
##   10     12.44215  0.4398578  9.481070
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 9.
plot(pls_model)

opti_value1 = pls_model$bestTune
print(opti_value1 )
##   ncomp
## 9     9

Looking at our results, the number of latent components with the best RMSE, R squared and MAE has 5 components. With four latent components, we have low RMSE ( root mean squared error), which help measures how well the model predicts the response variable, and how r square of 0.53 or 53 percent of the model variance will explain our response variables, and has MAE(mean absolute error) of 8.741811 which indicates the average errors found in the predictions of the model compared to the actual value.

We have to check how will our model will do against data not inputted into the model before. This done because the model we made can just memorize the answers from the model and this cause over fitting. Over fitting will generalize the data and will give low accuracy on any new data.

  1. Predict the response for the test set. What is the test set estimate of R2? The test data from predictor variable will be used to test if the model is
pls_preds =  predict(pls_model, newdata = X_test)
# postResample will compare the performance of the model to the response / dependent variable
# We observe how will the model will we predict our targeted values 
postResample(pls_preds, y_test) 
##     RMSE Rsquared      MAE 
## 9.933804 0.642241 7.939443

There is variation of 37.5 percent in the R sqaure With a such low variation, the model can still be improved and this this low variation indicate that that data is too spread out. With a low variation the response variable cannot explain why there is changes in our data.

  1. Try building other models discussed in this chapter. Do any have better predictive performance? We will build a linear regressiin and ridge regression model
training_ctrl = trainControl(method = "cv", number = 10)
linear_model  =  train(
  x = X_train, 
  y = y_train, 
  method = "lm", # linear regression 
  preProcess = c("center", "scale"), 
  trControl =  training_ctrl
)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
ridge_model =  train(
  x = X_train, 
  y = y_train, 
  method = "glmnet", 
  preProcess = c("center", "scale"), 
  # lambda   used to control how strong the penalty will be  for having large coefficicents in the data
  # tunegrid will  try different lambda strengths 10 times, between lambda valie of 0.001 to 1 
  tuneGrid = expand.grid(alpha = 0, lambda = seq(0.0001, 1, length = 10)), 
  trControl = training_ctrl
)

A lambda close to zero will create model similar to of ordinary linear regression, meaning it minize the sum of least squares. This can cause over fitting and too large lambda will take away any over fitting and cause under fitting. put all the models into a list * remeber ridge regression pushed cofficeients close to 0, but not fully to weo

models  = list(
  "PLS" = pls_model,
  "Linear Regression" = linear_model,
  "Ridge Regression" = ridge_model
)
# resample fuctiion will compare model 
results  =  resamples(models)
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: PLS, Linear Regression, Ridge Regression 
## Number of resamples: 10 
## 
## MAE 
##                        Min.   1st Qu.    Median      Mean   3rd Qu.     Max.
## PLS                5.062580  8.854054  9.616833  9.367976 10.606981 11.98775
## Linear Regression 10.166223 15.956758 21.732021 25.374879 33.940273 46.27533
## Ridge Regression   6.389101  8.204465  8.506049  8.668495  9.428291 11.30136
##                   NA's
## PLS                  0
## Linear Regression    0
## Ridge Regression     0
## 
## RMSE 
##                        Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PLS                7.440460 11.87682 12.56390 12.22510 13.84332 15.36944    0
## Linear Regression 12.979289 23.03389 32.33058 37.43822 44.73796 75.89702    0
## Ridge Regression   7.937061 10.32177 11.31130 11.81360 13.08203 16.98667    0
## 
## Rsquared 
##                           Min.    1st Qu.     Median      Mean   3rd Qu.
## PLS               0.0988254928 0.36917912 0.42577892 0.4433568 0.5036574
## Linear Regression 0.0002819909 0.06131537 0.07648197 0.1462571 0.1737759
## Ridge Regression  0.1361200693 0.36065813 0.43165539 0.4788606 0.6115515
##                        Max. NA's
## PLS               0.7377751    0
## Linear Regression 0.6465223    0
## Ridge Regression  0.8107929    0
print(models)
## $PLS
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 120, 121, 120, 118, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     13.04093  0.3103755  9.939871
##    2     12.22827  0.4122874  8.705134
##    3     12.40015  0.4195067  9.143409
##    4     12.69684  0.4045341  9.540222
##    5     12.86993  0.4226621  9.579177
##    6     12.70483  0.4358627  9.580128
##    7     12.56317  0.4375749  9.495606
##    8     12.24705  0.4380541  9.520556
##    9     12.22510  0.4433568  9.367976
##   10     12.44215  0.4398578  9.481070
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 9.
## 
## $`Linear Regression`
## Linear Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 120, 120, 119, 118, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   37.43822  0.1462571  25.37488
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## 
## $`Ridge Regression`
## glmnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 118, 120, 119, 121, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE     Rsquared   MAE     
##   0.0001  11.8136  0.4788606  8.668495
##   0.1112  11.8136  0.4788606  8.668495
##   0.2223  11.8136  0.4788606  8.668495
##   0.3334  11.8136  0.4788606  8.668495
##   0.4445  11.8136  0.4788606  8.668495
##   0.5556  11.8136  0.4788606  8.668495
##   0.6667  11.8136  0.4788606  8.668495
##   0.7778  11.8136  0.4788606  8.668495
##   0.8889  11.8136  0.4788606  8.668495
##   1.0000  11.8136  0.4788606  8.668495
## 
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 1.
  1. Explain which model you would use for predicting the fat content of a sample.
bwplot(results)

Observing the plot the both the linear and ridge regression model show a lower MAE and RMSE model with higher r square values than the linear regression model. The r square for PLS is 53.2 percent , linear regression is 30 percent and for regression is 54.6 percent. Both the PLS and ridge regression have better change of explaining the changes that will occur in in the model.

Question 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

data(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)
## [1] 176  58
help(ChemicalManufacturingProcess)

The matrix process Predictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

b.) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect.3.8).

Imputation will help in replacing any incomplete data with some subbed data.

# we are pre processing the data with the k to nearest neighbour to fill out missing values
preprocessed_data  = preProcess(ChemicalManufacturingProcess, method = "knnImpute") # this uses the k to nearst neightbour approach to impuate the data
# predict fuction will appy the KNN to the new data
imputed_data =  predict(preprocessed_data, newdata = ChemicalManufacturingProcess)
sum(is.na(imputed_data))
## [1] 0
dim(imputed_data)
## [1] 176  58
  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
set.seed(909)
# Yield is the response variable that we want to predict
# we use the default 80/20 split
split_index =  createDataPartition(imputed_data$Yield, p = 0.8, list = FALSE)
train_imputed  =  imputed_data[split_index, ]
test_imputed  =  imputed_data[-split_index, ]
# Usign 10 cross fold validation 
training_ctrl2  =  trainControl(method = "cv", number = 10)
pls_model3  = train(
  Yield ~ ., 
  data = train_imputed,
  method = "pls",
  trControl = training_ctrl2,
  metric = "RMSE",
  preProcess = c("center", "scale"),
  tuneLength = 20
)
print(pls_model3)
## Partial Least Squares 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 130, 130, 130, 131, 130, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.8578181  0.4494503  0.6314903
##    2     1.2022620  0.4393228  0.6792114
##    3     0.8275585  0.5354153  0.5721578
##    4     0.7394728  0.5796356  0.5516950
##    5     0.8036234  0.5337296  0.5735966
##    6     0.8534380  0.5166319  0.5875801
##    7     0.9681134  0.4934850  0.6236933
##    8     0.9726749  0.4972598  0.6289777
##    9     1.1419420  0.4838038  0.6732001
##   10     1.2250624  0.4803668  0.6936546
##   11     1.2629881  0.4579983  0.7090685
##   12     1.2283690  0.4685935  0.7024693
##   13     1.1821069  0.4878418  0.6932177
##   14     1.1652476  0.5076029  0.6822836
##   15     1.1146200  0.5160249  0.6568116
##   16     1.0736225  0.5199259  0.6528180
##   17     0.9884735  0.5325340  0.6269468
##   18     0.9297065  0.5452256  0.6103556
##   19     0.8963245  0.5472305  0.6053765
##   20     0.9313099  0.5239763  0.6175505
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 4.
# get the optimla vlaue 
opti_value = pls_model3$bestTune
print(opti_value)
##   ncomp
## 4     4
  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
# we will test to  pls model agaisnt the test data to see how will it explain the data
preds3 =  predict(pls_model3, newdata = test_imputed)
# postResmaple to see our RMSE,r square and MAE values
postResample(preds3, test_imputed$Yield)
##      RMSE  Rsquared       MAE 
## 0.7479883 0.5205417 0.6282878

Our linear regression model does a good job at explaining about 71.3 percent of the varicne in the data, when approached with new unseen data from our testing data. In the RMSE, we observe acutal data will be off about 0.51 units and we have an average absolute error is 0.41 , where our values will likely be off by 0.41 in the yeild variable.

(e). Try building other models discussed in this chapter. Do any have better predictive performance?

we will use the Linear and PCR model to build this PCR will combine linear and PlS model together

# Train a linear regression model on the training data
lm_model2  =  train(
  Yield ~ .,  # Use all predictors to predict Yield
  data = imputed_data ,
  method = "lm",  # Linear regression
  trControl = training_ctrl2   
)
print(lm_model2)
## Linear Regression 
## 
## 176 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 158, 160, 158, 158, 159, 158, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE      
##   1.272693  0.429542  0.7440433
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
summary(lm_model2)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10264 -0.30357 -0.03165  0.25409  1.06336 
## 
## Coefficients: (1 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -0.0018495  0.0438022  -0.042  0.96639    
## BiologicalMaterial01    0.0455980  0.1323190   0.345  0.73100    
## BiologicalMaterial02   -0.2437687  0.2784942  -0.875  0.38317    
## BiologicalMaterial03    0.4958561  0.5220036   0.950  0.34408    
## BiologicalMaterial04   -0.1169545  0.4908971  -0.238  0.81210    
## BiologicalMaterial05    0.1647417  0.1047952   1.572  0.11860    
## BiologicalMaterial06   -0.1866881  0.6194332  -0.301  0.76365    
## BiologicalMaterial07   -0.0957142  0.0569371  -1.681  0.09538 .  
## BiologicalMaterial08    0.2355963  0.2421458   0.973  0.33255    
## BiologicalMaterial09   -0.2515950  0.3129362  -0.804  0.42301    
## BiologicalMaterial10    0.0509659  0.4325772   0.118  0.90641    
## BiologicalMaterial11   -0.2346396  0.2135479  -1.099  0.27409    
## BiologicalMaterial12    0.1827764  0.2671196   0.684  0.49515    
## ManufacturingProcess01  0.0369664  0.0919907   0.402  0.68852    
## ManufacturingProcess02 -0.0004011  0.1859956  -0.002  0.99828    
## ManufacturingProcess03 -0.0602545  0.0625694  -0.963  0.33750    
## ManufacturingProcess04  0.2369374  0.0981557   2.414  0.01731 *  
## ManufacturingProcess05  0.0082087  0.0638537   0.129  0.89793    
## ManufacturingProcess06  0.0388170  0.0632274   0.614  0.54044    
## ManufacturingProcess07 -0.0465431  0.0574361  -0.810  0.41936    
## ManufacturingProcess08 -0.0265382  0.0688750  -0.385  0.70070    
## ManufacturingProcess09  0.2622360  0.1524794   1.720  0.08807 .  
## ManufacturingProcess10 -0.0185401  0.2355355  -0.079  0.93739    
## ManufacturingProcess11  0.0545496  0.2834653   0.192  0.84773    
## ManufacturingProcess12  0.1119268  0.1010813   1.107  0.27040    
## ManufacturingProcess13 -0.1459152  0.2147261  -0.680  0.49811    
## ManufacturingProcess14  0.0168105  0.3155612   0.053  0.95760    
## ManufacturingProcess15  0.0814451  0.2935170   0.277  0.78189    
## ManufacturingProcess16  0.0095529  0.0633091   0.151  0.88032    
## ManufacturingProcess17 -0.0677198  0.2038923  -0.332  0.74037    
## ManufacturingProcess18  0.7401067  0.8953599   0.827  0.41012    
## ManufacturingProcess19 -0.0543006  0.1925060  -0.282  0.77838    
## ManufacturingProcess20 -0.7227078  0.8911237  -0.811  0.41898    
## ManufacturingProcess21         NA         NA      NA       NA    
## ManufacturingProcess22 -0.0150200  0.0758028  -0.198  0.84327    
## ManufacturingProcess23 -0.0313061  0.0749360  -0.418  0.67687    
## ManufacturingProcess24 -0.0671870  0.0734892  -0.914  0.36244    
## ManufacturingProcess25 -0.9620548  2.8063833  -0.343  0.73235    
## ManufacturingProcess26  1.3221724  2.6639842   0.496  0.62059    
## ManufacturingProcess27 -1.5296797  1.4847028  -1.030  0.30496    
## ManufacturingProcess28 -0.2453565  0.0866782  -2.831  0.00546 ** 
## ManufacturingProcess29  1.2251111  0.7222629   1.696  0.09246 .  
## ManufacturingProcess30 -0.2056283  0.2980058  -0.690  0.49153    
## ManufacturingProcess31  0.2241897  0.3568015   0.628  0.53099    
## ManufacturingProcess32  0.8777672  0.1928887   4.551  1.3e-05 ***
## ManufacturingProcess33 -0.4766695  0.1701810  -2.801  0.00595 ** 
## ManufacturingProcess34 -0.0238571  0.0813978  -0.293  0.76996    
## ManufacturingProcess35 -0.0936055  0.1002698  -0.934  0.35243    
## ManufacturingProcess36  0.1241460  0.1428852   0.869  0.38668    
## ManufacturingProcess37 -0.1694232  0.0703383  -2.409  0.01754 *  
## ManufacturingProcess38 -0.0856827  0.0853147  -1.004  0.31727    
## ManufacturingProcess39  0.0473406  0.1064271   0.445  0.65726    
## ManufacturingProcess40  0.0324079  0.1366341   0.237  0.81292    
## ManufacturingProcess41 -0.0268241  0.1390033  -0.193  0.84731    
## ManufacturingProcess42  0.1170493  0.2201179   0.532  0.59589    
## ManufacturingProcess43  0.0909620  0.0565802   1.608  0.11056    
## ManufacturingProcess44 -0.0998864  0.2068479  -0.483  0.63006    
## ManufacturingProcess45  0.2214974  0.1209522   1.831  0.06956 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5636 on 119 degrees of freedom
## Multiple R-squared:  0.784,  Adjusted R-squared:  0.6823 
## F-statistic: 7.712 on 56 and 119 DF,  p-value: < 2.2e-16
# we will test to lm model agaisnt the test data to see how will it explain the data
lm_preds2  =  predict(lm_model2, newdata = test_imputed)
# postResmaple to see our RMSE,r square and MAE values
postResample(lm_preds2, test_imputed$Yield)
##      RMSE  Rsquared       MAE 
## 0.5307185 0.7661373 0.4356418
# PCR Model 
pcr_model  =  train(
  Yield ~ ., 
  data = train_imputed,
  method = "pcr",
  trControl = training_ctrl2,
  preProcess = c("center", "scale"),
  tuneLength = 20
)
print(pcr_model)
## Principal Component Analysis 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 131, 130, 128, 130, 129, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.9599903  0.2783475  0.7346390
##    2     0.9796250  0.3028973  0.7280088
##    3     1.0893876  0.3297326  0.7483017
##    4     1.1702638  0.3901733  0.7346735
##    5     1.1612427  0.3930433  0.7262085
##    6     1.1716890  0.3950469  0.7369601
##    7     1.1626013  0.3893585  0.7401618
##    8     1.1500363  0.3922661  0.7328569
##    9     1.1337147  0.4046351  0.6963287
##   10     1.1068249  0.4535474  0.6590726
##   11     1.0752391  0.4748928  0.6491893
##   12     1.0986376  0.4756759  0.6588284
##   13     1.0114436  0.5039221  0.6164314
##   14     0.9930940  0.5078058  0.6184654
##   15     0.9682991  0.5132088  0.6188779
##   16     0.9677342  0.5095159  0.6192046
##   17     0.7726571  0.5263169  0.5734064
##   18     0.8175566  0.5056910  0.5981806
##   19     0.8545303  0.5012867  0.6075666
##   20     0.7606981  0.5085721  0.5821039
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 20.
preds4  =  predict(pcr_model, newdata = test_imputed)
postResample(preds4, test_imputed$Yield)
##      RMSE  Rsquared       MAE 
## 0.7632398 0.5044555 0.6508389
models2  = list(
  "Pricinpal Component Analysis" = pcr_model,
  "Linear Regression" = lm_model2,
  "Partial Least Squares" = pls_model3
)
results2  =  resamples(models2)
summary(results2)
## 
## Call:
## summary.resamples(object = results2)
## 
## Models: Pricinpal Component Analysis, Linear Regression, Partial Least Squares 
## Number of resamples: 10 
## 
## MAE 
##                                   Min.   1st Qu.    Median      Mean   3rd Qu.
## Pricinpal Component Analysis 0.4347878 0.4854884 0.5332387 0.5821039 0.5618521
## Linear Regression            0.3799587 0.5806012 0.6699280 0.7440433 0.8399554
## Partial Least Squares        0.3893792 0.4539323 0.5618073 0.5516950 0.6162924
##                                   Max. NA's
## Pricinpal Component Analysis 0.9198492    0
## Linear Regression            1.4836498    0
## Partial Least Squares        0.7955932    0
## 
## RMSE 
##                                   Min.   1st Qu.    Median      Mean   3rd Qu.
## Pricinpal Component Analysis 0.4857683 0.5917400 0.6462086 0.7606981 0.7042514
## Linear Regression            0.4622790 0.7001441 0.8360196 1.2726931 1.6102967
## Partial Least Squares        0.4699549 0.5362325 0.6770921 0.7394728 0.7761977
##                                  Max. NA's
## Pricinpal Component Analysis 1.606723    0
## Linear Regression            3.829480    0
## Partial Least Squares        1.561645    0
## 
## Rsquared 
##                                    Min.   1st Qu.    Median      Mean   3rd Qu.
## Pricinpal Component Analysis 0.13185694 0.4786513 0.5447869 0.5085721 0.5654926
## Linear Regression            0.03005533 0.2108320 0.5320927 0.4295420 0.5799920
## Partial Least Squares        0.29896486 0.4234135 0.6131606 0.5796356 0.7303353
##                                   Max. NA's
## Pricinpal Component Analysis 0.8213090    0
## Linear Regression            0.8061203    0
## Partial Least Squares        0.7985753    0
bwplot(results2)

Comparing all three model the PCR, PLS and linear model, the PLS model has the best predictive performance. It has the lowest RMSE value (0.695), meaning we are likely to get closer to actual value, lower MAE and the highest r squared of 0.56 or 56 percent. The PLS is a good fit here, due to mmulticollinearity . We had lot of predictors in the imputed data and if there are many predictors related this makes it difficult to predict our data. The PCR does great job at reducing the dimension of the matrix, however the compnent that was optimal compared to PLS does not perform will. The component from the PCR was RMSE - 0.9031344 , Rsquared - 0.5220680 , and MAE - 0.6194639.

  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

We will look for the variable importance in both model to understand how it affect our yield variable

# varImp will retunr the variable importacnce (chapter 7)
pls_importance  =  varImp(pls_model3)
print(pls_importance)
## pls variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   88.15
## ManufacturingProcess36   84.46
## ManufacturingProcess13   84.45
## ManufacturingProcess17   74.80
## BiologicalMaterial02     65.52
## BiologicalMaterial06     64.18
## BiologicalMaterial03     60.95
## BiologicalMaterial08     59.55
## ManufacturingProcess11   57.37
## ManufacturingProcess33   55.82
## BiologicalMaterial12     55.28
## ManufacturingProcess06   54.66
## ManufacturingProcess12   54.65
## BiologicalMaterial04     52.57
## BiologicalMaterial11     52.20
## BiologicalMaterial01     51.54
## ManufacturingProcess28   45.64
## ManufacturingProcess10   42.20
## ManufacturingProcess34   41.55
# make top = 10 makes a 10 most important variables
plot(pls_importance, top = 10, main = "PLS Model Top 10 Important Variables ")

In the PLS model the top three imporant variables are the ManufacturingProcess32, ManufacturingProcess09, and ManufacturingProcess36. This shows the strong correlation with these variables and our response variables yield. The company will benefit if these increase these manufacturing process variables to boost their revenue. Understanding the impact of these predictors will help control measurements of manufacturing process such the concentration or tempeture of the chemical in the process.

pcr_importance  =  varImp(pcr_model)
print(pcr_importance)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   96.87
## BiologicalMaterial06     94.60
## BiologicalMaterial03     89.36
## ManufacturingProcess09   84.60
## ManufacturingProcess36   79.90
## ManufacturingProcess17   77.46
## BiologicalMaterial12     77.01
## BiologicalMaterial02     75.16
## ManufacturingProcess31   69.11
## ManufacturingProcess06   60.42
## BiologicalMaterial04     54.04
## BiologicalMaterial08     51.97
## BiologicalMaterial11     49.86
## ManufacturingProcess11   49.69
## BiologicalMaterial01     45.39
## ManufacturingProcess33   44.10
## ManufacturingProcess29   40.28
## BiologicalMaterial09     38.34
## ManufacturingProcess30   35.98
plot(pcr_importance, top = 10, main = "PCR Model Top 10 Important Variables -")

In the PCR model our top three model are, ManufacturingProcess32 100.00000 , ManufacturingProcess13 96.86898, and BiologicalMaterial06. ManufacturingProcess32 has strongest influece on both the PLS and PCR model, meaning it plays huge role in the production process of the yeild they are looking for. Since they cannot control the biological matter, the company can screen the quality of the raw material to futher observe its impact on the PCR model.

plot the most imporant variable, with geom smooth = lm to see the line best fit in the model

#  Ploting  ManufacturingProcess32 important variable from PLS model on the the top ten model 
ggplot(imputed_data, aes(x = ManufacturingProcess32, y = Yield)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = " Impct of Manufacturing Process 32  on the Batch Yield of Company X")
## `geom_smooth()` using formula = 'y ~ x'