All libraries needed for the Homework

library(fpp3)
## Warning: package 'fpp3' was built under R version 4.3.3
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## -- Attaching packages -------------------------------------------- fpp3 1.0.0 --
## v tibble      3.2.1     v tsibble     1.1.5
## v dplyr       1.1.2     v tsibbledata 0.4.1
## v tidyr       1.3.0     v feasts      0.3.2
## v lubridate   1.9.2     v fable       0.3.4
## v ggplot2     3.5.1     v fabletools  0.4.2
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tsibble' was built under R version 4.3.3
## Warning: package 'tsibbledata' was built under R version 4.3.3
## Warning: package 'feasts' was built under R version 4.3.3
## Warning: package 'fabletools' was built under R version 4.3.3
## Warning: package 'fable' was built under R version 4.3.3
## -- Conflicts ------------------------------------------------- fpp3_conflicts --
## x lubridate::date()    masks base::date()
## x dplyr::filter()      masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval()  masks lubridate::interval()
## x dplyr::lag()         masks stats::lag()
## x tsibble::setdiff()   masks base::setdiff()
## x tsibble::union()     masks base::union()
library(forecast)
## Warning: package 'forecast' was built under R version 4.3.3
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(tidyverse)
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v forcats 1.0.0     v readr   2.1.4
## v purrr   1.0.1     v stringr 1.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter()     masks stats::filter()
## x tsibble::interval() masks lubridate::interval()
## x dplyr::lag()        masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(lubridate)
library(tsibble)
library(pracma)
## Warning: package 'pracma' was built under R version 4.3.3
## 
## Attaching package: 'pracma'
## 
## The following object is masked from 'package:purrr':
## 
##     cross
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.3
library(corrplot)
## corrplot 0.92 loaded
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.1
## 
## Attaching package: 'e1071'
## 
## The following object is masked from 'package:pracma':
## 
##     sigmoid
## 
## The following object is masked from 'package:fabletools':
## 
##     interpolate
library(psych)
## Warning: package 'psych' was built under R version 4.3.1
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:pracma':
## 
##     logit, polar
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(imputeTS)
## Warning: package 'imputeTS' was built under R version 4.3.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.3.1
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret)
## Warning: package 'caret' was built under R version 4.3.1
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## The following objects are masked from 'package:fabletools':
## 
##     MAE, RMSE

6.2 - Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Start R and use these commands to load the data: library(AppliedPredictiveModeling) data(permeability) The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

  2. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

  3. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

  4. Predict the response for the test set. What is the test set estimate of R2?

  5. Try building other models discussed in this chapter. Do any have better predictive performance?

  6. Would you recommend any of your models to replace the permeability laboratory experiment?

  7. Start R and use these commands to load the data: library(AppliedPredictiveModeling) data(permeability) The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

#using the  AppliedPredictiveModeling library
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.3.3
#Permeability
data(permeability)
  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
#retrieving total number of predictors
dim(fingerprints)
## [1]  165 1107
#number of predictors left for modeling
fingerprints <- fingerprints[, -nearZeroVar(fingerprints)]
dim(fingerprints)
## [1] 165 388

The total number of predictors was 1107. 719 have low frequencies. 388 predictors are left for modeling.

  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding re sampled estimate of R2?
# setting the seed
set.seed(9865)

# index for training - set of data points (rows) used for training - 80% of the data will be used for this.
index <- createDataPartition(permeability, p = .8, list = FALSE)
train_permeability <- permeability[index, ]
train_fingerprints <- fingerprints[index, ]

# index for testing - set of data points (rows) used for training - 20% of the data will be used for this.
test_permeability <- permeability[-index, ]
test_fingerprints <- fingerprints [-index, ]

# 10-fold cross-validation to asses the performance of this model on data which is latent
cross_fold <- trainControl(method = "cv", number = 11)

pls_model <- train(train_fingerprints, train_permeability, method = "pls", metric = "Rsquared",
             tuneLength = 20, trControl = cross_fold, preProc = c("center", "scale"))


#retrieve results from the model
pls_model
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (11 fold) 
## Summary of sample sizes: 121, 121, 120, 121, 121, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     12.80059  0.3968619  9.829903
##    2     11.28795  0.5286479  7.746570
##    3     10.93059  0.5286472  8.081095
##    4     11.16608  0.5075637  8.244737
##    5     10.95860  0.5268477  7.805466
##    6     11.17479  0.5002639  7.987233
##    7     11.04093  0.5062815  8.098310
##    8     10.97525  0.5101272  8.062825
##    9     10.72349  0.5267382  7.942325
##   10     10.70995  0.5330382  7.699028
##   11     10.58912  0.5463168  7.671060
##   12     10.46075  0.5593953  7.654298
##   13     10.38991  0.5699466  7.695095
##   14     10.42986  0.5692514  7.697504
##   15     10.56412  0.5599195  7.822036
##   16     10.85302  0.5476701  8.051517
##   17     11.35347  0.5260383  8.331617
##   18     11.81763  0.5058061  8.511071
##   19     12.22391  0.4892186  8.814179
##   20     12.61268  0.4693128  9.097800
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 13.
#ploting the PLS model
plot(pls_model) 

From the results above, the optimal number of latent variables is 13 and the corresponding re sampled estimate of R2 is: 0.5699.

  1. Predict the response for the test set. What is the test set estimate of R2?
#predicting the response for the test set
fingerprints_predict <- predict(pls_model, test_fingerprints)
#getting test set estimate for R2
postResample(fingerprints_predict , test_permeability)
##       RMSE   Rsquared        MAE 
## 15.2527094  0.2226841 12.2233882

The test set estimate for R2 is: 0.223

  1. Try building other models discussed in this chapter. Do any have better predictive performance?

Other models, I will use discussed in this chapter are: Elastic Net, Ridge Regression and lasso.

                  #Ridge Regression:
set.seed(4)
ridge_model <- train(x=train_fingerprints,
                  y=train_permeability,
                  method='ridge',
                  metric='Rsquared',
                  tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )

#plotting the model
plot(ridge_model)

 #retrieving the R2, RMSE ad MAE 
  ridge_predict <- predict(ridge_model, test_fingerprints)
  ridge_metrics <- postResample(pred=ridge_predict, obs=test_permeability)
  ridge_metrics
##       RMSE   Rsquared        MAE 
## 14.1351966  0.3326506 10.7680534
                      #Elastic Net

set.seed(4)
enet_model <- train(x=train_fingerprints,
                 y=train_permeability,
                 method='enet',
                 metric='Rsquared',
                 tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                      .lambda = seq(0, 1, by=0.1)),
                 trControl=trainControl(method='cv'),
                 preProcess=c('center','scale')
                  )
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
## Warning in train.default(x = train_fingerprints, y = train_permeability, :
## missing values found in aggregated results
   #plotting the model
   plot(enet_model)

  #retrieving the R2, RMSE ad MAE 
  enet_predict <- predict(enet_model, test_fingerprints)
  enet_metrics <- postResample(pred= enet_predict, obs=test_permeability )
  enet_metrics
##      RMSE  Rsquared       MAE 
## 13.550031  0.324134 10.038379
                       #lasso
set.seed(1)
lasso_model <- train(x=train_fingerprints,
                  y=train_permeability,
                  method='lasso',
                  metric='Rsquared',
                  tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.05)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
## Warning in train.default(x = train_fingerprints, y = train_permeability, :
## missing values found in aggregated results
#plotting the model
plot(lasso_model)

  #retrieving the R2, RMSE ad MAE 
  lasso_predict <- predict(lasso_model, test_fingerprints)
  lasso_metrics <- postResample(pred=lasso_predict, obs=test_permeability)
  lasso_metrics
##       RMSE   Rsquared        MAE 
## 15.5065038  0.1836616 11.8176158

Based on the models metrics the Elastic Net model has the lowest MAE (10.03) and the lowest RMSE (13.55) of all the models. The lower the MAE and RMSE, the better the predictive performance, thus, the Elastic model works best.

  1. Would you recommend any of your models to replace the permeability laboratory experiment?

I would recommend the Elastic Net model, because it has hte best predictive performance, i.e, lowest MAE and lowest RMSE of all the models.

6.3 -A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(chemicalManufacturingProcess) The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(b).A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

(c). Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

  2. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

  3. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

  4. Start R and use these commands to load the data

#load the AppliedPredictiveModeling library
library(AppliedPredictiveModeling)
#Manufacturing Processes
data(ChemicalManufacturingProcess)
  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
#retrieving the number of missing values
sum(is.na(ChemicalManufacturingProcess))
## [1] 106
#utilizing the "bagImpute" function to fill in these missing values.
fill <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(fill, ChemicalManufacturingProcess)

There was a total of 106 missing values. The Bagged trees method was used to impute the data.

(c). Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

The model of my choice from this chapter is the: Elastic Net model.

#preprocessing the data by filtering out low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]

set.seed(4)
# index for training - set of data points (rows) used for training - 80% of the data will be used for this.
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)
train_chemical_process <- Chemical[index, ]
# index for testing - set of data points (rows) used for training - 20% of the data will be used for this.
test_chemical_process <-  Chemical[-index, ]


#Tunning the Elastic Net model
enet_model<- train(Yield ~ ., Chemical , method = "enet", 
                  tuneGrid = expand.grid(fraction = seq(0.1, 0.9, by = 0.1),
                       lambda = seq(0.01, 0.1, by = 0.01)), trControl = trainControl(method = "cv", number = 10), preProc = c("center", "scale"))
                                  

#retrieving the optimal value for the performance metric
enet_model
## Elasticnet 
## 
## 176 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 159, 157, 160, 157, 158, 158, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.01    0.1       1.316975  0.6019086  1.0760464
##   0.01    0.2       1.161001  0.6072977  0.9443368
##   0.01    0.3       1.115329  0.6424503  0.9128508
##   0.01    0.4       1.256898  0.5829407  0.9810714
##   0.01    0.5       1.533995  0.5761261  1.0493047
##   0.01    0.6       1.836513  0.5215431  1.1536244
##   0.01    0.7       1.948444  0.5163828  1.1903216
##   0.01    0.8       1.992865  0.5192833  1.2048963
##   0.01    0.9       2.001436  0.5160788  1.2104575
##   0.02    0.1       1.376686  0.5962553  1.1237506
##   0.02    0.2       1.179724  0.6029620  0.9651424
##   0.02    0.3       1.117122  0.6448564  0.9091019
##   0.02    0.4       1.137214  0.6273521  0.9267966
##   0.02    0.5       1.372629  0.5625950  1.0253075
##   0.02    0.6       1.614600  0.5630321  1.0821050
##   0.02    0.7       1.745722  0.5584073  1.1236004
##   0.02    0.8       1.812302  0.5640438  1.1388188
##   0.02    0.9       1.841691  0.5634063  1.1489656
##   0.03    0.1       1.411223  0.5914304  1.1508178
##   0.03    0.2       1.187743  0.6095602  0.9785188
##   0.03    0.3       1.125932  0.6385686  0.9165670
##   0.03    0.4       1.121846  0.6360250  0.9162427
##   0.03    0.5       1.272835  0.5835891  0.9886596
##   0.03    0.6       1.506165  0.5565417  1.0610965
##   0.03    0.7       1.727109  0.5276106  1.1263097
##   0.03    0.8       1.772254  0.5355759  1.1428405
##   0.03    0.9       1.801513  0.5414893  1.1517055
##   0.04    0.1       1.435269  0.5874408  1.1698199
##   0.04    0.2       1.198073  0.6134869  0.9862020
##   0.04    0.3       1.133606  0.6312434  0.9236687
##   0.04    0.4       1.115993  0.6403728  0.9096211
##   0.04    0.5       1.199852  0.6031642  0.9535365
##   0.04    0.6       1.420348  0.5656088  1.0382785
##   0.04    0.7       1.699327  0.5251323  1.1197602
##   0.04    0.8       1.751740  0.5254019  1.1380754
##   0.04    0.9       1.783042  0.5286679  1.1496392
##   0.05    0.1       1.453060  0.5840997  1.1837320
##   0.05    0.2       1.210467  0.6141504  0.9953762
##   0.05    0.3       1.140219  0.6258097  0.9293724
##   0.05    0.4       1.116012  0.6411159  0.9092313
##   0.05    0.5       1.157088  0.6179392  0.9368883
##   0.05    0.6       1.354265  0.5766527  1.0173467
##   0.05    0.7       1.659358  0.5268506  1.1102165
##   0.05    0.8       1.731289  0.5228214  1.1319998
##   0.05    0.9       1.759322  0.5250305  1.1437467
##   0.06    0.1       1.466882  0.5812903  1.1943340
##   0.06    0.2       1.222842  0.6133521  1.0051808
##   0.06    0.3       1.149595  0.6183270  0.9366474
##   0.06    0.4       1.117413  0.6409397  0.9100061
##   0.06    0.5       1.131494  0.6304926  0.9223921
##   0.06    0.6       1.303914  0.5867035  0.9992326
##   0.06    0.7       1.600538  0.5317128  1.0953266
##   0.06    0.8       1.712254  0.5223323  1.1272494
##   0.06    0.9       1.736239  0.5238930  1.1377361
##   0.07    0.1       1.478004  0.5789105  1.2027369
##   0.07    0.2       1.234253  0.6122094  1.0136973
##   0.07    0.3       1.158544  0.6117060  0.9429374
##   0.07    0.4       1.118617  0.6406096  0.9107760
##   0.07    0.5       1.123974  0.6350864  0.9159305
##   0.07    0.6       1.261993  0.5955604  0.9836681
##   0.07    0.7       1.547203  0.5377127  1.0805730
##   0.07    0.8       1.695925  0.5230274  1.1236267
##   0.07    0.9       1.713076  0.5241957  1.1316048
##   0.08    0.1       1.487186  0.5768869  1.2096016
##   0.08    0.2       1.244216  0.6112365  1.0208965
##   0.08    0.3       1.166754  0.6059270  0.9493631
##   0.08    0.4       1.121062  0.6391375  0.9117815
##   0.08    0.5       1.121910  0.6362429  0.9136953
##   0.08    0.6       1.226471  0.6041170  0.9690794
##   0.08    0.7       1.502625  0.5434707  1.0677599
##   0.08    0.8       1.679079  0.5243156  1.1200591
##   0.08    0.9       1.691483  0.5250656  1.1261856
##   0.09    0.1       1.494916  0.5751256  1.2153784
##   0.09    0.2       1.253086  0.6102868  1.0272216
##   0.09    0.3       1.172192  0.6028435  0.9548515
##   0.09    0.4       1.121756  0.6386978  0.9118041
##   0.09    0.5       1.120633  0.6370130  0.9118837
##   0.09    0.6       1.198794  0.6113090  0.9564443
##   0.09    0.7       1.466982  0.5486201  1.0569053
##   0.09    0.8       1.662749  0.5263574  1.1158370
##   0.09    0.9       1.673324  0.5262916  1.1220543
##   0.10    0.1       1.501510  0.5735907  1.2204103
##   0.10    0.2       1.260956  0.6094407  1.0331926
##   0.10    0.3       1.176570  0.6005218  0.9597081
##   0.10    0.4       1.121695  0.6386951  0.9107952
##   0.10    0.5       1.120174  0.6374852  0.9110819
##   0.10    0.6       1.179466  0.6165245  0.9469254
##   0.10    0.7       1.435287  0.5537491  1.0470693
##   0.10    0.8       1.633775  0.5294872  1.1078720
##   0.10    0.9       1.658501  0.5278059  1.1186859
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.3 and lambda = 0.01.
#plotting the Elastic Net model
plot(enet_model)

The optimal value of the metric performance is: fraction = 0.3, lambda = 0.1 and R2 = 0.5820

  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
enet_prediction <- predict(enet_model, test_chemical_process[,-1] )
(predResult <- postResample(enet_prediction, obs=test_chemical_process[,1]))
##      RMSE  Rsquared       MAE 
## 1.0964894 0.6932596 0.9263345

The prediction of the response for the test set is: 0.622, which is greater than that of the resampled performance metric on the training set of:0.582

  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

I will use the VarImp function to rank the importance of predictors

varImp(enet_model)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   90.02
## BiologicalMaterial06     84.56
## ManufacturingProcess36   75.99
## ManufacturingProcess17   74.88
## BiologicalMaterial03     73.53
## ManufacturingProcess09   70.37
## BiologicalMaterial12     67.98
## BiologicalMaterial02     65.33
## ManufacturingProcess31   60.32
## ManufacturingProcess06   58.10
## ManufacturingProcess33   49.70
## BiologicalMaterial11     48.11
## BiologicalMaterial04     47.13
## ManufacturingProcess11   42.19
## BiologicalMaterial08     41.88
## BiologicalMaterial01     39.14
## ManufacturingProcess30   33.14
## ManufacturingProcess12   32.62
## BiologicalMaterial09     32.42
(predResult <- postResample(enet_prediction, obs=test_chemical_process[,1]))
##      RMSE  Rsquared       MAE 
## 1.0964894 0.6932596 0.9263345

The most important variables in the trained model are: ManufacturingProccess32,ManufacturingProccess13, BiologicalMaterial06 and ManufacturingProcess36. Process predictors dominate the list. There are 11 process predictors and 9 biological predictors.

  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

I will create a correlation plot between the top predictors and the response variables in order to explore their relationships

top_10_important <- varImp(enet_model)$importance %>%
  arrange(-Overall) %>%
  head(10)
Chemical %>%
  select(c("Yield", row.names(top_10_important))) %>%
  cor() %>%
  corrplot()

From the correlation plot, we can see that the “ManufacturingProcess32” has one of the highest correlations with Yield. “ManufacturingProcess13”, “ManufacturingProcess36” and “ManufacturingProcess17” are negatively correlated with Yield. This is very useful information because the goal is to maximize Yield. Knowing the variables which have a high correlation with Yield will help them improve the raw materials and meaurements of those particular processes. Knowing the processes which are negatively correlated with yield will let them know to reduce materials for these processes as they produce a negative affect.