library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.4     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(DataExplorer)
library(RANN)

## Warning: package 'RANN' was built under R version 4.2.3

library(pls)

## Warning: package 'pls' was built under R version 4.2.3

## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:caret':
## 
##     R2
## 
## The following object is masked from 'package:stats':
## 
##     loadings

library(caret)
library(caTools)

## Warning: package 'caTools' was built under R version 4.2.3

library(Amelia)

## Warning: package 'Amelia' was built under R version 4.2.3

## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.1, built: 2022-11-18)
## ## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

Homework 7: Linear Regression

Instructions

In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.

Exercise 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

dim(fingerprints)

## [1]  165 1107

head(permeability)

##   permeability
## 1       12.520
## 2        1.120
## 3       19.405
## 4        1.730
## 5        1.680
## 6        0.510

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

#719 low frequency instances
low_frequency <- nearZeroVar(fingerprints)

#remove low frequency columns using baser df[row,columns]
predictors <- fingerprints[,-low_frequency]

#388 predictors remaining
#dim(predictors)

After applying the nerZeroVar function and filtering out the low frequency predictors, we are left with 388 out of the original 1,107.

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

We will split the data into 80% train and 20% test.

set.seed(624)

#70 30 split
split1<- sample(c(rep(0, 0.7 * nrow(permeability)), 
                  rep(1, 0.3 * nrow(permeability))))

#training split1[0] = 115 observations, test split1[1] = 49 observations 
#table(split1)

X_train <- predictors[split1 == 0,]
X_test <- predictors[split1 == 1,]

y_train <- permeability[split1 == 0]
y_test <- permeability[split1 == 1]

#PLS model 
plsTune <- train(X_train, y_train, 
                method='pls', metric='Rsquared',
                tuneLength=20, 
                trControl=trainControl(method='cv'),
                preProc=c('center', 'scale')
                )
plsTune

## Partial Least Squares 
## 
## 116 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 104, 105, 105, 104, 104, 104, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.98974  0.3897237  10.091790
##    2     11.71406  0.5029046   8.701257
##    3     11.83257  0.5312902   9.214913
##    4     11.42981  0.5632994   9.013731
##    5     11.03805  0.5914051   8.691079
##    6     10.59821  0.6316227   8.231370
##    7     10.26537  0.6443277   7.986676
##    8     10.46235  0.6304089   8.231908
##    9     10.73985  0.6017092   8.546290
##   10     10.83152  0.5891046   8.531458
##   11     10.99323  0.5805043   8.418635
##   12     11.13834  0.5746015   8.593098
##   13     11.37617  0.5655660   8.816645
##   14     11.58639  0.5586207   9.061223
##   15     11.77675  0.5451381   9.186252
##   16     12.04135  0.5270379   9.397151
##   17     12.31818  0.5175956   9.491485
##   18     12.67136  0.4995409   9.714319
##   19     12.76746  0.4976270   9.845285
##   20     12.90804  0.4913546  10.027778
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 7.

plsTune$results %>% 
  dplyr::filter(ncomp == 7)

##   ncomp     RMSE  Rsquared      MAE  RMSESD RsquaredSD    MAESD
## 1     7 10.26537 0.6443277 7.986676 2.36905  0.1930055 1.832539

The best tune was found at ncomp = 7 with an \(R^2\) value of 0.6443277.

(d) Predict the response for the test set. What is the test set estimate of R2?

#generate prediction using model and testing data
plsPred <- predict(plsTune, newdata=X_test)

#evaluation metrics
postResample(pred=plsPred, obs=y_test)

##      RMSE  Rsquared       MAE 
## 15.504629  0.298245 10.263821

The predictions on the test set yield an R2 of 0.298, which is lower than the training set R2.

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

We will try building a ridge regression and elastic net model, which use penalization to reduce RMSE.

ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))

enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))

set.seed(100)
ridgeRegFit <- train(X_train, y_train,
method = "ridge",
## Fit the model over many penalty values
tuneGrid = ridgeGrid,
trControl = trainControl(method = "cv", number = 10),
## put the predictors on the same scale
preProc = c("center", "scale"))

ridgeRegFit

## Ridge Regression 
## 
## 116 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 104, 104, 104, 104, 105, 105, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE      Rsquared   MAE      
##   0.000000000  13.42918  0.4350687   9.813896
##   0.007142857  13.39973  0.4347176  10.069807
##   0.014285714  12.79387  0.4595642   9.561699
##   0.021428571  12.57599  0.4709797   9.342162
##   0.028571429  12.36206  0.4822585   9.147202
##   0.035714286  12.24330  0.4896934   9.015106
##   0.042857143  12.16753  0.4945919   8.913440
##   0.050000000  12.08335  0.5004080   8.823110
##   0.057142857  12.02912  0.5045467   8.754285
##   0.064285714  11.97670  0.5086675   8.706204
##   0.071428571  11.92931  0.5120632   8.667938
##   0.078571429  11.89081  0.5151343   8.648102
##   0.085714286  11.87584  0.5175576   8.633921
##   0.092857143  11.84389  0.5198216   8.618253
##   0.100000000  11.82397  0.5223366   8.610166
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

The ridge regression used an optimal penalty of lambda 0.1 which yielded a RMSE of 11.82.

set.seed(100)
enetTune <- train(X_train, y_train,
method = "enet",
tuneGrid = enetGrid,
trControl = trainControl(method = "cv", number = 10),
preProc = c("center", "scale"))

enetTune

## Elasticnet 
## 
## 116 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 104, 104, 104, 104, 105, 105, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.00    0.05      12.89447  0.4958656  10.098686
##   0.00    0.10      11.55344  0.5511241   8.709817
##   0.00    0.15      11.16136  0.5697351   8.310974
##   0.00    0.20      11.17919  0.5598754   8.223731
##   0.00    0.25      11.37140  0.5416571   8.265924
##   0.00    0.30      11.50593  0.5310362   8.390119
##   0.00    0.35      11.57897  0.5266347   8.490916
##   0.00    0.40      11.72109  0.5176938   8.560236
##   0.00    0.45      11.89539  0.5095340   8.657046
##   0.00    0.50      11.95153  0.5100184   8.684189
##   0.00    0.55      12.03194  0.5056331   8.765216
##   0.00    0.60      12.17967  0.4984503   8.896229
##   0.00    0.65      12.28237  0.4938369   9.005594
##   0.00    0.70      12.38252  0.4892529   9.099680
##   0.00    0.75      12.51270  0.4824588   9.189189
##   0.00    0.80      12.65010  0.4744044   9.264844
##   0.00    0.85      12.79513  0.4656540   9.346416
##   0.00    0.90      12.98995  0.4559565   9.475609
##   0.00    0.95      13.20659  0.4458025   9.637678
##   0.00    1.00      13.42918  0.4350687   9.813896
##   0.01    0.05      13.18297  0.4866818  10.007414
##   0.01    0.10      13.97323  0.5239471  10.743653
##   0.01    0.15      15.53625  0.5068019  12.194767
##   0.01    0.20      17.33769  0.4942749  13.633990
##   0.01    0.25      19.06313  0.4976596  15.010541
##   0.01    0.30      21.07059  0.4910393  16.488132
##   0.01    0.35      23.16569  0.4866973  17.963082
##   0.01    0.40      25.37782  0.4806392  19.491249
##   0.01    0.45      27.56591  0.4746575  20.960403
##   0.01    0.50      29.82133  0.4701056  22.499166
##   0.01    0.55      31.97964  0.4663268  23.982901
##   0.01    0.60      34.21426  0.4607319  25.533774
##   0.01    0.65      36.47001  0.4540935  27.076710
##   0.01    0.70      38.75128  0.4484932  28.653110
##   0.01    0.75      41.13697  0.4440274  30.321401
##   0.01    0.80      43.71666  0.4383655  32.075752
##   0.01    0.85      46.29433  0.4325196  33.818743
##   0.01    0.90      48.84458  0.4265639  35.524996
##   0.01    0.95      51.37126  0.4224633  37.199978
##   0.01    1.00      53.86021  0.4211344  38.861576
##   0.10    0.05      12.86317  0.4525790  10.229388
##   0.10    0.10      11.03874  0.5455046   8.245289
##   0.10    0.15      10.70773  0.5701959   7.791600
##   0.10    0.20      10.87715  0.5630199   7.894887
##   0.10    0.25      11.05600  0.5518010   8.015183
##   0.10    0.30      11.23285  0.5388144   8.176591
##   0.10    0.35      11.27574  0.5362680   8.188161
##   0.10    0.40      11.18027  0.5424088   8.156111
##   0.10    0.45      11.07856  0.5482089   8.104849
##   0.10    0.50      11.06643  0.5487583   8.109480
##   0.10    0.55      11.10437  0.5473447   8.131812
##   0.10    0.60      11.14986  0.5456959   8.155057
##   0.10    0.65      11.19748  0.5446580   8.168515
##   0.10    0.70      11.28283  0.5420062   8.207598
##   0.10    0.75      11.37487  0.5389174   8.267045
##   0.10    0.80      11.47199  0.5346039   8.326146
##   0.10    0.85      11.56486  0.5308932   8.391362
##   0.10    0.90      11.67084  0.5268765   8.465233
##   0.10    0.95      11.75356  0.5243030   8.536928
##   0.10    1.00      11.82397  0.5223366   8.610166
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.15 and lambda = 0.1.

The elastic net used an optimal penalty of lambda 0.1 and fraction 0.10, which yielded a RMSE of 11.82 and R2 of 0.52.

It seems that the elastic net model has superior predictive performance than any of the previous models discussed in this exercise.

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

enetpredict <- predict(enetTune, X_test)

postResample(pred=enetpredict, obs = y_test)

##       RMSE   Rsquared        MAE 
## 14.2854921  0.3158383  9.4547206

With an R2 of about 0.31 for the predictions on our test set from our best model, I don’t feel confident that we could replace the laboratory experiments with any of these models.

Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

data("ChemicalManufacturingProcess")

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

dim(ChemicalManufacturingProcess)

## [1] 176  58

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

On the missing values plot below, we can observe that roughly 28 variables have missing values that range from 0.57% to 8.52% of values missing.

library(AppliedPredictiveModeling)
missmap(ChemicalManufacturingProcess)

#using knn imputation
(CHEM_knn_impute <- preProcess(ChemicalManufacturingProcess, method=c('knnImpute')))

## Created from 152 samples and 58 variables
## 
## Pre-processing:
##   - centered (58)
##   - ignored (0)
##   - 5 nearest neighbor imputation (58)
##   - scaled (58)

CHEM_dataframe <- predict(CHEM_knn_impute, ChemicalManufacturingProcess)
summary(CHEM_dataframe)

##      Yield         BiologicalMaterial01 BiologicalMaterial02
##  Min.   :-2.6692   Min.   :-2.5653      Min.   :-2.1858     
##  1st Qu.:-0.7716   1st Qu.:-0.6078      1st Qu.:-0.7457     
##  Median :-0.1119   Median :-0.1491      Median :-0.1484     
##  Mean   : 0.0000   Mean   : 0.0000      Mean   : 0.0000     
##  3rd Qu.: 0.7035   3rd Qu.: 0.6423      3rd Qu.: 0.7557     
##  Max.   : 3.3394   Max.   : 3.3597      Max.   : 2.2459     
##  BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
##  Min.   :-2.6830      Min.   :-1.6731      Min.   :-2.90576    
##  1st Qu.:-0.6811      1st Qu.:-0.6222      1st Qu.:-0.73944    
##  Median :-0.1212      Median :-0.1405      Median :-0.05891    
##  Mean   : 0.0000      Mean   : 0.0000      Mean   : 0.00000    
##  3rd Qu.: 0.6804      3rd Qu.: 0.4907      3rd Qu.: 0.70568    
##  Max.   : 2.6355      Max.   : 6.0523      Max.   : 3.38985    
##  BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
##  Min.   :-2.2184      Min.   :-0.1313      Min.   :-2.38535    
##  1st Qu.:-0.7622      1st Qu.:-0.1313      1st Qu.:-0.64225    
##  Median :-0.1202      Median :-0.1313      Median : 0.02249    
##  Mean   : 0.0000      Mean   : 0.0000      Mean   : 0.00000    
##  3rd Qu.: 0.6499      3rd Qu.:-0.1313      3rd Qu.: 0.56906    
##  Max.   : 2.7948      Max.   : 7.5723      Max.   : 2.43034    
##  BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
##  Min.   :-3.39629     Min.   :-1.7202      Min.   :-2.3116     
##  1st Qu.:-0.59627     1st Qu.:-0.5685      1st Qu.:-0.6505     
##  Median :-0.03627     Median :-0.1513      Median :-0.1811     
##  Mean   : 0.00000     Mean   : 0.0000      Mean   : 0.0000     
##  3rd Qu.: 0.67428     3rd Qu.: 0.3161      3rd Qu.: 0.5491     
##  Max.   : 2.96246     Max.   : 6.7920      Max.   : 2.4431     
##  BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
##  Min.   :-2.3914      Min.   :-6.149703      Min.   :-1.969253     
##  1st Qu.:-0.6074      1st Qu.:-0.223563      1st Qu.: 0.308956     
##  Median :-0.1033      Median : 0.105667      Median : 0.509627     
##  Mean   : 0.0000      Mean   : 0.001224      Mean   : 0.009518     
##  3rd Qu.: 0.7112      3rd Qu.: 0.503487      3rd Qu.: 0.568648     
##  Max.   : 2.5986      Max.   : 1.587202      Max.   : 0.686690     
##  ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
##  Min.   :-3.10582       Min.   :-3.323233      Min.   :-2.577803     
##  1st Qu.:-0.42705       1st Qu.:-0.613828      1st Qu.:-0.487046     
##  Median : 0.37658       Median : 0.342432      Median :-0.086583     
##  Mean   : 0.04123       Mean   : 0.003213      Mean   :-0.002534     
##  3rd Qu.: 0.46587       3rd Qu.: 0.661186      3rd Qu.: 0.230347     
##  Max.   : 2.69818       Max.   : 2.254953      Max.   : 5.686954     
##  ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
##  Min.   :-1.630631      Min.   :-0.9580199     Min.   :-1.111973     
##  1st Qu.:-0.630408      1st Qu.:-0.9580199     1st Qu.:-1.111973     
##  Median :-0.222910      Median :-0.9580199     Median : 0.894164     
##  Mean   :-0.006574      Mean   :-0.0009072     Mean   :-0.001759     
##  3rd Qu.: 0.480950      3rd Qu.: 1.0378549     3rd Qu.: 0.894164     
##  Max.   : 7.408415      Max.   : 1.0378549     Max.   : 0.894164     
##  ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
##  Min.   :-4.37787       Min.   :-2.18999       Min.   :-2.63442      
##  1st Qu.:-0.49799       1st Qu.:-0.62482       1st Qu.:-0.53867      
##  Median : 0.04519       Median :-0.10310       Median : 0.02020      
##  Mean   : 0.00000       Mean   : 0.02156       Mean   : 0.03163      
##  3rd Qu.: 0.55281       3rd Qu.: 0.54906       3rd Qu.: 0.71878      
##  Max.   : 2.39252       Max.   : 3.15768       Max.   : 2.95425      
##  ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
##  Min.   :-0.480694      Min.   :-2.37172       Min.   :-2.803712     
##  1st Qu.:-0.480694      1st Qu.:-0.59881       1st Qu.:-0.488202     
##  Median :-0.480694      Median : 0.09066       Median : 0.029921     
##  Mean   :-0.002731      Mean   : 0.00000       Mean   :-0.004071     
##  3rd Qu.:-0.480694      3rd Qu.: 0.68163       3rd Qu.: 0.520534     
##  Max.   : 2.068439      Max.   : 4.03046       Max.   : 3.688885     
##  ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
##  Min.   :-2.3137        Min.   :-12.98219      Min.   :-2.43850      
##  1st Qu.:-0.4960        1st Qu.: -0.01436      1st Qu.:-0.67597      
##  Median :-0.1273        Median :  0.06312      Median : 0.04507      
##  Mean   : 0.0000        Mean   :  0.00000      Mean   : 0.00000      
##  3rd Qu.: 0.3786        3rd Qu.:  0.15126      3rd Qu.: 0.60587      
##  Max.   : 3.3283        Max.   :  0.81376      Max.   : 4.53150      
##  ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
##  Min.   :-13.08836      Min.   :-3.0321        Min.   :-13.05542     
##  1st Qu.:  0.00903      1st Qu.:-0.6022        1st Qu.: -0.01063     
##  Median :  0.06890      Median :-0.1360        Median :  0.07318     
##  Mean   :  0.00000      Mean   : 0.0000        Mean   :  0.00000     
##  3rd Qu.:  0.14237      3rd Qu.: 0.4838        3rd Qu.:  0.15197     
##  Max.   :  0.43899      Max.   : 2.5846        Max.   :  0.58033     
##  ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
##  Min.   :-2.1018        Min.   :-1.6230324     Min.   :-1.814768     
##  1st Qu.:-0.5599        1st Qu.:-0.7223009     1st Qu.:-0.611797     
##  Median :-0.1745        Median :-0.1218132     Median :-0.010311     
##  Mean   : 0.0000        Mean   : 0.0003314     Mean   : 0.004726     
##  3rd Qu.: 0.2110        3rd Qu.: 0.7789183     3rd Qu.: 0.591175     
##  Max.   : 4.8365        Max.   : 1.9798937     Max.   : 1.794146     
##  ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
##  Min.   :-1.523304      Min.   :-12.927496     Min.   :-12.940454    
##  1st Qu.:-0.833580      1st Qu.:  0.002208     1st Qu.:  0.008505    
##  Median :-0.143857      Median :  0.069146     Median :  0.063251    
##  Mean   : 0.005061      Mean   :  0.000105     Mean   :  0.000404    
##  3rd Qu.: 0.890729      3rd Qu.:  0.128720     3rd Qu.:  0.115417    
##  Max.   : 2.442608      Max.   :  0.433287     Max.   :  0.312785    
##  ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
##  Min.   :-12.888994     Min.   :-1.25583       Min.   :-12.026718    
##  1st Qu.:  0.000681     1st Qu.:-1.25583       1st Qu.: -0.186978    
##  Median :  0.066362     Median : 0.72551       Median : -0.066778    
##  Mean   :  0.001121     Mean   :-0.02868       Mean   : -0.003263    
##  3rd Qu.:  0.131337     3rd Qu.: 0.78266       3rd Qu.:  0.233723    
##  Max.   :  0.416660     Max.   : 0.93507       Max.   :  1.195326    
##  ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
##  Min.   :-9.38589       Min.   :-12.632749     Min.   :-2.86552      
##  1st Qu.:-0.37026       1st Qu.: -0.015263     1st Qu.:-0.64216      
##  Median : 0.03954       Median :  0.110732     Median :-0.08632      
##  Mean   : 0.01114       Mean   : -0.001722     Mean   : 0.00000      
##  3rd Qu.: 0.55179       3rd Qu.:  0.218728     3rd Qu.: 0.65480      
##  Max.   : 2.08855       Max.   :  0.416720     Max.   : 2.69287      
##  ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
##  Min.   :-3.037737      Min.   :-3.558813      Min.   :-3.01270      
##  1st Qu.:-0.621676      1st Qu.: 0.118269      1st Qu.:-0.51725      
##  Median : 0.183677      Median : 0.118269      Median :-0.05513      
##  Mean   :-0.005764      Mean   :-0.009176      Mean   :-0.02362      
##  3rd Qu.: 0.586354      3rd Qu.: 0.118269      3rd Qu.: 0.49941      
##  Max.   : 2.599738      Max.   : 1.956810      Max.   : 2.44032      
##  ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
##  Min.   :-2.944307      Min.   :-2.27741       Min.   :-3.9024       
##  1st Qu.:-0.655777      1st Qu.:-0.70467       1st Qu.:-0.8225       
##  Median :-0.083645      Median :-0.03064       Median : 0.7175       
##  Mean   :-0.008228      Mean   : 0.00000       Mean   : 0.0000       
##  3rd Qu.: 0.488487      3rd Qu.: 0.64339       3rd Qu.: 0.7175       
##  Max.   : 2.777017      Max.   : 2.89017       Max.   : 0.7175       
##  ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
##  Min.   :-4.5508        Min.   :-0.4626528     Min.   :-0.440588     
##  1st Qu.: 0.1653        1st Qu.:-0.4626528     1st Qu.:-0.440588     
##  Median : 0.2317        Median :-0.4626528     Median :-0.440588     
##  Mean   : 0.0000        Mean   : 0.0003392     Mean   :-0.000392     
##  3rd Qu.: 0.2982        3rd Qu.:-0.4626528     3rd Qu.:-0.440588     
##  Max.   : 0.4310        Max.   : 2.1490969     Max.   : 3.275213     
##  ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
##  Min.   :-5.77163       Min.   :-1.0506        Min.   :-5.60583      
##  1st Qu.: 0.09979       1st Qu.:-0.3594        1st Qu.:-0.01588      
##  Median : 0.20280       Median :-0.1290        Median : 0.29467      
##  Mean   : 0.00000       Mean   : 0.0000        Mean   : 0.00000      
##  3rd Qu.: 0.25430       3rd Qu.: 0.1303        3rd Qu.: 0.29467      
##  Max.   : 0.46031       Max.   :11.6224        Max.   : 0.91578      
##  ManufacturingProcess45
##  Min.   :-5.25447      
##  1st Qu.:-0.09356      
##  Median : 0.15220      
##  Mean   : 0.00000      
##  3rd Qu.: 0.39796      
##  Max.   : 1.13523

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Based on the previous exercise and seeing that the elastic net model had better predictive performance, I will use this model to train the data.

We will split the data into 80% train and 20% test.

dim(CHEM_dataframe)

## [1] 176  58

CHEM_dataframe2 <- CHEM_dataframe[, -nearZeroVar(CHEM_dataframe)]
dim(CHEM_dataframe2)

## [1] 176  57

set.seed(555)

select_train <- createDataPartition(CHEM_dataframe2$Yield, times = 1, p = .80, list = FALSE)

train_x2 <- CHEM_dataframe2[select_train, ][, -c(1)] 
test_x2 <- CHEM_dataframe2[-select_train, ][, -c(1)] 
train_y2 <- CHEM_dataframe2[select_train, ]$Yield
test_y2 <- CHEM_dataframe2[-select_train, ]$Yield

(P_fit2 <- train(x = train_x2, y = train_y2,
                method = "pls",
                metric = "Rsquared",
                tuneLength = 25, 
                trControl = trainControl(method = "cv", number=10), 
                preProcess = c('center', 'scale')
          ))

## Partial Least Squares 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 129, 130, 130, 129, 130, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.7338982  0.4317348  0.6175189
##    2     0.7722558  0.4684310  0.5812528
##    3     0.8134501  0.5194408  0.5823906
##    4     0.9540440  0.5193093  0.6360902
##    5     1.0721588  0.5241915  0.6733250
##    6     1.1436786  0.5185259  0.6912503
##    7     1.2458840  0.5085213  0.7365470
##    8     1.3875640  0.4976078  0.7926451
##    9     1.4661458  0.4849316  0.8268420
##   10     1.5879129  0.4644183  0.8643309
##   11     1.6495018  0.4598224  0.8756271
##   12     1.7627131  0.4554407  0.8976769
##   13     1.7487823  0.4650528  0.8931955
##   14     1.6777520  0.4656269  0.8684202
##   15     1.6377169  0.4705341  0.8508950
##   16     1.6194606  0.4693782  0.8400870
##   17     1.5915410  0.4743053  0.8205206
##   18     1.5932854  0.4752533  0.8122477
##   19     1.6520158  0.4751787  0.8387607
##   20     1.6905995  0.4728315  0.8552220
##   21     1.7800104  0.4698293  0.8932960
##   22     1.8218081  0.4683798  0.9101214
##   23     1.9167962  0.4630918  0.9435827
##   24     1.9756004  0.4611352  0.9652468
##   25     2.0143881  0.4592907  0.9780208
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 5.

plot(P_fit2)

The optimal value for the performance metric yields a RMSE of 1.07 and R2 of 0.524.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

P_predict2 <- predict(P_fit2, newdata=test_x2)
(postResample(pred=P_predict2, obs=test_y2))

##      RMSE  Rsquared       MAE 
## 0.6344353 0.6718573 0.5241329

The predictions on the test set yield a RMSE of 0.634, which is better than the one on the training set and R2 of 0.524, which is slightly lower than the one from the training set. Overall it seems that the model has performed very well on the test set.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

In this model, the most important predictor at the top of the list is “ManufacturingProcess32”. However, there is not a predictor type that dominates the list, we have both biological and process predictors similarly as important in the list.

plot(varImp(P_fit2, scale = FALSE), top=20, scales = list(y = list(cex = 0.8)))

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

correlation <- cor(select(CHEM_dataframe2, 'ManufacturingProcess32','ManufacturingProcess36','ManufacturingProcess13','Yield'))
corrplot::corrplot(correlation, method='square', type="upper")

As we can see on the correlation plots above, some of the most important predictor variables have strong positive or negative relationships with the response variable. However, there are a few variables that do not seem to have strong correlations with the response and yet were identified as important for the model. Knowing which variables are positively or negatively correlated to the yield can help improve it by making the necessary adjustments in the manufacturing process to increase yield.

DATA 624 HW7

Bharani Nittala

April 2, 2023

Homework 7: Linear Regression

Instructions

Exercise 6.2

Exercise 6.3