6.2 Developing A model to predict permeability

A. Start R and use these commands to load in the data

## Let's call in the data
library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.3.3

data("permeability")

B.

## Was just playing around with the nearZeroVar function to understand what it was doing.
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

clean <- nearZeroVar(fingerprints,saveMetrics = TRUE)

dim(fingerprints)

## [1]  165 1107

clean2 <- nearZeroVar(fingerprints)

## Filter out this columns from the dataset.

clean_finger <- fingerprints[,-clean2]

dim(clean_finger)

## [1] 165 388

Using the nearzerovar function we removed over 800 predictors that had nearzerovar values.

C. Split the data into training and testing

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

## We split the data based on the response in which case is permeability
set.seed(1)
trainRow <- createDataPartition(permeability, p=0.8, list=FALSE)
X.train <- clean_finger[trainRow, ]
y.train <- permeability[trainRow, ]
X.test <- clean_finger[-trainRow, ]
y.test <- permeability[-trainRow, ]

## We can fit an PLS model now, 
# ctrl sets up 10-fold folds for cross-validation
set.seed(2)
ctrl <- trainControl(method = "cv",number = 10)
plsTune <- train(X.train,y.train,method = "pls",tuneLength = 20,trControl = ctrl,preProcess = c("center","scale"))

## Seems like it needed 10 components to explain 70 % of the variance?
plsTune

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 121, 119, 119, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     12.93602  0.3991379  9.756731
##    2     11.50435  0.5334462  8.353969
##    3     11.54192  0.5022321  8.860602
##    4     11.44849  0.5160740  8.876045
##    5     10.93868  0.5649522  8.205146
##    6     10.82624  0.5673513  8.156738
##    7     10.58600  0.5688317  8.189013
##    8     10.64118  0.5533584  8.247715
##    9     10.38141  0.5618200  8.013176
##   10     10.23861  0.5822621  7.622160
##   11     10.41218  0.5713843  7.710015
##   12     10.36078  0.5813480  7.669539
##   13     10.38942  0.5834779  7.696167
##   14     10.52169  0.5768715  7.760384
##   15     10.91159  0.5578152  7.942479
##   16     11.15889  0.5553447  8.241611
##   17     11.59726  0.5442666  8.496309
##   18     11.70003  0.5385997  8.665714
##   19     11.75514  0.5341747  8.694274
##   20     11.86050  0.5361340  8.799761
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 10.

summary(plsTune)

## Data:    X dimension: 133 388 
##  Y dimension: 133 1
## Fit method: oscorespls
## Number of components considered: 10
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X           21.09    34.70    40.33    45.02    51.49    58.92    62.07
## .outcome    34.57    51.73    59.70    66.80    72.46    76.41    79.79
##           8 comps  9 comps  10 comps
## X           66.69    69.49     72.48
## .outcome    81.40    83.47     84.54

plot(plsTune)

Looking at the summary function of plsTune and at the plot of the PLS model, we can see that most of the data’s variance can be captured by 10 components,after 10 components the model begans to “under-fit” and captures less of the complexity. So we can just use the first 10 components of the model to predict on our testing data. For the first 10 components the R2 is 0.58.

D. Predict the Response for the test set

We can use the postResample function from the caret library to find the R2 squared value

plsPred <- predict(plsTune,newdata = X.test,ncomp = 10)
postResample(pred = plsPred,obs = y.test)

##       RMSE   Rsquared        MAE 
## 14.6758044  0.3274115 10.9299035

Here the Rsquared value is lower on the test set..

E. Use Other Models

PCR model

We can try a PCR model, looking at the documentation of train we can train a variety of models we will first train a PCR model

set.seed(2)
ctrl <- trainControl(method = "cv",number = 10)
pcrTune <- train(X.train,y.train,method = "pcr",tuneLength = 20,trControl = ctrl,preProcess = c("center","scale"))

pcrTune

## Principal Component Analysis 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 121, 119, 119, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     14.71806  0.1129303  11.294285
##    2     14.78535  0.1070360  11.374555
##    3     13.31043  0.3684400   9.938426
##    4     12.57922  0.4418256   9.704423
##    5     12.01188  0.5076261   8.843804
##    6     12.00945  0.5164286   8.882011
##    7     12.03297  0.4953780   8.871224
##    8     11.97685  0.4975877   8.761882
##    9     12.04399  0.4671626   8.847345
##   10     12.00752  0.4715141   8.802659
##   11     11.93147  0.4829882   8.770246
##   12     11.69978  0.4952314   8.775275
##   13     11.77682  0.4865078   8.834785
##   14     11.89691  0.4750947   8.965196
##   15     12.05555  0.4602524   9.083852
##   16     11.97265  0.4713711   8.925006
##   17     12.16393  0.4672501   9.128545
##   18     11.96329  0.4878165   9.101439
##   19     12.13790  0.4765539   9.216944
##   20     11.86884  0.5036443   8.929720
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 12.

plot(pcrTune)

It seems like the lowest RMSE value was around 12 components..

pcrpred <- predict(pcrTune,newdata = X.test,ncomp = 1:12)
postResample(pred = pcrpred,obs = y.test)

##       RMSE   Rsquared        MAE 
## 13.2452661  0.2860616 10.0268331

The R-squared is smaller compared to the PLS model..

Ridge Regresion

## Set 10 different lambda values
ridgegrid <- data.frame(.lambda = seq(0,.1,length = 10))

## Build the model..
set.seed(3)
ridgeregfit <- train(X.train,y.train,method = "ridge",tuneGrid = ridgegrid,trControl = ctrl,preProc = c("center","scale"))

## Warning: model fit failed for Fold03: lambda=0.00000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning: model fit failed for Fold10: lambda=0.00000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

ridgeregfit

## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 121, 119, 120, 118, ... 
## Resampling results across tuning parameters:
## 
##   lambda      RMSE      Rsquared   MAE      
##   0.00000000  13.56105  0.4791827   9.976741
##   0.01111111  13.84156  0.4256120   9.970553
##   0.02222222  37.33681  0.4708809  20.447780
##   0.03333333  10.64119  0.5328469   7.683918
##   0.04444444  10.45913  0.5458059   7.539779
##   0.05555556  10.36459  0.5533689   7.493452
##   0.06666667  10.30773  0.5586993   7.480233
##   0.07777778  10.27000  0.5631410   7.479825
##   0.08888889  10.22660  0.5683446   7.463670
##   0.10000000  10.18854  0.5730327   7.467546
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

plot(ridgeregfit)

It seems with a penalty value or lambda value of 0.1 we get the lowest RMSE

ridpred <- predict(ridgeregfit,X.test)
postResample(pred = ridpred,obs = y.test)

##       RMSE   Rsquared        MAE 
## 15.0822363  0.3047575 11.4460367

We get an E-squared of 0.304 which is lower than the PLS..

Lasso Model

## We use the train function from the textbook..
enetGrid <- expand.grid(.lambda = c(0,0.01,0.1),.fraction = seq(.05,1,length = 15))
set.seed(4)
enetTune <- train(X.train,y.train,method = "enet",tuneGrid = enetGrid,trControl = ctrl,preProc = c("center","scale"))

## Warning: model fit failed for Fold01: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning: model fit failed for Fold08: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning: model fit failed for Fold09: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

enetTune

## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 118, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction   RMSE       Rsquared   MAE       
##   0.00    0.0500000   13.60367  0.3411740   10.252519
##   0.00    0.1178571   12.99927  0.3661877    9.260713
##   0.00    0.1857143   12.91312  0.3681914    8.981251
##   0.00    0.2535714   12.76254  0.3801867    8.960384
##   0.00    0.3214286   12.70837  0.3834288    9.029600
##   0.00    0.3892857   12.52627  0.3945208    8.990799
##   0.00    0.4571429   12.34712  0.4088136    8.928673
##   0.00    0.5250000   12.21493  0.4204528    8.926430
##   0.00    0.5928571   12.07260  0.4360365    8.809660
##   0.00    0.6607143   11.91441  0.4546768    8.669985
##   0.00    0.7285714   11.86337  0.4627041    8.589323
##   0.00    0.7964286   11.87481  0.4674083    8.579541
##   0.00    0.8642857   11.83754  0.4737802    8.563303
##   0.00    0.9321429   11.77899  0.4814910    8.485069
##   0.00    1.0000000   11.77030  0.4854085    8.445759
##   0.01    0.0500000   23.68575  0.4890447   13.731837
##   0.01    0.1178571   41.34191  0.4721048   21.307597
##   0.01    0.1857143   58.79682  0.5045713   29.475617
##   0.01    0.2535714   75.98616  0.5138722   37.437992
##   0.01    0.3214286   93.36600  0.5141966   45.440916
##   0.01    0.3892857  110.58352  0.5236589   53.410206
##   0.01    0.4571429  127.72537  0.5312118   61.728306
##   0.01    0.5250000  144.86517  0.5389867   70.066756
##   0.01    0.5928571  162.07390  0.5416979   78.421830
##   0.01    0.6607143  179.43726  0.5343540   86.879803
##   0.01    0.7285714  196.78044  0.5284694   95.296718
##   0.01    0.7964286  213.98151  0.5240815  103.669590
##   0.01    0.8642857  230.82681  0.5221129  112.002646
##   0.01    0.9321429  247.71380  0.5179338  120.368478
##   0.01    1.0000000  264.62364  0.5126777  128.760065
##   0.10    0.0500000   11.99639  0.4938224    9.098122
##   0.10    0.1178571   11.61431  0.4676538    8.191088
##   0.10    0.1857143   11.77134  0.4637975    8.340578
##   0.10    0.2535714   11.45928  0.4922367    8.205628
##   0.10    0.3214286   11.12820  0.5207526    8.021765
##   0.10    0.3892857   11.00917  0.5356437    7.888500
##   0.10    0.4571429   10.83820  0.5559069    7.688091
##   0.10    0.5250000   10.73832  0.5698699    7.562825
##   0.10    0.5928571   10.63047  0.5819268    7.466724
##   0.10    0.6607143   10.56720  0.5907506    7.456845
##   0.10    0.7285714   10.56277  0.5949968    7.481558
##   0.10    0.7964286   10.57709  0.5961651    7.508011
##   0.10    0.8642857   10.60347  0.5961122    7.530941
##   0.10    0.9321429   10.60164  0.5975929    7.531608
##   0.10    1.0000000   10.58250  0.6003715    7.523100
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.7285714 and lambda = 0.1.

It seems like lambda 0.1 and fraction = 0.72 were chosen as the final value due to it’s small RMSE value.

plot(enetTune)

lassopred <- predict(enetTune,X.test)
postResample(pred = lassopred,obs = y.test)

##      RMSE  Rsquared       MAE 
## 14.276328  0.337738 10.832378

It seems the R-squared did slightly better with an R-squared of 0.33

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

tablee <- bind_rows(pls = postResample(pred = plsPred,obs = y.test),pcr = postResample(pred = pcrpred,obs = y.test),ridge = postResample(pred = ridpred,obs = y.test),lasso = postResample(pred = lassopred,obs = y.test))

tablee$id = c("pls","pcr","ridge","lasso")

tablee

## # A tibble: 4 × 4
##    RMSE Rsquared   MAE id   
##   <dbl>    <dbl> <dbl> <chr>
## 1  14.7    0.327  10.9 pls  
## 2  13.2    0.286  10.0 pcr  
## 3  15.1    0.305  11.4 ridge
## 4  14.3    0.338  10.8 lasso

It appears that the lasso regression may have better predictive performance, but this may have occured due to the values I have chosen for my grid tuning.

F.

In this case, I would probably choose the enet model aka the lasso regression since we have slightly better metrics on the test-set as compared to the partial least squares model.

6.3 A chemical manufacturing process

A.

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")

B.

## we have to find the columns with missing values

na_counts <- colSums(is.na(ChemicalManufacturingProcess))

cols_w_na <- names(na_counts[na_counts > 0])

cols_w_na

##  [1] "ManufacturingProcess01" "ManufacturingProcess02" "ManufacturingProcess03"
##  [4] "ManufacturingProcess04" "ManufacturingProcess05" "ManufacturingProcess06"
##  [7] "ManufacturingProcess07" "ManufacturingProcess08" "ManufacturingProcess10"
## [10] "ManufacturingProcess11" "ManufacturingProcess12" "ManufacturingProcess14"
## [13] "ManufacturingProcess22" "ManufacturingProcess23" "ManufacturingProcess24"
## [16] "ManufacturingProcess25" "ManufacturingProcess26" "ManufacturingProcess27"
## [19] "ManufacturingProcess28" "ManufacturingProcess29" "ManufacturingProcess30"
## [22] "ManufacturingProcess31" "ManufacturingProcess33" "ManufacturingProcess34"
## [25] "ManufacturingProcess35" "ManufacturingProcess36" "ManufacturingProcess40"
## [28] "ManufacturingProcess41"

It appears the missing values are in the ManufacturingProcess columns

## Check each column and impute it 

trans <- preProcess(ChemicalManufacturingProcess,method = "knnImpute")

We use the preProcess function and apply knnimpute according to section 3.9 from the textbook.

## This is the only way I have found that I can view the knn imputation.. 
imp <- predict(trans,newdata = ChemicalManufacturingProcess)

head(imp)

##        Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 -1.1792673           -0.2261036           -1.5140979          -2.68303622
## 2  1.2263678            2.2391498            1.3089960          -0.05623504
## 3  1.0042258            2.2391498            1.3089960          -0.05623504
## 4  0.6737219            2.2391498            1.3089960          -0.05623504
## 5  1.2534583            1.4827653            1.8939391           1.13594780
## 6  1.8386128           -0.4081962            0.6620886          -0.59859075
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1            0.2201765            0.4941942           -1.3828880
## 2            1.2964386            0.4128555            1.1290767
## 3            1.2964386            0.4128555            1.1290767
## 4            1.2964386            0.4128555            1.1290767
## 5            0.9414412           -0.3734185            1.5348350
## 6            1.5894524            1.7305423            0.6192092
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1           -0.1313107            -1.233131           -3.3962895
## 2           -0.1313107             2.282619           -0.7227225
## 3           -0.1313107             2.282619           -0.7227225
## 4           -0.1313107             2.282619           -0.7227225
## 5           -0.1313107             1.071310           -0.1205678
## 6           -0.1313107             1.189487           -1.7343424
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1            1.1005296            -1.838655           -1.7709224
## 2            1.1005296             1.393395            1.0989855
## 3            1.1005296             1.393395            1.0989855
## 4            1.1005296             1.393395            1.0989855
## 5            0.4162193             0.136256            1.0989855
## 6            1.6346255             1.022062            0.7240877
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1              0.2154105              0.5662872              0.3765810
## 2             -6.1497028             -1.9692525              0.1979962
## 3             -6.1497028             -1.9692525              0.1087038
## 4             -6.1497028             -1.9692525              0.4658734
## 5             -0.2784345             -1.9692525              0.1087038
## 6              0.4348971             -1.9692525              0.5551658
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1              0.5655598            -0.44593467             -0.5414997
## 2             -2.3669726             0.99933318              0.9625383
## 3             -3.1638563             0.06246417             -0.1117745
## 4             -3.3232331             0.42279841              2.1850322
## 5             -2.2075958             0.84537219             -0.6304083
## 6             -1.2513352             0.49486525              0.5550403
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1             -0.1596700             -0.3095182             -1.7201524
## 2             -0.9580199              0.8941637              0.5883746
## 3              1.0378549              0.8941637             -0.3815947
## 4             -0.9580199             -1.1119728             -0.4785917
## 5              1.0378549              0.8941637             -0.4527258
## 6              1.0378549              0.8941637             -0.2199332
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1            -0.07700901            -0.09157342             -0.4806937
## 2             0.52297397             1.08204765             -0.4806937
## 3             0.31428424             0.55112383             -0.4806937
## 4            -0.02483658             0.80261406             -0.4806937
## 5            -0.39004361             0.10403009             -0.4806937
## 6             0.28819802             1.41736795             -0.4806937
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1             0.97711512              0.8093999              1.1846438
## 2            -0.50030980              0.2775205              0.9617071
## 3             0.28765016              0.4425865              0.8245152
## 4             0.28765016              0.7910592              1.0817499
## 5             0.09066017              2.5334227              3.3282665
## 6            -0.50030980              2.4050380              3.1396277
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1              0.3303945              0.9263296              0.1505348
## 2              0.1455765             -0.2753953              0.1559773
## 3              0.1455765              0.3655246              0.1831898
## 4              0.1967569              0.3655246              0.1695836
## 5              0.4754056             -0.3555103              0.2076811
## 6              0.6261033             -0.7560852              0.1423710
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1              0.4563798              0.3109942              0.2109804
## 2              1.5095063              0.1849230              0.2109804
## 3              1.0926437              0.1849230              0.2109804
## 4              0.9829430              0.1562704              0.2109804
## 5              1.6192070              0.2938027             -0.6884239
## 6              1.9044287              0.3998171             -0.5599376
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1             0.05833309              0.8317688              0.8907291
## 2            -0.72230090             -1.8147683             -1.0060115
## 3            -0.42205706             -1.2132826             -0.8335805
## 4            -0.12181322             -0.6117969             -0.6611496
## 5             0.77891831              0.5911745              1.5804530
## 6             1.07916216             -1.2132826             -1.3508734
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1              0.1200183              0.1256347              0.3460352
## 2              0.1093082              0.1966227              0.1906613
## 3              0.1842786              0.2159831              0.2104362
## 4              0.1708910              0.2052273              0.1906613
## 5              0.2726365              0.2912733              0.3432102
## 6              0.1146633              0.2417969              0.3516852
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1              0.7826636              0.5943242              0.7566948
## 2              0.8779201              0.8347250              0.7566948
## 3              0.8588688              0.7746248              0.2444430
## 4              0.8588688              0.7746248              0.2444430
## 5              0.8969714              0.9549255             -0.1653585
## 6              0.9160227              1.0150257              0.9615956
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1             -0.1952552             -0.4568829              0.9890307
## 2             -0.2672523              1.9517531              0.9890307
## 3             -0.1592567              2.6928719              0.9890307
## 4             -0.1592567              2.3223125              1.7943843
## 5             -0.1412574              2.3223125              2.5997378
## 6             -0.3572486              2.6928719              2.5997378
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1             -1.7202722            -0.88694718             -0.6557774
## 2              1.9568096             1.14638329             -0.6557774
## 3              1.9568096             1.23880740             -1.8000420
## 4              0.1182687             0.03729394             -1.8000420
## 5              0.1182687            -2.55058120             -2.9443066
## 6              0.1182687            -0.51725073             -1.8000420
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1             -1.1540243              0.7174727              0.2317270
## 2              2.2161351             -0.8224687              0.2317270
## 3             -0.7046697             -0.8224687              0.2317270
## 4              0.4187168             -0.8224687              0.2317270
## 5             -1.8280562             -0.8224687              0.2981503
## 6             -1.3787016             -0.8224687              0.2317270
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1             0.05969714            -0.06900773             0.20279570
## 2             2.14909691             2.34626280            -0.05472265
## 3            -0.46265281            -0.44058781             0.40881037
## 4            -0.46265281            -0.44058781            -0.31224099
## 5            -0.46265281            -0.44058781            -0.10622632
## 6            -0.46265281            -0.44058781             0.15129203
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1             2.40564734            -0.01588055             0.64371849
## 2            -0.01374656             0.29467248             0.15220242
## 3             0.10146268            -0.01588055             0.39796046
## 4             0.21667191            -0.01588055            -0.09355562
## 5             0.21667191            -0.32643359            -0.09355562
## 6             1.48397347            -0.01588055            -0.33931365

It seems the entire values were transformed, but the missing values are imputed with KNN, the defualt is k = 10.

C. Split The data into training and testing

## We need a ydefault
impnoY <- imp %>%
  select(-Yield)
set.seed(1)
trainRow <- createDataPartition(imp$Yield, p=0.8, list=FALSE)
imp.train <- impnoY[trainRow, ]
Yield.train <- imp[trainRow,]$Yield
imp.test <- impnoY[-trainRow, ]
Yield.test <- imp[-trainRow,]$Yield

## Develop  a pls model.. maybe I will try repeated cross-validations?
set.seed(3)
ctrl2 <- trainControl(method = "cv",number = 10)
plsmod <- train(imp.train,Yield.train,method = "pls",tuneLength = 20,trControl = ctrl2,preProcess = c("center","scale"))

plsmod

## Partial Least Squares 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 131, 129, 130, 128, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.7902236  0.4521356  0.6413296
##    2     1.0661419  0.4709495  0.6854578
##    3     0.6743168  0.5977954  0.5516494
##    4     0.8948797  0.5245414  0.6340117
##    5     1.2800750  0.5114029  0.7342358
##    6     1.4121377  0.5057793  0.7686054
##    7     1.6033925  0.5022937  0.8108227
##    8     1.6618180  0.4955572  0.8340363
##    9     1.8458902  0.4665501  0.8987356
##   10     1.9277890  0.4627515  0.9191841
##   11     1.9822065  0.4628795  0.9380517
##   12     2.0347138  0.4675141  0.9575761
##   13     2.0832809  0.4631592  0.9758059
##   14     2.0805139  0.4708037  0.9744500
##   15     2.0655596  0.4771334  0.9672732
##   16     2.0444887  0.4856160  0.9592943
##   17     2.0464037  0.4788930  0.9602063
##   18     2.0713150  0.4697270  0.9684118
##   19     2.1199838  0.4619142  0.9846577
##   20     2.1533898  0.4595382  0.9942218
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

plot(plsmod)

It seems like after the 3rd component the Rmse value was increasing and the R-squared fell after the 3rd component.

D. Predict The Response from the test

plspred2 <- predict(plsmod,newdata = imp.test)
postResample(pred = plspred2,obs = Yield.test)

##      RMSE  Rsquared       MAE 
## 0.5800874 0.6054297 0.4399309

It seems like the model did better on the test set compared to the training set.

E.

## We use the varImp function 
plsImp <- varImp(plsmod,scale = FALSE)

## Warning: package 'pls' was built under R version 4.3.3

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

plot(plsImp,top = 20)

It seems like a variety of manufacturing processes are given a greater importance in predicting the manufacturing processes, so I would say that the process predictor dominiated the process.

F.

Based on the top predictors being manufacturing processes, i believe we would have to take a look at the procedure of the manufacturing process say for 32 and 36 and emulate those for future manufacturing processes, so we can have a greater yield, and also to note to take a look at biological material 02 since this particular material has a greater hand in improving the yield.

Homework # 7

Al Haque

2024-03-27