## Let's call in the data
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.3.3
data("permeability")
## Was just playing around with the nearZeroVar function to understand what it was doing.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
clean <- nearZeroVar(fingerprints,saveMetrics = TRUE)
dim(fingerprints)
## [1] 165 1107
clean2 <- nearZeroVar(fingerprints)
## Filter out this columns from the dataset.
clean_finger <- fingerprints[,-clean2]
dim(clean_finger)
## [1] 165 388
Using the nearzerovar function we removed over 800 predictors that had nearzerovar values.
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
## We split the data based on the response in which case is permeability
set.seed(1)
trainRow <- createDataPartition(permeability, p=0.8, list=FALSE)
X.train <- clean_finger[trainRow, ]
y.train <- permeability[trainRow, ]
X.test <- clean_finger[-trainRow, ]
y.test <- permeability[-trainRow, ]
## We can fit an PLS model now,
# ctrl sets up 10-fold folds for cross-validation
set.seed(2)
ctrl <- trainControl(method = "cv",number = 10)
plsTune <- train(X.train,y.train,method = "pls",tuneLength = 20,trControl = ctrl,preProcess = c("center","scale"))
## Seems like it needed 10 components to explain 70 % of the variance?
plsTune
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 121, 119, 119, 119, 120, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.93602 0.3991379 9.756731
## 2 11.50435 0.5334462 8.353969
## 3 11.54192 0.5022321 8.860602
## 4 11.44849 0.5160740 8.876045
## 5 10.93868 0.5649522 8.205146
## 6 10.82624 0.5673513 8.156738
## 7 10.58600 0.5688317 8.189013
## 8 10.64118 0.5533584 8.247715
## 9 10.38141 0.5618200 8.013176
## 10 10.23861 0.5822621 7.622160
## 11 10.41218 0.5713843 7.710015
## 12 10.36078 0.5813480 7.669539
## 13 10.38942 0.5834779 7.696167
## 14 10.52169 0.5768715 7.760384
## 15 10.91159 0.5578152 7.942479
## 16 11.15889 0.5553447 8.241611
## 17 11.59726 0.5442666 8.496309
## 18 11.70003 0.5385997 8.665714
## 19 11.75514 0.5341747 8.694274
## 20 11.86050 0.5361340 8.799761
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 10.
summary(plsTune)
## Data: X dimension: 133 388
## Y dimension: 133 1
## Fit method: oscorespls
## Number of components considered: 10
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 21.09 34.70 40.33 45.02 51.49 58.92 62.07
## .outcome 34.57 51.73 59.70 66.80 72.46 76.41 79.79
## 8 comps 9 comps 10 comps
## X 66.69 69.49 72.48
## .outcome 81.40 83.47 84.54
plot(plsTune)
Looking at the summary function of plsTune and at the plot of the PLS model, we can see that most of the data’s variance can be captured by 10 components,after 10 components the model begans to “under-fit” and captures less of the complexity. So we can just use the first 10 components of the model to predict on our testing data. For the first 10 components the R2 is 0.58.
We can use the postResample function from the caret library to find the R2 squared value
plsPred <- predict(plsTune,newdata = X.test,ncomp = 10)
postResample(pred = plsPred,obs = y.test)
## RMSE Rsquared MAE
## 14.6758044 0.3274115 10.9299035
Here the Rsquared value is lower on the test set..
We can try a PCR model, looking at the documentation of train we can train a variety of models we will first train a PCR model
set.seed(2)
ctrl <- trainControl(method = "cv",number = 10)
pcrTune <- train(X.train,y.train,method = "pcr",tuneLength = 20,trControl = ctrl,preProcess = c("center","scale"))
pcrTune
## Principal Component Analysis
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 121, 119, 119, 119, 120, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 14.71806 0.1129303 11.294285
## 2 14.78535 0.1070360 11.374555
## 3 13.31043 0.3684400 9.938426
## 4 12.57922 0.4418256 9.704423
## 5 12.01188 0.5076261 8.843804
## 6 12.00945 0.5164286 8.882011
## 7 12.03297 0.4953780 8.871224
## 8 11.97685 0.4975877 8.761882
## 9 12.04399 0.4671626 8.847345
## 10 12.00752 0.4715141 8.802659
## 11 11.93147 0.4829882 8.770246
## 12 11.69978 0.4952314 8.775275
## 13 11.77682 0.4865078 8.834785
## 14 11.89691 0.4750947 8.965196
## 15 12.05555 0.4602524 9.083852
## 16 11.97265 0.4713711 8.925006
## 17 12.16393 0.4672501 9.128545
## 18 11.96329 0.4878165 9.101439
## 19 12.13790 0.4765539 9.216944
## 20 11.86884 0.5036443 8.929720
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 12.
plot(pcrTune)
It seems like the lowest RMSE value was around 12 components..
pcrpred <- predict(pcrTune,newdata = X.test,ncomp = 1:12)
postResample(pred = pcrpred,obs = y.test)
## RMSE Rsquared MAE
## 13.2452661 0.2860616 10.0268331
The R-squared is smaller compared to the PLS model..
## Set 10 different lambda values
ridgegrid <- data.frame(.lambda = seq(0,.1,length = 10))
## Build the model..
set.seed(3)
ridgeregfit <- train(X.train,y.train,method = "ridge",tuneGrid = ridgegrid,trControl = ctrl,preProc = c("center","scale"))
## Warning: model fit failed for Fold03: lambda=0.00000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold10: lambda=0.00000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
ridgeregfit
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 121, 119, 120, 118, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00000000 13.56105 0.4791827 9.976741
## 0.01111111 13.84156 0.4256120 9.970553
## 0.02222222 37.33681 0.4708809 20.447780
## 0.03333333 10.64119 0.5328469 7.683918
## 0.04444444 10.45913 0.5458059 7.539779
## 0.05555556 10.36459 0.5533689 7.493452
## 0.06666667 10.30773 0.5586993 7.480233
## 0.07777778 10.27000 0.5631410 7.479825
## 0.08888889 10.22660 0.5683446 7.463670
## 0.10000000 10.18854 0.5730327 7.467546
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
plot(ridgeregfit)
It seems with a penalty value or lambda value of 0.1 we get the lowest RMSE
ridpred <- predict(ridgeregfit,X.test)
postResample(pred = ridpred,obs = y.test)
## RMSE Rsquared MAE
## 15.0822363 0.3047575 11.4460367
We get an E-squared of 0.304 which is lower than the PLS..
## We use the train function from the textbook..
enetGrid <- expand.grid(.lambda = c(0,0.01,0.1),.fraction = seq(.05,1,length = 15))
set.seed(4)
enetTune <- train(X.train,y.train,method = "enet",tuneGrid = enetGrid,trControl = ctrl,preProc = c("center","scale"))
## Warning: model fit failed for Fold01: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold08: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold09: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
enetTune
## Elasticnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 118, 120, 120, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.0500000 13.60367 0.3411740 10.252519
## 0.00 0.1178571 12.99927 0.3661877 9.260713
## 0.00 0.1857143 12.91312 0.3681914 8.981251
## 0.00 0.2535714 12.76254 0.3801867 8.960384
## 0.00 0.3214286 12.70837 0.3834288 9.029600
## 0.00 0.3892857 12.52627 0.3945208 8.990799
## 0.00 0.4571429 12.34712 0.4088136 8.928673
## 0.00 0.5250000 12.21493 0.4204528 8.926430
## 0.00 0.5928571 12.07260 0.4360365 8.809660
## 0.00 0.6607143 11.91441 0.4546768 8.669985
## 0.00 0.7285714 11.86337 0.4627041 8.589323
## 0.00 0.7964286 11.87481 0.4674083 8.579541
## 0.00 0.8642857 11.83754 0.4737802 8.563303
## 0.00 0.9321429 11.77899 0.4814910 8.485069
## 0.00 1.0000000 11.77030 0.4854085 8.445759
## 0.01 0.0500000 23.68575 0.4890447 13.731837
## 0.01 0.1178571 41.34191 0.4721048 21.307597
## 0.01 0.1857143 58.79682 0.5045713 29.475617
## 0.01 0.2535714 75.98616 0.5138722 37.437992
## 0.01 0.3214286 93.36600 0.5141966 45.440916
## 0.01 0.3892857 110.58352 0.5236589 53.410206
## 0.01 0.4571429 127.72537 0.5312118 61.728306
## 0.01 0.5250000 144.86517 0.5389867 70.066756
## 0.01 0.5928571 162.07390 0.5416979 78.421830
## 0.01 0.6607143 179.43726 0.5343540 86.879803
## 0.01 0.7285714 196.78044 0.5284694 95.296718
## 0.01 0.7964286 213.98151 0.5240815 103.669590
## 0.01 0.8642857 230.82681 0.5221129 112.002646
## 0.01 0.9321429 247.71380 0.5179338 120.368478
## 0.01 1.0000000 264.62364 0.5126777 128.760065
## 0.10 0.0500000 11.99639 0.4938224 9.098122
## 0.10 0.1178571 11.61431 0.4676538 8.191088
## 0.10 0.1857143 11.77134 0.4637975 8.340578
## 0.10 0.2535714 11.45928 0.4922367 8.205628
## 0.10 0.3214286 11.12820 0.5207526 8.021765
## 0.10 0.3892857 11.00917 0.5356437 7.888500
## 0.10 0.4571429 10.83820 0.5559069 7.688091
## 0.10 0.5250000 10.73832 0.5698699 7.562825
## 0.10 0.5928571 10.63047 0.5819268 7.466724
## 0.10 0.6607143 10.56720 0.5907506 7.456845
## 0.10 0.7285714 10.56277 0.5949968 7.481558
## 0.10 0.7964286 10.57709 0.5961651 7.508011
## 0.10 0.8642857 10.60347 0.5961122 7.530941
## 0.10 0.9321429 10.60164 0.5975929 7.531608
## 0.10 1.0000000 10.58250 0.6003715 7.523100
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.7285714 and lambda = 0.1.
It seems like lambda 0.1 and fraction = 0.72 were chosen as the final value due to it’s small RMSE value.
plot(enetTune)
lassopred <- predict(enetTune,X.test)
postResample(pred = lassopred,obs = y.test)
## RMSE Rsquared MAE
## 14.276328 0.337738 10.832378
It seems the R-squared did slightly better with an R-squared of 0.33
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
tablee <- bind_rows(pls = postResample(pred = plsPred,obs = y.test),pcr = postResample(pred = pcrpred,obs = y.test),ridge = postResample(pred = ridpred,obs = y.test),lasso = postResample(pred = lassopred,obs = y.test))
tablee$id = c("pls","pcr","ridge","lasso")
tablee
## # A tibble: 4 × 4
## RMSE Rsquared MAE id
## <dbl> <dbl> <dbl> <chr>
## 1 14.7 0.327 10.9 pls
## 2 13.2 0.286 10.0 pcr
## 3 15.1 0.305 11.4 ridge
## 4 14.3 0.338 10.8 lasso
It appears that the lasso regression may have better predictive performance, but this may have occured due to the values I have chosen for my grid tuning.
In this case, I would probably choose the enet model aka the lasso regression since we have slightly better metrics on the test-set as compared to the partial least squares model.
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
## we have to find the columns with missing values
na_counts <- colSums(is.na(ChemicalManufacturingProcess))
cols_w_na <- names(na_counts[na_counts > 0])
cols_w_na
## [1] "ManufacturingProcess01" "ManufacturingProcess02" "ManufacturingProcess03"
## [4] "ManufacturingProcess04" "ManufacturingProcess05" "ManufacturingProcess06"
## [7] "ManufacturingProcess07" "ManufacturingProcess08" "ManufacturingProcess10"
## [10] "ManufacturingProcess11" "ManufacturingProcess12" "ManufacturingProcess14"
## [13] "ManufacturingProcess22" "ManufacturingProcess23" "ManufacturingProcess24"
## [16] "ManufacturingProcess25" "ManufacturingProcess26" "ManufacturingProcess27"
## [19] "ManufacturingProcess28" "ManufacturingProcess29" "ManufacturingProcess30"
## [22] "ManufacturingProcess31" "ManufacturingProcess33" "ManufacturingProcess34"
## [25] "ManufacturingProcess35" "ManufacturingProcess36" "ManufacturingProcess40"
## [28] "ManufacturingProcess41"
It appears the missing values are in the ManufacturingProcess columns
## Check each column and impute it
trans <- preProcess(ChemicalManufacturingProcess,method = "knnImpute")
We use the preProcess function and apply knnimpute according to section 3.9 from the textbook.
## This is the only way I have found that I can view the knn imputation..
imp <- predict(trans,newdata = ChemicalManufacturingProcess)
head(imp)
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 -1.1792673 -0.2261036 -1.5140979 -2.68303622
## 2 1.2263678 2.2391498 1.3089960 -0.05623504
## 3 1.0042258 2.2391498 1.3089960 -0.05623504
## 4 0.6737219 2.2391498 1.3089960 -0.05623504
## 5 1.2534583 1.4827653 1.8939391 1.13594780
## 6 1.8386128 -0.4081962 0.6620886 -0.59859075
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1 0.2201765 0.4941942 -1.3828880
## 2 1.2964386 0.4128555 1.1290767
## 3 1.2964386 0.4128555 1.1290767
## 4 1.2964386 0.4128555 1.1290767
## 5 0.9414412 -0.3734185 1.5348350
## 6 1.5894524 1.7305423 0.6192092
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1 -0.1313107 -1.233131 -3.3962895
## 2 -0.1313107 2.282619 -0.7227225
## 3 -0.1313107 2.282619 -0.7227225
## 4 -0.1313107 2.282619 -0.7227225
## 5 -0.1313107 1.071310 -0.1205678
## 6 -0.1313107 1.189487 -1.7343424
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1 1.1005296 -1.838655 -1.7709224
## 2 1.1005296 1.393395 1.0989855
## 3 1.1005296 1.393395 1.0989855
## 4 1.1005296 1.393395 1.0989855
## 5 0.4162193 0.136256 1.0989855
## 6 1.6346255 1.022062 0.7240877
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1 0.2154105 0.5662872 0.3765810
## 2 -6.1497028 -1.9692525 0.1979962
## 3 -6.1497028 -1.9692525 0.1087038
## 4 -6.1497028 -1.9692525 0.4658734
## 5 -0.2784345 -1.9692525 0.1087038
## 6 0.4348971 -1.9692525 0.5551658
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1 0.5655598 -0.44593467 -0.5414997
## 2 -2.3669726 0.99933318 0.9625383
## 3 -3.1638563 0.06246417 -0.1117745
## 4 -3.3232331 0.42279841 2.1850322
## 5 -2.2075958 0.84537219 -0.6304083
## 6 -1.2513352 0.49486525 0.5550403
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1 -0.1596700 -0.3095182 -1.7201524
## 2 -0.9580199 0.8941637 0.5883746
## 3 1.0378549 0.8941637 -0.3815947
## 4 -0.9580199 -1.1119728 -0.4785917
## 5 1.0378549 0.8941637 -0.4527258
## 6 1.0378549 0.8941637 -0.2199332
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1 -0.07700901 -0.09157342 -0.4806937
## 2 0.52297397 1.08204765 -0.4806937
## 3 0.31428424 0.55112383 -0.4806937
## 4 -0.02483658 0.80261406 -0.4806937
## 5 -0.39004361 0.10403009 -0.4806937
## 6 0.28819802 1.41736795 -0.4806937
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1 0.97711512 0.8093999 1.1846438
## 2 -0.50030980 0.2775205 0.9617071
## 3 0.28765016 0.4425865 0.8245152
## 4 0.28765016 0.7910592 1.0817499
## 5 0.09066017 2.5334227 3.3282665
## 6 -0.50030980 2.4050380 3.1396277
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1 0.3303945 0.9263296 0.1505348
## 2 0.1455765 -0.2753953 0.1559773
## 3 0.1455765 0.3655246 0.1831898
## 4 0.1967569 0.3655246 0.1695836
## 5 0.4754056 -0.3555103 0.2076811
## 6 0.6261033 -0.7560852 0.1423710
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1 0.4563798 0.3109942 0.2109804
## 2 1.5095063 0.1849230 0.2109804
## 3 1.0926437 0.1849230 0.2109804
## 4 0.9829430 0.1562704 0.2109804
## 5 1.6192070 0.2938027 -0.6884239
## 6 1.9044287 0.3998171 -0.5599376
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1 0.05833309 0.8317688 0.8907291
## 2 -0.72230090 -1.8147683 -1.0060115
## 3 -0.42205706 -1.2132826 -0.8335805
## 4 -0.12181322 -0.6117969 -0.6611496
## 5 0.77891831 0.5911745 1.5804530
## 6 1.07916216 -1.2132826 -1.3508734
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1 0.1200183 0.1256347 0.3460352
## 2 0.1093082 0.1966227 0.1906613
## 3 0.1842786 0.2159831 0.2104362
## 4 0.1708910 0.2052273 0.1906613
## 5 0.2726365 0.2912733 0.3432102
## 6 0.1146633 0.2417969 0.3516852
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1 0.7826636 0.5943242 0.7566948
## 2 0.8779201 0.8347250 0.7566948
## 3 0.8588688 0.7746248 0.2444430
## 4 0.8588688 0.7746248 0.2444430
## 5 0.8969714 0.9549255 -0.1653585
## 6 0.9160227 1.0150257 0.9615956
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1 -0.1952552 -0.4568829 0.9890307
## 2 -0.2672523 1.9517531 0.9890307
## 3 -0.1592567 2.6928719 0.9890307
## 4 -0.1592567 2.3223125 1.7943843
## 5 -0.1412574 2.3223125 2.5997378
## 6 -0.3572486 2.6928719 2.5997378
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1 -1.7202722 -0.88694718 -0.6557774
## 2 1.9568096 1.14638329 -0.6557774
## 3 1.9568096 1.23880740 -1.8000420
## 4 0.1182687 0.03729394 -1.8000420
## 5 0.1182687 -2.55058120 -2.9443066
## 6 0.1182687 -0.51725073 -1.8000420
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1 -1.1540243 0.7174727 0.2317270
## 2 2.2161351 -0.8224687 0.2317270
## 3 -0.7046697 -0.8224687 0.2317270
## 4 0.4187168 -0.8224687 0.2317270
## 5 -1.8280562 -0.8224687 0.2981503
## 6 -1.3787016 -0.8224687 0.2317270
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1 0.05969714 -0.06900773 0.20279570
## 2 2.14909691 2.34626280 -0.05472265
## 3 -0.46265281 -0.44058781 0.40881037
## 4 -0.46265281 -0.44058781 -0.31224099
## 5 -0.46265281 -0.44058781 -0.10622632
## 6 -0.46265281 -0.44058781 0.15129203
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1 2.40564734 -0.01588055 0.64371849
## 2 -0.01374656 0.29467248 0.15220242
## 3 0.10146268 -0.01588055 0.39796046
## 4 0.21667191 -0.01588055 -0.09355562
## 5 0.21667191 -0.32643359 -0.09355562
## 6 1.48397347 -0.01588055 -0.33931365
It seems the entire values were transformed, but the missing values are imputed with KNN, the defualt is k = 10.
## We need a ydefault
impnoY <- imp %>%
select(-Yield)
set.seed(1)
trainRow <- createDataPartition(imp$Yield, p=0.8, list=FALSE)
imp.train <- impnoY[trainRow, ]
Yield.train <- imp[trainRow,]$Yield
imp.test <- impnoY[-trainRow, ]
Yield.test <- imp[-trainRow,]$Yield
## Develop a pls model.. maybe I will try repeated cross-validations?
set.seed(3)
ctrl2 <- trainControl(method = "cv",number = 10)
plsmod <- train(imp.train,Yield.train,method = "pls",tuneLength = 20,trControl = ctrl2,preProcess = c("center","scale"))
plsmod
## Partial Least Squares
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 131, 129, 130, 128, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.7902236 0.4521356 0.6413296
## 2 1.0661419 0.4709495 0.6854578
## 3 0.6743168 0.5977954 0.5516494
## 4 0.8948797 0.5245414 0.6340117
## 5 1.2800750 0.5114029 0.7342358
## 6 1.4121377 0.5057793 0.7686054
## 7 1.6033925 0.5022937 0.8108227
## 8 1.6618180 0.4955572 0.8340363
## 9 1.8458902 0.4665501 0.8987356
## 10 1.9277890 0.4627515 0.9191841
## 11 1.9822065 0.4628795 0.9380517
## 12 2.0347138 0.4675141 0.9575761
## 13 2.0832809 0.4631592 0.9758059
## 14 2.0805139 0.4708037 0.9744500
## 15 2.0655596 0.4771334 0.9672732
## 16 2.0444887 0.4856160 0.9592943
## 17 2.0464037 0.4788930 0.9602063
## 18 2.0713150 0.4697270 0.9684118
## 19 2.1199838 0.4619142 0.9846577
## 20 2.1533898 0.4595382 0.9942218
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
plot(plsmod)
It seems like after the 3rd component the Rmse value was increasing and the R-squared fell after the 3rd component.
plspred2 <- predict(plsmod,newdata = imp.test)
postResample(pred = plspred2,obs = Yield.test)
## RMSE Rsquared MAE
## 0.5800874 0.6054297 0.4399309
It seems like the model did better on the test set compared to the training set.
## We use the varImp function
plsImp <- varImp(plsmod,scale = FALSE)
## Warning: package 'pls' was built under R version 4.3.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
plot(plsImp,top = 20)
It seems like a variety of manufacturing processes are given a greater importance in predicting the manufacturing processes, so I would say that the process predictor dominiated the process.
Based on the top predictors being manufacturing processes, i believe we would have to take a look at the procedure of the manufacturing process say for 32 and 36 and emulate those for future manufacturing processes, so we can have a greater yield, and also to note to take a look at biological material 02 since this particular material has a greater hand in improving the yield.