library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.2.3
data(permeability)
library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.2.3
fingerprints_filtered <- fingerprints[, -nearZeroVar(fingerprints)]
dim(fingerprints_filtered)
## [1] 165 388
388 predictors are left for modeling.
set.seed(1234)
train_selection <- createDataPartition(permeability, p = .8, list = FALSE)
train_x <- fingerprints_filtered[train_selection, ]
test_x <- fingerprints_filtered[-train_selection, ]
train_y <- permeability[train_selection, ]
test_y <- permeability[-train_selection, ]
PLS <- train(x = train_x, y = train_y,method = "pls",metric = "Rsquared", tuneLength = 10, trControl = trainControl(method = "cv"), preProcess = c('center', 'scale'))
PLS_result <- PLS$results
PLS
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 119, 120, 120, 120, 121, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.76548 0.3949848 9.782056
## 2 11.08419 0.5662412 7.827891
## 3 11.02409 0.5704308 8.285652
## 4 11.16727 0.5556966 8.423255
## 5 11.09218 0.5573860 8.252132
## 6 10.96320 0.5653475 8.090110
## 7 11.06634 0.5592164 8.455477
## 8 11.21343 0.5559115 8.601945
## 9 11.24153 0.5635851 8.559953
## 10 11.63386 0.5463035 8.841897
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 3.
The final value used for the model was ncomp = 3 with an rsquares of 0.5606977
fingerprint_predict <- predict(PLS, test_x)
postResample(fingerprint_predict, test_y)
## RMSE Rsquared MAE
## 14.033165 0.138877 9.495030
The test set estimated of Rsquared is 0.2440012
set.seed(5678)
ridge_train <- train(x=train_x, y=train_y,method='ridge', metric='Rsquared',tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)),trControl=trainControl(method='cv'),preProcess=c('center','scale'))
## Warning: model fit failed for Fold10: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
ridge_train
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 119, 120, 119, 120, 120, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 13.49126 0.4365961 9.679907
## 0.1 11.50897 0.5413120 8.507836
## 0.2 11.69481 0.5387242 8.743099
## 0.3 12.07012 0.5403318 9.074433
## 0.4 12.55911 0.5419691 9.475287
## 0.5 13.13064 0.5434704 9.925781
## 0.6 13.77109 0.5440621 10.459382
## 0.7 14.46238 0.5446213 11.039901
## 0.8 15.19632 0.5449756 11.644364
## 0.9 15.96540 0.5450875 12.263755
## 1.0 16.75702 0.5452463 12.888947
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 1.
Ridge regression’s optimal lambda is 0.4, with an rsquared of 0.5382022, which is a better performance that PLS.
set.seed(789)
lars_train <- train(train_x, train_y, method = "lars", metric = "Rsquared",tuneLength = 10, trControl=trainControl(method='cv'), preProc = c("center", "scale"))
lars_train
## Least Angle Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 120, 119, 120, 119, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.0500000 11.20680 0.5827230 7.887273
## 0.1555556 11.36698 0.5654115 8.342067
## 0.2611111 11.99233 0.5360780 8.742474
## 0.3666667 12.39335 0.5121641 8.931486
## 0.4722222 12.65635 0.4940958 9.037095
## 0.5777778 13.04394 0.4761195 9.296571
## 0.6833333 13.13094 0.4696574 9.386220
## 0.7888889 13.33400 0.4756693 9.433763
## 0.8944444 13.68104 0.4711790 9.704378
## 1.0000000 14.39200 0.4531616 10.183606
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.05.
The final value used for the model was fraction = 0.05 with an rsquared of 0.4924561 which is a better performance that PLS but not than Lars.
The Lars model has a higher Rsquared and lower RMSE than the others
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
summary(ChemicalManufacturingProcess)
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## Min. :35.25 Min. :4.580 Min. :46.87 Min. :56.97
## 1st Qu.:38.75 1st Qu.:5.978 1st Qu.:52.68 1st Qu.:64.98
## Median :39.97 Median :6.305 Median :55.09 Median :67.22
## Mean :40.18 Mean :6.411 Mean :55.69 Mean :67.70
## 3rd Qu.:41.48 3rd Qu.:6.870 3rd Qu.:58.74 3rd Qu.:70.43
## Max. :46.34 Max. :8.810 Max. :64.75 Max. :78.25
##
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## Min. : 9.38 Min. :13.24 Min. :40.60
## 1st Qu.:11.24 1st Qu.:17.23 1st Qu.:46.05
## Median :12.10 Median :18.49 Median :48.46
## Mean :12.35 Mean :18.60 Mean :48.91
## 3rd Qu.:13.22 3rd Qu.:19.90 3rd Qu.:51.34
## Max. :23.09 Max. :24.85 Max. :59.38
##
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## Min. :100.0 Min. :15.88 Min. :11.44
## 1st Qu.:100.0 1st Qu.:17.06 1st Qu.:12.60
## Median :100.0 Median :17.51 Median :12.84
## Mean :100.0 Mean :17.49 Mean :12.85
## 3rd Qu.:100.0 3rd Qu.:17.88 3rd Qu.:13.13
## Max. :100.8 Max. :19.14 Max. :14.08
##
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## Min. :1.770 Min. :135.8 Min. :18.35
## 1st Qu.:2.460 1st Qu.:143.8 1st Qu.:19.73
## Median :2.710 Median :146.1 Median :20.12
## Mean :2.801 Mean :147.0 Mean :20.20
## 3rd Qu.:2.990 3rd Qu.:149.6 3rd Qu.:20.75
## Max. :6.870 Max. :158.7 Max. :22.21
##
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## Min. : 0.00 Min. : 0.00 Min. :1.47
## 1st Qu.:10.80 1st Qu.:19.30 1st Qu.:1.53
## Median :11.40 Median :21.00 Median :1.54
## Mean :11.21 Mean :16.68 Mean :1.54
## 3rd Qu.:12.15 3rd Qu.:21.50 3rd Qu.:1.55
## Max. :14.10 Max. :22.50 Max. :1.60
## NA's :1 NA's :3 NA's :15
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## Min. :911.0 Min. : 923.0 Min. :203.0
## 1st Qu.:928.0 1st Qu.: 986.8 1st Qu.:205.7
## Median :934.0 Median : 999.2 Median :206.8
## Mean :931.9 Mean :1001.7 Mean :207.4
## 3rd Qu.:936.0 3rd Qu.:1008.9 3rd Qu.:208.7
## Max. :946.0 Max. :1175.3 Max. :227.4
## NA's :1 NA's :1 NA's :2
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## Min. :177.0 Min. :177.0 Min. :38.89
## 1st Qu.:177.0 1st Qu.:177.0 1st Qu.:44.89
## Median :177.0 Median :178.0 Median :45.73
## Mean :177.5 Mean :177.6 Mean :45.66
## 3rd Qu.:178.0 3rd Qu.:178.0 3rd Qu.:46.52
## Max. :178.0 Max. :178.0 Max. :49.36
## NA's :1 NA's :1
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## Min. : 7.500 Min. : 7.500 Min. : 0.0
## 1st Qu.: 8.700 1st Qu.: 9.000 1st Qu.: 0.0
## Median : 9.100 Median : 9.400 Median : 0.0
## Mean : 9.179 Mean : 9.386 Mean : 857.8
## 3rd Qu.: 9.550 3rd Qu.: 9.900 3rd Qu.: 0.0
## Max. :11.600 Max. :11.500 Max. :4549.0
## NA's :9 NA's :10 NA's :1
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## Min. :32.10 Min. :4701 Min. :5904
## 1st Qu.:33.90 1st Qu.:4828 1st Qu.:6010
## Median :34.60 Median :4856 Median :6032
## Mean :34.51 Mean :4854 Mean :6039
## 3rd Qu.:35.20 3rd Qu.:4882 3rd Qu.:6061
## Max. :38.60 Max. :5055 Max. :6233
## NA's :1
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## Min. : 0 Min. :31.30 Min. : 0
## 1st Qu.:4561 1st Qu.:33.50 1st Qu.:4813
## Median :4588 Median :34.40 Median :4835
## Mean :4566 Mean :34.34 Mean :4810
## 3rd Qu.:4619 3rd Qu.:35.10 3rd Qu.:4862
## Max. :4852 Max. :40.00 Max. :4971
##
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## Min. :5890 Min. : 0 Min. :-1.8000
## 1st Qu.:6001 1st Qu.:4553 1st Qu.:-0.6000
## Median :6022 Median :4582 Median :-0.3000
## Mean :6028 Mean :4556 Mean :-0.1642
## 3rd Qu.:6050 3rd Qu.:4610 3rd Qu.: 0.0000
## Max. :6146 Max. :4759 Max. : 3.6000
##
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## Min. : 0.000 Min. :0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.:2.000 1st Qu.: 4.000
## Median : 5.000 Median :3.000 Median : 8.000
## Mean : 5.406 Mean :3.017 Mean : 8.834
## 3rd Qu.: 8.000 3rd Qu.:4.000 3rd Qu.:14.000
## Max. :12.000 Max. :6.000 Max. :23.000
## NA's :1 NA's :1 NA's :1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.:4832 1st Qu.:6020 1st Qu.:4560
## Median :4855 Median :6047 Median :4587
## Mean :4828 Mean :6016 Mean :4563
## 3rd Qu.:4877 3rd Qu.:6070 3rd Qu.:4609
## Max. :4990 Max. :6161 Max. :4710
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.:19.70 1st Qu.: 8.800
## Median :10.400 Median :19.90 Median : 9.100
## Mean : 6.592 Mean :20.01 Mean : 9.161
## 3rd Qu.:10.750 3rd Qu.:20.40 3rd Qu.: 9.700
## Max. :11.500 Max. :22.00 Max. :11.200
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## Min. : 0.00 Min. :143.0 Min. :56.00
## 1st Qu.:70.10 1st Qu.:155.0 1st Qu.:62.00
## Median :70.80 Median :158.0 Median :64.00
## Mean :70.18 Mean :158.5 Mean :63.54
## 3rd Qu.:71.40 3rd Qu.:162.0 3rd Qu.:65.00
## Max. :72.50 Max. :173.0 Max. :70.00
## NA's :5 NA's :5
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## Min. :2.300 Min. :463.0 Min. :0.01700
## 1st Qu.:2.500 1st Qu.:490.0 1st Qu.:0.01900
## Median :2.500 Median :495.0 Median :0.02000
## Mean :2.494 Mean :495.6 Mean :0.01957
## 3rd Qu.:2.500 3rd Qu.:501.5 3rd Qu.:0.02000
## Max. :2.600 Max. :522.0 Max. :0.02200
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.700 1st Qu.:2.000 1st Qu.:7.100
## Median :1.000 Median :3.000 Median :7.200
## Mean :1.014 Mean :2.534 Mean :6.851
## 3rd Qu.:1.300 3rd Qu.:3.000 3rd Qu.:7.300
## Max. :2.300 Max. :3.000 Max. :7.500
##
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## Min. :0.00000 Min. :0.00000 Min. : 0.00
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:11.40
## Median :0.00000 Median :0.00000 Median :11.60
## Mean :0.01771 Mean :0.02371 Mean :11.21
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:11.70
## Max. :0.10000 Max. :0.20000 Max. :12.10
## NA's :1 NA's :1
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## Min. : 0.0000 Min. :0.000 Min. :0.000
## 1st Qu.: 0.6000 1st Qu.:1.800 1st Qu.:2.100
## Median : 0.8000 Median :1.900 Median :2.200
## Mean : 0.9119 Mean :1.805 Mean :2.138
## 3rd Qu.: 1.0250 3rd Qu.:1.900 3rd Qu.:2.300
## Max. :11.0000 Max. :2.100 Max. :2.600
##
sum(is.na(ChemicalManufacturingProcess))
## [1] 106
We have 106 NAs across different variables.
chemical_missing <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
chemical_imputed <- predict(chemical_missing, ChemicalManufacturingProcess)
Bag imputing creates multiple imputed datasets by sampling with replacement from the original dataset.
set.seed(1357)
index <- createDataPartition(chemical_imputed$Yield, p = .8, list = FALSE)
train_chemical <- chemical_imputed[index, ]
test_chemical <- chemical_imputed[-index, ]
lm_model <- lm(Yield ~ ., chemical_imputed)
summary(lm_model)
##
## Call:
## lm(formula = Yield ~ ., data = chemical_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.14184 -0.52928 -0.05022 0.48100 2.04082
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.443e+02 1.304e+02 1.107 0.27074
## BiologicalMaterial01 2.374e-01 3.351e-01 0.708 0.48011
## BiologicalMaterial02 -1.166e-01 1.254e-01 -0.929 0.35470
## BiologicalMaterial03 2.389e-01 2.326e-01 1.027 0.30653
## BiologicalMaterial04 -1.709e-01 5.229e-01 -0.327 0.74433
## BiologicalMaterial05 1.617e-01 1.041e-01 1.553 0.12310
## BiologicalMaterial06 -8.433e-02 2.970e-01 -0.284 0.77695
## BiologicalMaterial07 -1.460e+00 9.556e-01 -1.528 0.12928
## BiologicalMaterial08 6.953e-01 6.429e-01 1.081 0.28167
## BiologicalMaterial09 -1.360e+00 1.354e+00 -1.004 0.31724
## BiologicalMaterial10 1.998e-01 1.370e+00 0.146 0.88429
## BiologicalMaterial11 -8.302e-02 8.057e-02 -1.030 0.30494
## BiologicalMaterial12 3.457e-01 6.254e-01 0.553 0.58145
## ManufacturingProcess01 6.930e-02 9.338e-02 0.742 0.45949
## ManufacturingProcess02 2.978e-03 4.579e-02 0.065 0.94825
## ManufacturingProcess03 -4.229e+00 5.106e+00 -0.828 0.40923
## ManufacturingProcess04 6.193e-02 2.925e-02 2.117 0.03634 *
## ManufacturingProcess05 1.017e-03 3.804e-03 0.267 0.78974
## ManufacturingProcess06 3.489e-02 4.229e-02 0.825 0.41100
## ManufacturingProcess07 -1.776e-01 2.084e-01 -0.852 0.39580
## ManufacturingProcess08 -1.350e-01 2.506e-01 -0.539 0.59115
## ManufacturingProcess09 3.529e-01 1.777e-01 1.987 0.04927 *
## ManufacturingProcess10 -2.360e-02 5.568e-01 -0.042 0.96627
## ManufacturingProcess11 4.518e-01 7.323e-01 0.617 0.53840
## ManufacturingProcess12 8.256e-05 1.067e-04 0.774 0.44043
## ManufacturingProcess13 -1.906e-01 3.701e-01 -0.515 0.60746
## ManufacturingProcess14 -4.991e-04 1.072e-02 -0.047 0.96294
## ManufacturingProcess15 3.285e-03 8.596e-03 0.382 0.70299
## ManufacturingProcess16 -8.787e-05 3.346e-04 -0.263 0.79332
## ManufacturingProcess17 -1.532e-01 2.905e-01 -0.527 0.59887
## ManufacturingProcess18 4.710e-03 4.312e-03 1.092 0.27695
## ManufacturingProcess19 -1.100e-03 7.254e-03 -0.152 0.87975
## ManufacturingProcess20 -4.969e-03 4.574e-03 -1.086 0.27956
## ManufacturingProcess21 NA NA NA NA
## ManufacturingProcess22 -1.243e-02 4.115e-02 -0.302 0.76317
## ManufacturingProcess23 -4.783e-02 8.166e-02 -0.586 0.55916
## ManufacturingProcess24 -2.537e-02 2.302e-02 -1.102 0.27260
## ManufacturingProcess25 -1.906e-03 2.283e-03 -0.835 0.40538
## ManufacturingProcess26 2.883e-03 6.057e-03 0.476 0.63502
## ManufacturingProcess27 -6.446e-03 6.794e-03 -0.949 0.34462
## ManufacturingProcess28 -8.247e-02 3.070e-02 -2.687 0.00825 **
## ManufacturingProcess29 1.188e+00 9.069e-01 1.310 0.19263
## ManufacturingProcess30 -6.555e-01 5.903e-01 -1.110 0.26907
## ManufacturingProcess31 6.274e-02 1.155e-01 0.543 0.58790
## ManufacturingProcess32 3.237e-01 6.827e-02 4.741 5.96e-06 ***
## ManufacturingProcess33 -3.868e-01 1.274e-01 -3.036 0.00295 **
## ManufacturingProcess34 -1.251e+00 2.735e+00 -0.458 0.64813
## ManufacturingProcess35 -2.107e-02 1.724e-02 -1.222 0.22395
## ManufacturingProcess36 3.452e+02 3.079e+02 1.121 0.26449
## ManufacturingProcess37 -8.166e-01 2.889e-01 -2.826 0.00552 **
## ManufacturingProcess38 -2.175e-01 2.378e-01 -0.914 0.36232
## ManufacturingProcess39 8.136e-02 1.283e-01 0.634 0.52708
## ManufacturingProcess40 1.080e+00 6.466e+00 0.167 0.86761
## ManufacturingProcess41 -3.648e-01 4.690e+00 -0.078 0.93813
## ManufacturingProcess42 5.257e-02 2.062e-01 0.255 0.79923
## ManufacturingProcess43 1.957e-01 1.159e-01 1.689 0.09393 .
## ManufacturingProcess44 -8.385e-01 1.181e+00 -0.710 0.47906
## ManufacturingProcess45 9.891e-01 5.347e-01 1.850 0.06679 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.022 on 119 degrees of freedom
## Multiple R-squared: 0.7914, Adjusted R-squared: 0.6932
## F-statistic: 8.06 on 56 and 119 DF, p-value: < 2.2e-16
the linear regression has an r sqared of 0.7858.
lm_predict <- predict(lm_model, test_chemical[ ,-1])
## Warning in predict.lm(lm_model, test_chemical[, -1]): prediction from a
## rank-deficient fit may be misleading
postResample(lm_predict, test_chemical[ ,1])
## RMSE Rsquared MAE
## 0.8443454 0.8338351 0.6445676
The Rsquared on the predicted test set is 0.7812, slightly lower, but just about the same as the training set.
Based on the model above, ManufacturingProcess32, 28, 33, 37 has the highest significance based on pvalues.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.3
## corrplot 0.92 loaded
correlation <- cor(select(chemical_imputed, 'ManufacturingProcess32','ManufacturingProcess28','ManufacturingProcess33','ManufacturingProcess37','Yield'))
corrplot(correlation,)
The highest correlation, which is positive, appears to be between 32 and
33 and there are some negative correlations noted as well. The company
would want to improve their yield based on predictors, more so focusing
on process32. Maximizing that performance measure can potentially
increase results.