#Data 624: HW 7 Kuhn and Johnson - problems 6.2 and 6.3.
#6.2: Developing a model to predict permeability
##6.2(a) Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
###Exploring the data
Dimensions - Fingerprints
dim(fingerprints)
## [1] 165 1107
Dimensions - Permeability
head(permeability)
## permeability
## 1 12.520
## 2 1.120
## 3 19.405
## 4 1.730
## 5 1.680
## 6 0.510
##6.2(b) How many predictors are left for modeling?
###Loading library
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
###Creating a variable for predictions that have low frequencies
near_zero <- nearZeroVar(fingerprints)
###CHecking how many predictors there are
ncol(fingerprints)
## [1] 1107
###CHecking how many predictors there are with near zero variance
length(near_zero)
## [1] 719
###Filtering to create new filtered fingerprints data set
fingerprints2 <- fingerprints[, -near_zero]
###Checking how many predictors left after filtering
ncol(fingerprints2)
## [1] 388
There are 388 predictors left now after filtering out the near zero predictors.
##6.2(c) How many latent variables are optimal and what is the corresponding resampled estimate of R2?
###Splitting data into training and test set
set.seed(100)
train_data <- createDataPartition(permeability, p = 0.80, list = FALSE)
head(train_data)
## Resample1
## [1,] 2
## [2,] 3
## [3,] 4
## [4,] 6
## [5,] 7
## [6,] 8
X_train <- fingerprints2[train_data, ]
X_test <- fingerprints2[-train_data, ]
Y_train <- permeability[train_data]
Y_test <- permeability[-train_data]
###Pre-Processing Data
ctrl <- trainControl(method = "cv", number = 10)
###Tuning a PLS (Partial Least Squares)
set.seed(100)
pls_tune <- train(X_train, Y_train,
method = "pls",
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale"))
pls_tune
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.85492 0.3362303 9.721289
## 2 11.82217 0.4436244 8.092500
## 3 11.71991 0.4718842 8.673662
## 4 11.56038 0.4864682 8.683859
## 5 11.40392 0.4845769 8.561044
## 6 11.08213 0.4956908 8.266156
## 7 11.21951 0.4861156 8.526250
## 8 11.25938 0.4897819 8.413176
## 9 11.37410 0.4889105 8.555701
## 10 11.55558 0.4772198 8.541903
## 11 11.77598 0.4596393 8.781978
## 12 11.90984 0.4588892 8.970828
## 13 12.38027 0.4339820 9.260477
## 14 12.59802 0.4194958 9.689526
## 15 12.94009 0.4009358 9.912681
## 16 13.27633 0.3805674 10.162833
## 17 13.48894 0.3741538 10.413769
## 18 13.60309 0.3690599 10.585876
## 19 13.60980 0.3653960 10.578503
## 20 13.66044 0.3613265 10.627887
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.
plot(pls_tune)
There are 6 latent variables and the re-sampled estimate of r squared is 0.4956908.
##6.2(d) “Predict the response for the test set. What is the test set estimate of R2?”
pls_prediction <- predict(pls_tune, X_test)
postResample(pred = pls_prediction, obs = Y_test)
## RMSE Rsquared MAE
## 11.773628 0.471816 8.672468
6 11.08213 0.4956908 8.266156
The training and test sets appear to be actually pretty close to each other. The test estimate for R2 is 0.47. This model does a good job as I see no overfitting.
##6.2(e) Try building other models discussed in this chapter. Do any have better predictive performance?
set.seed(100)
pcr_tune <- train(X_train, Y_train,
method = "pcr",
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale"))
pcr_tune
## Principal Component Analysis
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 14.69636 0.1110173 11.495094
## 2 14.69643 0.1107412 11.545796
## 3 13.01748 0.3397017 9.946278
## 4 12.72074 0.3437014 9.441999
## 5 12.39655 0.3706625 8.972824
## 6 12.35651 0.3718052 8.944906
## 7 12.34481 0.3715405 8.857335
## 8 12.06702 0.4175700 8.418313
## 9 12.07646 0.4285779 8.537146
## 10 12.05970 0.4396163 8.716194
## 11 12.09726 0.4384438 8.748660
## 12 12.24633 0.4225441 8.758359
## 13 12.28194 0.4149305 8.778917
## 14 12.27708 0.4148031 8.796068
## 15 12.27271 0.4150780 8.817452
## 16 12.24219 0.4252113 8.806884
## 17 12.12211 0.4293419 8.706982
## 18 12.15276 0.4423403 8.738848
## 19 12.07120 0.4392150 8.726134
## 20 11.66452 0.4631430 8.472242
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 20.
plot(pcr_tune)
pcr_prediction <- predict(pcr_tune, X_test)
postResample(pred = pcr_prediction, obs = Y_test)
## RMSE Rsquared MAE
## 12.4227765 0.4114892 9.4308226
20 11.66452 0.4631430 8.472242
This model did not do well. It needed 20 components to fit, when PLS only needed 6. Also, PCR does not build models with y in mind like PLS so not only is it using too many components it’s not even explaining the variance with relevance to permeability.
Checking if increasing tune length makes the component need go up even more:
set.seed(100)
pcr_tune2 <- train(X_train, Y_train,
method = "pcr",
tuneLength = 50,
trControl = ctrl,
preProc = c("center", "scale"))
pcr_tune2
## Principal Component Analysis
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 14.69636 0.1110173 11.495094
## 2 14.69643 0.1107412 11.545796
## 3 13.01748 0.3397017 9.946278
## 4 12.72074 0.3437014 9.441999
## 5 12.39655 0.3706625 8.972824
## 6 12.35651 0.3718052 8.944906
## 7 12.34481 0.3715405 8.857335
## 8 12.06702 0.4175700 8.418313
## 9 12.07646 0.4285779 8.537146
## 10 12.05970 0.4396163 8.716194
## 11 12.09726 0.4384438 8.748660
## 12 12.24633 0.4225441 8.758359
## 13 12.28194 0.4149305 8.778917
## 14 12.27708 0.4148031 8.796068
## 15 12.27271 0.4150780 8.817452
## 16 12.24219 0.4252113 8.806884
## 17 12.12211 0.4293419 8.706982
## 18 12.15276 0.4423403 8.738848
## 19 12.07120 0.4392150 8.726134
## 20 11.66452 0.4631430 8.472242
## 21 11.83560 0.4600169 8.846798
## 22 11.89061 0.4590446 8.858161
## 23 11.90705 0.4502853 8.795143
## 24 11.89356 0.4520845 8.782471
## 25 11.76709 0.4646409 8.647535
## 26 11.95438 0.4516535 8.833433
## 27 11.50474 0.4790646 8.472424
## 28 11.36204 0.4846198 8.494791
## 29 11.38455 0.4852023 8.508798
## 30 11.54396 0.4763228 8.613945
## 31 11.55421 0.4684521 8.699147
## 32 11.15536 0.4940905 8.420403
## 33 11.04658 0.5043588 8.414902
## 34 10.96711 0.5085231 8.370172
## 35 10.97881 0.5096122 8.386033
## 36 11.22650 0.4943005 8.552872
## 37 11.19429 0.5030008 8.460189
## 38 11.20168 0.5044827 8.407628
## 39 10.89835 0.5213800 8.110102
## 40 10.84547 0.5248868 8.069804
## 41 11.00378 0.5071243 8.275885
## 42 11.24342 0.4924778 8.552700
## 43 11.27461 0.4900943 8.616688
## 44 11.34000 0.4904779 8.671189
## 45 11.49380 0.4855984 8.663002
## 46 11.58815 0.4843147 8.679616
## 47 11.75285 0.4740361 8.758917
## 48 11.70395 0.4753790 8.706313
## 49 11.74894 0.4720810 8.701926
## 50 11.76481 0.4687003 8.851439
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 40.
It does go up. Using PLS model is better in this case because it keeps the y in mind unlike this model.
###Ridge Model
ridgeGrid <- expand.grid(lambda = seq(0.001, 1, length = 20))
set.seed(100)
ridge_tune <- train(X_train, Y_train,
method = "ridge",
tuneGrid = ridgeGrid,
trControl = ctrl,
preProc = c("center", "scale"))
ridge_tune
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00100000 421.85554 0.1085090 272.873407
## 0.05357895 12.62825 0.4189466 9.454504
## 0.10615789 12.05244 0.4600003 8.985801
## 0.15873684 11.88822 0.4790265 8.890134
## 0.21131579 11.87008 0.4903643 8.904842
## 0.26389474 11.93611 0.4979491 8.972070
## 0.31647368 12.05556 0.5031442 9.058626
## 0.36905263 12.21199 0.5070957 9.169629
## 0.42163158 12.39878 0.5101101 9.351295
## 0.47421053 12.61096 0.5125018 9.562080
## 0.52678947 12.84595 0.5144065 9.787019
## 0.57936842 13.09832 0.5159385 10.014813
## 0.63194737 13.36932 0.5171686 10.247180
## 0.68452632 13.65409 0.5181938 10.488730
## 0.73710526 13.95471 0.5190714 10.738073
## 0.78968421 14.26324 0.5196712 10.982823
## 0.84226316 14.58465 0.5202015 11.232557
## 0.89484211 14.91564 0.5206326 11.505148
## 0.94742105 15.25572 0.5209462 11.783635
## 1.00000000 15.60323 0.5211916 12.068218
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.2113158.
plot(ridge_tune)
ridge_prediction <- predict(ridge_tune, X_test)
postResample(pred = ridge_prediction, obs = Y_test)
## RMSE Rsquared MAE
## 12.0135032 0.5235396 8.5961594
###LAsso Model
glmnetGrid <- expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 20))
set.seed(100)
lasso_tune <- train(X_train, Y_train,
method = "glmnet",
tuneGrid = glmnetGrid,
trControl = ctrl,
preProc = c("center", "scale"))
lasso_tune
## glmnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00100000 12.21709 0.4299361 8.942562
## 0.05357895 12.21709 0.4299361 8.942562
## 0.10615789 12.06935 0.4378085 8.815047
## 0.15873684 11.61041 0.4656562 8.390984
## 0.21131579 11.39613 0.4800470 8.142893
## 0.26389474 11.27804 0.4875442 8.064225
## 0.31647368 11.24578 0.4876755 8.014870
## 0.36905263 11.19600 0.4908586 7.989136
## 0.42163158 11.17822 0.4919982 7.965935
## 0.47421053 11.14488 0.4950772 7.888418
## 0.52678947 11.12161 0.4973859 7.830690
## 0.57936842 11.11436 0.4981588 7.795229
## 0.63194737 11.14309 0.4968463 7.788788
## 0.68452632 11.18116 0.4952827 7.790848
## 0.73710526 11.20972 0.4952494 7.796949
## 0.78968421 11.24824 0.4946068 7.804676
## 0.84226316 11.29505 0.4930287 7.822094
## 0.89484211 11.34232 0.4913345 7.841994
## 0.94742105 11.38565 0.4903604 7.861293
## 1.00000000 11.42119 0.4902964 7.863809
##
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.5793684.
plot(lasso_tune)
lasso_prediction <- predict(lasso_tune, X_test)
postResample(pred = lasso_prediction, obs = Y_test)
## RMSE Rsquared MAE
## 11.8572660 0.4637412 8.4616371
###Elastic NEt Model
enetGrid <- expand.grid(alpha = seq(0.1, 0.80, length = 5), lambda = seq(0.001, 1, length = 20))
set.seed(100)
enet_tune <- train(X_train, Y_train,
method = "glmnet",
tuneGrid = enetGrid,
trControl = ctrl,
preProc = c("center", "scale"))
enet_tune
## glmnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.100 0.00100000 11.43210 0.4722095 8.199503
## 0.100 0.05357895 11.43210 0.4722095 8.199503
## 0.100 0.10615789 11.43210 0.4722095 8.199503
## 0.100 0.15873684 11.43210 0.4722095 8.199503
## 0.100 0.21131579 11.43210 0.4722095 8.199503
## 0.100 0.26389474 11.43210 0.4722095 8.199503
## 0.100 0.31647368 11.43210 0.4722095 8.199503
## 0.100 0.36905263 11.43210 0.4722095 8.199503
## 0.100 0.42163158 11.43210 0.4722095 8.199503
## 0.100 0.47421053 11.43210 0.4722095 8.199503
## 0.100 0.52678947 11.43210 0.4722095 8.199503
## 0.100 0.57936842 11.43210 0.4722095 8.199503
## 0.100 0.63194737 11.43210 0.4722095 8.199503
## 0.100 0.68452632 11.43210 0.4722095 8.199503
## 0.100 0.73710526 11.43210 0.4722095 8.199503
## 0.100 0.78968421 11.43210 0.4722095 8.199503
## 0.100 0.84226316 11.43210 0.4722095 8.199503
## 0.100 0.89484211 11.42824 0.4724875 8.199392
## 0.100 0.94742105 11.41643 0.4733132 8.193682
## 0.100 1.00000000 11.39075 0.4746886 8.164284
## 0.275 0.00100000 11.74481 0.4533784 8.514164
## 0.275 0.05357895 11.74481 0.4533784 8.514164
## 0.275 0.10615789 11.74481 0.4533784 8.514164
## 0.275 0.15873684 11.74481 0.4533784 8.514164
## 0.275 0.21131579 11.74481 0.4533784 8.514164
## 0.275 0.26389474 11.74481 0.4533784 8.514164
## 0.275 0.31647368 11.74267 0.4534946 8.512868
## 0.275 0.36905263 11.67235 0.4573581 8.454410
## 0.275 0.42163158 11.51670 0.4675921 8.289002
## 0.275 0.47421053 11.38890 0.4763297 8.145190
## 0.275 0.52678947 11.32318 0.4812475 8.053562
## 0.275 0.57936842 11.29304 0.4837829 7.990588
## 0.275 0.63194737 11.26683 0.4859662 7.946107
## 0.275 0.68452632 11.24269 0.4875761 7.918856
## 0.275 0.73710526 11.22379 0.4886842 7.905719
## 0.275 0.78968421 11.19919 0.4903021 7.886597
## 0.275 0.84226316 11.17045 0.4923623 7.866079
## 0.275 0.89484211 11.14064 0.4946254 7.844978
## 0.275 0.94742105 11.11162 0.4967938 7.829251
## 0.275 1.00000000 11.08585 0.4986491 7.820341
## 0.450 0.00100000 11.84661 0.4474280 8.639628
## 0.450 0.05357895 11.84661 0.4474280 8.639628
## 0.450 0.10615789 11.84661 0.4474280 8.639628
## 0.450 0.15873684 11.84661 0.4474280 8.639628
## 0.450 0.21131579 11.82330 0.4487202 8.618697
## 0.450 0.26389474 11.57111 0.4637140 8.363749
## 0.450 0.31647368 11.42322 0.4742652 8.197355
## 0.450 0.36905263 11.34498 0.4809515 8.066326
## 0.450 0.42163158 11.29052 0.4845973 8.002124
## 0.450 0.47421053 11.24403 0.4877677 7.973511
## 0.450 0.52678947 11.20008 0.4907607 7.940339
## 0.450 0.57936842 11.16419 0.4935389 7.910367
## 0.450 0.63194737 11.13170 0.4953902 7.880784
## 0.450 0.68452632 11.10414 0.4972038 7.864232
## 0.450 0.73710526 11.07184 0.4995633 7.841399
## 0.450 0.78968421 11.04176 0.5017072 7.811803
## 0.450 0.84226316 11.02093 0.5032174 7.794738
## 0.450 0.89484211 11.00914 0.5045763 7.783117
## 0.450 0.94742105 11.00375 0.5054265 7.764687
## 0.450 1.00000000 11.00150 0.5059166 7.738962
## 0.625 0.00100000 11.92042 0.4434555 8.725955
## 0.625 0.05357895 11.92042 0.4434555 8.725955
## 0.625 0.10615789 11.92042 0.4434555 8.725955
## 0.625 0.15873684 11.87656 0.4456902 8.691825
## 0.625 0.21131579 11.52602 0.4673424 8.325183
## 0.625 0.26389474 11.38942 0.4779188 8.170552
## 0.625 0.31647368 11.29412 0.4846900 8.050341
## 0.625 0.36905263 11.22700 0.4892367 7.990826
## 0.625 0.42163158 11.18251 0.4925249 7.955462
## 0.625 0.47421053 11.16303 0.4929859 7.928071
## 0.625 0.52678947 11.12698 0.4953648 7.904856
## 0.625 0.57936842 11.08325 0.4986268 7.871817
## 0.625 0.63194737 11.05813 0.5007434 7.847826
## 0.625 0.68452632 11.04616 0.5021441 7.816898
## 0.625 0.73710526 11.03357 0.5037805 7.769634
## 0.625 0.78968421 11.02607 0.5051979 7.733287
## 0.625 0.84226316 11.02108 0.5060274 7.709230
## 0.625 0.89484211 11.01927 0.5064586 7.693913
## 0.625 0.94742105 11.02224 0.5069367 7.676267
## 0.625 1.00000000 11.03757 0.5066237 7.667829
## 0.800 0.00100000 11.98373 0.4402705 8.781762
## 0.800 0.05357895 11.98373 0.4402705 8.781762
## 0.800 0.10615789 11.98373 0.4402705 8.781762
## 0.800 0.15873684 11.60605 0.4627230 8.430325
## 0.800 0.21131579 11.40537 0.4769057 8.205472
## 0.800 0.26389474 11.28280 0.4855796 8.059906
## 0.800 0.31647368 11.20558 0.4915130 7.997411
## 0.800 0.36905263 11.19535 0.4913605 7.983860
## 0.800 0.42163158 11.15766 0.4934563 7.946361
## 0.800 0.47421053 11.11693 0.4966345 7.924234
## 0.800 0.52678947 11.09716 0.4986381 7.888670
## 0.800 0.57936842 11.06768 0.5014476 7.806330
## 0.800 0.63194737 11.04342 0.5040800 7.741690
## 0.800 0.68452632 11.03275 0.5052358 7.712522
## 0.800 0.73710526 11.03241 0.5057306 7.690563
## 0.800 0.78968421 11.05545 0.5049038 7.683817
## 0.800 0.84226316 11.08710 0.5037192 7.689270
## 0.800 0.89484211 11.11201 0.5034363 7.697575
## 0.800 0.94742105 11.13789 0.5033287 7.702531
## 0.800 1.00000000 11.17337 0.5026044 7.707972
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.45 and lambda = 1.
plot(enet_tune)
enet_prediction <- predict(enet_tune, X_test)
postResample(pred = enet_prediction, obs = Y_test)
## RMSE Rsquared MAE
## 11.7079433 0.4763978 8.2892487
Looking at the Rsquared values, it looks like the Ridge model did the best as it has the highest value, meaning it can explain 52% of the varience within the dataset. The Ridge model keeps its predictors, so it doing the best, means that for this dataset it is more effective to keep the predictors rather then eliminate them. Overall, the models performed very similarly.
##6.2(f) Would you recommend any of your models to replace the permeability laboratory experiment? I would not recommend any of these models as they are fitting only for about half of the data and for the context that is not good enough to replace lab experiments. It might work as a screening tool, but I would not trust it for any actual predictive purposes. ————————-
#6.3: The relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield
##6.3(a): Start R and use these commands to load the data
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)
## [1] 176 58
str(ChemicalManufacturingProcess)
## 'data.frame': 176 obs. of 58 variables:
## $ Yield : num 38 42.4 42 41.4 42.5 ...
## $ BiologicalMaterial01 : num 6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
## $ BiologicalMaterial02 : num 49.6 61 61 61 63.3 ...
## $ BiologicalMaterial03 : num 57 67.5 67.5 67.5 72.2 ...
## $ BiologicalMaterial04 : num 12.7 14.7 14.7 14.7 14 ...
## $ BiologicalMaterial05 : num 19.5 19.4 19.4 19.4 17.9 ...
## $ BiologicalMaterial06 : num 43.7 53.1 53.1 53.1 54.7 ...
## $ BiologicalMaterial07 : num 100 100 100 100 100 100 100 100 100 100 ...
## $ BiologicalMaterial08 : num 16.7 19 19 19 18.2 ...
## $ BiologicalMaterial09 : num 11.4 12.6 12.6 12.6 12.8 ...
## $ BiologicalMaterial10 : num 3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
## $ BiologicalMaterial11 : num 138 154 154 154 148 ...
## $ BiologicalMaterial12 : num 18.8 21.1 21.1 21.1 21.1 ...
## $ ManufacturingProcess01: num NA 0 0 0 10.7 12 11.5 12 12 12 ...
## $ ManufacturingProcess02: num NA 0 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess03: num NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
## $ ManufacturingProcess04: num NA 917 912 911 918 924 933 929 928 938 ...
## $ ManufacturingProcess05: num NA 1032 1004 1015 1028 ...
## $ ManufacturingProcess06: num NA 210 207 213 206 ...
## $ ManufacturingProcess07: num NA 177 178 177 178 178 177 178 177 177 ...
## $ ManufacturingProcess08: num NA 178 178 177 178 178 178 178 177 177 ...
## $ ManufacturingProcess09: num 43 46.6 45.1 44.9 45 ...
## $ ManufacturingProcess10: num NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
## $ ManufacturingProcess11: num NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
## $ ManufacturingProcess12: num NA 0 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess13: num 35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
## $ ManufacturingProcess14: num 4898 4869 4878 4897 4992 ...
## $ ManufacturingProcess15: num 6108 6095 6087 6102 6233 ...
## $ ManufacturingProcess16: num 4682 4617 4617 4635 4733 ...
## $ ManufacturingProcess17: num 35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
## $ ManufacturingProcess18: num 4865 4867 4877 4872 4886 ...
## $ ManufacturingProcess19: num 6049 6097 6078 6073 6102 ...
## $ ManufacturingProcess20: num 4665 4621 4621 4611 4659 ...
## $ ManufacturingProcess21: num 0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
## $ ManufacturingProcess22: num NA 3 4 5 8 9 1 2 3 4 ...
## $ ManufacturingProcess23: num NA 0 1 2 4 1 1 2 3 1 ...
## $ ManufacturingProcess24: num NA 3 4 5 18 1 1 2 3 4 ...
## $ ManufacturingProcess25: num 4873 4869 4897 4892 4930 ...
## $ ManufacturingProcess26: num 6074 6107 6116 6111 6151 ...
## $ ManufacturingProcess27: num 4685 4630 4637 4630 4684 ...
## $ ManufacturingProcess28: num 10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
## $ ManufacturingProcess29: num 21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
## $ ManufacturingProcess30: num 9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
## $ ManufacturingProcess31: num 69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
## $ ManufacturingProcess32: num 156 169 173 171 171 173 159 161 160 164 ...
## $ ManufacturingProcess33: num 66 66 66 68 70 70 65 65 65 66 ...
## $ ManufacturingProcess34: num 2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
## $ ManufacturingProcess35: num 486 508 509 496 468 490 475 478 491 488 ...
## $ ManufacturingProcess36: num 0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
## $ ManufacturingProcess37: num 0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
## $ ManufacturingProcess38: num 3 2 2 2 2 2 2 2 3 3 ...
## $ ManufacturingProcess39: num 7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
## $ ManufacturingProcess40: num NA 0.1 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess41: num NA 0.15 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess42: num 11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
## $ ManufacturingProcess43: num 3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
## $ ManufacturingProcess44: num 1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
## $ ManufacturingProcess45: num 2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...
colSums(is.na(ChemicalManufacturingProcess))
## Yield BiologicalMaterial01 BiologicalMaterial02
## 0 0 0
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## 0 0 0
## BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
## 0 0 0
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## 0 0 0
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## 0 1 3
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
## 15 1 1
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
## 2 1 1
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
## 0 9 10
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## 1 0 1
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## 0 0 0
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## 0 0 0
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## 0 1 1
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
## 1 5 5
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
## 5 5 5
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
## 5 5 0
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## 5 5 5
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## 5 0 0
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
## 0 1 1
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
## 0 0 0
## ManufacturingProcess45
## 0
##6.3(b): A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values
predictors_only <- ChemicalManufacturingProcess[, -1]
imputer <-preProcess(predictors_only, method = "knnImpute")
predictors_impt <- predict(imputer, predictors_only)
sum(is.na(predictors_impt))
## [1] 0