Data624 week10

R Markdown

Problem 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

a) Start R and use these commands to load the data:

Load all fingerprints predictors are target variables. Confirming the number of observation are same for both predictors and target variable.

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.0.3

## [1] 165

## [1] 165

## [1] 1107

## [1] 1

There are 165 observation with 1107 predictors and 1 target variable.

b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

## Loading required package: lattice

## Loading required package: ggplot2

## [1] 388

After removing the low frequency variable, we have 388 variables in the predictor list. nearZeroVar will help in removing all variable with zero variance or which has very few unique values relative to the number of samples.

c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

## Warning: package 'pls' was built under R version 4.0.3

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

plsmodel$results

##    ncomp     RMSE  Rsquared       MAE   RMSESD RsquaredSD    MAESD
## 1      1 13.55228 0.2889906 10.254144 1.970636  0.2661762 1.788984
## 2      2 12.31313 0.4143172  8.752912 2.427410  0.2249563 1.800905
## 3      3 11.96318 0.4350040  8.954915 2.523226  0.2328929 1.708100
## 4      4 12.25871 0.4192916  9.427148 2.072124  0.1812841 1.433524
## 5      5 12.03421 0.4585351  9.103037 2.433273  0.1901322 1.613929
## 6      6 11.86129 0.4711070  8.938106 2.211652  0.1614947 1.895656
## 7      7 11.61248 0.4945463  8.982042 1.887481  0.1305865 1.526076
## 8      8 11.34235 0.5103917  8.972807 1.720665  0.1430526 1.395402
## 9      9 11.27453 0.5145793  8.778225 1.652229  0.1404113 1.087510
## 10    10 11.22447 0.5227839  8.628987 1.967255  0.1421539 1.349183
## 11    11 11.53618 0.5061332  8.963147 2.034791  0.1384892 1.417809
## 12    12 11.75938 0.4929538  9.040591 2.249614  0.1500689 1.602299
## 13    13 12.12157 0.4754017  9.256365 2.268663  0.1518130 1.686388
## 14    14 12.31703 0.4664117  9.444963 2.389078  0.1484254 1.732642
## 15    15 12.59417 0.4428841  9.671427 2.873839  0.1694875 2.060419
## 16    16 13.15339 0.4029050  9.902577 2.949224  0.1895255 2.065107
## 17    17 13.38994 0.3919127 10.051180 3.398321  0.2004601 2.397669
## 18    18 13.52542 0.3865039 10.180677 3.527901  0.2049565 2.592259
## 19    19 13.72755 0.3689978 10.269883 3.421158  0.2163252 2.389133
## 20    20 13.96971 0.3538878 10.368577 3.380498  0.2212317 2.443328

optimalcomp <- which.min(plsmodel$results[,'RMSE'])

The optimal number of component is

optimalcomp

## [1] 10

The RMSE and R-squared for the optimal number of component is

optimal_rmse <- plsmodel$results[optimalcomp, 'RMSE']
optimal_r2 <- plsmodel$results[optimalcomp, 'Rsquared']

optimal_rmse

## [1] 11.22447

optimal_r2

## [1] 0.5227839

d) Predict the response for the test set. What is the test set estimate of R2?

# Build a DF
trainingData <- as.data.frame(train_X)
trainingData$Perm <- train_Y


plsFit <- plsr(Perm ~., data=trainingData, ncomp=optimalcomp)

plsPred <- predict(plsFit, test_X, ncomp=optimalcomp)
head(plsPred)

## , , 10 comps
## 
##         Perm
## 3  17.733048
## 4  -4.657418
## 12  4.323120
## 13  2.679132
## 21 14.398584
## 26 23.222364

postResample(pred = plsPred, obs = test_Y)

##      RMSE  Rsquared       MAE 
## 12.485482  0.483895  9.965692

e) Try building other models discussed in this chapter. Do any have better predictive performance?

I will use ridge, lasso and elastic net models. we will use glmnet package to use this models.

when alpha = 0 , lasso component goes to 0 and ridge component would be active.

when alpha = 1, ridge component goes to 0 and lasso component would be active.

Lets start with ridge

## Warning: package 'glmnet' was built under R version 4.0.3

## Loading required package: Matrix

## Loaded glmnet 4.0-2

##       RMSE   Rsquared        MAE 
## 11.3731538  0.6094073  9.0394251

Next, We will try Lasso

lasso.fit <- cv.glmnet(train_X, train_Y, type.measure = "mse", alpha= 1 , family="gaussian" )

lasso.pred <- predict(lasso.fit, s=lasso.fit$lambda.1se, newx = test_X)
postResample(pred = lasso.pred, obs = test_Y)

##      RMSE  Rsquared       MAE 
## 12.003534  0.448561  9.330812

Elastic net. I tried different values of alpha between 0 and 1. the optimal value found to be 0.5 with R-square 0.5

elasticnet.fit <- cv.glmnet(train_X, train_Y, type.measure = "mse", alpha= 0.5 , family="gaussian" )
elasticnet.pred <- predict(lasso.fit, s=elasticnet.fit$lambda.1se, newx = test_X)
postResample(pred = elasticnet.pred, obs = test_Y)

##       RMSE   Rsquared        MAE 
## 12.3168372  0.6202373  9.6740307

Problem 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

a) Start R and use these commands to load the data:

Lets look at the sample data from the dataframe. We see there are 58 columns and the target variable is Yield.

head(ChemicalManufacturingProcess)

##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00                 6.25                49.58                56.97
## 2 42.44                 8.01                60.97                67.48
## 3 42.03                 8.01                60.97                67.48
## 4 41.42                 8.01                60.97                67.48
## 5 42.49                 7.47                63.33                72.25
## 6 43.57                 6.12                58.36                65.31
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1                12.74                19.51                43.73
## 2                14.65                19.36                53.14
## 3                14.65                19.36                53.14
## 4                14.65                19.36                53.14
## 5                14.02                17.91                54.66
## 6                15.17                21.79                51.23
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1                  100                16.66                11.44
## 2                  100                19.04                12.55
## 3                  100                19.04                12.55
## 4                  100                19.04                12.55
## 5                  100                18.22                12.80
## 6                  100                18.30                12.13
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1                 3.46               138.09                18.83
## 2                 3.46               153.67                21.05
## 3                 3.46               153.67                21.05
## 4                 3.46               153.67                21.05
## 5                 3.05               147.61                21.05
## 6                 3.78               151.88                20.76
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1                     NA                     NA                     NA
## 2                    0.0                      0                     NA
## 3                    0.0                      0                     NA
## 4                    0.0                      0                     NA
## 5                   10.7                      0                     NA
## 6                   12.0                      0                     NA
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1                     NA                     NA                     NA
## 2                    917                 1032.2                  210.0
## 3                    912                 1003.6                  207.1
## 4                    911                 1014.6                  213.3
## 5                    918                 1027.5                  205.7
## 6                    924                 1016.8                  208.9
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1                     NA                     NA                  43.00
## 2                    177                    178                  46.57
## 3                    178                    178                  45.07
## 4                    177                    177                  44.92
## 5                    178                    178                  44.96
## 6                    178                    178                  45.32
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1                     NA                     NA                     NA
## 2                     NA                     NA                      0
## 3                     NA                     NA                      0
## 4                     NA                     NA                      0
## 5                     NA                     NA                      0
## 6                     NA                     NA                      0
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1                   35.5                   4898                   6108
## 2                   34.0                   4869                   6095
## 3                   34.8                   4878                   6087
## 4                   34.8                   4897                   6102
## 5                   34.6                   4992                   6233
## 6                   34.0                   4985                   6222
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1                   4682                   35.5                   4865
## 2                   4617                   34.0                   4867
## 3                   4617                   34.8                   4877
## 4                   4635                   34.8                   4872
## 5                   4733                   33.9                   4886
## 6                   4786                   33.4                   4862
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1                   6049                   4665                    0.0
## 2                   6097                   4621                    0.0
## 3                   6078                   4621                    0.0
## 4                   6073                   4611                    0.0
## 5                   6102                   4659                   -0.7
## 6                   6115                   4696                   -0.6
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1                     NA                     NA                     NA
## 2                      3                      0                      3
## 3                      4                      1                      4
## 4                      5                      2                      5
## 5                      8                      4                     18
## 6                      9                      1                      1
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1                   4873                   6074                   4685
## 2                   4869                   6107                   4630
## 3                   4897                   6116                   4637
## 4                   4892                   6111                   4630
## 5                   4930                   6151                   4684
## 6                   4871                   6128                   4687
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1                   10.7                   21.0                    9.9
## 2                   11.2                   21.4                    9.9
## 3                   11.1                   21.3                    9.4
## 4                   11.1                   21.3                    9.4
## 5                   11.3                   21.6                    9.0
## 6                   11.4                   21.7                   10.1
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1                   69.1                    156                     66
## 2                   68.7                    169                     66
## 3                   69.3                    173                     66
## 4                   69.3                    171                     68
## 5                   69.4                    171                     70
## 6                   68.2                    173                     70
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1                    2.4                    486                  0.019
## 2                    2.6                    508                  0.019
## 3                    2.6                    509                  0.018
## 4                    2.5                    496                  0.018
## 5                    2.5                    468                  0.017
## 6                    2.5                    490                  0.018
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1                    0.5                      3                    7.2
## 2                    2.0                      2                    7.2
## 3                    0.7                      2                    7.2
## 4                    1.2                      2                    7.2
## 5                    0.2                      2                    7.3
## 6                    0.4                      2                    7.2
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1                     NA                     NA                   11.6
## 2                    0.1                   0.15                   11.1
## 3                    0.0                   0.00                   12.0
## 4                    0.0                   0.00                   10.6
## 5                    0.0                   0.00                   11.0
## 6                    0.0                   0.00                   11.5
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1                    3.0                    1.8                    2.4
## 2                    0.9                    1.9                    2.2
## 3                    1.0                    1.8                    2.3
## 4                    1.1                    1.8                    2.1
## 5                    1.1                    1.7                    2.1
## 6                    2.2                    1.8                    2.0

ncol(ChemicalManufacturingProcess)

## [1] 58

Lets do some preprocessing and clean the data. As part of it, we will see if there is any missing values.

There are 176 rows in this dataset. Out of that 24 rows has NAs. There are 106 total NA occurances in the data set.

Total number of Observation.

nrow(ChemicalManufacturingProcess)

## [1] 176

b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

There are 176 rows in this dataset. Out of that 24 rows has NAs. There are 106 total NA occurances in the data set.

Total number of NAs

length(which(is.na(ChemicalManufacturingProcess)))

## [1] 106

Total number of rows with NA

length(which(!complete.cases(ChemicalManufacturingProcess)))

## [1] 24

Lets impute the missing data

impute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute'))
imputed <- predict(impute, ChemicalManufacturingProcess[,-c(1)])

c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Lets try some preprocessing and remove all non-significant predicators.

## Warning: package 'car' was built under R version 4.0.3

## Loading required package: carData

## [1] "BiologicalMaterial07"

BiologicalMaterial07 is having low variance. we will remove this predictor variable. now lets remove some multi colinearity by removing pair wise correlated variables.

correlations <- cor(imputed_lowvar)
highCorr <- findCorrelation(correlations, names=TRUE, cutoff=0.9)
(highCorr)

##  [1] "BiologicalMaterial02"   "BiologicalMaterial04"   "BiologicalMaterial12"  
##  [4] "ManufacturingProcess29" "ManufacturingProcess42" "ManufacturingProcess27"
##  [7] "ManufacturingProcess25" "ManufacturingProcess31" "ManufacturingProcess18"
## [10] "ManufacturingProcess40"

highCorr <- findCorrelation(correlations, cutoff=0.9)
imputed <- imputed_lowvar[,-highCorr]

highCorr <- findCorrelation(correlations, cutoff=0.9)
imputed <- imputed_lowvar[,-highCorr]

Splitting the dataset into training and test data sets.

# Train/test plitting data, 25% testing
set.seed(1)
trainRow <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.75, list=FALSE)
trainData <- imputed[trainRow, ]
trainData_X <- imputed[trainRow, ]

trainData$Yield <- ChemicalManufacturingProcess[trainRow, ]$Yield
testData <- imputed[-trainRow, ]
testData_X <- imputed[-trainRow, ]
testData$Yield <- ChemicalManufacturingProcess[-trainRow, ]$Yield

Fiting a PLS model and find performance variables.

plsmodel<- train(trainData_X, trainData$Yield,
                 method="pls",
                 tuneLength=20,
                 trControl = cv_cntrl)
plot(plsmodel)

plsmodel$results

##    ncomp     RMSE  Rsquared      MAE     RMSESD RsquaredSD     MAESD
## 1      1 1.627390 0.2012777 1.311968  0.2542385  0.2201262 0.2308777
## 2      2 1.621468 0.2050440 1.305967  0.2493284  0.2191528 0.2281001
## 3      3 2.301524 0.1669201 1.588843  2.2200509  0.2054500 0.9613340
## 4      4 5.223241 0.1775648 2.630489 10.9359216  0.1823202 4.0247246
## 5      5 8.510862 0.1782906 3.612193 14.7463940  0.1851440 5.1450703
## 6      6 7.583476 0.2029144 3.199323 13.0670376  0.2048655 4.1587077
## 7      7 5.830914 0.2908409 2.516072  9.7857047  0.2248824 2.9892242
## 8      8 4.922830 0.2980774 2.136187  7.7220214  0.2308815 2.2661297
## 9      9 4.857704 0.3890628 2.220807  7.6416609  0.2654233 2.6278505
## 10    10 4.113983 0.4732698 1.841474  6.4016219  0.2620997 1.9637427
## 11    11 3.569480 0.4835020 1.661748  5.3442612  0.2567280 1.6282935
## 12    12 3.395403 0.5266141 1.685316  4.9416988  0.2816553 1.7339094
## 13    13 3.403486 0.5321589 1.579610  5.1800569  0.2877901 1.5884963
## 14    14 2.906176 0.5558700 1.448003  4.3328315  0.2742609 1.3675248
## 15    15 2.605877 0.5557473 1.381274  3.5726329  0.2752953 1.2004795
## 16    16 2.975488 0.5578095 1.508461  4.3445101  0.2884171 1.4684815
## 17    17 3.095662 0.5529388 1.540630  4.6574418  0.2880221 1.5173538
## 18    18 3.635812 0.5320253 1.709054  5.6021668  0.2861593 1.8298733
## 19    19 3.745813 0.5365047 1.748883  5.8128208  0.2846875 1.9114805
## 20    20 3.843954 0.5323276 1.799170  5.9539847  0.2924349 2.0146958

optimalcomp <- which.min(plsmodel$results[,'RMSE'])
optimal_rmse <- plsmodel$results[optimalcomp, 'RMSE']
optimal_r2 <- plsmodel$results[optimalcomp, 'Rsquared']

optimal_rmse

## [1] 1.621468

optimal_r2

## [1] 0.205044

From the PLS model, we get the really low R-square. This indicate the model only capture 20% of the variability of the predictors.

d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

plsFit <- plsr(Yield ~., data=trainData, ncomp=optimalcomp)

plsPred <- predict(plsFit, testData_X, ncomp=optimalcomp)
postResample(pred = plsPred, obs = testData$Yield)

##       RMSE   Rsquared        MAE 
## 1.97180761 0.06492146 1.60536298

It seems r-square is really poor when it comes to the test results. The PLS model is not that great in this case. we may need to try some other models like ridge, lasso and elastic net.

Lets try with ridge first

ridge.fit <- cv.glmnet(as.matrix(trainData_X), trainData$Yield, type.measure = "mse", alpha= 0 , family="gaussian" )
ridge.pred <- predict(ridge.fit, s=ridge.fit$lambda.1se, newx = as.matrix(testData_X))
postResample(pred = ridge.pred, obs = testData$Yield)

##      RMSE  Rsquared       MAE 
## 1.7643785 0.5897654 1.4074020

Next, lets tru with lasso

lasso.fit <- cv.glmnet(as.matrix(trainData_X), trainData$Yield, type.measure = "mse", alpha=1 , family="gaussian" )

lasso.pred <- predict(lasso.fit, s=lasso.fit$lambda.1se, newx  = as.matrix(testData_X))
postResample(pred = lasso.pred, obs = testData$Yield)

##      RMSE  Rsquared       MAE 
## 1.3091127 0.6179137 1.0385829

Finally, we will try with elastic net

ELastic net

elasticnet.fit <- cv.glmnet(as.matrix(trainData_X), trainData$Yield, type.measure = "mse", alpha= 0.5 , family="gaussian" )
elasticnet.pred <- predict(lasso.fit, s=elasticnet.fit$lambda.1se, newx = as.matrix(testData_X))
postResample(pred = elasticnet.pred, obs = testData$Yield)

##     RMSE Rsquared      MAE 
## 1.461456 0.587580 1.177213

####e ) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

lassoImp <- caret::getModelInfo("glmnet")$glmnet$varImp(lasso.fit, lambda = lasso.fit$lambda.1se , scale = FALSE)

index<- which(lassoImp$Overall > .001)
impvariable <- as.data.frame(lassoImp[index, 0])
impvariable$overall <- lassoImp[index, 1]
impvariable

##                             overall
## BiologicalMaterial03    0.030316362
## ManufacturingProcess04  0.008431423
## ManufacturingProcess06  0.017527652
## ManufacturingProcess09  0.290011611
## ManufacturingProcess11  0.076299732
## ManufacturingProcess13  0.190783345
## ManufacturingProcess17  0.084000774
## ManufacturingProcess32  0.167682810
## ManufacturingProcess34  2.480230512
## ManufacturingProcess36 15.270745914
## ManufacturingProcess37  0.127481831
## ManufacturingProcess39  0.033635041
## ManufacturingProcess45  0.208210924

From the important variables, it looks like manufacturing variables dominates the biological predictors

####f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

None of these variables are yielding a direct linear relatioship to the target variables. However it is more clear that the biological predictors dont have much significance towards the target variable.

featurePlot(x = ChemicalManufacturingProcess['ManufacturingProcess34'], 
            y = ChemicalManufacturingProcess$Yield, 
            plot = "scatter",
            type = c("p", 'smooth'),
            span = .5,
            pch = 20)

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 2.6015

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 0.1015

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 0.01

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## zero-width neighborhood. make span bigger

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 2.6015

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 0.1015

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 0.01

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## zero-width neighborhood. make span bigger

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 2.6015

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 0.1015

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 0.01

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## zero-width neighborhood. make span bigger

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 2.6015

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 0.1015

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 0.01

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## zero-width neighborhood. make span bigger

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 2.6015

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 0.1015

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 0.01

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## zero-width neighborhood. make span bigger

featurePlot(x = ChemicalManufacturingProcess['ManufacturingProcess09'], 
            y = ChemicalManufacturingProcess$Yield, 
            plot = "scatter",
            type = c("p", 'smooth'),
            span = .5,
            pch = 20)

featurePlot(x = ChemicalManufacturingProcess['ManufacturingProcess13'], 
            y = ChemicalManufacturingProcess$Yield, 
            plot = "scatter",
            type = c("p", 'smooth'),
            span = .5,
            pch = 20)

featurePlot(x = ChemicalManufacturingProcess['ManufacturingProcess32'], 
            y = ChemicalManufacturingProcess$Yield, 
            plot = "scatter",
            type = c("p", 'smooth'),
            span = .5,
            pch = 20)

featurePlot(x = ChemicalManufacturingProcess['ManufacturingProcess45'], 
            y = ChemicalManufacturingProcess$Yield, 
            plot = "scatter",
            type = c("p", 'smooth'),
            span = .5,
            pch = 20)

featurePlot(x = ChemicalManufacturingProcess['ManufacturingProcess17'], 
            y = ChemicalManufacturingProcess$Yield, 
            plot = "scatter",
            type = c("p", 'smooth'),
            span = .5,
            pch = 20)

Data624 week10

Charls Joseph

11/19/2020

R Markdown

Problem 6.2

a) Start R and use these commands to load the data:

c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

d) Predict the response for the test set. What is the test set estimate of R2?

e) Try building other models discussed in this chapter. Do any have better predictive performance?

Problem 6.3

a) Start R and use these commands to load the data:

b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

Data624 week10

Charls Joseph

11/19/2020

R Markdown

Problem 6.2

a) Start R and use these commands to load the data:

c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

d) Predict the response for the test set. What is the test set estimate of R2?

e) Try building other models discussed in this chapter. Do any have better predictive performance?

f) Would you recommend any of your models to replace the permeability laboratory experiment?

Problem 6.3

a) Start R and use these commands to load the data:

b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?