Data624 Assignment 7

suppressMessages(suppressWarnings(library(fpp2)))
suppressMessages(suppressWarnings(library(readxl)))
suppressMessages(suppressWarnings(library(seasonal)))
suppressMessages(suppressWarnings(library(rdatamarket)))
suppressMessages(suppressWarnings(library(tseries)))
suppressMessages(suppressWarnings(library(AppliedPredictiveModeling)))
suppressMessages(suppressWarnings(library(fma)))
suppressMessages(suppressWarnings(library(corrplot)))
suppressMessages(suppressWarnings(library(caret)))
suppressMessages(suppressWarnings(library(pls)))
suppressMessages(suppressWarnings(library(glmnet)))
suppressMessages(suppressWarnings(library(missForest)))
set.seed(3)

6.2 Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

data(permeability)
#permeability
summary(permeability)

##   permeability  
##  Min.   : 0.06  
##  1st Qu.: 1.55  
##  Median : 4.91  
##  Mean   :12.24  
##  3rd Qu.:15.47  
##  Max.   :55.60

The matrix fingerprints contains the 1,107 binary molecular predic-tors for the 165 compounds, while permeability contains permeability response.

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

nz <- nearZeroVar(fingerprints)
length(nz)

## [1] 719

# Filter predictors
fp <- fingerprints[, -nz]

719 predictors can be dropped because they have near zero variance. So now there are 388 predictors.

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

cp <- cor(fp)
#cp

Lets remove the predictors with high correlation(.9)

cp_high <- findCorrelation(cp, cutoff = .9)
fp <- fp[, -cp]

Now lets do the train test split and build the model

# train and test split

fp_train <- fp[1:124, ]
fp_test <- fp[1:124, ]

permeability_train <- permeability[1:124, ]
permeability_test <- permeability[1:124, ]

PLS Model

pls_model <- train(fp_train, permeability_train,
                method = "pls",
                tuneLength = 10,
                trControl = trainControl(method = "cv"))


# Plot PLS Model

plot(pls_model, main="RMSE Error vs Components")

pls_model$results[pls_model$results$ncomp == 8, 'Rsquared']

## [1] 0.4928981

pls_model$results[pls_model$results$ncomp == 8, 'RMSE']

## [1] 11.50874

pls_model

## Partial Least Squares 
## 
## 124 samples
## 387 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 112, 111, 112, 112, 112, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.53374  0.2972899  10.277374
##    2     11.62831  0.4346302   8.471991
##    3     11.56626  0.4597054   8.868359
##    4     12.25383  0.4306803   9.592441
##    5     11.86676  0.4589821   8.836670
##    6     11.90645  0.4643249   8.993007
##    7     11.56075  0.4839172   9.052520
##    8     11.50874  0.4928981   8.957240
##    9     11.75846  0.4741684   9.017795
##   10     12.20983  0.4497406   9.398011
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 8.

R^2 value is 0.4928981 and RMSE is 11.50874 with n = 8

(d) Predict the response for the test set. What is the test set estimate of R2?

# Predict the permeability based on test set of fingerprints

permeability_predict = predict(pls_model, fp_test)

# Print out predictions and outcomes

plot(permeability_predict, permeability_test, main="Observed vs Predicted Permeability from PLS Model (n=5)", xlab="Predicted Permeability", ylab="Observed Permeability")

abline(0, 1, col='red')
text(0, 30, paste("R^2 = ", round(cor(permeability_test, permeability_predict)^2, 2)))
text(0, 27, paste("RMSE = ", round(sqrt(sum((permeability_test - permeability_predict)^2)), 2)))

The R^2 value is .79 which is really good.

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

Lets use glmnet to build a new model.

glm_model <- glmnet(fp_train, permeability_train, family="gaussian", alpha=0.5, lambda=0.001)
permeability_glm_predict <- predict(glm_model, fp_test)

plot(permeability_glm_predict, permeability_test, main="Observed vs Predicted Permeability from GLM Model", xlab="Predicted Permeability", ylab="Observed Permeability")

abline(0, 1, col='red')
text(0, 30, paste("R^2 = ", round(cor(permeability_test, permeability_glm_predict)^2, 2)))
text(0, 27, paste("RMSE = ", round(sqrt(sum((permeability_test - permeability_glm_predict)^2)), 2)))

6.3 . A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

data(chemicalManufacturing)

## Warning in data(chemicalManufacturing): data set 'chemicalManufacturing'
## not found

Name of the data is ChemicalManufacturingProcess

data(ChemicalManufacturingProcess)
#ChemicalManufacturingProcess
summary(ChemicalManufacturingProcess)

##      Yield       BiologicalMaterial01 BiologicalMaterial02
##  Min.   :35.25   Min.   :4.580        Min.   :46.87       
##  1st Qu.:38.75   1st Qu.:5.978        1st Qu.:52.68       
##  Median :39.97   Median :6.305        Median :55.09       
##  Mean   :40.18   Mean   :6.411        Mean   :55.69       
##  3rd Qu.:41.48   3rd Qu.:6.870        3rd Qu.:58.74       
##  Max.   :46.34   Max.   :8.810        Max.   :64.75       
##                                                           
##  BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
##  Min.   :56.97        Min.   : 9.38        Min.   :13.24       
##  1st Qu.:64.98        1st Qu.:11.24        1st Qu.:17.23       
##  Median :67.22        Median :12.10        Median :18.49       
##  Mean   :67.70        Mean   :12.35        Mean   :18.60       
##  3rd Qu.:70.43        3rd Qu.:13.22        3rd Qu.:19.90       
##  Max.   :78.25        Max.   :23.09        Max.   :24.85       
##                                                                
##  BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
##  Min.   :40.60        Min.   :100.0        Min.   :15.88       
##  1st Qu.:46.05        1st Qu.:100.0        1st Qu.:17.06       
##  Median :48.46        Median :100.0        Median :17.51       
##  Mean   :48.91        Mean   :100.0        Mean   :17.49       
##  3rd Qu.:51.34        3rd Qu.:100.0        3rd Qu.:17.88       
##  Max.   :59.38        Max.   :100.8        Max.   :19.14       
##                                                                
##  BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
##  Min.   :11.44        Min.   :1.770        Min.   :135.8       
##  1st Qu.:12.60        1st Qu.:2.460        1st Qu.:143.8       
##  Median :12.84        Median :2.710        Median :146.1       
##  Mean   :12.85        Mean   :2.801        Mean   :147.0       
##  3rd Qu.:13.13        3rd Qu.:2.990        3rd Qu.:149.6       
##  Max.   :14.08        Max.   :6.870        Max.   :158.7       
##                                                                
##  BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
##  Min.   :18.35        Min.   : 0.00          Min.   : 0.00         
##  1st Qu.:19.73        1st Qu.:10.80          1st Qu.:19.30         
##  Median :20.12        Median :11.40          Median :21.00         
##  Mean   :20.20        Mean   :11.21          Mean   :16.68         
##  3rd Qu.:20.75        3rd Qu.:12.15          3rd Qu.:21.50         
##  Max.   :22.21        Max.   :14.10          Max.   :22.50         
##                       NA's   :1              NA's   :3             
##  ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
##  Min.   :1.47           Min.   :911.0          Min.   : 923.0        
##  1st Qu.:1.53           1st Qu.:928.0          1st Qu.: 986.8        
##  Median :1.54           Median :934.0          Median : 999.2        
##  Mean   :1.54           Mean   :931.9          Mean   :1001.7        
##  3rd Qu.:1.55           3rd Qu.:936.0          3rd Qu.:1008.9        
##  Max.   :1.60           Max.   :946.0          Max.   :1175.3        
##  NA's   :15             NA's   :1              NA's   :1             
##  ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
##  Min.   :203.0          Min.   :177.0          Min.   :177.0         
##  1st Qu.:205.7          1st Qu.:177.0          1st Qu.:177.0         
##  Median :206.8          Median :177.0          Median :178.0         
##  Mean   :207.4          Mean   :177.5          Mean   :177.6         
##  3rd Qu.:208.7          3rd Qu.:178.0          3rd Qu.:178.0         
##  Max.   :227.4          Max.   :178.0          Max.   :178.0         
##  NA's   :2              NA's   :1              NA's   :1             
##  ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
##  Min.   :38.89          Min.   : 7.500         Min.   : 7.500        
##  1st Qu.:44.89          1st Qu.: 8.700         1st Qu.: 9.000        
##  Median :45.73          Median : 9.100         Median : 9.400        
##  Mean   :45.66          Mean   : 9.179         Mean   : 9.386        
##  3rd Qu.:46.52          3rd Qu.: 9.550         3rd Qu.: 9.900        
##  Max.   :49.36          Max.   :11.600         Max.   :11.500        
##                         NA's   :9              NA's   :10            
##  ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
##  Min.   :   0.0         Min.   :32.10          Min.   :4701          
##  1st Qu.:   0.0         1st Qu.:33.90          1st Qu.:4828          
##  Median :   0.0         Median :34.60          Median :4856          
##  Mean   : 857.8         Mean   :34.51          Mean   :4854          
##  3rd Qu.:   0.0         3rd Qu.:35.20          3rd Qu.:4882          
##  Max.   :4549.0         Max.   :38.60          Max.   :5055          
##  NA's   :1                                     NA's   :1             
##  ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
##  Min.   :5904           Min.   :   0           Min.   :31.30         
##  1st Qu.:6010           1st Qu.:4561           1st Qu.:33.50         
##  Median :6032           Median :4588           Median :34.40         
##  Mean   :6039           Mean   :4566           Mean   :34.34         
##  3rd Qu.:6061           3rd Qu.:4619           3rd Qu.:35.10         
##  Max.   :6233           Max.   :4852           Max.   :40.00         
##                                                                      
##  ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
##  Min.   :   0           Min.   :5890           Min.   :   0          
##  1st Qu.:4813           1st Qu.:6001           1st Qu.:4553          
##  Median :4835           Median :6022           Median :4582          
##  Mean   :4810           Mean   :6028           Mean   :4556          
##  3rd Qu.:4862           3rd Qu.:6050           3rd Qu.:4610          
##  Max.   :4971           Max.   :6146           Max.   :4759          
##                                                                      
##  ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
##  Min.   :-1.8000        Min.   : 0.000         Min.   :0.000         
##  1st Qu.:-0.6000        1st Qu.: 3.000         1st Qu.:2.000         
##  Median :-0.3000        Median : 5.000         Median :3.000         
##  Mean   :-0.1642        Mean   : 5.406         Mean   :3.017         
##  3rd Qu.: 0.0000        3rd Qu.: 8.000         3rd Qu.:4.000         
##  Max.   : 3.6000        Max.   :12.000         Max.   :6.000         
##                         NA's   :1              NA's   :1             
##  ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
##  Min.   : 0.000         Min.   :   0           Min.   :   0          
##  1st Qu.: 4.000         1st Qu.:4832           1st Qu.:6020          
##  Median : 8.000         Median :4855           Median :6047          
##  Mean   : 8.834         Mean   :4828           Mean   :6016          
##  3rd Qu.:14.000         3rd Qu.:4877           3rd Qu.:6070          
##  Max.   :23.000         Max.   :4990           Max.   :6161          
##  NA's   :1              NA's   :5              NA's   :5             
##  ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
##  Min.   :   0           Min.   : 0.000         Min.   : 0.00         
##  1st Qu.:4560           1st Qu.: 0.000         1st Qu.:19.70         
##  Median :4587           Median :10.400         Median :19.90         
##  Mean   :4563           Mean   : 6.592         Mean   :20.01         
##  3rd Qu.:4609           3rd Qu.:10.750         3rd Qu.:20.40         
##  Max.   :4710           Max.   :11.500         Max.   :22.00         
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
##  Min.   : 0.000         Min.   : 0.00          Min.   :143.0         
##  1st Qu.: 8.800         1st Qu.:70.10          1st Qu.:155.0         
##  Median : 9.100         Median :70.80          Median :158.0         
##  Mean   : 9.161         Mean   :70.18          Mean   :158.5         
##  3rd Qu.: 9.700         3rd Qu.:71.40          3rd Qu.:162.0         
##  Max.   :11.200         Max.   :72.50          Max.   :173.0         
##  NA's   :5              NA's   :5                                    
##  ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
##  Min.   :56.00          Min.   :2.300          Min.   :463.0         
##  1st Qu.:62.00          1st Qu.:2.500          1st Qu.:490.0         
##  Median :64.00          Median :2.500          Median :495.0         
##  Mean   :63.54          Mean   :2.494          Mean   :495.6         
##  3rd Qu.:65.00          3rd Qu.:2.500          3rd Qu.:501.5         
##  Max.   :70.00          Max.   :2.600          Max.   :522.0         
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
##  Min.   :0.01700        Min.   :0.000          Min.   :0.000         
##  1st Qu.:0.01900        1st Qu.:0.700          1st Qu.:2.000         
##  Median :0.02000        Median :1.000          Median :3.000         
##  Mean   :0.01957        Mean   :1.014          Mean   :2.534         
##  3rd Qu.:0.02000        3rd Qu.:1.300          3rd Qu.:3.000         
##  Max.   :0.02200        Max.   :2.300          Max.   :3.000         
##  NA's   :5                                                           
##  ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
##  Min.   :0.000          Min.   :0.00000        Min.   :0.00000       
##  1st Qu.:7.100          1st Qu.:0.00000        1st Qu.:0.00000       
##  Median :7.200          Median :0.00000        Median :0.00000       
##  Mean   :6.851          Mean   :0.01771        Mean   :0.02371       
##  3rd Qu.:7.300          3rd Qu.:0.00000        3rd Qu.:0.00000       
##  Max.   :7.500          Max.   :0.10000        Max.   :0.20000       
##                         NA's   :1              NA's   :1             
##  ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
##  Min.   : 0.00          Min.   : 0.0000        Min.   :0.000         
##  1st Qu.:11.40          1st Qu.: 0.6000        1st Qu.:1.800         
##  Median :11.60          Median : 0.8000        Median :1.900         
##  Mean   :11.21          Mean   : 0.9119        Mean   :1.805         
##  3rd Qu.:11.70          3rd Qu.: 1.0250        3rd Qu.:1.900         
##  Max.   :12.10          Max.   :11.0000        Max.   :2.100         
##                                                                      
##  ManufacturingProcess45
##  Min.   :0.000         
##  1st Qu.:2.100         
##  Median :2.200         
##  Mean   :2.138         
##  3rd Qu.:2.300         
##  Max.   :2.600         
##

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

We see some missing values from the summary and will impute them using missForest.

cmfp = missForest(ChemicalManufacturingProcess)

##   missForest iteration 1 in progress...

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## done!
##   missForest iteration 2 in progress...

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## done!
##   missForest iteration 3 in progress...

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
## mtry, : The response has five or fewer unique values. Are you sure you want
## to do regression?

## done!

cmfp = cmfp$ximp

Next we will separate out target variable yield from rest of the predictors

cmfp_data = cmfp[,2:58]
target = cmfp[,1]

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

In this problem lets try 75/ 25 split.

train = createDataPartition(target, p=0.75 )
predictor_train = cmfp_data[train$Resample1,]
target_train = target[train$Resample]
predictor_test = cmfp_data[-train$Resample1,]
target_test = target[-train$Resample1]

PLS model

pls_model <- train(predictor_train, target_train,
                   method = "pls",
                   tuneLength = 20,
                   trControl = trainControl(method = "cv"),
                   preProc = c("center", "scale"))

# Plot PLS Model

plot(pls_model, main="RMSE Error vs Components")

pls_model$results[pls_model$results$ncomp == 1, 'Rsquared']

## [1] 0.4637491

pls_model$results[pls_model$results$ncomp == 1, 'RMSE']

## [1] 1.412183

pls_model

## Partial Least Squares 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 120, 119, 119, 118, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.412183  0.4637491  1.140819
##    2     1.445793  0.4904245  1.082553
##    3     1.424673  0.5536017  1.066961
##    4     1.680173  0.5324892  1.166260
##    5     1.981922  0.5092919  1.258992
##    6     1.947023  0.5133619  1.245893
##    7     1.864152  0.5114982  1.215801
##    8     2.024598  0.5033420  1.264425
##    9     2.209687  0.4920212  1.336585
##   10     2.442016  0.4898081  1.424506
##   11     2.550068  0.4902477  1.465960
##   12     2.737215  0.4880870  1.526794
##   13     2.802030  0.4834814  1.548499
##   14     2.885685  0.4781477  1.577304
##   15     2.919612  0.4735137  1.585403
##   16     2.858094  0.4806946  1.566027
##   17     2.924027  0.4780045  1.595327
##   18     3.024529  0.4726409  1.630956
##   19     3.167027  0.4761935  1.667615
##   20     3.325339  0.4785661  1.710573
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 1.

R^2 value is 0.4637491 and RMSE is 1.412183 with n = 1

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

pred_test = predict(pls_model, predictor_test)
#pred_train = predict(plsTune, predictor_training)
pred_cmpf <- data.frame(obs = target_test, pred = pred_test)
defaultSummary(pred_cmpf)

##      RMSE  Rsquared       MAE 
## 1.7951022 0.3258719 1.2250007

R^2 = 0.3258719 and RMSE = 1.7951022, which is not better then the above metric.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

plot(varImp(pls_model))

v_imp <- varImp(pls_model)
v_imp

## pls variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   88.61
## ManufacturingProcess36   88.38
## ManufacturingProcess13   88.07
## BiologicalMaterial06     76.64
## BiologicalMaterial02     75.84
## ManufacturingProcess17   75.61
## ManufacturingProcess33   75.12
## ManufacturingProcess06   72.72
## BiologicalMaterial03     71.89
## BiologicalMaterial08     62.22
## BiologicalMaterial04     62.10
## ManufacturingProcess12   60.34
## BiologicalMaterial12     59.91
## BiologicalMaterial11     59.27
## ManufacturingProcess11   58.71
## BiologicalMaterial01     54.83
## ManufacturingProcess30   39.32
## ManufacturingProcess10   37.88
## ManufacturingProcess15   35.09

We can see that top predictors are manufacturing processes. So manufatucring processes have much higher impact on target.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Lets consider the top 3 predictors and plot them against our target variable.

1. ManufacturingProcess32

2. ManufacturingProcess09

3. ManufacturingProcess36

#ManufacturingProcess32
plot(cmfp_data$ManufacturingProcess32, target)
abline(lm(target~cmfp_data$ManufacturingProcess32),col="red",lwd=1.5)

#ManufacturingProcess09
plot(cmfp_data$ManufacturingProcess09, target)
abline(lm(target~cmfp_data$ManufacturingProcess09),col="red",lwd=1.5)

#ManufacturingProcess36
plot(cmfp_data$ManufacturingProcess36, target)
abline(lm(target~cmfp_data$ManufacturingProcess36),col="red",lwd=1.5)

ManufacturingProcess32 and ManufacturingProcess09 has a positive corelation with the target and ManufacturingProcess36 have negative corelation with the target. This means we could adjust the variables to potentially increase the yield.

Data624 Assignment7

Ritesh Lohiya

April 4, 2019

Data624 Assignment 7

6.2 Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

The matrix fingerprints contains the 1,107 binary molecular predic-tors for the 165 compounds, while permeability contains permeability response.

719 predictors can be dropped because they have near zero variance. So now there are 388 predictors.

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

Lets remove the predictors with high correlation(.9)

Now lets do the train test split and build the model

PLS Model

R^2 value is 0.4928981 and RMSE is 11.50874 with n = 8

(d) Predict the response for the test set. What is the test set estimate of R2?

The R^2 value is .79 which is really good.

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

Lets use glmnet to build a new model.

(a) Start R and use these commands to load the data:

Name of the data is ChemicalManufacturingProcess

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

We see some missing values from the summary and will impute them using missForest.

Next we will separate out target variable yield from rest of the predictors

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

In this problem lets try 75/ 25 split.

PLS model

R^2 value is 0.4637491 and RMSE is 1.412183 with n = 1

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

R^2 = 0.3258719 and RMSE = 1.7951022, which is not better then the above metric.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

We can see that top predictors are manufacturing processes. So manufatucring processes have much higher impact on target.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Lets consider the top 3 predictors and plot them against our target variable.

1. ManufacturingProcess32

2. ManufacturingProcess09

3. ManufacturingProcess36

ManufacturingProcess32 and ManufacturingProcess09 has a positive corelation with the target and ManufacturingProcess36 have negative corelation with the target. This means we could adjust the variables to potentially increase the yield.

Data624 Assignment7

Ritesh Lohiya

April 4, 2019

Data624 Assignment 7

6.2 Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

The matrix fingerprints contains the 1,107 binary molecular predic-tors for the 165 compounds, while permeability contains permeability response.

719 predictors can be dropped because they have near zero variance. So now there are 388 predictors.

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

Lets remove the predictors with high correlation(.9)

Now lets do the train test split and build the model

PLS Model

R^2 value is 0.4928981 and RMSE is 11.50874 with n = 8

(d) Predict the response for the test set. What is the test set estimate of R2?

The R^2 value is .79 which is really good.

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

Lets use glmnet to build a new model.

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

I would recommend GLM Net model because it has a very high R^2.

(a) Start R and use these commands to load the data:

Name of the data is ChemicalManufacturingProcess

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

We see some missing values from the summary and will impute them using missForest.

Next we will separate out target variable yield from rest of the predictors

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

In this problem lets try 75/ 25 split.

PLS model

R^2 value is 0.4637491 and RMSE is 1.412183 with n = 1

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

R^2 = 0.3258719 and RMSE = 1.7951022, which is not better then the above metric.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

We can see that top predictors are manufacturing processes. So manufatucring processes have much higher impact on target.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Lets consider the top 3 predictors and plot them against our target variable.

1. ManufacturingProcess32

2. ManufacturingProcess09

3. ManufacturingProcess36

ManufacturingProcess32 and ManufacturingProcess09 has a positive corelation with the target and ManufacturingProcess36 have negative corelation with the target. This means we could adjust the variables to potentially increase the yield.