BrianSingh_data624

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.2.3

data(permeability)

b. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

library(caret)

## Warning: package 'caret' was built under R version 4.2.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.2.3

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 4.2.3

fingerprints_filtered <- fingerprints[, -nearZeroVar(fingerprints)]
                                     
dim(fingerprints_filtered)

## [1] 165 388

388 predictors are left for modeling.

c. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

set.seed(1234)

train_selection <- createDataPartition(permeability, p = .8, list = FALSE)

train_x <- fingerprints_filtered[train_selection, ]
test_x <- fingerprints_filtered[-train_selection, ]
train_y <- permeability[train_selection, ]
test_y <- permeability[-train_selection, ]

PLS <- train(x = train_x, y = train_y,method = "pls",metric = "Rsquared", tuneLength = 10, trControl = trainControl(method = "cv"), preProcess = c('center', 'scale'))
PLS_result <- PLS$results
PLS

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 120, 120, 120, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     12.76548  0.3949848  9.782056
##    2     11.08419  0.5662412  7.827891
##    3     11.02409  0.5704308  8.285652
##    4     11.16727  0.5556966  8.423255
##    5     11.09218  0.5573860  8.252132
##    6     10.96320  0.5653475  8.090110
##    7     11.06634  0.5592164  8.455477
##    8     11.21343  0.5559115  8.601945
##    9     11.24153  0.5635851  8.559953
##   10     11.63386  0.5463035  8.841897
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 3.

The final value used for the model was ncomp = 3 with an rsquares of 0.5606977

d. Predict the response for the test set. What is the test set estimate of R2?

fingerprint_predict <- predict(PLS, test_x)

postResample(fingerprint_predict, test_y)

##      RMSE  Rsquared       MAE 
## 14.033165  0.138877  9.495030

The test set estimated of Rsquared is 0.2440012

e. Try building other models discussed in this chapter. Do any have better predictive performance?

set.seed(5678)

ridge_train <- train(x=train_x, y=train_y,method='ridge', metric='Rsquared',tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)),trControl=trainControl(method='cv'),preProcess=c('center','scale'))

## Warning: model fit failed for Fold10: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

ridge_train

## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 120, 119, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.0     13.49126  0.4365961   9.679907
##   0.1     11.50897  0.5413120   8.507836
##   0.2     11.69481  0.5387242   8.743099
##   0.3     12.07012  0.5403318   9.074433
##   0.4     12.55911  0.5419691   9.475287
##   0.5     13.13064  0.5434704   9.925781
##   0.6     13.77109  0.5440621  10.459382
##   0.7     14.46238  0.5446213  11.039901
##   0.8     15.19632  0.5449756  11.644364
##   0.9     15.96540  0.5450875  12.263755
##   1.0     16.75702  0.5452463  12.888947
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 1.

Ridge regression’s optimal lambda is 0.4, with an rsquared of 0.5382022, which is a better performance that PLS.

set.seed(789)
lars_train <- train(train_x, train_y, method = "lars", metric = "Rsquared",tuneLength = 10, trControl=trainControl(method='cv'), preProc = c("center", "scale"))
lars_train

## Least Angle Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 120, 119, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   fraction   RMSE      Rsquared   MAE      
##   0.0500000  11.20680  0.5827230   7.887273
##   0.1555556  11.36698  0.5654115   8.342067
##   0.2611111  11.99233  0.5360780   8.742474
##   0.3666667  12.39335  0.5121641   8.931486
##   0.4722222  12.65635  0.4940958   9.037095
##   0.5777778  13.04394  0.4761195   9.296571
##   0.6833333  13.13094  0.4696574   9.386220
##   0.7888889  13.33400  0.4756693   9.433763
##   0.8944444  13.68104  0.4711790   9.704378
##   1.0000000  14.39200  0.4531616  10.183606
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.05.

The final value used for the model was fraction = 0.05 with an rsquared of 0.4924561 which is a better performance that PLS but not than Lars.

6.3 A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

b. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

summary(ChemicalManufacturingProcess)

##      Yield       BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
##  Min.   :35.25   Min.   :4.580        Min.   :46.87        Min.   :56.97       
##  1st Qu.:38.75   1st Qu.:5.978        1st Qu.:52.68        1st Qu.:64.98       
##  Median :39.97   Median :6.305        Median :55.09        Median :67.22       
##  Mean   :40.18   Mean   :6.411        Mean   :55.69        Mean   :67.70       
##  3rd Qu.:41.48   3rd Qu.:6.870        3rd Qu.:58.74        3rd Qu.:70.43       
##  Max.   :46.34   Max.   :8.810        Max.   :64.75        Max.   :78.25       
##                                                                                
##  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
##  Min.   : 9.38        Min.   :13.24        Min.   :40.60       
##  1st Qu.:11.24        1st Qu.:17.23        1st Qu.:46.05       
##  Median :12.10        Median :18.49        Median :48.46       
##  Mean   :12.35        Mean   :18.60        Mean   :48.91       
##  3rd Qu.:13.22        3rd Qu.:19.90        3rd Qu.:51.34       
##  Max.   :23.09        Max.   :24.85        Max.   :59.38       
##                                                                
##  BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
##  Min.   :100.0        Min.   :15.88        Min.   :11.44       
##  1st Qu.:100.0        1st Qu.:17.06        1st Qu.:12.60       
##  Median :100.0        Median :17.51        Median :12.84       
##  Mean   :100.0        Mean   :17.49        Mean   :12.85       
##  3rd Qu.:100.0        3rd Qu.:17.88        3rd Qu.:13.13       
##  Max.   :100.8        Max.   :19.14        Max.   :14.08       
##                                                                
##  BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
##  Min.   :1.770        Min.   :135.8        Min.   :18.35       
##  1st Qu.:2.460        1st Qu.:143.8        1st Qu.:19.73       
##  Median :2.710        Median :146.1        Median :20.12       
##  Mean   :2.801        Mean   :147.0        Mean   :20.20       
##  3rd Qu.:2.990        3rd Qu.:149.6        3rd Qu.:20.75       
##  Max.   :6.870        Max.   :158.7        Max.   :22.21       
##                                                                
##  ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
##  Min.   : 0.00          Min.   : 0.00          Min.   :1.47          
##  1st Qu.:10.80          1st Qu.:19.30          1st Qu.:1.53          
##  Median :11.40          Median :21.00          Median :1.54          
##  Mean   :11.21          Mean   :16.68          Mean   :1.54          
##  3rd Qu.:12.15          3rd Qu.:21.50          3rd Qu.:1.55          
##  Max.   :14.10          Max.   :22.50          Max.   :1.60          
##  NA's   :1              NA's   :3              NA's   :15            
##  ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
##  Min.   :911.0          Min.   : 923.0         Min.   :203.0         
##  1st Qu.:928.0          1st Qu.: 986.8         1st Qu.:205.7         
##  Median :934.0          Median : 999.2         Median :206.8         
##  Mean   :931.9          Mean   :1001.7         Mean   :207.4         
##  3rd Qu.:936.0          3rd Qu.:1008.9         3rd Qu.:208.7         
##  Max.   :946.0          Max.   :1175.3         Max.   :227.4         
##  NA's   :1              NA's   :1              NA's   :2             
##  ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
##  Min.   :177.0          Min.   :177.0          Min.   :38.89         
##  1st Qu.:177.0          1st Qu.:177.0          1st Qu.:44.89         
##  Median :177.0          Median :178.0          Median :45.73         
##  Mean   :177.5          Mean   :177.6          Mean   :45.66         
##  3rd Qu.:178.0          3rd Qu.:178.0          3rd Qu.:46.52         
##  Max.   :178.0          Max.   :178.0          Max.   :49.36         
##  NA's   :1              NA's   :1                                    
##  ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
##  Min.   : 7.500         Min.   : 7.500         Min.   :   0.0        
##  1st Qu.: 8.700         1st Qu.: 9.000         1st Qu.:   0.0        
##  Median : 9.100         Median : 9.400         Median :   0.0        
##  Mean   : 9.179         Mean   : 9.386         Mean   : 857.8        
##  3rd Qu.: 9.550         3rd Qu.: 9.900         3rd Qu.:   0.0        
##  Max.   :11.600         Max.   :11.500         Max.   :4549.0        
##  NA's   :9              NA's   :10             NA's   :1             
##  ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
##  Min.   :32.10          Min.   :4701           Min.   :5904          
##  1st Qu.:33.90          1st Qu.:4828           1st Qu.:6010          
##  Median :34.60          Median :4856           Median :6032          
##  Mean   :34.51          Mean   :4854           Mean   :6039          
##  3rd Qu.:35.20          3rd Qu.:4882           3rd Qu.:6061          
##  Max.   :38.60          Max.   :5055           Max.   :6233          
##                         NA's   :1                                    
##  ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
##  Min.   :   0           Min.   :31.30          Min.   :   0          
##  1st Qu.:4561           1st Qu.:33.50          1st Qu.:4813          
##  Median :4588           Median :34.40          Median :4835          
##  Mean   :4566           Mean   :34.34          Mean   :4810          
##  3rd Qu.:4619           3rd Qu.:35.10          3rd Qu.:4862          
##  Max.   :4852           Max.   :40.00          Max.   :4971          
##                                                                      
##  ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
##  Min.   :5890           Min.   :   0           Min.   :-1.8000       
##  1st Qu.:6001           1st Qu.:4553           1st Qu.:-0.6000       
##  Median :6022           Median :4582           Median :-0.3000       
##  Mean   :6028           Mean   :4556           Mean   :-0.1642       
##  3rd Qu.:6050           3rd Qu.:4610           3rd Qu.: 0.0000       
##  Max.   :6146           Max.   :4759           Max.   : 3.6000       
##                                                                      
##  ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
##  Min.   : 0.000         Min.   :0.000          Min.   : 0.000        
##  1st Qu.: 3.000         1st Qu.:2.000          1st Qu.: 4.000        
##  Median : 5.000         Median :3.000          Median : 8.000        
##  Mean   : 5.406         Mean   :3.017          Mean   : 8.834        
##  3rd Qu.: 8.000         3rd Qu.:4.000          3rd Qu.:14.000        
##  Max.   :12.000         Max.   :6.000          Max.   :23.000        
##  NA's   :1              NA's   :1              NA's   :1             
##  ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
##  Min.   :   0           Min.   :   0           Min.   :   0          
##  1st Qu.:4832           1st Qu.:6020           1st Qu.:4560          
##  Median :4855           Median :6047           Median :4587          
##  Mean   :4828           Mean   :6016           Mean   :4563          
##  3rd Qu.:4877           3rd Qu.:6070           3rd Qu.:4609          
##  Max.   :4990           Max.   :6161           Max.   :4710          
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
##  Min.   : 0.000         Min.   : 0.00          Min.   : 0.000        
##  1st Qu.: 0.000         1st Qu.:19.70          1st Qu.: 8.800        
##  Median :10.400         Median :19.90          Median : 9.100        
##  Mean   : 6.592         Mean   :20.01          Mean   : 9.161        
##  3rd Qu.:10.750         3rd Qu.:20.40          3rd Qu.: 9.700        
##  Max.   :11.500         Max.   :22.00          Max.   :11.200        
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
##  Min.   : 0.00          Min.   :143.0          Min.   :56.00         
##  1st Qu.:70.10          1st Qu.:155.0          1st Qu.:62.00         
##  Median :70.80          Median :158.0          Median :64.00         
##  Mean   :70.18          Mean   :158.5          Mean   :63.54         
##  3rd Qu.:71.40          3rd Qu.:162.0          3rd Qu.:65.00         
##  Max.   :72.50          Max.   :173.0          Max.   :70.00         
##  NA's   :5                                     NA's   :5             
##  ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
##  Min.   :2.300          Min.   :463.0          Min.   :0.01700       
##  1st Qu.:2.500          1st Qu.:490.0          1st Qu.:0.01900       
##  Median :2.500          Median :495.0          Median :0.02000       
##  Mean   :2.494          Mean   :495.6          Mean   :0.01957       
##  3rd Qu.:2.500          3rd Qu.:501.5          3rd Qu.:0.02000       
##  Max.   :2.600          Max.   :522.0          Max.   :0.02200       
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
##  Min.   :0.000          Min.   :0.000          Min.   :0.000         
##  1st Qu.:0.700          1st Qu.:2.000          1st Qu.:7.100         
##  Median :1.000          Median :3.000          Median :7.200         
##  Mean   :1.014          Mean   :2.534          Mean   :6.851         
##  3rd Qu.:1.300          3rd Qu.:3.000          3rd Qu.:7.300         
##  Max.   :2.300          Max.   :3.000          Max.   :7.500         
##                                                                      
##  ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
##  Min.   :0.00000        Min.   :0.00000        Min.   : 0.00         
##  1st Qu.:0.00000        1st Qu.:0.00000        1st Qu.:11.40         
##  Median :0.00000        Median :0.00000        Median :11.60         
##  Mean   :0.01771        Mean   :0.02371        Mean   :11.21         
##  3rd Qu.:0.00000        3rd Qu.:0.00000        3rd Qu.:11.70         
##  Max.   :0.10000        Max.   :0.20000        Max.   :12.10         
##  NA's   :1              NA's   :1                                    
##  ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
##  Min.   : 0.0000        Min.   :0.000          Min.   :0.000         
##  1st Qu.: 0.6000        1st Qu.:1.800          1st Qu.:2.100         
##  Median : 0.8000        Median :1.900          Median :2.200         
##  Mean   : 0.9119        Mean   :1.805          Mean   :2.138         
##  3rd Qu.: 1.0250        3rd Qu.:1.900          3rd Qu.:2.300         
##  Max.   :11.0000        Max.   :2.100          Max.   :2.600         
##

sum(is.na(ChemicalManufacturingProcess))

## [1] 106

We have 106 NAs across different variables.

chemical_missing <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
chemical_imputed <- predict(chemical_missing, ChemicalManufacturingProcess)

Bag imputing creates multiple imputed datasets by sampling with replacement from the original dataset.

c. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

set.seed(1357)

index <- createDataPartition(chemical_imputed$Yield, p = .8, list = FALSE)

train_chemical <- chemical_imputed[index, ]
test_chemical <- chemical_imputed[-index, ]

lm_model <- lm(Yield ~ ., chemical_imputed)
summary(lm_model)

## 
## Call:
## lm(formula = Yield ~ ., data = chemical_imputed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.14184 -0.52928 -0.05022  0.48100  2.04082 
## 
## Coefficients: (1 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.443e+02  1.304e+02   1.107  0.27074    
## BiologicalMaterial01    2.374e-01  3.351e-01   0.708  0.48011    
## BiologicalMaterial02   -1.166e-01  1.254e-01  -0.929  0.35470    
## BiologicalMaterial03    2.389e-01  2.326e-01   1.027  0.30653    
## BiologicalMaterial04   -1.709e-01  5.229e-01  -0.327  0.74433    
## BiologicalMaterial05    1.617e-01  1.041e-01   1.553  0.12310    
## BiologicalMaterial06   -8.433e-02  2.970e-01  -0.284  0.77695    
## BiologicalMaterial07   -1.460e+00  9.556e-01  -1.528  0.12928    
## BiologicalMaterial08    6.953e-01  6.429e-01   1.081  0.28167    
## BiologicalMaterial09   -1.360e+00  1.354e+00  -1.004  0.31724    
## BiologicalMaterial10    1.998e-01  1.370e+00   0.146  0.88429    
## BiologicalMaterial11   -8.302e-02  8.057e-02  -1.030  0.30494    
## BiologicalMaterial12    3.457e-01  6.254e-01   0.553  0.58145    
## ManufacturingProcess01  6.930e-02  9.338e-02   0.742  0.45949    
## ManufacturingProcess02  2.978e-03  4.579e-02   0.065  0.94825    
## ManufacturingProcess03 -4.229e+00  5.106e+00  -0.828  0.40923    
## ManufacturingProcess04  6.193e-02  2.925e-02   2.117  0.03634 *  
## ManufacturingProcess05  1.017e-03  3.804e-03   0.267  0.78974    
## ManufacturingProcess06  3.489e-02  4.229e-02   0.825  0.41100    
## ManufacturingProcess07 -1.776e-01  2.084e-01  -0.852  0.39580    
## ManufacturingProcess08 -1.350e-01  2.506e-01  -0.539  0.59115    
## ManufacturingProcess09  3.529e-01  1.777e-01   1.987  0.04927 *  
## ManufacturingProcess10 -2.360e-02  5.568e-01  -0.042  0.96627    
## ManufacturingProcess11  4.518e-01  7.323e-01   0.617  0.53840    
## ManufacturingProcess12  8.256e-05  1.067e-04   0.774  0.44043    
## ManufacturingProcess13 -1.906e-01  3.701e-01  -0.515  0.60746    
## ManufacturingProcess14 -4.991e-04  1.072e-02  -0.047  0.96294    
## ManufacturingProcess15  3.285e-03  8.596e-03   0.382  0.70299    
## ManufacturingProcess16 -8.787e-05  3.346e-04  -0.263  0.79332    
## ManufacturingProcess17 -1.532e-01  2.905e-01  -0.527  0.59887    
## ManufacturingProcess18  4.710e-03  4.312e-03   1.092  0.27695    
## ManufacturingProcess19 -1.100e-03  7.254e-03  -0.152  0.87975    
## ManufacturingProcess20 -4.969e-03  4.574e-03  -1.086  0.27956    
## ManufacturingProcess21         NA         NA      NA       NA    
## ManufacturingProcess22 -1.243e-02  4.115e-02  -0.302  0.76317    
## ManufacturingProcess23 -4.783e-02  8.166e-02  -0.586  0.55916    
## ManufacturingProcess24 -2.537e-02  2.302e-02  -1.102  0.27260    
## ManufacturingProcess25 -1.906e-03  2.283e-03  -0.835  0.40538    
## ManufacturingProcess26  2.883e-03  6.057e-03   0.476  0.63502    
## ManufacturingProcess27 -6.446e-03  6.794e-03  -0.949  0.34462    
## ManufacturingProcess28 -8.247e-02  3.070e-02  -2.687  0.00825 ** 
## ManufacturingProcess29  1.188e+00  9.069e-01   1.310  0.19263    
## ManufacturingProcess30 -6.555e-01  5.903e-01  -1.110  0.26907    
## ManufacturingProcess31  6.274e-02  1.155e-01   0.543  0.58790    
## ManufacturingProcess32  3.237e-01  6.827e-02   4.741 5.96e-06 ***
## ManufacturingProcess33 -3.868e-01  1.274e-01  -3.036  0.00295 ** 
## ManufacturingProcess34 -1.251e+00  2.735e+00  -0.458  0.64813    
## ManufacturingProcess35 -2.107e-02  1.724e-02  -1.222  0.22395    
## ManufacturingProcess36  3.452e+02  3.079e+02   1.121  0.26449    
## ManufacturingProcess37 -8.166e-01  2.889e-01  -2.826  0.00552 ** 
## ManufacturingProcess38 -2.175e-01  2.378e-01  -0.914  0.36232    
## ManufacturingProcess39  8.136e-02  1.283e-01   0.634  0.52708    
## ManufacturingProcess40  1.080e+00  6.466e+00   0.167  0.86761    
## ManufacturingProcess41 -3.648e-01  4.690e+00  -0.078  0.93813    
## ManufacturingProcess42  5.257e-02  2.062e-01   0.255  0.79923    
## ManufacturingProcess43  1.957e-01  1.159e-01   1.689  0.09393 .  
## ManufacturingProcess44 -8.385e-01  1.181e+00  -0.710  0.47906    
## ManufacturingProcess45  9.891e-01  5.347e-01   1.850  0.06679 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.022 on 119 degrees of freedom
## Multiple R-squared:  0.7914, Adjusted R-squared:  0.6932 
## F-statistic:  8.06 on 56 and 119 DF,  p-value: < 2.2e-16

the linear regression has an r sqared of 0.7858.

d. Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

lm_predict <- predict(lm_model, test_chemical[ ,-1])

## Warning in predict.lm(lm_model, test_chemical[, -1]): prediction from a
## rank-deficient fit may be misleading

postResample(lm_predict, test_chemical[ ,1])

##      RMSE  Rsquared       MAE 
## 0.8443454 0.8338351 0.6445676

The Rsquared on the predicted test set is 0.7812, slightly lower, but just about the same as the training set.

e. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

Based on the model above, ManufacturingProcess32, 28, 33, 37 has the highest significance based on pvalues.

f. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## Warning: package 'readr' was built under R version 4.2.3

## Warning: package 'purrr' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## Warning: package 'stringr' was built under R version 4.2.3

## Warning: package 'forcats' was built under R version 4.2.3

## Warning: package 'lubridate' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ purrr::lift()   masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.2.3

## corrplot 0.92 loaded

correlation <- cor(select(chemical_imputed, 'ManufacturingProcess32','ManufacturingProcess28','ManufacturingProcess33','ManufacturingProcess37','Yield'))
corrplot(correlation,)

The highest correlation, which is positive, appears to be between 32 and 33 and there are some negative correlations noted as well. The company would want to improve their yield based on predictors, more so focusing on process32. Maximizing that performance measure can potentially increase results.

BrianSingh_data624_HW7

2024-04-12

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

c. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

d. Predict the response for the test set. What is the test set estimate of R2?

e. Try building other models discussed in this chapter. Do any have better predictive performance?

(a) Start R and use these commands to load the data:

b. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

c. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

d. Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

e. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

f. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

BrianSingh_data624_HW7

2024-04-12

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

c. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

d. Predict the response for the test set. What is the test set estimate of R2?

e. Try building other models discussed in this chapter. Do any have better predictive performance?

e. Would you recommend any of your models to replace the permeability laboratory experiment?

(a) Start R and use these commands to load the data:

b. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

c. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

d. Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

e. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

f. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?