$\label{fig:fig1}Applied Predictive Modeling.$

Applied Predictive Modeling.

Instructions

Do problems 6.2 and 6.3 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .rmd code.

URL: http://appliedpredictivemodeling.com/

Exercises

6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a)

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

(b)

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package.

How many predictors are left for modeling?

library(caret)
fingerprints.df <- data.frame(fingerprints)
fingerprints.ZeroVar <- nearZeroVar(fingerprints.df, names = TRUE)
predictors <- dim(fingerprints.df)[2] - length(fingerprints.ZeroVar)

The remaining number of predictors left for modeling are: 388.

(c)

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding re-sampled estimate of $R^2$?

In order to pre-process the data, what we need to do is to center and scale, in order to do so, we can do it as follows:

\[\text{Pre-Process Data} = \frac{y_i - \mu}{\sigma}\]

The above data, will be centered and scaled while being trained for the plsTune.

# First, we need to pre-process the data by discarding predictors 
# with Near Zero Variance.
reduced <- names(fingerprints.df) %in% fingerprints.ZeroVar
fingerprints.reduced.df <- fingerprints.df[ , !reduced]

# Now, I will split the data as follows:
# Training 75%
# Test     25%
set.seed(123)
n <- nrow(fingerprints.reduced.df)
trainIndex <- sample(1:n, size = round(0.75*n), replace=FALSE)
fingerprints.train <- fingerprints.reduced.df[trainIndex ,]
fingerprints.test <- fingerprints.reduced.df[-trainIndex ,]

# Defining Control for a ten-fold Cross Validation
ctrl <- trainControl(method = "cv", number = 10)

# Defining training sets in order to tune a PLS model
x <- fingerprints.reduced.df[trainIndex ,]
y <- permeability[trainIndex]

# Tuning a PLS model
set.seed(100)
plsTune <- train(x, y, 
                 method = "pls",
                 ## The default tuning grid evaluates
                 ## components 1... tuneLength
                 tuneLength = 20,
                 trControl = ctrl,
                 preProc = c("center", "scale")
                 )

$\label{fig:fig2}The cross-validation profiles for a Partial Least Squares (PLS) model.$

The cross-validation profiles for a Partial Least Squares (PLS) model.

From the above graph, we notice how the lowest RMSE is obtained when there are 8 components used. The lowest obtained RMSE is $10.635$ and the respective $R^2$ is $0.49$

The below table shows their respective results for each component and the corresponding $R^2$.

## Partial Least Squares 
## 
## 124 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 111, 112, 111, 111, 112, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     11.98173  0.3565306  8.987053
##    2     10.98268  0.4927106  7.765846
##    3     11.17758  0.4596512  8.562117
##    4     11.35684  0.4462066  8.618259
##    5     10.97281  0.4789562  8.257085
##    6     10.99786  0.4678063  8.246938
##    7     10.84200  0.4783988  8.117766
##    8     10.63489  0.4948970  8.005125
##    9     10.88187  0.4879433  7.900238
##   10     10.89131  0.4734362  7.939967
##   11     11.04987  0.4544378  8.049652
##   12     10.97178  0.4547096  8.131693
##   13     10.98286  0.4520488  8.121929
##   14     11.20020  0.4325586  8.250437
##   15     11.38889  0.4227432  8.391429
##   16     11.34637  0.4255055  8.422120
##   17     11.29886  0.4185157  8.299045
##   18     11.43835  0.4114370  8.345948
##   19     11.40326  0.4152484  8.127962
##   20     11.54095  0.4051354  8.087660
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 8.

(d)

Predict the response for the test set. What is the test set estimate of $R^2$?

fingerprints.train$Permeability <- permeability[trainIndex]

# PLS Fit and Predictions with up to 8 components.
library(pls)
plsFit <- plsr(Permeability ~ ., data = fingerprints.train)
predictPLS <- predict(plsFit, fingerprints.test, ncomp = 1:8)

# Creating a data frame from previous results.
predictPLS.df <- data.frame(predictPLS)

# Calculating R-squared on the testing data is a little tricky, 
# as you have to remember what your baseline is. 
# Your baseline projection is a mean of your training data.

get_R2 <- function(c){
    # Defining the correct baseline data
    y <-  permeability[-trainIndex]
    
    # y_bar is the mean of the observed data
    y_bar <- mean(fingerprints.train$Permeability)
    
    # Defining the predicted data, in this case for 8 components
    f <- predictPLS.df[,c]
    
    # then the variability of the data set can be measured 
    # using three sums of squares formulas:
    
    # The total sum of squares (proportional to the variance of the data):
    SS.total <- sum((y - y_bar)^2)
    
    # The regression sum of squares, also called the explained sum of squares:
    SS.regression <- sum((f - y_bar)^2)
    
    # The sum of squares of residuals, also called the residual sum of squares:
    SS.residual   <- sum((y - f)^2)
    
    # The most general definition of the R Squared coefficient of determination is:
    R2 <-  1 - SS.residual/SS.total
    
    R2 <- round(R2,7)
    
    return(R2)
}

# Obtaining R2 values for predicted R2.
R2_Predicted <- ""
for (i in 1:8){
  R2_Predicted[i] <- get_R2(i)
}

Let’s compare the values and their differences, the row indicates the number of components used.

	R2_Predicted	R2_Train
1	0.396103	0.3565306
2	0.4591568	0.4927106
3	0.3993341	0.4596512
4	0.4114313	0.4462066
5	0.3813263	0.4789562
6	0.4171841	0.4678063
7	0.4884075	0.4783988
8	0.4944606	0.4948970

As we can notice on the last row, the values are near.

(e)

Try building other models discussed in this chapter. Do any have better predictive performance?

Ridge Model

# Tuning a Ridge model

ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
set.seed(100)

ridgeRegFit <- train(x, y,
                     method = "ridge",
                     ## Fit the model over many penalty values
                     tuneGrid = ridgeGrid,
                     trControl = ctrl,
                     ## put the predictors on the same scale
                     preProc = c("center", "scale"))

## Ridge Regression 
## 
## 124 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 111, 112, 111, 111, 112, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE      
##   0.000000000   11.61404  0.4251840   7.979825
##   0.007142857  106.73370  0.2961840  85.376218
##   0.014285714   20.47443  0.3572114  13.821656
##   0.021428571   29.52876  0.3978892  19.833103
##   0.028571429   11.34962  0.4222886   7.914513
##   0.035714286   11.21141  0.4327152   7.859329
##   0.042857143   11.13188  0.4397708   7.834692
##   0.050000000   11.08863  0.4447829   7.846635
##   0.057142857   11.04668  0.4494877   7.857929
##   0.064285714   11.01545  0.4533167   7.864283
##   0.071428571   11.01397  0.4559008   7.888531
##   0.078571429   11.00020  0.4588167   7.900352
##   0.085714286   10.99370  0.4610162   7.915624
##   0.092857143   10.99070  0.4633150   7.930794
##   0.100000000   10.99585  0.4647631   7.947210
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.09285714.

$\label{fig:fig3}The cross-validation profiles for a Ridge model.$

The cross-validation profiles for a Ridge model.

From the above, we notice how the weight of decay come to what seems to be a straight line starting with $\lambda = 0.0286$, obtaining the lowest RMSE results with $\lambda = 0.0929$.

Let’s calculate the $R^2$.

get_R2enet <- function(f){
    # Defining the correct baseline data
    y <-  permeability[-trainIndex]
    
    # y_bar is the mean of the observed data
    y_bar <- mean(fingerprints.train$Permeability)
    
    # Defining the predicted data, in this case for 8 components
    #f <- ridgePred$fit
    
    # then the variability of the data set can be measured 
    # using three sums of squares formulas:
    
    # The total sum of squares (proportional to the variance of the data):
    SS.total <- sum((y - y_bar)^2)
    
    # The regression sum of squares, also called the explained sum of squares:
    SS.regression <- sum((f - y_bar)^2)
    
    # The sum of squares of residuals, also called the residual sum of squares:
    SS.residual   <- sum((y - f)^2)
    
    # The most general definition of the R Squared coefficient of determination is:
    R2 <-  1 - SS.residual/SS.total
    
    R2 <- round(R2,7)
    
    return(R2)
}

get_R2enet(ridgePred$fit)
#[1] 0.4547481

In this case, the returned $R^2 = 0.4547481$, which is slightly lower compared to the $R^2 = 0.4633150$ returned from the training data set.

Now, by comparing the $R^2 = 0.4547481$ from Ridge model and the $R^2 = 0.4944606$ from the PLS model, we noticed that the PLS explains slightly better the variance in the data set.

Enet Model

# Tuning a Enet model
y <- permeability[trainIndex]

enetGrid <- expand.grid(.lambda = c(0, 0.01, .1),
                        .fraction = seq(.05, 1, length = 20))
set.seed(100)

enetTune <- train(x, y,
                  method = "enet",
                  tuneGrid = enetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"))

## Warning: model fit failed for Fold06: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.

## Elasticnet 
## 
## 124 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 111, 112, 111, 111, 112, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE        Rsquared   MAE       
##   0.00    0.05       10.711923  0.5501899    7.977930
##   0.00    0.10       10.407607  0.5348454    7.329535
##   0.00    0.15       10.295620  0.5294794    7.182214
##   0.00    0.20       10.133571  0.5290360    7.157684
##   0.00    0.25        9.975111  0.5347711    7.042807
##   0.00    0.30        9.918872  0.5355194    7.022232
##   0.00    0.35        9.940631  0.5345060    7.032634
##   0.00    0.40       10.034343  0.5262229    7.108625
##   0.00    0.45       10.186233  0.5137433    7.200506
##   0.00    0.50       10.340369  0.4990464    7.276011
##   0.00    0.55       10.374904  0.4898779    7.263962
##   0.00    0.60       10.471261  0.4808300    7.294505
##   0.00    0.65       10.557048  0.4755929    7.365999
##   0.00    0.70       10.627163  0.4734722    7.425724
##   0.00    0.75       10.711507  0.4706411    7.446881
##   0.00    0.80       10.827727  0.4658134    7.511314
##   0.00    0.85       10.963902  0.4595376    7.600576
##   0.00    0.90       11.175606  0.4476629    7.713156
##   0.00    0.95       11.387338  0.4359785    7.838597
##   0.00    1.00       11.614037  0.4251840    7.979825
##   0.01    0.05       21.084119  0.5250403   14.377453
##   0.01    0.10       35.676932  0.5084231   23.179511
##   0.01    0.15       48.522321  0.5247744   31.705414
##   0.01    0.20       56.776652  0.5238109   37.386250
##   0.01    0.25       61.961964  0.5290693   40.911665
##   0.01    0.30       67.576149  0.5331302   44.482246
##   0.01    0.35       73.473727  0.5277792   48.251206
##   0.01    0.40       79.395211  0.5208918   52.229652
##   0.01    0.45       85.595075  0.5109801   56.246767
##   0.01    0.50       92.652097  0.4950344   60.610062
##   0.01    0.55       99.974890  0.4773596   65.040782
##   0.01    0.60      107.460371  0.4574953   69.546294
##   0.01    0.65      114.992047  0.4400249   74.046791
##   0.01    0.70      122.549989  0.4248916   78.523706
##   0.01    0.75      130.087702  0.4116705   82.970361
##   0.01    0.80      137.608462  0.4041988   88.120819
##   0.01    0.85      145.249561  0.3982796   93.463457
##   0.01    0.90      152.542654  0.3918093   98.526807
##   0.01    0.95      159.833900  0.3869120  103.578796
##   0.01    1.00      167.096488  0.3865823  108.627985
##   0.10    0.05       11.253149  0.5129544    8.586781
##   0.10    0.10       10.685980  0.5162941    7.531721
##   0.10    0.15       10.737138  0.4906719    7.420195
##   0.10    0.20       10.469545  0.4946123    7.367835
##   0.10    0.25       10.193889  0.5078322    7.229669
##   0.10    0.30       10.096661  0.5141351    7.229300
##   0.10    0.35       10.099017  0.5161346    7.214265
##   0.10    0.40       10.078473  0.5179258    7.131271
##   0.10    0.45       10.084795  0.5170338    7.130899
##   0.10    0.50       10.163397  0.5109624    7.195639
##   0.10    0.55       10.277115  0.5034110    7.282561
##   0.10    0.60       10.383843  0.4972600    7.344549
##   0.10    0.65       10.472132  0.4926200    7.411040
##   0.10    0.70       10.550363  0.4887880    7.466813
##   0.10    0.75       10.618902  0.4852205    7.531970
##   0.10    0.80       10.701186  0.4806883    7.615791
##   0.10    0.85       10.797671  0.4757570    7.716131
##   0.10    0.90       10.881163  0.4712766    7.813281
##   0.10    0.95       10.938739  0.4680404    7.885414
##   0.10    1.00       10.995850  0.4647631    7.947210
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.3 and lambda = 0.

$\label{fig:fig4}The cross-validation profiles for a Enet model.$

The cross-validation profiles for a Enet model.

enetModel <- enet(x = as.matrix(x), y, lambda = 0, normalize = TRUE)
enetPred <- predict(enetModel, newx = as.matrix(fingerprints.test),
                    s = .1, mode = "fraction",
                    type = "fit")

get_R2enet(enetPred$fit)
#[1] 0.4296827

In this case, the returned $R^2 = 0.4296827$, which is slightly lower compared to the $R^2 = 0.5355194$ returned from the training data set.

Now, by comparing the $R^2 = 0.4296827$ from Ridge model and the $R^2 = 0.4944606$ from the PLS model, we noticed that the PLS explains much better the variance in the data set.

Do any have better predictive performance?

The above two examples show that the PLS model has a better performance in terms of $R^2$.

(f)

Would you recommend any of your models to replace the permeability laboratory experiment?

Answer:

Based on the above results, I feel that I should not recommend any of my models in order to replace the permeability laboratory result. The reason why, is because there’s still too much variability that is not being explained by the above models. Thus can be seeing the the “low” scores for the $R^2$ values.

6.3.

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a)

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(b)

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

Let’s visualize the missing values.

Let’s have a better understanding of the missing data.

Let’s visualize the missing values.

Since there are various predictors with a small number of predictors, I believe is better to replace the missing entries with the mean on that respective predictor. I believe this to be the best approach due to:

Using complete cases will reduce dramatically the data set, with the downside of losing possible valuable factors currently present in the model.
Replacing just a few records per predictor with the mean seems to be the best approach due to low incidence of missing records, thus not affecting dramatically the data set since only about 1% of data is missing.

# Procedure to replace NAs with the respective mean of the predictor.
for (i in 1:dim(ChemicalManufacturingProcess)[2]){
  
  totalNA <- sum(is.na(ChemicalManufacturingProcess[i]))
  
  if (totalNA > 0)
    print(i)
    meanColunm <- mean(ChemicalManufacturingProcess[,i], na.rm = TRUE)
    ChemicalManufacturingProcess[i][is.na(ChemicalManufacturingProcess[i])] <- meanColunm
}

(c)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

In order to pre-process the data, what we need to do is to center and scale, in order to do so, we can do it as follows:

\[\text{Pre-Process Data} = \frac{y_i - \mu}{\sigma}\]

The above data, will be centered and scaled.

We can achieve this by employing the preProcess function from the caret library. This function has the ability to transform, center, scale, or impute values.

# Function to pre-process data.
library(caret)
trans <- preProcess(ChemicalManufacturingProcess, 
                    method = c("center", "scale"))

# Need to obtain new transformed values
CMP.trans <- predict(trans, ChemicalManufacturingProcess)

Now, I will split the data into 75 % training data and 25 % test.

# Now, I will split the data as follows:
# Training 75%
# Test     25%
set.seed(123)
n <- nrow(CMP.trans)
trainIndex <- sample(1:n, size = round(0.75*n), replace=FALSE)
CMPtrain <- CMP.trans[trainIndex ,]
CMPtest <- CMP.trans[-trainIndex ,]

Let’s find correlations:

The below list, represents the column numbers in which highly correlated data is present $> 0.90$.

##  [1]  3  5 13 42 55 38 40 44 31 53

PLS Model

From the above list, is important to note that a few predictors are highly correlated. Hence, I will select a PLS as my starting model.

# Defining Control for a ten-fold Cross Validation
ctrl <- trainControl(method = "cv", number = 10)

# Defining training sets in order to tune a PLS model
x <- CMPtrain[, -1]
y <- CMPtrain[ , 1]

# Tuning a PLS model
set.seed(100)
plsTune <- train(x, y, 
                 method = "pls",
                 ## The default tuning grid evaluates
                 ## components 1... tuneLength
                 tuneLength = 20,
                 trControl = ctrl
                 #preProc = c("center", "scale")
                 )

## Partial Least Squares 
## 
## 132 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 118, 117, 119, 119, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.7658526  0.4518044  0.6171320
##    2     0.6757649  0.5505490  0.5537815
##    3     0.6546293  0.5779411  0.5459406
##    4     0.6638579  0.5715837  0.5544711
##    5     0.6835387  0.5547950  0.5811289
##    6     0.6887232  0.5601403  0.5809894
##    7     0.7001701  0.5479305  0.5886709
##    8     0.7214515  0.5421792  0.5996710
##    9     0.7553794  0.5253173  0.6218221
##   10     0.7957133  0.4972207  0.6439174
##   11     0.8457952  0.4598843  0.6746375
##   12     0.8836218  0.4424776  0.6936333
##   13     0.9010614  0.4409993  0.7013384
##   14     0.9127460  0.4408725  0.7020893
##   15     0.9280000  0.4367019  0.7101100
##   16     0.9499455  0.4306951  0.7116352
##   17     0.9802895  0.4241682  0.7105741
##   18     0.9990792  0.4208851  0.7097576
##   19     1.0276590  0.4158179  0.7120654
##   20     1.0779890  0.4015857  0.7296648
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

$\label{fig:fig3.2}The cross-validation profiles for a Partial Least Squares (PLS) model.$

The cross-validation profiles for a Partial Least Squares (PLS) model.

From the above results, a total of 3 Components will return the lowest RMSE with a $R^2 = 0.5779411$.

(d)

Predict the response for the test set. What is the value of the performance metric and how does this compare with the re-sampled performance metric on the training set?

# PLS Fit and Predictions with up to 8 components.
library(pls)
plsFit <- plsr(Yield ~ ., data = CMPtrain)
predictPLS <- predict(plsFit, CMPtest, ncomp = 1:8)

# Creating a data frame from previous results.
predictPLS.df <- data.frame(predictPLS)

Let’s calculate the test $R^2$.

# Calculating R-squared on the testing data is a little tricky, 
# as you have to remember what your baseline is. 
# Your baseline projection is a mean of your training data.

get_R2_3 <- function(c){
    # Defining the correct baseline data
    y <-  CMPtest[1]
    
    # y_bar is the mean of the observed data
    y_bar <- mean(CMPtrain[,1])
    
    # Defining the predicted data, in this case for 8 components
    f <- predictPLS.df[,c]
    
    # then the variability of the data set can be measured 
    # using three sums of squares formulas:
    
    # The total sum of squares (proportional to the variance of the data):
    SS.total <- sum((y - y_bar)^2)
    
    # The regression sum of squares, also called the explained sum of squares:
    SS.regression <- sum((f - y_bar)^2)
    
    # The sum of squares of residuals, also called the residual sum of squares:
    SS.residual   <- sum((y - f)^2)
    
    # The most general definition of the R Squared coefficient of determination is:
    R2 <-  1 - SS.residual/SS.total
    
    R2 <- round(R2,7)
    
    return(R2)
}

# Obtaining R2 values for predicted R2.
R2_Predicted <- ""
for (i in 1:8){
  R2_Predicted[i] <- get_R2_3(i)
}

Let’s compare the values and their differences, the row indicates the number of components used.

	R2_Predicted	R2_Train
1	0.493486	0.4518044
2	0.6625767	0.5505490
3	0.6962113	0.5779411
4	0.6896897	0.5715837
5	0.6733863	0.5547950
6	0.6271297	0.5601403
7	0.6354934	0.5479305
8	0.6446168	0.5421792

From the above table, it is interesting to note such a “big” change in between the $R^2$ generated during training and the $R^2$ generated during testing. The model seems to perform better on the tested data set. For this, we will focus on the 3 components in which the predicted $R^2 = 0.6962113$ vs the training $R^2 = 0.5779411$.

(e)

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

$\label{fig:fig3_3}A diagram depicting the structure of a PLS model. PLS finds components that simultaneously summarize variation of the predictors while being optimally correlated with the outcome.$

A diagram depicting the structure of a PLS model. PLS finds components that simultaneously summarize variation of the predictors while being optimally correlated with the outcome.

Since the PLS are constructed using linear combinations of the original predictors, and the PLS finds components that maximally summarize the variation of the predictors while simultaneously requiring these components to have maximum correlation with the response and, the PLS therefore strikes a compromise between the objectives of predictor space dimension reduction and a predictive relationship with the response. It is concluded that all predictors are important, that is, since 12 describe the input biological material and 45 describing the process predictors, the process predictors are the ones that dominate.

(f)

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

From the above plots, we can notice some sort of significance correlation in between the manufacture predictors and the yield; thus by focusing in the manufacture processed, the yield of the future runs of the manufacturing process can be improved.

References

Kuhn, M. & Johnson, K. 2018. Applied Predictive Modeling. USA: Pfizer Global R&D. http://appliedpredictivemodeling.com/.

R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

CUNY School of Professional Studies

Instructions

Exercises

6.2

(a)

(b)

(c)

(d)

(e)

(f)

6.3.

(a)

(b)

(c)

(d)

(e)

(f)

References