HW7DATA624

Author

Semyon Toybis

Assignment

We are required to complete questions 6.2 and 6.3 from chapter 6 of “Applied Predictive Modeling” by Max Kuhn and Kjell Johnson.

6.2

A

Below I load the permeability data set which contains the matrix “fingerprints” and a matrix for permeability.

data("permeability")

Fingerprints contains binary molecular predictors for 165 compounds and permeability contains the permeability response.

nrow(fingerprints)

[1] 165

ncol(fingerprints)

[1] 1107

nrow(permeability)

[1] 165

B

We are tasked with filtering out predictors that have low frequencies since many of the substructures for each molecule are sparse.

length(nearZeroVar(fingerprints))

[1] 719

We can see from the above that 719 of the 1107 columns in fingerprints have near zero variance based on the default conditions of the function, which is a frequency ratio of the most common value to second most common value of at least 95/5, and a percentage of unique values out of the total number of data points of at least ten. For example, here is what these statistics look like for the first few variables:

head(nearZeroVar(fingerprints, saveMetrics = TRUE), n=10)

    freqRatio percentUnique zeroVar   nzv
X1   3.342105     1.2121212   FALSE FALSE
X2   3.459459     1.2121212   FALSE FALSE
X3   3.714286     1.2121212   FALSE FALSE
X4   3.714286     1.2121212   FALSE FALSE
X5   3.714286     1.2121212   FALSE FALSE
X6   1.229730     1.2121212   FALSE FALSE
X7   0.000000     0.6060606    TRUE  TRUE
X8   0.000000     0.6060606    TRUE  TRUE
X9  54.000000     1.2121212   FALSE  TRUE
X10 40.250000     1.2121212   FALSE  TRUE

Below, I filter out the predictors with near zero variance. This leaves 388 predictors for modeling.

nzv_fingerprint <- nearZeroVar(fingerprints)
fingerprints_filtered <- fingerprints[,-nzv_fingerprint]
ncol(fingerprints_filtered)

[1] 388

C

In this section we are tasked with splitting the data into training and test sets, pre-processing the data, and tuning a PLS model.

First, I combine the 388 predictors from above and the permeability matrix into one matrix.

modeling_data <- as.data.frame(cbind(fingerprints_filtered, permeability))
modeling_data <- modeling_data |> select(permeability, everything())

Next, I split the data into a train and test set via a 80%/20% split.

set.seed(10)
trainIndex <- createDataPartition(modeling_data$permeability, p = 0.8, list = FALSE)

trainData <- modeling_data[trainIndex,]
testData <- modeling_data[-trainIndex,]

Next, I tune a PLS model on the training set. I center and scale the values using the preProcess function.

set.seed(100)
fitControl <- trainControl(method = 'CV', number = 10)

plsFit <- train(permeability ~ ., data = trainData,
                method = 'pls',
                tuneLength = 20,
                preProcess = c('center','scale'),
                trControl = fitControl)

plsFit

Partial Least Squares 

133 samples
388 predictors

Pre-processing: centered (388), scaled (388) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
Resampling results across tuning parameters:

  ncomp  RMSE      Rsquared   MAE     
   1     13.09303  0.3751150  9.955400
   2     11.93669  0.4749213  8.426133
   3     11.42380  0.5213586  8.536021
   4     11.15017  0.5479451  8.634503
   5     11.08069  0.5518608  8.428915
   6     11.00353  0.5662611  8.441985
   7     10.90816  0.5692303  8.401248
   8     10.97174  0.5612605  8.490543
   9     11.19584  0.5590258  8.533019
  10     11.32924  0.5622410  8.720479
  11     11.81544  0.5372975  9.012913
  12     12.18829  0.5127903  9.470607
  13     12.55119  0.4971668  9.533206
  14     12.83874  0.4776451  9.752862
  15     12.90284  0.4737939  9.623939
  16     13.25394  0.4558385  9.863059
  17     13.29940  0.4590803  9.754005
  18     13.47374  0.4547626  9.851230
  19     13.37292  0.4628417  9.691664
  20     13.63080  0.4468999  9.829598

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 7.

Based on the above, the optimal number of latent variables is 7.

plot(plsFit)

summary(plsFit)

Data:   X dimension: 133 388 
    Y dimension: 133 1
Fit method: oscorespls
Number of components considered: 7
TRAINING: % variance explained
          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X           20.32    35.18    40.50    45.08    49.40    57.76    62.36
.outcome    36.47    52.82    62.12    69.35    75.31    77.53    79.50

The average R squared from the re-sampling process is:

mean(plsFit$resample$Rsquared)

[1] 0.5692303

D

Now I use the model to predict the responses for the test set:

predictions <-predict(plsFit, newdata = testData)

The test set estimate of R squared is:

postResample(pred = predictions, obs = testData$permeability)

      RMSE   Rsquared        MAE 
13.7668596  0.1969428 10.3091999

This Rsquared is low, suggesting the model does not do a good job of predicting permeability.

results <- data.frame(actual = testData$permeability, predicted = predictions)

ggplot(results, aes(x = actual, y = predicted)) + geom_point()+ 
  labs(title = "Predicted vs. Actual Values",
       x = "Actual Values",
       y = "Predicted Values") +
  theme_minimal()

E

Next, I try training and predicting with a ridge regression model.

ridgeGrid <-data.frame(.lambda = seq(0.01, .2, length=20))

set.seed(100)
ridgeFit <- train(permeability ~., data=trainData,
                  method = 'ridge',
                  tuneGrid =  ridgeGrid,
                  trControl = fitControl,
                  preProcess = c('center','scale'))

ridgeFit

Ridge Regression 

133 samples
388 predictors

Pre-processing: centered (388), scaled (388) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
Resampling results across tuning parameters:

  lambda  RMSE      Rsquared   MAE     
  0.01    13.37438  0.4507224  9.446142
  0.02    12.93139  0.4788009  9.229718
  0.03    12.63829  0.4931569  9.149707
  0.04    12.50602  0.5018970  9.101594
  0.05    12.39371  0.5091283  9.060586
  0.06    12.31765  0.5148716  9.030316
  0.07    12.23757  0.5204122  8.991373
  0.08    12.19520  0.5245518  8.972502
  0.09    12.15088  0.5285588  8.956560
  0.10    12.12592  0.5319177  8.951619
  0.11    12.09949  0.5351551  8.949193
  0.12    12.08405  0.5379754  8.956057
  0.13    12.07515  0.5406405  8.965868
  0.14    12.07091  0.5429374  8.977723
  0.15    12.08750  0.5442211  9.001394
  0.16    12.07566  0.5472016  9.004834
  0.17    12.07000  0.5494009  9.023342
  0.18    12.09274  0.5502725  9.046279
  0.19    12.08507  0.5530325  9.056521
  0.20    12.10620  0.5545031  9.088196

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was lambda = 0.17.

The optimal lambda value is 0.17

plot(ridgeFit)

Next, I try predicting values via the Ridge regression model

predictionsRidge <-predict(ridgeFit, newdata = testData)

postResample(pred = predictionsRidge, obs = testData$permeability)

      RMSE   Rsquared        MAE 
14.2419221  0.2340469 10.1707615

The R squared is slightly higher, which may suggest a better model, though it is still low and thus not a very accurate model.

resultsRidge <- data.frame(actual = testData$permeability, predicted = predictionsRidge)

ggplot(resultsRidge, aes(x = actual, y = predicted)) + geom_point()+ 
  labs(title = "Predicted vs. Actual Values",
       x = "Actual Values",
       y = "Predicted Values") +
  theme_minimal()

F

Both of the models above seem to fit poorly and thus I would not recommend them for the permeability experiment.

6.3

A

Below, I load the data

data("ChemicalManufacturingProcess")

B

There are missing values in the data set:

sum(is.na(ChemicalManufacturingProcess))

[1] 106

Below I impute missing values using the k-nearest neighbors approach. I will use the KNN function from the VIM package. The function adds columns which state whether a value was imputed to the right of the original data. I will drop these columns from the data frame.

chem_data <- kNN(ChemicalManufacturingProcess)
chem_data <- chem_data[,1:ncol(ChemicalManufacturingProcess)]

C

Next, I split the data into a training and test set via a 80%/20% split.

set.seed(10)
chem_trainIndex <- createDataPartition(chem_data$Yield, p = 0.8, list = FALSE)

chem_trainData <- chem_data[chem_trainIndex,]
chem_testData <- chem_data[-chem_trainIndex,]

Now I will tune a Lasso regression model to the data for two reasons. One, I have not yet used this model type and two, Lasso allows weights of variables that are not relevant to the independent variable to go to zero whereas in Ridge regression, these values can go near zero but not all the way to it. I will center and scale the data in the model tuning.

set.seed(150)
chem_fitControl <- trainControl(method = 'CV', number = 10)

chem_fit <- train(Yield ~ ., data = chem_trainData,
                method = 'glmnet',
                tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 1, lengt = 10)),
                preProcess = c('center','scale'),
                trControl = chem_fitControl)

chem_fit

glmnet 

144 samples
 57 predictor

Pre-processing: centered (57), scaled (57) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 130, 131, 131, 129, 130, 130, ... 
Resampling results across tuning parameters:

  lambda  RMSE      Rsquared   MAE      
  0.001   8.277536  0.4467059  3.4016057
  0.112   1.247332  0.5877934  0.9888373
  0.223   1.383448  0.5398664  1.0446488
  0.334   1.296413  0.5726856  1.0369149
  0.445   1.341709  0.5893853  1.0745626
  0.556   1.414385  0.5801993  1.1338832
  0.667   1.496858  0.5630656  1.2002924
  0.778   1.586499  0.5342690  1.2712958
  0.889   1.666759  0.5028704  1.3429808
  1.000   1.740322  0.4633349  1.4112292

Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.112.

plot(chem_fit)

The optimal lambda value was found to be 0.112.

D

Next, I try predicting the test set:

chem_predictions <-predict(chem_fit, newdata = chem_testData)

postResample(pred = chem_predictions, obs = chem_testData$Yield)

     RMSE  Rsquared       MAE 
1.1114012 0.5332645 0.9400514

The R sqaured is 0.53 and the RMSE is 1.11, which is slightly lower than the RMSE for the training set at a lambda of 0.112.

E

Below are the predictors ranked in order of importance. This function scales the importance scores to be between 0 and 100.

varImp(chem_fit)

glmnet variable importance

  only 20 most important variables shown (out of 57)

                        Overall
ManufacturingProcess32 100.0000
ManufacturingProcess09  59.3054
ManufacturingProcess13  26.4793
ManufacturingProcess17  25.1969
ManufacturingProcess45  19.0773
ManufacturingProcess15  14.4978
ManufacturingProcess06  13.6682
ManufacturingProcess37  12.5137
ManufacturingProcess36  10.8643
BiologicalMaterial03     8.8566
BiologicalMaterial05     8.4810
ManufacturingProcess34   5.8699
ManufacturingProcess07   0.9961
BiologicalMaterial07     0.7014
ManufacturingProcess39   0.3192
ManufacturingProcess10   0.0000
BiologicalMaterial01     0.0000
ManufacturingProcess11   0.0000
ManufacturingProcess40   0.0000
BiologicalMaterial06     0.0000

plot(varImp(chem_fit), top = 20)

The manufacturing process predictors dominate the list.

F

Below I explore the relationship between the top predictors and the response variable (yield):

top_predictors <- varImp(chem_fit)
top_predictors <- top_predictors$importance
top_predictors <- top_predictors |> filter(Overall > 0)

filter <- rownames(top_predictors)

relationship_df <- chem_data |> select(Yield, all_of(filter))

ggpairs(relationship_df)

plot(varImp(chem_fit), top = 20)

From the above plot, we can see that ManufacturingProcess32 has the most impact on the Yield. In the ggpairs plot, we can see it has a statistically significant positive correlation of 0.608 to yield. The importance of ManufacturingProcess32 can provide insight to factory management and help them optimize the process and allocate resources to that process at the expense of less important processes, such as Process 24, 19, etc.