library(ggplot2)
library(dplyr)
library(caret)
library(RANN)

Homework-7 Linear Regression and Its Cousins

Kuhn, Max. Applied Predictive Modeling (p. 138).

Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

Part A

  1. Start R and use these commands to load the data:
  • > library(AppliedPredictiveModeling)

  • > data(chemicalManufacturingProcess)

    The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. Yield contains the percent yield for each run.

    library(AppliedPredictiveModeling)
    data("ChemicalManufacturingProcess")

Part B

  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

    naCols <- names(which(sapply(ChemicalManufacturingProcess, anyNA)))
    naCols
    ##  [1] "ManufacturingProcess01" "ManufacturingProcess02"
    ##  [3] "ManufacturingProcess03" "ManufacturingProcess04"
    ##  [5] "ManufacturingProcess05" "ManufacturingProcess06"
    ##  [7] "ManufacturingProcess07" "ManufacturingProcess08"
    ##  [9] "ManufacturingProcess10" "ManufacturingProcess11"
    ## [11] "ManufacturingProcess12" "ManufacturingProcess14"
    ## [13] "ManufacturingProcess22" "ManufacturingProcess23"
    ## [15] "ManufacturingProcess24" "ManufacturingProcess25"
    ## [17] "ManufacturingProcess26" "ManufacturingProcess27"
    ## [19] "ManufacturingProcess28" "ManufacturingProcess29"
    ## [21] "ManufacturingProcess30" "ManufacturingProcess31"
    ## [23] "ManufacturingProcess33" "ManufacturingProcess34"
    ## [25] "ManufacturingProcess35" "ManufacturingProcess36"
    ## [27] "ManufacturingProcess40" "ManufacturingProcess41"
    print(paste("There are", length(naCols), "columns with missing values"))
    ## [1] "There are 28 columns with missing values"
    preP <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))
    #NOTE: preProcess will center and scale the data by default as well
    df <- predict(preP, ChemicalManufacturingProcess)
    # Restore the response variable values to original
    df$Yield = ChemicalManufacturingProcess$Yield

Part C

  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

    library(e1071)
    skewVals <- apply(df, 2, skewness)
    # order by skewness magnitude in descending order
    skewOrd <- round(skewVals[order(-abs(skewVals))], 4)
    skewOrd %>% head(10)
    ## ManufacturingProcess26 ManufacturingProcess25 ManufacturingProcess18 
    ##               -12.8576               -12.8177               -12.7361 
    ## ManufacturingProcess27 ManufacturingProcess20 ManufacturingProcess16 
    ##               -12.7047               -12.6383               -12.4202 
    ## ManufacturingProcess31 ManufacturingProcess29 ManufacturingProcess43 
    ##               -11.9676               -10.2175                 9.0549 
    ##   BiologicalMaterial07 
    ##                 7.3987


    There is some significant skewness in the predictors. The plots below show one of the predictors with the largest skewness score. For this predictor, most of the values are packed together, although there are some readings with zero value, which appear as outliers. However, zero value may indicate the absence of a particular manufacturing process or a biological attribute. This fact by itself maybe influential in the modelling. Therefore, without any domain knowledge, it would not be advisable to deal with zero values and outliers in any way. Instead Box-Cox transformation will be used to minimize skewness.


    hist(ChemicalManufacturingProcess$ManufacturingProcess26)

    boxplot(ChemicalManufacturingProcess$ManufacturingProcess26)

    #Repeat preprocessing done before, but now including the "BoxCox" transformation.
    preP <- preProcess(ChemicalManufacturingProcess, method = c("BoxCox", "knnImpute"))
    df <- predict(preP, ChemicalManufacturingProcess)
    # Restore the response variable values to original
    df$Yield = ChemicalManufacturingProcess$Yield
    
    # Split the data into a training and a test set
    trainRows <- createDataPartition(df$Yield, p = .80, list = FALSE)
    df.train <- df[trainRows,]
    df.test <- df[-trainRows,]
    # Tune PLS model
    set.seed(123)
    colYield <-
      which(colnames(ChemicalManufacturingProcess) == "Yield")
    ctrl <- trainControl(method = "cv", number = 5)
    mdl <- train(
      x = df.train[, -colYield],
      y = df.train$Yield,
      method = "pls",
      tuneLength = 20,
      trControl = ctrl
    )
    mdl
    ## Partial Least Squares 
    ## 
    ## 144 samples
    ##  57 predictor
    ## 
    ## No pre-processing
    ## Resampling: Cross-Validated (5 fold) 
    ## Summary of sample sizes: 116, 116, 115, 116, 113 
    ## Resampling results across tuning parameters:
    ## 
    ##   ncomp  RMSE      Rsquared   MAE      
    ##    1     1.408300  0.4332553  1.1575874
    ##    2     1.272670  0.5348657  1.0309490
    ##    3     1.218631  0.5801732  0.9804358
    ##    4     1.218133  0.5811390  0.9789612
    ##    5     1.217444  0.5836795  0.9894163
    ##    6     1.231287  0.5770110  0.9963243
    ##    7     1.249034  0.5678888  1.0163617
    ##    8     1.264068  0.5593044  1.0387946
    ##    9     1.266819  0.5608885  1.0390328
    ##   10     1.298033  0.5426360  1.0626458
    ##   11     1.338906  0.5172047  1.0997018
    ##   12     1.377222  0.4975651  1.1177735
    ##   13     1.409151  0.4810685  1.1324220
    ##   14     1.425269  0.4747577  1.1327053
    ##   15     1.444032  0.4690502  1.1322622
    ##   16     1.467374  0.4649384  1.1377789
    ##   17     1.483788  0.4609287  1.1496691
    ##   18     1.519606  0.4454053  1.1669062
    ##   19     1.550066  0.4336239  1.1804848
    ##   20     1.583255  0.4199894  1.1950141
    ## 
    ## RMSE was used to select the optimal model using the smallest value.
    ## The final value used for the model was ncomp = 5.
    mdl.ncomp <- mdl[["bestTune"]][["ncomp"]]
    print(
      paste(
        "The optimal value (for the number of PLS components) with respect to RMSE performance metric is",
        mdl.ncomp
      )
    )
    ## [1] "The optimal value (for the number of PLS components) with respect to RMSE performance metric is 5"
    plot(mdl)

Part D

  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

    postResample(predict(mdl, newdata = df.test, ncomp = mdl.ncomp), df.test$Yield)
    ##      RMSE  Rsquared       MAE 
    ## 1.1721978 0.5953868 0.9704013
    getTrainPerf(mdl)
    ##   TrainRMSE TrainRsquared  TrainMAE method
    ## 1  1.217444     0.5836795 0.9894163    pls
    plot(mdl$finalModel)
    abline(a = 0, b = 1)

    plot(residuals(mdl, ncomp = mdl.ncomp))
    abline(h = 0)


    The RMSE on the test set turned out to be actually slightly lower than the RMSE of the resampled training set. Also, the model seems to be fitting well, with residuals looking to be scattered randomly and uniformly around zero value.

Part E

  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

    mdl.vip <- varImp(mdl)
    ## Warning: package 'pls' was built under R version 3.6.3
    ## 
    ## Attaching package: 'pls'
    ## The following object is masked from 'package:caret':
    ## 
    ##     R2
    ## The following object is masked from 'package:stats':
    ## 
    ##     loadings
    plot(mdl.vip, top = 20)

    mdl.vip
    ## pls variable importance
    ## 
    ##   only 20 most important variables shown (out of 57)
    ## 
    ##                        Overall
    ## ManufacturingProcess32  100.00
    ## ManufacturingProcess13   97.38
    ## ManufacturingProcess17   94.37
    ## ManufacturingProcess09   83.46
    ## ManufacturingProcess36   82.34
    ## ManufacturingProcess12   63.47
    ## ManufacturingProcess06   62.12
    ## BiologicalMaterial06     59.44
    ## BiologicalMaterial02     59.17
    ## ManufacturingProcess11   56.21
    ## ManufacturingProcess33   54.56
    ## BiologicalMaterial12     54.53
    ## BiologicalMaterial03     54.49
    ## BiologicalMaterial11     54.01
    ## BiologicalMaterial08     53.73
    ## BiologicalMaterial04     53.16
    ## ManufacturingProcess34   51.80
    ## ManufacturingProcess04   49.83
    ## ManufacturingProcess28   47.84
    ## BiologicalMaterial01     46.91


    Based on the graph above, for the importance of predictors in this PLS model, it appears that the top 5 most significant predictors have to do with manufacturing process. Also, among the top 20, more than half are process predictors as oppose to the biological ones.

Part F

  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

    mdl.coef <- coef(mdl$finalModel)
    mdl.coef.rn <- rownames(mdl.coef)
    top5.coef <- data.frame(
      PREDICTOR = c(
        "ManufacturingProcess32",
        "ManufacturingProcess13",
        "ManufacturingProcess09",
        "ManufacturingProcess17",
        "ManufacturingProcess36"
      ),
      COEF = c(mdl.coef[mdl.coef.rn == "ManufacturingProcess32"],
               mdl.coef[mdl.coef.rn == "ManufacturingProcess13"],
               mdl.coef[mdl.coef.rn == "ManufacturingProcess09"],
               mdl.coef[mdl.coef.rn == "ManufacturingProcess17"],
               mdl.coef[mdl.coef.rn == "ManufacturingProcess36"])
    )
    top5.coef
    ##                PREDICTOR       COEF
    ## 1 ManufacturingProcess32  0.4885703
    ## 2 ManufacturingProcess13 -0.3525023
    ## 3 ManufacturingProcess09  0.2889084
    ## 4 ManufacturingProcess17 -0.2784638
    ## 5 ManufacturingProcess36 -0.2732666
    plot(x = df$Yield, y = df$ManufacturingProcess32)

    plot(x = df$Yield, y = df$ManufacturingProcess36)


    Among the top 5 most influential predictors, some have positive estimated coefficients, while others have negative. Two plots above illustrate positive and negative correlation with the response variable. To improve the yield in the future runs, the manufacturing processes with positive coefficients should be boosted or increased, while the processes with negative coefficients should be reduced.