A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
cd<-ChemicalManufacturingProcess
summary(ChemicalManufacturingProcess)
## Yield BiologicalMaterial01 BiologicalMaterial02
## Min. :35.25 Min. :4.580 Min. :46.87
## 1st Qu.:38.75 1st Qu.:5.978 1st Qu.:52.68
## Median :39.97 Median :6.305 Median :55.09
## Mean :40.18 Mean :6.411 Mean :55.69
## 3rd Qu.:41.48 3rd Qu.:6.870 3rd Qu.:58.74
## Max. :46.34 Max. :8.810 Max. :64.75
##
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## Min. :56.97 Min. : 9.38 Min. :13.24
## 1st Qu.:64.98 1st Qu.:11.24 1st Qu.:17.23
## Median :67.22 Median :12.10 Median :18.49
## Mean :67.70 Mean :12.35 Mean :18.60
## 3rd Qu.:70.43 3rd Qu.:13.22 3rd Qu.:19.90
## Max. :78.25 Max. :23.09 Max. :24.85
##
## BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
## Min. :40.60 Min. :100.0 Min. :15.88
## 1st Qu.:46.05 1st Qu.:100.0 1st Qu.:17.06
## Median :48.46 Median :100.0 Median :17.51
## Mean :48.91 Mean :100.0 Mean :17.49
## 3rd Qu.:51.34 3rd Qu.:100.0 3rd Qu.:17.88
## Max. :59.38 Max. :100.8 Max. :19.14
##
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## Min. :11.44 Min. :1.770 Min. :135.8
## 1st Qu.:12.60 1st Qu.:2.460 1st Qu.:143.8
## Median :12.84 Median :2.710 Median :146.1
## Mean :12.85 Mean :2.801 Mean :147.0
## 3rd Qu.:13.13 3rd Qu.:2.990 3rd Qu.:149.6
## Max. :14.08 Max. :6.870 Max. :158.7
##
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## Min. :18.35 Min. : 0.00 Min. : 0.00
## 1st Qu.:19.73 1st Qu.:10.80 1st Qu.:19.30
## Median :20.12 Median :11.40 Median :21.00
## Mean :20.20 Mean :11.21 Mean :16.68
## 3rd Qu.:20.75 3rd Qu.:12.15 3rd Qu.:21.50
## Max. :22.21 Max. :14.10 Max. :22.50
## NA's :1 NA's :3
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
## Min. :1.47 Min. :911.0 Min. : 923.0
## 1st Qu.:1.53 1st Qu.:928.0 1st Qu.: 986.8
## Median :1.54 Median :934.0 Median : 999.2
## Mean :1.54 Mean :931.9 Mean :1001.7
## 3rd Qu.:1.55 3rd Qu.:936.0 3rd Qu.:1008.9
## Max. :1.60 Max. :946.0 Max. :1175.3
## NA's :15 NA's :1 NA's :1
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
## Min. :203.0 Min. :177.0 Min. :177.0
## 1st Qu.:205.7 1st Qu.:177.0 1st Qu.:177.0
## Median :206.8 Median :177.0 Median :178.0
## Mean :207.4 Mean :177.5 Mean :177.6
## 3rd Qu.:208.7 3rd Qu.:178.0 3rd Qu.:178.0
## Max. :227.4 Max. :178.0 Max. :178.0
## NA's :2 NA's :1 NA's :1
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
## Min. :38.89 Min. : 7.500 Min. : 7.500
## 1st Qu.:44.89 1st Qu.: 8.700 1st Qu.: 9.000
## Median :45.73 Median : 9.100 Median : 9.400
## Mean :45.66 Mean : 9.179 Mean : 9.386
## 3rd Qu.:46.52 3rd Qu.: 9.550 3rd Qu.: 9.900
## Max. :49.36 Max. :11.600 Max. :11.500
## NA's :9 NA's :10
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## Min. : 0.0 Min. :32.10 Min. :4701
## 1st Qu.: 0.0 1st Qu.:33.90 1st Qu.:4828
## Median : 0.0 Median :34.60 Median :4856
## Mean : 857.8 Mean :34.51 Mean :4854
## 3rd Qu.: 0.0 3rd Qu.:35.20 3rd Qu.:4882
## Max. :4549.0 Max. :38.60 Max. :5055
## NA's :1 NA's :1
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## Min. :5904 Min. : 0 Min. :31.30
## 1st Qu.:6010 1st Qu.:4561 1st Qu.:33.50
## Median :6032 Median :4588 Median :34.40
## Mean :6039 Mean :4566 Mean :34.34
## 3rd Qu.:6061 3rd Qu.:4619 3rd Qu.:35.10
## Max. :6233 Max. :4852 Max. :40.00
##
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## Min. : 0 Min. :5890 Min. : 0
## 1st Qu.:4813 1st Qu.:6001 1st Qu.:4553
## Median :4835 Median :6022 Median :4582
## Mean :4810 Mean :6028 Mean :4556
## 3rd Qu.:4862 3rd Qu.:6050 3rd Qu.:4610
## Max. :4971 Max. :6146 Max. :4759
##
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## Min. :-1.8000 Min. : 0.000 Min. :0.000
## 1st Qu.:-0.6000 1st Qu.: 3.000 1st Qu.:2.000
## Median :-0.3000 Median : 5.000 Median :3.000
## Mean :-0.1642 Mean : 5.406 Mean :3.017
## 3rd Qu.: 0.0000 3rd Qu.: 8.000 3rd Qu.:4.000
## Max. : 3.6000 Max. :12.000 Max. :6.000
## NA's :1 NA's :1
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
## Min. : 0.000 Min. : 0 Min. : 0
## 1st Qu.: 4.000 1st Qu.:4832 1st Qu.:6020
## Median : 8.000 Median :4855 Median :6047
## Mean : 8.834 Mean :4828 Mean :6016
## 3rd Qu.:14.000 3rd Qu.:4877 3rd Qu.:6070
## Max. :23.000 Max. :4990 Max. :6161
## NA's :1 NA's :5 NA's :5
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
## Min. : 0 Min. : 0.000 Min. : 0.00
## 1st Qu.:4560 1st Qu.: 0.000 1st Qu.:19.70
## Median :4587 Median :10.400 Median :19.90
## Mean :4563 Mean : 6.592 Mean :20.01
## 3rd Qu.:4609 3rd Qu.:10.750 3rd Qu.:20.40
## Max. :4710 Max. :11.500 Max. :22.00
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
## Min. : 0.000 Min. : 0.00 Min. :143.0
## 1st Qu.: 8.800 1st Qu.:70.10 1st Qu.:155.0
## Median : 9.100 Median :70.80 Median :158.0
## Mean : 9.161 Mean :70.18 Mean :158.5
## 3rd Qu.: 9.700 3rd Qu.:71.40 3rd Qu.:162.0
## Max. :11.200 Max. :72.50 Max. :173.0
## NA's :5 NA's :5
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## Min. :56.00 Min. :2.300 Min. :463.0
## 1st Qu.:62.00 1st Qu.:2.500 1st Qu.:490.0
## Median :64.00 Median :2.500 Median :495.0
## Mean :63.54 Mean :2.494 Mean :495.6
## 3rd Qu.:65.00 3rd Qu.:2.500 3rd Qu.:501.5
## Max. :70.00 Max. :2.600 Max. :522.0
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## Min. :0.01700 Min. :0.000 Min. :0.000
## 1st Qu.:0.01900 1st Qu.:0.700 1st Qu.:2.000
## Median :0.02000 Median :1.000 Median :3.000
## Mean :0.01957 Mean :1.014 Mean :2.534
## 3rd Qu.:0.02000 3rd Qu.:1.300 3rd Qu.:3.000
## Max. :0.02200 Max. :2.300 Max. :3.000
## NA's :5
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
## Min. :0.000 Min. :0.00000 Min. :0.00000
## 1st Qu.:7.100 1st Qu.:0.00000 1st Qu.:0.00000
## Median :7.200 Median :0.00000 Median :0.00000
## Mean :6.851 Mean :0.01771 Mean :0.02371
## 3rd Qu.:7.300 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :7.500 Max. :0.10000 Max. :0.20000
## NA's :1 NA's :1
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
## Min. : 0.00 Min. : 0.0000 Min. :0.000
## 1st Qu.:11.40 1st Qu.: 0.6000 1st Qu.:1.800
## Median :11.60 Median : 0.8000 Median :1.900
## Mean :11.21 Mean : 0.9119 Mean :1.805
## 3rd Qu.:11.70 3rd Qu.: 1.0250 3rd Qu.:1.900
## Max. :12.10 Max. :11.0000 Max. :2.100
##
## ManufacturingProcess45
## Min. :0.000
## 1st Qu.:2.100
## Median :2.200
## Mean :2.138
## 3rd Qu.:2.300
## Max. :2.600
##
Lets examine the distributions of predictors.
library(DataExplorer)
plot_histogram(cd)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
plot_missing(cd)
Our missing data plot shows that the target variable is complete. Manufactuing process 03 is missing 8.52 percent of entires. There are more predictors missing less than 3 percent of their entries. This is an ideal situation to impute variables.
The impute package is not available in CRAN. We need to install it directly from BiocManager. We utilize knn method to impute missing values across all variables with missing data. We essentially use k nearest neighbors toimpute the missing values. For each variable with missing data, we use Euclidean distance to identify the k nearest neighbors. If we are missing a coordinate to compute the distance, the package uses the average distance from the closest non missing coordinates. This package assumes that not all variables are missing data.
Some other methods of imputation include using the mean or median of each variable to fill in the NA’s.
http://www.bioconductor.org/packages/release/bioc/html/impute.html
#if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#BiocManager::install("impute")
library(impute)
cd2 <- impute.knn(as.matrix(cd))
cd2 <- as.data.frame(cd2$data)
plot_missing(cd2)
We are no longer missing data. We can see that impute has worked correctly. We display new summary statistics.
summary(cd2)
## Yield BiologicalMaterial01 BiologicalMaterial02
## Min. :35.25 Min. :4.580 Min. :46.87
## 1st Qu.:38.75 1st Qu.:5.978 1st Qu.:52.68
## Median :39.97 Median :6.305 Median :55.09
## Mean :40.18 Mean :6.411 Mean :55.69
## 3rd Qu.:41.48 3rd Qu.:6.870 3rd Qu.:58.74
## Max. :46.34 Max. :8.810 Max. :64.75
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## Min. :56.97 Min. : 9.38 Min. :13.24
## 1st Qu.:64.98 1st Qu.:11.24 1st Qu.:17.23
## Median :67.22 Median :12.10 Median :18.49
## Mean :67.70 Mean :12.35 Mean :18.60
## 3rd Qu.:70.43 3rd Qu.:13.22 3rd Qu.:19.90
## Max. :78.25 Max. :23.09 Max. :24.85
## BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
## Min. :40.60 Min. :100.0 Min. :15.88
## 1st Qu.:46.05 1st Qu.:100.0 1st Qu.:17.06
## Median :48.46 Median :100.0 Median :17.51
## Mean :48.91 Mean :100.0 Mean :17.49
## 3rd Qu.:51.34 3rd Qu.:100.0 3rd Qu.:17.88
## Max. :59.38 Max. :100.8 Max. :19.14
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## Min. :11.44 Min. :1.770 Min. :135.8
## 1st Qu.:12.60 1st Qu.:2.460 1st Qu.:143.8
## Median :12.84 Median :2.710 Median :146.1
## Mean :12.85 Mean :2.801 Mean :147.0
## 3rd Qu.:13.13 3rd Qu.:2.990 3rd Qu.:149.6
## Max. :14.08 Max. :6.870 Max. :158.7
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## Min. :18.35 Min. : 0.00 Min. : 0.00
## 1st Qu.:19.73 1st Qu.:10.78 1st Qu.:19.17
## Median :20.12 Median :11.40 Median :21.00
## Mean :20.20 Mean :11.20 Mean :16.66
## 3rd Qu.:20.75 3rd Qu.:12.12 3rd Qu.:21.50
## Max. :22.21 Max. :14.10 Max. :22.50
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
## Min. :1.470 Min. :911.0 Min. : 923.0
## 1st Qu.:1.530 1st Qu.:928.0 1st Qu.: 986.8
## Median :1.544 Median :934.0 Median : 999.4
## Mean :1.540 Mean :931.8 Mean :1001.8
## 3rd Qu.:1.550 3rd Qu.:936.0 3rd Qu.:1009.2
## Max. :1.600 Max. :946.0 Max. :1175.3
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
## Min. :203.0 Min. :177.0 Min. :177.0
## 1st Qu.:205.7 1st Qu.:177.0 1st Qu.:177.0
## Median :206.8 Median :177.0 Median :178.0
## Mean :207.4 Mean :177.5 Mean :177.6
## 3rd Qu.:208.7 3rd Qu.:178.0 3rd Qu.:178.0
## Max. :227.4 Max. :178.0 Max. :178.0
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
## Min. :38.89 Min. : 7.500 Min. : 7.500
## 1st Qu.:44.89 1st Qu.: 8.700 1st Qu.: 9.000
## Median :45.73 Median : 9.100 Median : 9.400
## Mean :45.66 Mean : 9.186 Mean : 9.396
## 3rd Qu.:46.52 3rd Qu.: 9.525 3rd Qu.: 9.900
## Max. :49.36 Max. :11.600 Max. :11.500
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## Min. : 0.0 Min. :32.10 Min. :4701
## 1st Qu.: 0.0 1st Qu.:33.90 1st Qu.:4827
## Median : 0.0 Median :34.60 Median :4856
## Mean : 852.9 Mean :34.51 Mean :4854
## 3rd Qu.: 0.0 3rd Qu.:35.20 3rd Qu.:4882
## Max. :4549.0 Max. :38.60 Max. :5055
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## Min. :5904 Min. : 0 Min. :31.30
## 1st Qu.:6010 1st Qu.:4561 1st Qu.:33.50
## Median :6032 Median :4588 Median :34.40
## Mean :6039 Mean :4566 Mean :34.34
## 3rd Qu.:6061 3rd Qu.:4619 3rd Qu.:35.10
## Max. :6233 Max. :4852 Max. :40.00
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## Min. : 0 Min. :5890 Min. : 0
## 1st Qu.:4813 1st Qu.:6001 1st Qu.:4553
## Median :4835 Median :6022 Median :4582
## Mean :4810 Mean :6028 Mean :4556
## 3rd Qu.:4862 3rd Qu.:6050 3rd Qu.:4610
## Max. :4971 Max. :6146 Max. :4759
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## Min. :-1.8000 Min. : 0.000 Min. :0.000
## 1st Qu.:-0.6000 1st Qu.: 3.000 1st Qu.:2.000
## Median :-0.3000 Median : 5.000 Median :3.000
## Mean :-0.1642 Mean : 5.406 Mean :3.011
## 3rd Qu.: 0.0000 3rd Qu.: 8.000 3rd Qu.:4.000
## Max. : 3.6000 Max. :12.000 Max. :6.000
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
## Min. : 0.000 Min. : 0 Min. : 0
## 1st Qu.: 4.000 1st Qu.:4831 1st Qu.:6020
## Median : 8.000 Median :4854 Median :6046
## Mean : 8.823 Mean :4825 Mean :6013
## 3rd Qu.:14.000 3rd Qu.:4876 3rd Qu.:6069
## Max. :23.000 Max. :4990 Max. :6161
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
## Min. : 0 Min. : 0.000 Min. : 0.0
## 1st Qu.:4561 1st Qu.: 0.000 1st Qu.:19.7
## Median :4588 Median :10.400 Median :19.9
## Mean :4561 Mean : 6.444 Mean :20.0
## 3rd Qu.:4609 3rd Qu.:10.700 3rd Qu.:20.4
## Max. :4710 Max. :11.500 Max. :22.0
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
## Min. : 0.000 Min. : 0.00 Min. :143.0
## 1st Qu.: 8.800 1st Qu.:70.10 1st Qu.:155.0
## Median : 9.200 Median :70.80 Median :158.0
## Mean : 9.167 Mean :70.16 Mean :158.5
## 3rd Qu.: 9.700 3rd Qu.:71.40 3rd Qu.:162.0
## Max. :11.200 Max. :72.50 Max. :173.0
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## Min. :56.00 Min. :2.300 Min. :463.0
## 1st Qu.:62.00 1st Qu.:2.500 1st Qu.:490.0
## Median :64.00 Median :2.500 Median :495.5
## Mean :63.49 Mean :2.493 Mean :495.7
## 3rd Qu.:65.00 3rd Qu.:2.500 3rd Qu.:501.2
## Max. :70.00 Max. :2.600 Max. :522.0
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## Min. :0.01700 Min. :0.000 Min. :0.000
## 1st Qu.:0.01900 1st Qu.:0.700 1st Qu.:2.000
## Median :0.02000 Median :1.000 Median :3.000
## Mean :0.01959 Mean :1.014 Mean :2.534
## 3rd Qu.:0.02000 3rd Qu.:1.300 3rd Qu.:3.000
## Max. :0.02200 Max. :2.300 Max. :3.000
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
## Min. :0.000 Min. :0.00000 Min. :0.00000
## 1st Qu.:7.100 1st Qu.:0.00000 1st Qu.:0.00000
## Median :7.200 Median :0.00000 Median :0.00000
## Mean :6.851 Mean :0.01761 Mean :0.02358
## 3rd Qu.:7.300 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :7.500 Max. :0.10000 Max. :0.20000
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
## Min. : 0.00 Min. : 0.0000 Min. :0.000
## 1st Qu.:11.40 1st Qu.: 0.6000 1st Qu.:1.800
## Median :11.60 Median : 0.8000 Median :1.900
## Mean :11.21 Mean : 0.9119 Mean :1.805
## 3rd Qu.:11.70 3rd Qu.: 1.0250 3rd Qu.:1.900
## Max. :12.10 Max. :11.0000 Max. :2.100
## ManufacturingProcess45
## Min. :0.000
## 1st Qu.:2.100
## Median :2.200
## Mean :2.138
## 3rd Qu.:2.300
## Max. :2.600
Lets see the correlation between variables and the predictor
#correlation matrix and visualization
correlation_matrix <- round(cor(cd2),2)
# Get lower triangle of the correlation matrix
get_lower_tri<-function(correlation_matrix){
correlation_matrix[upper.tri(correlation_matrix)] <- NA
return(correlation_matrix)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(correlation_matrix){
correlation_matrix[lower.tri(correlation_matrix)]<- NA
return(correlation_matrix)
}
upper_tri <- get_upper_tri(correlation_matrix)
library(reshape2)
# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)
# Heatmap
library(ggplot2)
ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()
#add nice labels
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 2) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x=element_text(size=rel(0.4), angle=90),
axis.text.y=element_text(size=rel(0.4)),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwicrash_training2h = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
We can see several predictors that ar quite correlated with each other. We can use a function to apply a correlation threshold and remove pairwise correlations. Lets remove any pairwise correlation greater than .7. We are essentially being proactive when it comes to avoiding multicolinearity.
library(caret)
cd3 = cor(cd2)
hc = findCorrelation(cd3, cutoff=0.7) # putt any value as a "cutoff"
hc = sort(hc)
reduced_Data = cd2[,-c(hc)]
names(reduced_Data)
## [1] "Yield" "BiologicalMaterial03"
## [3] "BiologicalMaterial05" "BiologicalMaterial07"
## [5] "BiologicalMaterial09" "BiologicalMaterial10"
## [7] "ManufacturingProcess01" "ManufacturingProcess02"
## [9] "ManufacturingProcess03" "ManufacturingProcess04"
## [11] "ManufacturingProcess05" "ManufacturingProcess06"
## [13] "ManufacturingProcess07" "ManufacturingProcess08"
## [15] "ManufacturingProcess10" "ManufacturingProcess11"
## [17] "ManufacturingProcess12" "ManufacturingProcess16"
## [19] "ManufacturingProcess17" "ManufacturingProcess19"
## [21] "ManufacturingProcess20" "ManufacturingProcess21"
## [23] "ManufacturingProcess22" "ManufacturingProcess23"
## [25] "ManufacturingProcess24" "ManufacturingProcess25"
## [27] "ManufacturingProcess28" "ManufacturingProcess34"
## [29] "ManufacturingProcess35" "ManufacturingProcess36"
## [31] "ManufacturingProcess37" "ManufacturingProcess38"
## [33] "ManufacturingProcess39" "ManufacturingProcess41"
## [35] "ManufacturingProcess43" "ManufacturingProcess45"
#reduced data
set.seed(20)
train_row_partition <- createDataPartition(reduced_Data$Yield, p=0.8, list=FALSE)
X_train <- reduced_Data[train_row_partition, -1]
y_train <- reduced_Data[train_row_partition, 1]
X_test <- reduced_Data[-train_row_partition, -1]
y_test <- reduced_Data[-train_row_partition, 1]
Fit an Initial Model
We will be fitting a partial least squares model using the train function. We specify method to pls and request the 20 best fits based on RMSE. We build the model on the features that were selected from dropping variables that had pairwise correlation. We also use 10 fold cross validation. On a high level, this means that we will parition the training data into k equally sized sets and retain one of those ki sets to validate our model.
pls_tunned <- train(X_train, y_train, method = "pls",tuneLength = 20, trControl=trainControl(method='cv'), preProc = c("center", "scale"))
pls_tunned;
## Partial Least Squares
##
## 144 samples
## 35 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 131, 128, 130, 130, 129, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.560007 0.3891078 1.232125
## 2 1.882913 0.4172155 1.292349
## 3 1.727661 0.4592812 1.251877
## 4 2.191698 0.4292503 1.400555
## 5 2.336186 0.4337308 1.432827
## 6 2.339050 0.4369019 1.428144
## 7 2.378060 0.4476128 1.421328
## 8 2.351002 0.4544916 1.406405
## 9 2.335555 0.4576866 1.391024
## 10 2.375508 0.4549352 1.400848
## 11 2.441100 0.4528699 1.419688
## 12 2.484580 0.4490261 1.432653
## 13 2.484220 0.4533374 1.433991
## 14 2.493490 0.4543693 1.437662
## 15 2.468935 0.4584692 1.428240
## 16 2.442578 0.4614583 1.420099
## 17 2.437957 0.4630998 1.418461
## 18 2.442169 0.4634011 1.419379
## 19 2.446810 0.4635929 1.420389
## 20 2.449727 0.4643199 1.421220
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 1.
plot(pls_tunned)
The plot reveals the the optimal value of components. In terms of r squared , ncomp 13 is the ideal parameter.
pred_pls <- predict(pls_tunned, newdata=X_test)
predResult <- postResample(pred=pred_pls, obs=y_test)
predResult
## RMSE Rsquared MAE
## 1.4657792 0.3327395 1.1385856
The performance is pretty similar on the test data vs the training data.
pls_tunned$finalModel$coefficients
## , , 1 comps
##
## .outcome
## BiologicalMaterial03 0.247356455
## BiologicalMaterial05 0.088394378
## BiologicalMaterial07 -0.063928707
## BiologicalMaterial09 0.040444529
## BiologicalMaterial10 0.133123649
## ManufacturingProcess01 -0.059898507
## ManufacturingProcess02 -0.133579881
## ManufacturingProcess03 -0.045630016
## ManufacturingProcess04 -0.152606990
## ManufacturingProcess05 0.050681758
## ManufacturingProcess06 0.220434418
## ManufacturingProcess07 -0.027636631
## ManufacturingProcess08 0.008692913
## ManufacturingProcess10 0.098791122
## ManufacturingProcess11 0.163991396
## ManufacturingProcess12 0.164484979
## ManufacturingProcess16 -0.016371809
## ManufacturingProcess17 -0.208290130
## ManufacturingProcess19 0.104785161
## ManufacturingProcess20 -0.030279376
## ManufacturingProcess21 -0.006562046
## ManufacturingProcess22 0.004510278
## ManufacturingProcess23 -0.042809642
## ManufacturingProcess24 -0.106829029
## ManufacturingProcess25 0.003305285
## ManufacturingProcess28 0.160523715
## ManufacturingProcess34 0.088526155
## ManufacturingProcess35 -0.086230402
## ManufacturingProcess36 -0.278795312
## ManufacturingProcess37 -0.089267175
## ManufacturingProcess38 -0.062296854
## ManufacturingProcess39 0.005792439
## ManufacturingProcess41 -0.019524610
## ManufacturingProcess43 0.122890032
## ManufacturingProcess45 -0.005554439
key_features <- varImp(pls_tunned)
plot(key_features, top=20)
Manufacturing Process 36 is the most important predictor followed by BiologicalMaterial03.Overall, the process is doinated by manufacturing process predictors.
We are unable to change biological process but make alterations to the raw input materials that go into the biological process. Based on the importance of bio process 3, we could perhaps explore making changes into the raw materials. Manufacturing process 36 is the most important. I suggest using experimental design to compare that particular process with the other manufacturing processes. We want to see why a process such as 19 is not as important as 36.