1 QUALITY PREDICTION OF A MANUFACTURING PROCESS OUTPUT

1.1 Overview

This is part of LBB (Learning by Building) for Data Science course. In this document, we will leverage some statistical techniques and machine learning model (focusing on regression model) to predict quality rating from manufacturing output. The source data set is using real-world data collected from a manufacturing process which consists of Temperature, Pressure, Temperature x Pressure, Material Fusion Metric, Material Transformation Metric as well as Quality Rating. By understanding the relationships between process parameters (temperature, pressure) and product quality, manufacturers can optimize their processes for higher-quality output.

1.2 Preparation

In this section, we will read the data and prepare all the required libraries.

#load the required libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.1
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
#read data and inspect
manuf <- read.csv("data_input/manufacturing_for_polynomyal_reg.csv")
summary(manuf)
##  Temperature...C. Pressure..kPa.   Temperature.x.Pressure
##  Min.   :100.0    Min.   : 5.003   Min.   : 513.7        
##  1st Qu.:150.9    1st Qu.: 9.693   1st Qu.:1798.2        
##  Median :198.6    Median :14.833   Median :2678.3        
##  Mean   :200.0    Mean   :14.816   Mean   :2955.3        
##  3rd Qu.:251.4    3rd Qu.:19.750   3rd Qu.:3929.1        
##  Max.   :300.0    Max.   :24.999   Max.   :7365.0        
##  Material.Fusion.Metric Material.Transformation.Metric Quality.Rating  
##  Min.   : 10157         Min.   :  999946               Min.   :  1.00  
##  1st Qu.: 27627         1st Qu.: 3433810               1st Qu.: 99.94  
##  Median : 44611         Median : 7833390               Median :100.00  
##  Mean   : 48127         Mean   :10036454               Mean   : 96.26  
##  3rd Qu.: 67805         3rd Qu.:15882514               3rd Qu.:100.00  
##  Max.   :103756         Max.   :26997826               Max.   :100.00
glimpse(manuf)
## Rows: 3,957
## Columns: 6
## $ Temperature...C.               <dbl> 209.7627, 243.0379, 220.5527, 208.9766,…
## $ Pressure..kPa.                 <dbl> 8.050855, 15.812068, 7.843130, 23.78608…
## $ Temperature.x.Pressure         <dbl> 1688.769, 3842.931, 1729.823, 4970.737,…
## $ Material.Fusion.Metric         <dbl> 44522.22, 63020.76, 49125.95, 57128.88,…
## $ Material.Transformation.Metric <dbl> 9229576, 14355367, 10728389, 9125702, 6…
## $ Quality.Rating                 <dbl> 99.99997, 99.98570, 99.99976, 99.99997,…

Before going to feature selection, it is best practice to inspect the data type and change it as necessary. However, our data are all numerik and the data types are already correct.

1.3 Checking correlations

In this step, we will check correlations between target and each feature to determine which one goes to our model. >> +1 value means strong positive correlation >> -1 value means strong negative correlation

ggcorr(data = manuf, label = T, hjust = 1, layout.exp = 3)

Based on the correlation matrix, Material Transformation Metric have the strongest negative correlation with our target, followed by Material Fusion Metric and Temperature. Though, the correlation is not really strong. Let’s make a plot to visualize the relationship.

# Scatter plot of Temperature vs Quality Rating
ggplot(manuf, aes(x = Quality.Rating, y = Temperature...C.)) +
  geom_point() +
  labs(title = "Scatter Plot of Temperature vs Quality Rating", x = "Quality Rating", y = "Temperature (C)")

# Scatter plot of Material Transformation Metric vs Quality Rating
ggplot(manuf, aes(x = Material.Transformation.Metric, y = Quality.Rating)) +
  geom_point() +
  labs(title = "Material Transformation Metric vs Quality Rating", x = "Material Transformation Metric", y = "Quality Rating")

1.4 Feature selection

Based on the plot, we can see the correlation is a non-linear. Let’s try to build a few models based on this. 1. Linear regression model with all features 2. Linear regression with only Material Transformation Metric and Temperature 3. Polynomial regression model (7 degree for temperature)

#Model with all features
model_manuf_all <- lm(Quality.Rating ~ .,
   manuf)


#Model with 2 strongest correlated features
model_manuf_line <- lm(formula = Quality.Rating ~ Material.Transformation.Metric + Temperature...C.,
                data = manuf)

#Polynomial model with 7 degree for Temperature
model_manuf_poly <- lm(Quality.Rating ~ poly(Temperature...C., 7) + Material.Transformation.Metric, data = manuf)

#summary of all the models
summary(model_manuf_all)
## 
## Call:
## lm(formula = Quality.Rating ~ ., data = manuf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.876  -3.557  -0.606   4.744  14.719 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     7.142e+01  2.699e+00  26.460  < 2e-16 ***
## Temperature...C.                2.435e-01  2.407e-02  10.118  < 2e-16 ***
## Pressure..kPa.                 -5.946e-01  1.201e-01  -4.949 7.77e-07 ***
## Temperature.x.Pressure          5.206e-04  4.363e-04   1.193    0.233    
## Material.Fusion.Metric          6.947e-04  1.057e-04   6.572 5.61e-11 ***
## Material.Transformation.Metric -4.986e-06  1.908e-07 -26.126  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.138 on 3951 degrees of freedom
## Multiple R-squared:  0.5059, Adjusted R-squared:  0.5053 
## F-statistic: 809.2 on 5 and 3951 DF,  p-value: < 2.2e-16
summary(model_manuf_line)
## 
## Call:
## lm(formula = Quality.Rating ~ Material.Transformation.Metric + 
##     Temperature...C., data = manuf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.517  -3.907  -0.721   4.894  13.769 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     5.777e+01  1.345e+00   42.95   <2e-16 ***
## Material.Transformation.Metric -3.849e-06  8.067e-08  -47.71   <2e-16 ***
## Temperature...C.                3.855e-01  1.055e-02   36.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.186 on 3954 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.5001 
## F-statistic:  1980 on 2 and 3954 DF,  p-value: < 2.2e-16
summary(model_manuf_poly)
## 
## Call:
## lm(formula = Quality.Rating ~ poly(Temperature...C., 7) + Material.Transformation.Metric, 
##     data = manuf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5545  -0.9062   0.0650   0.8917   3.3055 
## 
## Coefficients:
##                                  Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                     9.243e+02  1.260e+03    0.734    0.463    
## poly(Temperature...C., 7)1      3.792e+04  5.827e+04    0.651    0.515    
## poly(Temperature...C., 7)2      8.953e+03  1.424e+04    0.629    0.530    
## poly(Temperature...C., 7)3      4.121e+02  1.208e+03    0.341    0.733    
## poly(Temperature...C., 7)4     -3.169e+02  1.372e+00 -230.915   <2e-16 ***
## poly(Temperature...C., 7)5     -2.452e+02  1.372e+00 -178.675   <2e-16 ***
## poly(Temperature...C., 7)6     -1.782e+02  1.372e+00 -129.864   <2e-16 ***
## poly(Temperature...C., 7)7     -1.180e+02  1.373e+00  -85.937   <2e-16 ***
## Material.Transformation.Metric -8.250e-05  1.255e-04   -0.657    0.511    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.372 on 3948 degrees of freedom
## Multiple R-squared:  0.9889, Adjusted R-squared:  0.9888 
## F-statistic: 4.384e+04 on 8 and 3948 DF,  p-value: < 2.2e-16

Lets make comparison of metrics from all 3 models

# Compare Adjusted R-squared
adj_r2_all <- summary(model_manuf_all)$adj.r.squared
adj_r2_line <- summary(model_manuf_line)$adj.r.squared
adj_r2_poly <- summary(model_manuf_poly)$adj.r.squared

# Compare AIC
aic_all <- AIC(model_manuf_all)
aic_line <- AIC(model_manuf_line)
aic_poly <- AIC(model_manuf_poly)


comparison <- data.frame(
  Model = c("All Features", "2 Strongest Correlated Features", "Polynomial Model"),
  Adjusted_R2 = c(adj_r2_all, adj_r2_line, adj_r2_poly),
  AIC = c(aic_all, aic_line, aic_poly))

comparison
##                             Model Adjusted_R2      AIC
## 1                    All Features   0.5052994 28746.83
## 2 2 Strongest Correlated Features   0.5001337 28784.93
## 3                Polynomial Model   0.9888456 13744.40

Takeaways from the comparison : - Polynomial model have the highest Adjusted Rsquare, lowest AIC, possibly better model

1.5 Prediction

Now we will try to make prediction from all three models, however since we don’t have any new data, let’s predict using the existing manuf data and return back the prediction to the dataset.

#predictions
manuf$predictions_all <- predict(model_manuf_all, newdata = manuf)
manuf$predictions_line <- predict(model_manuf_line, newdata = manuf)
manuf$predictions_poly <- predict(model_manuf_poly, newdata = manuf)

head(manuf)
##   Temperature...C. Pressure..kPa. Temperature.x.Pressure Material.Fusion.Metric
## 1         209.7627       8.050855               1688.769               44522.22
## 2         243.0379      15.812068               3842.931               63020.76
## 3         220.5527       7.843130               1729.823               49125.95
## 4         208.9766      23.786089               4970.737               57128.88
## 5         184.7310      15.797812               2918.345               38068.20
## 6         229.1788       8.498306               1947.632               53136.69
##   Material.Transformation.Metric Quality.Rating predictions_all
## 1                        9229576       99.99997       103.51076
## 2                       14355367       99.98570        95.41645
## 3                       10728389       99.99976       102.00896
## 4                        9125702       99.99997       104.94808
## 5                        6303792      100.00000       103.55201
## 6                       12037072       99.99879       100.09510
##   predictions_line predictions_poly
## 1        103.11612        101.15352
## 2         96.21670         98.43024
## 3        101.50739        100.47024
## 4        103.21285        101.20953
## 5        104.72629         99.79855
## 6         99.79621         99.50022

1.6 Model evaluation

Let’s evaluate the model one more time using RMSE and MAPE

#calculating Mean Absolute Error (MAE)
mae_all <-  MAE(manuf$predictions_all, manuf$Quality.Rating)
mae_line <- MAE(manuf$predictions_line, manuf$Quality.Rating)
mae_poly <- MAE(manuf$predictions_poly, manuf$Quality.Rating)

#calculating Mean Absolute Percentage Error (MAPE)
mape_all <-  MAPE(manuf$predictions_all, manuf$Quality.Rating)
mape_line <- MAPE(manuf$predictions_line, manuf$Quality.Rating)
mape_poly <- MAPE(manuf$predictions_poly, manuf$Quality.Rating)

#comparison table
comparison <- data.frame(
  Model = c("All Features", "2 Strongest Correlated Features", "Polynomial Model"),
  Adjusted_R2 = c(adj_r2_all, adj_r2_line, adj_r2_poly),
  AIC = c(aic_all, aic_line, aic_poly),
  MAE = c(mae_all, mae_line, mae_poly),
  MAPE = c(mape_all, mape_line, mape_poly))

comparison
##                             Model Adjusted_R2      AIC      MAE       MAPE
## 1                    All Features   0.5052994 28746.83 5.682157 0.17429750
## 2 2 Strongest Correlated Features   0.5001337 28784.93 5.748292 0.17538334
## 3                Polynomial Model   0.9888456 13744.40 1.022557 0.02649997

Based on the comparison we can see that Polynomial Model is better compared to the other two models since it has the highest adj.R-square (means better accuracy) and lowest MAE and MAPE means very little error on the prediction result.