This is part of LBB (Learning by Building) for Data Science course. In this document, we will leverage some statistical techniques and machine learning model (focusing on regression model) to predict quality rating from manufacturing output. The source data set is using real-world data collected from a manufacturing process which consists of Temperature, Pressure, Temperature x Pressure, Material Fusion Metric, Material Transformation Metric as well as Quality Rating. By understanding the relationships between process parameters (temperature, pressure) and product quality, manufacturers can optimize their processes for higher-quality output.
In this section, we will read the data and prepare all the required libraries.
#load the required libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.1
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
#read data and inspect
manuf <- read.csv("data_input/manufacturing_for_polynomyal_reg.csv")
summary(manuf)
## Temperature...C. Pressure..kPa. Temperature.x.Pressure
## Min. :100.0 Min. : 5.003 Min. : 513.7
## 1st Qu.:150.9 1st Qu.: 9.693 1st Qu.:1798.2
## Median :198.6 Median :14.833 Median :2678.3
## Mean :200.0 Mean :14.816 Mean :2955.3
## 3rd Qu.:251.4 3rd Qu.:19.750 3rd Qu.:3929.1
## Max. :300.0 Max. :24.999 Max. :7365.0
## Material.Fusion.Metric Material.Transformation.Metric Quality.Rating
## Min. : 10157 Min. : 999946 Min. : 1.00
## 1st Qu.: 27627 1st Qu.: 3433810 1st Qu.: 99.94
## Median : 44611 Median : 7833390 Median :100.00
## Mean : 48127 Mean :10036454 Mean : 96.26
## 3rd Qu.: 67805 3rd Qu.:15882514 3rd Qu.:100.00
## Max. :103756 Max. :26997826 Max. :100.00
glimpse(manuf)
## Rows: 3,957
## Columns: 6
## $ Temperature...C. <dbl> 209.7627, 243.0379, 220.5527, 208.9766,…
## $ Pressure..kPa. <dbl> 8.050855, 15.812068, 7.843130, 23.78608…
## $ Temperature.x.Pressure <dbl> 1688.769, 3842.931, 1729.823, 4970.737,…
## $ Material.Fusion.Metric <dbl> 44522.22, 63020.76, 49125.95, 57128.88,…
## $ Material.Transformation.Metric <dbl> 9229576, 14355367, 10728389, 9125702, 6…
## $ Quality.Rating <dbl> 99.99997, 99.98570, 99.99976, 99.99997,…
Before going to feature selection, it is best practice to inspect the data type and change it as necessary. However, our data are all numerik and the data types are already correct.
In this step, we will check correlations between target and each feature to determine which one goes to our model. >> +1 value means strong positive correlation >> -1 value means strong negative correlation
ggcorr(data = manuf, label = T, hjust = 1, layout.exp = 3)
Based on the correlation matrix,
Material Transformation Metric have the strongest negative
correlation with our target, followed by
Material Fusion Metric and Temperature.
Though, the correlation is not really strong. Let’s make a plot to
visualize the relationship.
# Scatter plot of Temperature vs Quality Rating
ggplot(manuf, aes(x = Quality.Rating, y = Temperature...C.)) +
geom_point() +
labs(title = "Scatter Plot of Temperature vs Quality Rating", x = "Quality Rating", y = "Temperature (C)")
# Scatter plot of Material Transformation Metric vs Quality Rating
ggplot(manuf, aes(x = Material.Transformation.Metric, y = Quality.Rating)) +
geom_point() +
labs(title = "Material Transformation Metric vs Quality Rating", x = "Material Transformation Metric", y = "Quality Rating")
Based on the plot, we can see the correlation is a non-linear. Let’s
try to build a few models based on this. 1. Linear regression model with
all features 2. Linear regression with only
Material Transformation Metric and Temperature
3. Polynomial regression model (7 degree for temperature)
#Model with all features
model_manuf_all <- lm(Quality.Rating ~ .,
manuf)
#Model with 2 strongest correlated features
model_manuf_line <- lm(formula = Quality.Rating ~ Material.Transformation.Metric + Temperature...C.,
data = manuf)
#Polynomial model with 7 degree for Temperature
model_manuf_poly <- lm(Quality.Rating ~ poly(Temperature...C., 7) + Material.Transformation.Metric, data = manuf)
#summary of all the models
summary(model_manuf_all)
##
## Call:
## lm(formula = Quality.Rating ~ ., data = manuf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.876 -3.557 -0.606 4.744 14.719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.142e+01 2.699e+00 26.460 < 2e-16 ***
## Temperature...C. 2.435e-01 2.407e-02 10.118 < 2e-16 ***
## Pressure..kPa. -5.946e-01 1.201e-01 -4.949 7.77e-07 ***
## Temperature.x.Pressure 5.206e-04 4.363e-04 1.193 0.233
## Material.Fusion.Metric 6.947e-04 1.057e-04 6.572 5.61e-11 ***
## Material.Transformation.Metric -4.986e-06 1.908e-07 -26.126 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.138 on 3951 degrees of freedom
## Multiple R-squared: 0.5059, Adjusted R-squared: 0.5053
## F-statistic: 809.2 on 5 and 3951 DF, p-value: < 2.2e-16
summary(model_manuf_line)
##
## Call:
## lm(formula = Quality.Rating ~ Material.Transformation.Metric +
## Temperature...C., data = manuf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.517 -3.907 -0.721 4.894 13.769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.777e+01 1.345e+00 42.95 <2e-16 ***
## Material.Transformation.Metric -3.849e-06 8.067e-08 -47.71 <2e-16 ***
## Temperature...C. 3.855e-01 1.055e-02 36.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.186 on 3954 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.5001
## F-statistic: 1980 on 2 and 3954 DF, p-value: < 2.2e-16
summary(model_manuf_poly)
##
## Call:
## lm(formula = Quality.Rating ~ poly(Temperature...C., 7) + Material.Transformation.Metric,
## data = manuf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5545 -0.9062 0.0650 0.8917 3.3055
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.243e+02 1.260e+03 0.734 0.463
## poly(Temperature...C., 7)1 3.792e+04 5.827e+04 0.651 0.515
## poly(Temperature...C., 7)2 8.953e+03 1.424e+04 0.629 0.530
## poly(Temperature...C., 7)3 4.121e+02 1.208e+03 0.341 0.733
## poly(Temperature...C., 7)4 -3.169e+02 1.372e+00 -230.915 <2e-16 ***
## poly(Temperature...C., 7)5 -2.452e+02 1.372e+00 -178.675 <2e-16 ***
## poly(Temperature...C., 7)6 -1.782e+02 1.372e+00 -129.864 <2e-16 ***
## poly(Temperature...C., 7)7 -1.180e+02 1.373e+00 -85.937 <2e-16 ***
## Material.Transformation.Metric -8.250e-05 1.255e-04 -0.657 0.511
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.372 on 3948 degrees of freedom
## Multiple R-squared: 0.9889, Adjusted R-squared: 0.9888
## F-statistic: 4.384e+04 on 8 and 3948 DF, p-value: < 2.2e-16
Lets make comparison of metrics from all 3 models
# Compare Adjusted R-squared
adj_r2_all <- summary(model_manuf_all)$adj.r.squared
adj_r2_line <- summary(model_manuf_line)$adj.r.squared
adj_r2_poly <- summary(model_manuf_poly)$adj.r.squared
# Compare AIC
aic_all <- AIC(model_manuf_all)
aic_line <- AIC(model_manuf_line)
aic_poly <- AIC(model_manuf_poly)
comparison <- data.frame(
Model = c("All Features", "2 Strongest Correlated Features", "Polynomial Model"),
Adjusted_R2 = c(adj_r2_all, adj_r2_line, adj_r2_poly),
AIC = c(aic_all, aic_line, aic_poly))
comparison
## Model Adjusted_R2 AIC
## 1 All Features 0.5052994 28746.83
## 2 2 Strongest Correlated Features 0.5001337 28784.93
## 3 Polynomial Model 0.9888456 13744.40
Takeaways from the comparison : - Polynomial model have the highest Adjusted Rsquare, lowest AIC, possibly better model
Now we will try to make prediction from all three models, however
since we don’t have any new data, let’s predict using the existing
manuf data and return back the prediction to the
dataset.
#predictions
manuf$predictions_all <- predict(model_manuf_all, newdata = manuf)
manuf$predictions_line <- predict(model_manuf_line, newdata = manuf)
manuf$predictions_poly <- predict(model_manuf_poly, newdata = manuf)
head(manuf)
## Temperature...C. Pressure..kPa. Temperature.x.Pressure Material.Fusion.Metric
## 1 209.7627 8.050855 1688.769 44522.22
## 2 243.0379 15.812068 3842.931 63020.76
## 3 220.5527 7.843130 1729.823 49125.95
## 4 208.9766 23.786089 4970.737 57128.88
## 5 184.7310 15.797812 2918.345 38068.20
## 6 229.1788 8.498306 1947.632 53136.69
## Material.Transformation.Metric Quality.Rating predictions_all
## 1 9229576 99.99997 103.51076
## 2 14355367 99.98570 95.41645
## 3 10728389 99.99976 102.00896
## 4 9125702 99.99997 104.94808
## 5 6303792 100.00000 103.55201
## 6 12037072 99.99879 100.09510
## predictions_line predictions_poly
## 1 103.11612 101.15352
## 2 96.21670 98.43024
## 3 101.50739 100.47024
## 4 103.21285 101.20953
## 5 104.72629 99.79855
## 6 99.79621 99.50022
Let’s evaluate the model one more time using RMSE and MAPE
#calculating Mean Absolute Error (MAE)
mae_all <- MAE(manuf$predictions_all, manuf$Quality.Rating)
mae_line <- MAE(manuf$predictions_line, manuf$Quality.Rating)
mae_poly <- MAE(manuf$predictions_poly, manuf$Quality.Rating)
#calculating Mean Absolute Percentage Error (MAPE)
mape_all <- MAPE(manuf$predictions_all, manuf$Quality.Rating)
mape_line <- MAPE(manuf$predictions_line, manuf$Quality.Rating)
mape_poly <- MAPE(manuf$predictions_poly, manuf$Quality.Rating)
#comparison table
comparison <- data.frame(
Model = c("All Features", "2 Strongest Correlated Features", "Polynomial Model"),
Adjusted_R2 = c(adj_r2_all, adj_r2_line, adj_r2_poly),
AIC = c(aic_all, aic_line, aic_poly),
MAE = c(mae_all, mae_line, mae_poly),
MAPE = c(mape_all, mape_line, mape_poly))
comparison
## Model Adjusted_R2 AIC MAE MAPE
## 1 All Features 0.5052994 28746.83 5.682157 0.17429750
## 2 2 Strongest Correlated Features 0.5001337 28784.93 5.748292 0.17538334
## 3 Polynomial Model 0.9888456 13744.40 1.022557 0.02649997
Based on the comparison we can see that Polynomial Model is better compared to the other two models since it has the highest adj.R-square (means better accuracy) and lowest MAE and MAPE means very little error on the prediction result.