library(tidyverse)
library(caret)
library(GGally)
library(lmtest)
library(car)
library(rsample)
library(class)
library(lime)
library(ggplot2)
library(reshape2)A was an product development assistant for a company focusing on concrete making. To support her team, A was assigned to make a prediction model to find out whether a certain mixture will be in good quality
A will use multiple prediction approach to generate the best model.
Before creating the model, A need to ensure that her data was ready
Her data consisted of the following variables:
id: Id of each cement mixture
cement: The amount of cement (Kg) in a metre cubic of mixture
slag: The amount of blast furnace slag (Kg) in a metre cubic of mixture
flyash: The amount of fly ash (Kg) in a metre cubic of mixture
water: The amount of water (Kg) in a metre cubic of mixture,
super_plast: The amount of Superplasticizer (Kg) in a metre cubic of mixture
coarse_agg: The amount of Coarse Aggreagate (Kg) in a metre cubic of mixture
fine_agg: The amount of Fine Aggreagate (Kg) in a metre cubic of mixture
age: the number of resting days before the compressive strength measurement
strength: Concrete compressive strength measurement in MPa unit
#> [1] FALSE
#> id cement slag flyash water super_plast
#> 0 0 0 0 0 0
#> coarse_agg fine_agg age strength
#> 0 0 0 0
A identified no missing value from her dataset, hence she continued to data exploration
Before starting, A need to understand how was her data distributed. She run summary() function to extract key statistical information from her dataset. To help with her analysis, she also plotted her variables into histogram and boxplot chart for rough information of data distribution for each variables.
#> id cement slag flyash
#> S1 : 1 Min. :102.0 Min. : 0.00 Min. : 0.00
#> S10 : 1 1st Qu.:194.7 1st Qu.: 0.00 1st Qu.: 0.00
#> S100 : 1 Median :275.1 Median : 20.00 Median : 0.00
#> S101 : 1 Mean :280.9 Mean : 73.18 Mean : 54.03
#> S102 : 1 3rd Qu.:350.0 3rd Qu.:141.30 3rd Qu.:118.20
#> S103 : 1 Max. :540.0 Max. :359.40 Max. :200.10
#> (Other):819
#> water super_plast coarse_agg fine_agg
#> Min. :121.8 Min. : 0.000 Min. : 801.0 Min. :594.0
#> 1st Qu.:164.9 1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:734.0
#> Median :184.0 Median : 6.500 Median : 968.0 Median :780.1
#> Mean :181.1 Mean : 6.266 Mean : 972.8 Mean :775.6
#> 3rd Qu.:192.0 3rd Qu.:10.100 3rd Qu.:1028.4 3rd Qu.:826.8
#> Max. :247.0 Max. :32.200 Max. :1145.0 Max. :992.6
#>
#> age strength
#> Min. : 1.00 Min. : 2.33
#> 1st Qu.: 7.00 1st Qu.:23.64
#> Median : 28.00 Median :34.57
#> Mean : 45.14 Mean :35.79
#> 3rd Qu.: 56.00 3rd Qu.:45.94
#> Max. :365.00 Max. :82.60
#>
ggplot(data = melt, mapping = aes(x = value, fill = variable)) +
geom_boxplot(fill = "#CC0000", color = "orange") +
facet_wrap(variable~., ncol = 3) +
labs( title = "Concrete Variable Variance") +
coord_flip()From boxplot sets above, A didn’t find any material outliers, since the extreme points of her variables only existed in small amount compared to her total observation. With this consideration, she included all observations into her modelling process.
#> [1] 10851.54
#> [1] 7414.7
#> [1] 4081.476
#> [1] 447.7841
#> [1] 35.69299
#> [1] 5881.473
#> [1] 6565.186
#> [1] 3850.662
#> [1] 279.3822
From above analysis, A identified that the variables are varying widely, with the smallest variance being super_plast with 35.69299 point, and the biggest one being cement with 10851.54 point. This wide range of variance between variables implies that the variables bear different weight, which in turn could potentially distort the modelling process where one variable overpowered others only because it has bigger units.
To prevent it from happening, A scaled her variables. Since her variables didn’t have exact maximum data range, she use z-score standarization to scale out her variables. Later, she assigned the scaled data to concrete_scale object.
#> cement slag flyash water
#> Min. :-1.71772 Min. :-0.8498 Min. :-0.8458 Min. :-2.8030
#> 1st Qu.:-0.82783 1st Qu.:-0.8498 1st Qu.:-0.8458 1st Qu.:-0.7663
#> Median :-0.05602 Median :-0.6176 Median :-0.8458 Median : 0.1363
#> Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
#> 3rd Qu.: 0.66299 3rd Qu.: 0.7911 3rd Qu.: 1.0044 3rd Qu.: 0.5144
#> Max. : 2.48692 Max. : 3.3240 Max. : 2.2863 Max. : 3.1135
#> super_plast coarse_agg fine_agg age
#> Min. :-1.04874 Min. :-2.24044 Min. :-2.24166 Min. :-0.7114
#> 1st Qu.:-1.04874 1st Qu.:-0.53228 1st Qu.:-0.51382 1st Qu.:-0.6147
#> Median : 0.03924 Median :-0.06287 Median : 0.05514 Median :-0.2763
#> Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
#> 3rd Qu.: 0.64181 3rd Qu.: 0.72471 3rd Qu.: 0.63150 3rd Qu.: 0.1750
#> Max. : 4.34095 Max. : 2.24510 Max. : 2.67776 Max. : 5.1545
#> strength
#> Min. :-2.00171
#> 1st Qu.:-0.72678
#> Median :-0.07287
#> Mean : 0.00000
#> 3rd Qu.: 0.60737
#> Max. : 2.80064
From above summary, her variables were already within similar range and ready to be executed.
From above plot, A found out that the strength of a mixture is the most statistically correlated with the amount of cement and the amount of superplasticizer per metre cubic.
Some variables did have statistically small correlation index (being nearer to zero). However, according to her knowledge, A believe that all of her variables were important to determine mixture strength. Since correlation isn’t always equal with causation anyway, she kept all variables despite low statisical correlation score.
#>
#> Pearson's product-moment correlation
#>
#> data: concrete_scale$strength and concrete_scale$age
#> t = 10.43, df = 823, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.2799728 0.4006103
#> sample estimates:
#> cor
#> 0.3416984
#>
#> Pearson's product-moment correlation
#>
#> data: concrete_scale$strength and concrete_scale$fine_agg
#> t = -5.452, df = 823, p-value = 0.00000006586
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> -0.2517513 -0.1199778
#> sample estimates:
#> cor
#> -0.1867042
#>
#> Pearson's product-moment correlation
#>
#> data: concrete_scale$strength and concrete_scale$coarse_agg
#> t = -3.929, df = 823, p-value = 0.00009244
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> -0.20207473 -0.06806607
#> sample estimates:
#> cor
#> -0.135691
#>
#> Pearson's product-moment correlation
#>
#> data: concrete_scale$strength and concrete_scale$super_plast
#> t = 11.127, df = 823, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.3007983 0.4195290
#> sample estimates:
#> cor
#> 0.3616289
#>
#> Pearson's product-moment correlation
#>
#> data: concrete_scale$strength and concrete_scale$water
#> t = -8.4715, df = 823, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> -0.3447994 -0.2191910
#> sample estimates:
#> cor
#> -0.2832092
#>
#> Pearson's product-moment correlation
#>
#> data: concrete_scale$strength and concrete_scale$flyash
#> t = -2.9894, df = 823, p-value = 0.002879
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> -0.17068925 -0.03563822
#> sample estimates:
#> cor
#> -0.1036414
#>
#> Pearson's product-moment correlation
#>
#> data: concrete_scale$strength and concrete_scale$slag
#> t = 3.7032, df = 823, p-value = 0.0002271
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.06029336 0.19457687
#> sample estimates:
#> cor
#> 0.1280218
#>
#> Pearson's product-moment correlation
#>
#> data: concrete_scale$strength and concrete_scale$cement
#> t = 16.387, df = 823, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.4427419 0.5457857
#> sample estimates:
#> cor
#> 0.4960081
All predictors did have significant correlation with the mixture strength.
SIgnificantly - positively correlated with the mixture strength (if we add more of these variables, the mixture will (statistically) improve:
The number of resting days before the compressive strength measurement - Age
Superplasticizer - super_plast
Blast furnace slag - slag
Cement - cement
Meanwhile, A needed to add lesser of below components to the mixture to enhance the mixture robustness:
Fly ash - flyash
Water - water:
Coarse Aggreagate - coarse_agg
Fine Aggreagate - fine_agg
Since the linearity assumption was met, A predicted the mixture strength using linear regression model.
To avoid model overfitting, A splitted her data into 85% data train to generate model, and another 15% to test her model.
Linear regression model was generated using lm() function
#>
#> Call:
#> lm(formula = strength ~ ., data = concrete_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.7342 -0.3855 0.0429 0.4347 1.9508
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.000957 0.023355 -0.041 0.96733
#> cement 0.731969 0.064524 11.344 < 0.0000000000000002 ***
#> slag 0.500705 0.062495 8.012 0.00000000000000477 ***
#> flyash 0.327664 0.058368 5.614 0.00000002861842464 ***
#> water -0.179943 0.060390 -2.980 0.00299 **
#> super_plast 0.111408 0.039421 2.826 0.00485 **
#> coarse_agg 0.084681 0.051561 1.642 0.10097
#> fine_agg 0.087791 0.061728 1.422 0.15541
#> age 0.469512 0.025564 18.366 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.6193 on 695 degrees of freedom
#> Multiple R-squared: 0.624, Adjusted R-squared: 0.6197
#> F-statistic: 144.2 on 8 and 695 DF, p-value: < 0.00000000000000022
From above summary, the adjusted R-squared score is 0.6197. Means, the model only able to explain 61.97% of the target variable.
Generating the further end of the model - predicting the strength using the strength variable itself
#>
#> Call:
#> lm(formula = strength ~ cement + slag + flyash + water + super_plast +
#> coarse_agg + fine_agg + age, data = concrete_train)
#>
#> Coefficients:
#> (Intercept) cement slag flyash water super_plast
#> -0.000957 0.731969 0.500705 0.327664 -0.179943 0.111408
#> coarse_agg fine_agg age
#> 0.084681 0.087791 0.469512
Using this method, there were no any variables omitted.
#>
#> Call:
#> lm(formula = strength ~ cement + age + super_plast + slag + water +
#> flyash, data = concrete_train)
#>
#> Coefficients:
#> (Intercept) cement age super_plast slag water
#> -0.001107 0.647274 0.469854 0.091880 0.417023 -0.262098
#> flyash
#> 0.258126
Using this method, two variables were removed: fine_agg and coarse_agg. A assigned this new model with better AIC score to concrete_rm3 object.
#>
#> Call:
#> lm(formula = strength ~ cement + age + super_plast + slag + water +
#> flyash, data = concrete_train)
#>
#> Coefficients:
#> (Intercept) cement age super_plast slag water
#> -0.001107 0.647274 0.469854 0.091880 0.417023 -0.262098
#> flyash
#> 0.258126
This method eliminated the same two variables as the forward-selection method.
Adjusted R-squared for concrete_rm1 (all variables included)
#> [1] 0.6196796
Adjusted R-squared for concrete_rm3 (two variables omitted to generate regression model with better AIC)
#> [1] 0.619292
From above comparison, the first model explained the target variable sligthly better than the one after feature selection with 61.97% ability to explain.
For testing purpose, A create data train & data test using the original (pre-scaled) data
set.seed(921)
idns <- initial_split(concrete_pre,prop = 0.85,strata = "strength")
concrete_trains <- training(idns)
concrete_tests <- testing(idns)#> [1] 364.0723
MAE generated from data train: 364.07
#> [1] 398.639
MAE generated from data train: 398.63
To avoid model overfitting, A splitted her data into 85% data train to generate model, and another 15% to test her model.
Linear regression model was generated using lm() function
#> [1] 0.6196796
From above summary, the adjusted R-squared score is 0.6197. Means, the model only able to explain 61.97% of the target variable - which was actually around same score with the scaled one (concrete_rm1).
#> [1] 8.314682
MAE generated from data train: 8.31
#> [1] 8.60933
MAE generated from data train: 8.60
To avoid model overfitting, A splitted her data into 85% data train to generate model, and another 15% to test her model.
set.seed(921)
idsq <- initial_split(mm_concrete, prop = 0.85,strata = "strength")
mm_train <- training(idsq)
mm_test <- testing(idsq)Variable Normalization
Linear regression model was generated using lm() function
#> [1] 0.6196796
From above summary, the adjusted R-squared score is 61.96%. Means, the model only able to explain 61.96% of the target variable - which was worse than the two regression models generated above.
Putting the normalized variable back to original
denormalize_train <- function(x){
return(
((x - min(x))/
((max(x) - min(x)))
*
(max(mm_train) - (min(mm_train))))
+
(min(mm_train))
)
}denormalize_test <- function(x){
return(((x - min(x))/
((max(x) - min(x)))
*(max(mm_test) - (min(mm_test))))
+
(min(mm_test)))}#> [1] 49.03312
MAE generated from data train: 49.03
#> [1] 49.08752
MAE generated from data train: 49.08
From above tuning processes, the regression model performed better if the variable was normalized using the min-max normalization - where the furthest range of each variables was set as the maximum point.
To make sure this is already the optimum, A proceeded with step-wise.
#> Start: AIC=-6617.19
#> strength ~ cement + slag + flyash + water + super_plast + coarse_agg +
#> fine_agg + age
#>
#> Df Sum of Sq RSS AIC
#> <none> 0.056800 -6617.2
#> - fine_agg 1 0.0001653 0.056966 -6617.1
#> - coarse_agg 1 0.0002204 0.057021 -6616.5
#> - super_plast 1 0.0006528 0.057453 -6611.1
#> - water 1 0.0007256 0.057526 -6610.3
#> - flyash 1 0.0025756 0.059376 -6588.0
#> - slag 1 0.0052461 0.062047 -6557.0
#> - cement 1 0.0105175 0.067318 -6499.6
#> - age 1 0.0275678 0.084368 -6340.7
#>
#> Call:
#> lm(formula = strength ~ cement + slag + flyash + water + super_plast +
#> coarse_agg + fine_agg + age, data = mm_trainn)
#>
#> Coefficients:
#> (Intercept) cement slag flyash water super_plast
#> -0.01999 0.11745 0.09719 0.08573 -0.14213 0.31169
#> coarse_agg fine_agg age
#> 0.01846 0.01811 0.12647
#>
#> Call:
#> lm(formula = strength ~ cement + slag + flyash + water + super_plast +
#> coarse_agg + fine_agg + age, data = mm_trainn)
#>
#> Coefficients:
#> (Intercept) cement slag flyash water super_plast
#> -0.01999 0.11745 0.09719 0.08573 -0.14213 0.31169
#> coarse_agg fine_agg age
#> 0.01846 0.01811 0.12647
#>
#> Call:
#> lm(formula = strength ~ cement + slag + flyash + water + super_plast +
#> coarse_agg + fine_agg + age, data = mm_trainn)
#>
#> Coefficients:
#> (Intercept) cement slag flyash water super_plast
#> -0.01999 0.11745 0.09719 0.08573 -0.14213 0.31169
#> coarse_agg fine_agg age
#> 0.01846 0.01811 0.12647
Step-Wise Feature Selection suggested no variable removal. Hence, A proceed directly to next tuning
#> [1] 0.8041318
From above summary, the adjusted R-squared score is 80.41%. Means, the model only able to explain 80.41% of the target variable - which was way better than any models above.
#> [1] 32.99998
MAE generated from data train: 32.9
#> [1] 6.24852
MAE generated from data train: 6.24
A checked her model normality to ensure that the error/residual coming from her prediction was normally distributed. She used shapiro.test and later visualized her error in a histogram.
For Log Transformation
#>
#> Shapiro-Wilk normality test
#>
#> data: concrete_rm5log$residuals
#> W = 0.98736, p-value = 0.000009023
ggplot(data = as.data.frame(concrete_rm5log$residuals), aes(x = concrete_rm5$residuals)) +
geom_histogram(fill = "#CC0000", color = "orange") +
labs( title = "Regression Residual Distribution", subtitle = "Log Transformation", x = "residual")Assumption of Normality: met
A checked her model whether its error variance wasn’t making a certain pattern (instead, scattered randomly). She used Breusch-Pagan hypotesis test and later visualized her error in a scatterplot
#>
#> studentized Breusch-Pagan test
#>
#> data: concrete_rm5log
#> BP = 63.314, df = 8, p-value = 0.0000000001039
Assumption of homoscedasticity: met
A checked her model whether her independent variables had no correlation among themselves (only correlated to the concrete performance). She used Variance Inflation Factor score, and ensured that there were no variable with VIF score greater than 10.
#> cement slag flyash water super_plast coarse_agg
#> 2.731454 2.971403 2.989579 3.584346 3.524065 2.526502
#> fine_agg age
#> 2.806878 1.031586
*Assumption of multicolinearity: met
model_type.lm <- function(x){
return('regression')
}
predict_model.lm <- function(x, newdata, type = "response") {
#for classification model
res<- predict(x, newdata, type = "response") %>% as.data.frame()
return(res)
}#> [1] "lm"
#> [1] "regression"
From above LIME analysis, A concluded that up to 80.41% explainability and average error of 32.9 (data train) & 6.24 (data test), the most relevant/significant variable to determine the concrete strength were the following 5 (not in particular order):
age
cement
water
super plast
slag
Since the performance hadn’t meet the expectation yet, A tried other alternative methods. Random forest is a brute-force classifying method using varying decision trees. However, it also able to identify other prediction case.
A reused her train & test objects from previous model: mm_train (85%) and mm_test (15%). The observations were randomly sampled in a way that the strength variable in both sets as similar as possible
Before going to
A used the K-fold cross validation method to resample her trees. K-fold Cross-Validation was picked because it generally results in a less biased estimate of the model performance compared to other resampling methods.
The k picked was 5, to minimize bias in her Random Forest training.
Model Summary:
#> Random Forest
#>
#> 704 samples
#> 8 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold)
#> Summary of sample sizes: 564, 563, 562, 564, 563
#> Resampling results across tuning parameters:
#>
#> mtry RMSE Rsquared MAE
#> 2 6.134173 0.8842844 4.602431
#> 5 5.687851 0.8894546 4.077606
#> 8 5.750056 0.8861399 4.044689
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was mtry = 5.
The final model out of the 5 folds:
#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry)
#> Type of random forest: regression
#> Number of trees: 500
#> No. of variables tried at each split: 5
#>
#> Mean of squared residuals: 28.2658
#> % Var explained: 89.95
Model Visualization:
From above model, the number of predictors generate the most optimum (smallest) residual is at 5 predictors. The most significant variables were generated through below function:
#> rf variable importance
#>
#> Overall
#> age 100.000
#> cement 81.349
#> water 31.145
#> super_plast 12.931
#> slag 7.354
#> fine_agg 4.551
#> coarse_agg 1.074
#> flyash 0.000
The most important variables were:
Age
Cement
Water
Super plast
Slag
actualtrain <- mm_train$strength
predictedtrain <- unname(predict(concrete_rf, mm_train))
R2train <- 1 - (sum((actualtrain-predictedtrain)^2)/sum((actualtrain-mean(actualtrain))^2))
R2train#> [1] 0.9782211
#> [1] 1.737291
Rsquared of the RF model: 97.82%
MAE generated from data train: 1.74
#> [1] 3.047134
From above LIME analysis, A concluded that up to 97.82% explainability and average error of 1.74 (data train) & 3.03 (data test), the most relevant/significant variable to determine the concrete strength were the following 5 (not in particular order):
age
cement
water
super plast
slag
The major variables were pretty much similar to the LIME analysis result of concrete_rm5log, as well as the manual analysis to the random forest model.
Is your goal achieved?: No, since both modelling process is yet to explain the required explainability of >90% despite the MAE score is already minimized to 3.
Is the problem can be solved by machine learning?: potentially yes
What model did you use and how is the performance? regression model showed the best performance when the predictors are log-transformed. However, Random Forest algorithm performs the best overall
Z-Score Standardization Linear Regression Model (concrete_rm1)
Adjusted R Square: 61.97%
MAE train : 364.07
MAE test : 398.639
Linear Regression Model (concrete_rm4)
Adjusted R Square: 61.967%
MAE train : 8.314
MAE test : 8.609
Min-Max Standardization Linear Regression Model (concrete_rm5)
Adjusted R Square: 61.967%
MAE train : 49.033
MAE test : 49.087
Log-Transformation Regression Model (concrete_rm5log)
Adjusted R Square: 80.404%
MAE train : 32.998
MAE test : 6.248
New Data Rsquare : 78%
New Data MAE : 6.03
Random Forest (5-fold Cross-Validation) (concrete_rf)
Adjusted R Square: 97.822%
MAE train : 1.742
MAE test : 3.047
New Data Rsquare : 90%
New Data MAE : 3.76