Modelling Context & Purpose

A was an product development assistant for a company focusing on concrete making. To support her team, A was assigned to make a prediction model to find out whether a certain mixture will be in good quality

A will use multiple prediction approach to generate the best model.

Data Preparation

Before creating the model, A need to ensure that her data was ready

Read Data

Her data consisted of the following variables:

id: Id of each cement mixture

cement: The amount of cement (Kg) in a metre cubic of mixture

slag: The amount of blast furnace slag (Kg) in a metre cubic of mixture

flyash: The amount of fly ash (Kg) in a metre cubic of mixture

water: The amount of water (Kg) in a metre cubic of mixture,

super_plast: The amount of Superplasticizer (Kg) in a metre cubic of mixture

coarse_agg: The amount of Coarse Aggreagate (Kg) in a metre cubic of mixture

fine_agg: The amount of Fine Aggreagate (Kg) in a metre cubic of mixture

age: the number of resting days before the compressive strength measurement

strength: Concrete compressive strength measurement in MPa unit

Data Inspection

Checking missing value

#> [1] FALSE
#>          id      cement        slag      flyash       water super_plast 
#>           0           0           0           0           0           0 
#>  coarse_agg    fine_agg         age    strength 
#>           0           0           0           0

A identified no missing value from her dataset, hence she continued to data exploration

Variable exploration

Variable Data Distribution

Before starting, A need to understand how was her data distributed. She run summary() function to extract key statistical information from her dataset. To help with her analysis, she also plotted her variables into histogram and boxplot chart for rough information of data distribution for each variables.

#>        id          cement           slag            flyash      
#>  S1     :  1   Min.   :102.0   Min.   :  0.00   Min.   :  0.00  
#>  S10    :  1   1st Qu.:194.7   1st Qu.:  0.00   1st Qu.:  0.00  
#>  S100   :  1   Median :275.1   Median : 20.00   Median :  0.00  
#>  S101   :  1   Mean   :280.9   Mean   : 73.18   Mean   : 54.03  
#>  S102   :  1   3rd Qu.:350.0   3rd Qu.:141.30   3rd Qu.:118.20  
#>  S103   :  1   Max.   :540.0   Max.   :359.40   Max.   :200.10  
#>  (Other):819                                                    
#>      water        super_plast       coarse_agg        fine_agg    
#>  Min.   :121.8   Min.   : 0.000   Min.   : 801.0   Min.   :594.0  
#>  1st Qu.:164.9   1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:734.0  
#>  Median :184.0   Median : 6.500   Median : 968.0   Median :780.1  
#>  Mean   :181.1   Mean   : 6.266   Mean   : 972.8   Mean   :775.6  
#>  3rd Qu.:192.0   3rd Qu.:10.100   3rd Qu.:1028.4   3rd Qu.:826.8  
#>  Max.   :247.0   Max.   :32.200   Max.   :1145.0   Max.   :992.6  
#>                                                                   
#>       age            strength    
#>  Min.   :  1.00   Min.   : 2.33  
#>  1st Qu.:  7.00   1st Qu.:23.64  
#>  Median : 28.00   Median :34.57  
#>  Mean   : 45.14   Mean   :35.79  
#>  3rd Qu.: 56.00   3rd Qu.:45.94  
#>  Max.   :365.00   Max.   :82.60  
#> 

From boxplot sets above, A didn’t find any material outliers, since the extreme points of her variables only existed in small amount compared to her total observation. With this consideration, she included all observations into her modelling process.

Variance Analysis

#> [1] 10851.54
#> [1] 7414.7
#> [1] 4081.476
#> [1] 447.7841
#> [1] 35.69299
#> [1] 5881.473
#> [1] 6565.186
#> [1] 3850.662
#> [1] 279.3822

From above analysis, A identified that the variables are varying widely, with the smallest variance being super_plast with 35.69299 point, and the biggest one being cement with 10851.54 point. This wide range of variance between variables implies that the variables bear different weight, which in turn could potentially distort the modelling process where one variable overpowered others only because it has bigger units.

Variable Pre-processing

To prevent it from happening, A scaled her variables. Since her variables didn’t have exact maximum data range, she use z-score standarization to scale out her variables. Later, she assigned the scaled data to concrete_scale object.

#>      cement              slag             flyash            water        
#>  Min.   :-1.71772   Min.   :-0.8498   Min.   :-0.8458   Min.   :-2.8030  
#>  1st Qu.:-0.82783   1st Qu.:-0.8498   1st Qu.:-0.8458   1st Qu.:-0.7663  
#>  Median :-0.05602   Median :-0.6176   Median :-0.8458   Median : 0.1363  
#>  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
#>  3rd Qu.: 0.66299   3rd Qu.: 0.7911   3rd Qu.: 1.0044   3rd Qu.: 0.5144  
#>  Max.   : 2.48692   Max.   : 3.3240   Max.   : 2.2863   Max.   : 3.1135  
#>   super_plast         coarse_agg          fine_agg             age         
#>  Min.   :-1.04874   Min.   :-2.24044   Min.   :-2.24166   Min.   :-0.7114  
#>  1st Qu.:-1.04874   1st Qu.:-0.53228   1st Qu.:-0.51382   1st Qu.:-0.6147  
#>  Median : 0.03924   Median :-0.06287   Median : 0.05514   Median :-0.2763  
#>  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
#>  3rd Qu.: 0.64181   3rd Qu.: 0.72471   3rd Qu.: 0.63150   3rd Qu.: 0.1750  
#>  Max.   : 4.34095   Max.   : 2.24510   Max.   : 2.67776   Max.   : 5.1545  
#>     strength       
#>  Min.   :-2.00171  
#>  1st Qu.:-0.72678  
#>  Median :-0.07287  
#>  Mean   : 0.00000  
#>  3rd Qu.: 0.60737  
#>  Max.   : 2.80064

From above summary, her variables were already within similar range and ready to be executed.

Variable Correlation Analysis

Finding out correlation index of strengh to each predictors

From above plot, A found out that the strength of a mixture is the most statistically correlated with the amount of cement and the amount of superplasticizer per metre cubic.

Some variables did have statistically small correlation index (being nearer to zero). However, according to her knowledge, A believe that all of her variables were important to determine mixture strength. Since correlation isn’t always equal with causation anyway, she kept all variables despite low statisical correlation score.

Confirming the significance
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  concrete_scale$strength and concrete_scale$age
#> t = 10.43, df = 823, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.2799728 0.4006103
#> sample estimates:
#>       cor 
#> 0.3416984
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  concrete_scale$strength and concrete_scale$fine_agg
#> t = -5.452, df = 823, p-value = 0.00000006586
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.2517513 -0.1199778
#> sample estimates:
#>        cor 
#> -0.1867042
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  concrete_scale$strength and concrete_scale$coarse_agg
#> t = -3.929, df = 823, p-value = 0.00009244
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.20207473 -0.06806607
#> sample estimates:
#>       cor 
#> -0.135691
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  concrete_scale$strength and concrete_scale$super_plast
#> t = 11.127, df = 823, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.3007983 0.4195290
#> sample estimates:
#>       cor 
#> 0.3616289
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  concrete_scale$strength and concrete_scale$water
#> t = -8.4715, df = 823, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.3447994 -0.2191910
#> sample estimates:
#>        cor 
#> -0.2832092
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  concrete_scale$strength and concrete_scale$flyash
#> t = -2.9894, df = 823, p-value = 0.002879
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.17068925 -0.03563822
#> sample estimates:
#>        cor 
#> -0.1036414
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  concrete_scale$strength and concrete_scale$slag
#> t = 3.7032, df = 823, p-value = 0.0002271
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.06029336 0.19457687
#> sample estimates:
#>       cor 
#> 0.1280218
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  concrete_scale$strength and concrete_scale$cement
#> t = 16.387, df = 823, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.4427419 0.5457857
#> sample estimates:
#>       cor 
#> 0.4960081

All predictors did have significant correlation with the mixture strength.

Key Information

SIgnificantly - positively correlated with the mixture strength (if we add more of these variables, the mixture will (statistically) improve:

  • The number of resting days before the compressive strength measurement - Age

  • Superplasticizer - super_plast

  • Blast furnace slag - slag

  • Cement - cement

Meanwhile, A needed to add lesser of below components to the mixture to enhance the mixture robustness:

  • Fly ash - flyash

  • Water - water:

  • Coarse Aggreagate - coarse_agg

  • Fine Aggreagate - fine_agg

Modelling Process

1. Linear Regression Model

Since the linearity assumption was met, A predicted the mixture strength using linear regression model.

Cross Validation

To avoid model overfitting, A splitted her data into 85% data train to generate model, and another 15% to test her model.

Model Generation

Linear regression model was generated using lm() function

Model 1st Evaluation

#> 
#> Call:
#> lm(formula = strength ~ ., data = concrete_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.7342 -0.3855  0.0429  0.4347  1.9508 
#> 
#> Coefficients:
#>              Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept) -0.000957   0.023355  -0.041              0.96733    
#> cement       0.731969   0.064524  11.344 < 0.0000000000000002 ***
#> slag         0.500705   0.062495   8.012  0.00000000000000477 ***
#> flyash       0.327664   0.058368   5.614  0.00000002861842464 ***
#> water       -0.179943   0.060390  -2.980              0.00299 ** 
#> super_plast  0.111408   0.039421   2.826              0.00485 ** 
#> coarse_agg   0.084681   0.051561   1.642              0.10097    
#> fine_agg     0.087791   0.061728   1.422              0.15541    
#> age          0.469512   0.025564  18.366 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.6193 on 695 degrees of freedom
#> Multiple R-squared:  0.624,  Adjusted R-squared:  0.6197 
#> F-statistic: 144.2 on 8 and 695 DF,  p-value: < 0.00000000000000022

From above summary, the adjusted R-squared score is 0.6197. Means, the model only able to explain 61.97% of the target variable.

Model Tuning with Step-Wise Regression

Generating the further end of the model - predicting the strength using the strength variable itself

1. Backward Elimination Step-Wise

#> 
#> Call:
#> lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
#>     coarse_agg + fine_agg + age, data = concrete_train)
#> 
#> Coefficients:
#> (Intercept)       cement         slag       flyash        water  super_plast  
#>   -0.000957     0.731969     0.500705     0.327664    -0.179943     0.111408  
#>  coarse_agg     fine_agg          age  
#>    0.084681     0.087791     0.469512

Using this method, there were no any variables omitted.

2. Forward Selection Step-Wise

#> 
#> Call:
#> lm(formula = strength ~ cement + age + super_plast + slag + water + 
#>     flyash, data = concrete_train)
#> 
#> Coefficients:
#> (Intercept)       cement          age  super_plast         slag        water  
#>   -0.001107     0.647274     0.469854     0.091880     0.417023    -0.262098  
#>      flyash  
#>    0.258126

Using this method, two variables were removed: fine_agg and coarse_agg. A assigned this new model with better AIC score to concrete_rm3 object.

3. Both

#> 
#> Call:
#> lm(formula = strength ~ cement + age + super_plast + slag + water + 
#>     flyash, data = concrete_train)
#> 
#> Coefficients:
#> (Intercept)       cement          age  super_plast         slag        water  
#>   -0.001107     0.647274     0.469854     0.091880     0.417023    -0.262098  
#>      flyash  
#>    0.258126

This method eliminated the same two variables as the forward-selection method.

4. Model 2nd Evaluation

a. Comparing Adjusted R-squared

Adjusted R-squared for concrete_rm1 (all variables included)

#> [1] 0.6196796

Adjusted R-squared for concrete_rm3 (two variables omitted to generate regression model with better AIC)

#> [1] 0.619292

From above comparison, the first model explained the target variable sligthly better than the one after feature selection with 61.97% ability to explain.

c. Computing MAE from model application to Data Test
#> [1] 398.639

MAE generated from data train: 398.63

Model Tuning 2: Using Normal Variable

Cross Validation

To avoid model overfitting, A splitted her data into 85% data train to generate model, and another 15% to test her model.

Model Generation

Linear regression model was generated using lm() function

Adjusted R-Square from Data Train

#> [1] 0.6196796

From above summary, the adjusted R-squared score is 0.6197. Means, the model only able to explain 61.97% of the target variable - which was actually around same score with the scaled one (concrete_rm1).

Model Tuning 3: Variable Min-Max Normalization

Cross Validation

To avoid model overfitting, A splitted her data into 85% data train to generate model, and another 15% to test her model.

Variable Normalization

Model Generation

Linear regression model was generated using lm() function

Adjusted R-Square from Data Train

#> [1] 0.6196796

From above summary, the adjusted R-squared score is 61.96%. Means, the model only able to explain 61.96% of the target variable - which was worse than the two regression models generated above.

Computing MAE from model application to Data Test

#> [1] 49.08752

MAE generated from data train: 49.08

From above tuning processes, the regression model performed better if the variable was normalized using the min-max normalization - where the furthest range of each variables was set as the maximum point.

To make sure this is already the optimum, A proceeded with step-wise.

Re-tuning the model: Step-wise
1. Backward Elimination Step-Wise
#> Start:  AIC=-6617.19
#> strength ~ cement + slag + flyash + water + super_plast + coarse_agg + 
#>     fine_agg + age
#> 
#>               Df Sum of Sq      RSS     AIC
#> <none>                     0.056800 -6617.2
#> - fine_agg     1 0.0001653 0.056966 -6617.1
#> - coarse_agg   1 0.0002204 0.057021 -6616.5
#> - super_plast  1 0.0006528 0.057453 -6611.1
#> - water        1 0.0007256 0.057526 -6610.3
#> - flyash       1 0.0025756 0.059376 -6588.0
#> - slag         1 0.0052461 0.062047 -6557.0
#> - cement       1 0.0105175 0.067318 -6499.6
#> - age          1 0.0275678 0.084368 -6340.7
#> 
#> Call:
#> lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
#>     coarse_agg + fine_agg + age, data = mm_trainn)
#> 
#> Coefficients:
#> (Intercept)       cement         slag       flyash        water  super_plast  
#>    -0.01999      0.11745      0.09719      0.08573     -0.14213      0.31169  
#>  coarse_agg     fine_agg          age  
#>     0.01846      0.01811      0.12647
2. Forward Selection Step-Wise
#> 
#> Call:
#> lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
#>     coarse_agg + fine_agg + age, data = mm_trainn)
#> 
#> Coefficients:
#> (Intercept)       cement         slag       flyash        water  super_plast  
#>    -0.01999      0.11745      0.09719      0.08573     -0.14213      0.31169  
#>  coarse_agg     fine_agg          age  
#>     0.01846      0.01811      0.12647
3. Both
#> 
#> Call:
#> lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
#>     coarse_agg + fine_agg + age, data = mm_trainn)
#> 
#> Coefficients:
#> (Intercept)       cement         slag       flyash        water  super_plast  
#>    -0.01999      0.11745      0.09719      0.08573     -0.14213      0.31169  
#>  coarse_agg     fine_agg          age  
#>     0.01846      0.01811      0.12647

Step-Wise Feature Selection suggested no variable removal. Hence, A proceed directly to next tuning

Model Tuning 4: Log transforming the Variable

Adjusted R-Square from Data Train

#> [1] 0.8041318

From above summary, the adjusted R-squared score is 80.41%. Means, the model only able to explain 80.41% of the target variable - which was way better than any models above.

Assumption Checking

a. Normality

A checked her model normality to ensure that the error/residual coming from her prediction was normally distributed. She used shapiro.test and later visualized her error in a histogram.

For Log Transformation

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  concrete_rm5log$residuals
#> W = 0.98736, p-value = 0.000009023

Assumption of Normality: met

b. Homoscedasticity

A checked her model whether its error variance wasn’t making a certain pattern (instead, scattered randomly). She used Breusch-Pagan hypotesis test and later visualized her error in a scatterplot

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  concrete_rm5log
#> BP = 63.314, df = 8, p-value = 0.0000000001039

Assumption of homoscedasticity: met

c. Multicolinearity

A checked her model whether her independent variables had no correlation among themselves (only correlated to the concrete performance). She used Variance Inflation Factor score, and ensured that there were no variable with VIF score greater than 10.

#>      cement        slag      flyash       water super_plast  coarse_agg 
#>    2.731454    2.971403    2.989579    3.584346    3.524065    2.526502 
#>    fine_agg         age 
#>    2.806878    1.031586

*Assumption of multicolinearity: met

LIME Model Interpretation

LIME Visualization & Interpretation

From above LIME analysis, A concluded that up to 80.41% explainability and average error of 32.9 (data train) & 6.24 (data test), the most relevant/significant variable to determine the concrete strength were the following 5 (not in particular order):

  • age

  • cement

  • water

  • super plast

  • slag

2. Random Forest

Since the performance hadn’t meet the expectation yet, A tried other alternative methods. Random forest is a brute-force classifying method using varying decision trees. However, it also able to identify other prediction case.

Setting Up the Train & Test Object

A reused her train & test objects from previous model: mm_train (85%) and mm_test (15%). The observations were randomly sampled in a way that the strength variable in both sets as similar as possible

Setting Up the Random Forest Model

Before going to

A used the K-fold cross validation method to resample her trees. K-fold Cross-Validation was picked because it generally results in a less biased estimate of the model performance compared to other resampling methods.

The k picked was 5, to minimize bias in her Random Forest training.

RF Model Inspection

Model Summary:

#> Random Forest 
#> 
#> 704 samples
#>   8 predictor
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 564, 563, 562, 564, 563 
#> Resampling results across tuning parameters:
#> 
#>   mtry  RMSE      Rsquared   MAE     
#>   2     6.134173  0.8842844  4.602431
#>   5     5.687851  0.8894546  4.077606
#>   8     5.750056  0.8861399  4.044689
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was mtry = 5.

The final model out of the 5 folds:

#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: regression
#>                      Number of trees: 500
#> No. of variables tried at each split: 5
#> 
#>           Mean of squared residuals: 28.2658
#>                     % Var explained: 89.95

Model Visualization:

From above model, the number of predictors generate the most optimum (smallest) residual is at 5 predictors. The most significant variables were generated through below function:

#> rf variable importance
#> 
#>             Overall
#> age         100.000
#> cement       81.349
#> water        31.145
#> super_plast  12.931
#> slag          7.354
#> fine_agg      4.551
#> coarse_agg    1.074
#> flyash        0.000

The most important variables were:

  • Age

  • Cement

  • Water

  • Super plast

  • Slag

Computing MAE from model application to Data Test

#> [1] 3.047134
  • MAE generated from data test: 3.03

LIME Model Interpretation

LIME Visualization

From above LIME analysis, A concluded that up to 97.82% explainability and average error of 1.74 (data train) & 3.03 (data test), the most relevant/significant variable to determine the concrete strength were the following 5 (not in particular order):

  • age

  • cement

  • water

  • super plast

  • slag

The major variables were pretty much similar to the LIME analysis result of concrete_rm5log, as well as the manual analysis to the random forest model.

Summary

  • Is your goal achieved?: No, since both modelling process is yet to explain the required explainability of >90% despite the MAE score is already minimized to 3.

  • Is the problem can be solved by machine learning?: potentially yes

  • What model did you use and how is the performance? regression model showed the best performance when the predictors are log-transformed. However, Random Forest algorithm performs the best overall

Z-Score Standardization Linear Regression Model (concrete_rm1)

  • Adjusted R Square: 61.97%

  • MAE train : 364.07

  • MAE test : 398.639

Linear Regression Model (concrete_rm4)

  • Adjusted R Square: 61.967%

  • MAE train : 8.314

  • MAE test : 8.609

Min-Max Standardization Linear Regression Model (concrete_rm5)

  • Adjusted R Square: 61.967%

  • MAE train : 49.033

  • MAE test : 49.087

Log-Transformation Regression Model (concrete_rm5log)

  • Adjusted R Square: 80.404%

  • MAE train : 32.998

  • MAE test : 6.248

  • New Data Rsquare : 78%

  • New Data MAE : 6.03

Random Forest (5-fold Cross-Validation) (concrete_rf)

  • Adjusted R Square: 97.822%

  • MAE train : 1.742

  • MAE test : 3.047

  • New Data Rsquare : 90%

  • New Data MAE : 3.76