library(tidyverse)
library(caret)
library(GGally)
library(lmtest)
library(car)
library(rsample)
library(class)
library(lime)
library(ggplot2)
library(reshape2)

Modelling Context & Purpose

A was an product development assistant for a company focusing on concrete making. To support her team, A was assigned to make a prediction model to find out whether a certain mixture will be in good quality

A will use multiple prediction approach to generate the best model.

Data Preparation

Before creating the model, A need to ensure that her data was ready

Read Data

concrete <- read.csv("data_input/data-train.csv")

Her data consisted of the following variables:

id: Id of each cement mixture

cement: The amount of cement (Kg) in a metre cubic of mixture

slag: The amount of blast furnace slag (Kg) in a metre cubic of mixture

flyash: The amount of fly ash (Kg) in a metre cubic of mixture

water: The amount of water (Kg) in a metre cubic of mixture,

super_plast: The amount of Superplasticizer (Kg) in a metre cubic of mixture

coarse_agg: The amount of Coarse Aggreagate (Kg) in a metre cubic of mixture

fine_agg: The amount of Fine Aggreagate (Kg) in a metre cubic of mixture

age: the number of resting days before the compressive strength measurement

strength: Concrete compressive strength measurement in MPa unit

Data Inspection

Checking missing value

anyNA(concrete)

#> [1] FALSE

colSums(is.na(concrete))

#>          id      cement        slag      flyash       water super_plast 
#>           0           0           0           0           0           0 
#>  coarse_agg    fine_agg         age    strength 
#>           0           0           0           0

A identified no missing value from her dataset, hence she continued to data exploration

Variable exploration

Variable Data Distribution

Before starting, A need to understand how was her data distributed. She run summary() function to extract key statistical information from her dataset. To help with her analysis, she also plotted her variables into histogram and boxplot chart for rough information of data distribution for each variables.

summary(concrete)

#>        id          cement           slag            flyash      
#>  S1     :  1   Min.   :102.0   Min.   :  0.00   Min.   :  0.00  
#>  S10    :  1   1st Qu.:194.7   1st Qu.:  0.00   1st Qu.:  0.00  
#>  S100   :  1   Median :275.1   Median : 20.00   Median :  0.00  
#>  S101   :  1   Mean   :280.9   Mean   : 73.18   Mean   : 54.03  
#>  S102   :  1   3rd Qu.:350.0   3rd Qu.:141.30   3rd Qu.:118.20  
#>  S103   :  1   Max.   :540.0   Max.   :359.40   Max.   :200.10  
#>  (Other):819                                                    
#>      water        super_plast       coarse_agg        fine_agg    
#>  Min.   :121.8   Min.   : 0.000   Min.   : 801.0   Min.   :594.0  
#>  1st Qu.:164.9   1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:734.0  
#>  Median :184.0   Median : 6.500   Median : 968.0   Median :780.1  
#>  Mean   :181.1   Mean   : 6.266   Mean   : 972.8   Mean   :775.6  
#>  3rd Qu.:192.0   3rd Qu.:10.100   3rd Qu.:1028.4   3rd Qu.:826.8  
#>  Max.   :247.0   Max.   :32.200   Max.   :1145.0   Max.   :992.6  
#>                                                                   
#>       age            strength    
#>  Min.   :  1.00   Min.   : 2.33  
#>  1st Qu.:  7.00   1st Qu.:23.64  
#>  Median : 28.00   Median :34.57  
#>  Mean   : 45.14   Mean   :35.79  
#>  3rd Qu.: 56.00   3rd Qu.:45.94  
#>  Max.   :365.00   Max.   :82.60  
#>

melt <- melt(concrete)

ggplot(data = melt, mapping = aes(x = value, fill = variable)) +
  geom_boxplot(fill = "#CC0000", color = "orange") +
  facet_wrap(variable~., ncol = 3) +
  labs( title = "Concrete Variable Variance") +
  coord_flip()

From boxplot sets above, A didn’t find any material outliers, since the extreme points of her variables only existed in small amount compared to her total observation. With this consideration, she included all observations into her modelling process.

Variance Analysis

var(x = concrete$cement)

#> [1] 10851.54

var(x = concrete$slag)

#> [1] 7414.7

var(x = concrete$flyash)

#> [1] 4081.476

var(x = concrete$water)

#> [1] 447.7841

var(x = concrete$super_plast)

#> [1] 35.69299

var(x = concrete$coarse_agg)

#> [1] 5881.473

var(x = concrete$fine_agg)

#> [1] 6565.186

var(x = concrete$age)

#> [1] 3850.662

var(x = concrete$strength)

#> [1] 279.3822

From above analysis, A identified that the variables are varying widely, with the smallest variance being super_plast with 35.69299 point, and the biggest one being cement with 10851.54 point. This wide range of variance between variables implies that the variables bear different weight, which in turn could potentially distort the modelling process where one variable overpowered others only because it has bigger units.

Variable Pre-processing

To prevent it from happening, A scaled her variables. Since her variables didn’t have exact maximum data range, she use z-score standarization to scale out her variables. Later, she assigned the scaled data to concrete_scale object.

concrete_pre <- concrete %>% 
  select(-id)

concrete_scale <- concrete_pre %>%
  as.matrix() %>% 
  scale() %>% 
  as.data.frame()

summary(concrete_scale)

#>      cement              slag             flyash            water        
#>  Min.   :-1.71772   Min.   :-0.8498   Min.   :-0.8458   Min.   :-2.8030  
#>  1st Qu.:-0.82783   1st Qu.:-0.8498   1st Qu.:-0.8458   1st Qu.:-0.7663  
#>  Median :-0.05602   Median :-0.6176   Median :-0.8458   Median : 0.1363  
#>  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
#>  3rd Qu.: 0.66299   3rd Qu.: 0.7911   3rd Qu.: 1.0044   3rd Qu.: 0.5144  
#>  Max.   : 2.48692   Max.   : 3.3240   Max.   : 2.2863   Max.   : 3.1135  
#>   super_plast         coarse_agg          fine_agg             age         
#>  Min.   :-1.04874   Min.   :-2.24044   Min.   :-2.24166   Min.   :-0.7114  
#>  1st Qu.:-1.04874   1st Qu.:-0.53228   1st Qu.:-0.51382   1st Qu.:-0.6147  
#>  Median : 0.03924   Median :-0.06287   Median : 0.05514   Median :-0.2763  
#>  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
#>  3rd Qu.: 0.64181   3rd Qu.: 0.72471   3rd Qu.: 0.63150   3rd Qu.: 0.1750  
#>  Max.   : 4.34095   Max.   : 2.24510   Max.   : 2.67776   Max.   : 5.1545  
#>     strength       
#>  Min.   :-2.00171  
#>  1st Qu.:-0.72678  
#>  Median :-0.07287  
#>  Mean   : 0.00000  
#>  3rd Qu.: 0.60737  
#>  Max.   : 2.80064

From above summary, her variables were already within similar range and ready to be executed.

Variable Correlation Analysis

Finding out correlation index of strengh to each predictors

ggcorr(concrete_scale, label = T, label_size = 3.5, hjust = 0.9, layout.exp = 3, legend.size = 9, low = "#CC0000", mid = "#ffcccc", high = "#6666ff",label_color = "#ffffcc")

From above plot, A found out that the strength of a mixture is the most statistically correlated with the amount of cement and the amount of superplasticizer per metre cubic.

Some variables did have statistically small correlation index (being nearer to zero). However, according to her knowledge, A believe that all of her variables were important to determine mixture strength. Since correlation isn’t always equal with causation anyway, she kept all variables despite low statisical correlation score.

Confirming the significance

cor.test(concrete_scale$strength, concrete_scale$age)

#> 
#>  Pearson's product-moment correlation
#> 
#> data:  concrete_scale$strength and concrete_scale$age
#> t = 10.43, df = 823, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.2799728 0.4006103
#> sample estimates:
#>       cor 
#> 0.3416984

cor.test(concrete_scale$strength, concrete_scale$fine_agg)

#> 
#>  Pearson's product-moment correlation
#> 
#> data:  concrete_scale$strength and concrete_scale$fine_agg
#> t = -5.452, df = 823, p-value = 0.00000006586
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.2517513 -0.1199778
#> sample estimates:
#>        cor 
#> -0.1867042

cor.test(concrete_scale$strength, concrete_scale$coarse_agg)

#> 
#>  Pearson's product-moment correlation
#> 
#> data:  concrete_scale$strength and concrete_scale$coarse_agg
#> t = -3.929, df = 823, p-value = 0.00009244
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.20207473 -0.06806607
#> sample estimates:
#>       cor 
#> -0.135691

cor.test(concrete_scale$strength, concrete_scale$super_plast)

#> 
#>  Pearson's product-moment correlation
#> 
#> data:  concrete_scale$strength and concrete_scale$super_plast
#> t = 11.127, df = 823, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.3007983 0.4195290
#> sample estimates:
#>       cor 
#> 0.3616289

cor.test(concrete_scale$strength, concrete_scale$water)

#> 
#>  Pearson's product-moment correlation
#> 
#> data:  concrete_scale$strength and concrete_scale$water
#> t = -8.4715, df = 823, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.3447994 -0.2191910
#> sample estimates:
#>        cor 
#> -0.2832092

cor.test(concrete_scale$strength, concrete_scale$flyash)

#> 
#>  Pearson's product-moment correlation
#> 
#> data:  concrete_scale$strength and concrete_scale$flyash
#> t = -2.9894, df = 823, p-value = 0.002879
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.17068925 -0.03563822
#> sample estimates:
#>        cor 
#> -0.1036414

cor.test(concrete_scale$strength, concrete_scale$slag)

#> 
#>  Pearson's product-moment correlation
#> 
#> data:  concrete_scale$strength and concrete_scale$slag
#> t = 3.7032, df = 823, p-value = 0.0002271
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.06029336 0.19457687
#> sample estimates:
#>       cor 
#> 0.1280218

cor.test(concrete_scale$strength, concrete_scale$cement)

#> 
#>  Pearson's product-moment correlation
#> 
#> data:  concrete_scale$strength and concrete_scale$cement
#> t = 16.387, df = 823, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.4427419 0.5457857
#> sample estimates:
#>       cor 
#> 0.4960081

All predictors did have significant correlation with the mixture strength.

Key Information

SIgnificantly - positively correlated with the mixture strength (if we add more of these variables, the mixture will (statistically) improve:

The number of resting days before the compressive strength measurement - Age
Superplasticizer - super_plast
Blast furnace slag - slag
Cement - cement

Meanwhile, A needed to add lesser of below components to the mixture to enhance the mixture robustness:

Fly ash - flyash
Water - water:
Coarse Aggreagate - coarse_agg
Fine Aggreagate - fine_agg

Modelling Process

1. Linear Regression Model

Since the linearity assumption was met, A predicted the mixture strength using linear regression model.

Cross Validation

To avoid model overfitting, A splitted her data into 85% data train to generate model, and another 15% to test her model.

set.seed(921)
idx <- initial_split(concrete_scale,prop = 0.85,strata = "strength")
concrete_train <- training(idx)
concrete_test <- testing(idx)

Model Generation

Linear regression model was generated using lm() function

concrete_rm1 <- lm(data = concrete_train, formula = strength ~ .)

Model 1st Evaluation

summary(concrete_rm1)

#> 
#> Call:
#> lm(formula = strength ~ ., data = concrete_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.7342 -0.3855  0.0429  0.4347  1.9508 
#> 
#> Coefficients:
#>              Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept) -0.000957   0.023355  -0.041              0.96733    
#> cement       0.731969   0.064524  11.344 < 0.0000000000000002 ***
#> slag         0.500705   0.062495   8.012  0.00000000000000477 ***
#> flyash       0.327664   0.058368   5.614  0.00000002861842464 ***
#> water       -0.179943   0.060390  -2.980              0.00299 ** 
#> super_plast  0.111408   0.039421   2.826              0.00485 ** 
#> coarse_agg   0.084681   0.051561   1.642              0.10097    
#> fine_agg     0.087791   0.061728   1.422              0.15541    
#> age          0.469512   0.025564  18.366 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.6193 on 695 degrees of freedom
#> Multiple R-squared:  0.624,  Adjusted R-squared:  0.6197 
#> F-statistic: 144.2 on 8 and 695 DF,  p-value: < 0.00000000000000022

From above summary, the adjusted R-squared score is 0.6197. Means, the model only able to explain 61.97% of the target variable.

Model Tuning with Step-Wise Regression

Generating the further end of the model - predicting the strength using the strength variable itself

concrete_rm2 <- lm(data = concrete_train, formula = strength ~ 1)

1. Backward Elimination Step-Wise

step(object = concrete_rm1,direction = "backward", trace = 0)

#> 
#> Call:
#> lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
#>     coarse_agg + fine_agg + age, data = concrete_train)
#> 
#> Coefficients:
#> (Intercept)       cement         slag       flyash        water  super_plast  
#>   -0.000957     0.731969     0.500705     0.327664    -0.179943     0.111408  
#>  coarse_agg     fine_agg          age  
#>    0.084681     0.087791     0.469512

Using this method, there were no any variables omitted.

2. Forward Selection Step-Wise

step(object = concrete_rm2,scope = list(lower = concrete_rm2, upper =concrete_rm1),direction = "forward", trace = 0)

#> 
#> Call:
#> lm(formula = strength ~ cement + age + super_plast + slag + water + 
#>     flyash, data = concrete_train)
#> 
#> Coefficients:
#> (Intercept)       cement          age  super_plast         slag        water  
#>   -0.001107     0.647274     0.469854     0.091880     0.417023    -0.262098  
#>      flyash  
#>    0.258126

Using this method, two variables were removed: fine_agg and coarse_agg. A assigned this new model with better AIC score to concrete_rm3 object.

concrete_rm3 <- lm(formula = strength ~ cement + age + super_plast + slag + water + flyash, data = concrete_train)

3. Both

step(object = concrete_rm2,scope = list(lower = concrete_rm2, upper =concrete_rm1),direction = "both", trace = 0)

#> 
#> Call:
#> lm(formula = strength ~ cement + age + super_plast + slag + water + 
#>     flyash, data = concrete_train)
#> 
#> Coefficients:
#> (Intercept)       cement          age  super_plast         slag        water  
#>   -0.001107     0.647274     0.469854     0.091880     0.417023    -0.262098  
#>      flyash  
#>    0.258126

This method eliminated the same two variables as the forward-selection method.

4. Model 2nd Evaluation

a. Comparing Adjusted R-squared

Adjusted R-squared for concrete_rm1 (all variables included)

summary(concrete_rm1)$adj.r.squared

#> [1] 0.6196796

Adjusted R-squared for concrete_rm3 (two variables omitted to generate regression model with better AIC)

summary(concrete_rm3)$adj.r.squared

#> [1] 0.619292

From above comparison, the first model explained the target variable sligthly better than the one after feature selection with 61.97% ability to explain.

b. Computing MAE from the Data Train

For testing purpose, A create data train & data test using the original (pre-scaled) data

set.seed(921)
idns <- initial_split(concrete_pre,prop = 0.85,strata = "strength")
concrete_trains <- training(idns)
concrete_tests <- testing(idns)

rm_pred_train <- predict(object = concrete_rm1,newdata = concrete_trains, interval = "confidence", level = 0.95)

MAE(pred = rm_pred_train, obs = concrete_trains$strength)

#> [1] 364.0723

MAE generated from data train: 364.07

c. Computing MAE from model application to Data Test

rm_prediction <- predict(object = concrete_rm1,newdata = concrete_tests, interval = "confidence", level = 0.95)

MAE(pred = rm_prediction, obs = concrete_test$strength)

#> [1] 398.639

MAE generated from data train: 398.63

Model Tuning 2: Using Normal Variable

concrete_normal <- concrete_pre

Cross Validation

To avoid model overfitting, A splitted her data into 85% data train to generate model, and another 15% to test her model.

set.seed(921)
idnor <- initial_split(concrete_normal, prop = 0.85,strata = "strength")
normal_train <- training(idnor)
normal_test <- testing(idnor)

Model Generation

Linear regression model was generated using lm() function

concrete_rm4 <- lm(data = normal_train, formula = strength ~ .)

Adjusted R-Square from Data Train

summary(concrete_rm4)$adj.r.squared

#> [1] 0.6196796

From above summary, the adjusted R-squared score is 0.6197. Means, the model only able to explain 61.97% of the target variable - which was actually around same score with the scaled one (concrete_rm1).

Computing MAE from the Data Train

rm_pred4_train <- predict(object = concrete_rm4,newdata = normal_train, interval = "confidence", level = 0.95)

MAE(pred = rm_pred4_train, obs = normal_train$strength)

#> [1] 8.314682

MAE generated from data train: 8.31

Computing MAE from model application to Data Test

rm4_prediction <- predict(object = concrete_rm4,newdata = normal_test, interval = "confidence", level = 0.95)

MAE(pred = rm4_prediction, obs = normal_test$strength)

#> [1] 8.60933

MAE generated from data train: 8.60

Model Tuning 3: Variable Min-Max Normalization

Min-max normalization

normalize <- function(x){
  return ( 
    (x - min(x))/(max(x) - min(x)) 
           )
}

Cross Validation

To avoid model overfitting, A splitted her data into 85% data train to generate model, and another 15% to test her model.

mm_concrete <- concrete[2:10]

set.seed(921)
idsq <- initial_split(mm_concrete, prop = 0.85,strata = "strength")
mm_train <- training(idsq)
mm_test <- testing(idsq)

Variable Normalization

mm_trainn <- normalize(mm_train)
mm_testn <- normalize(mm_test)

Model Generation

Linear regression model was generated using lm() function

concrete_rm5 <- lm(data = mm_trainn, formula = strength ~ .)

Adjusted R-Square from Data Train

summary(concrete_rm5)$adj.r.squared

#> [1] 0.6196796

From above summary, the adjusted R-squared score is 61.96%. Means, the model only able to explain 61.96% of the target variable - which was worse than the two regression models generated above.

Computing MAE from the Data Train

Putting the normalized variable back to original

denormalize_train <- function(x){
  return(
    ((x - min(x))/
      ((max(x) - min(x)))
       *
         (max(mm_train) - (min(mm_train))))
    +
      (min(mm_train))
        
    )
}

denormalize_test <- function(x){
  return(((x - min(x))/
      ((max(x) - min(x)))
       *(max(mm_test) - (min(mm_test))))
    +
      (min(mm_test)))}

den_train <- denormalize_train(mm_trainn)

rm_pred5_train <- predict(object = concrete_rm5,newdata = den_train, interval = "confidence", level = 0.95)

MAE(pred = rm_pred5_train, obs = den_train$strength)

#> [1] 49.03312

MAE generated from data train: 49.03

Computing MAE from model application to Data Test

den_test <- denormalize_test(mm_testn)

rm5_prediction <- predict(object = concrete_rm5,newdata = den_test, interval = "confidence", level = 0.95)

MAE(pred = rm5_prediction, obs = den_test$strength)

#> [1] 49.08752

MAE generated from data train: 49.08

From above tuning processes, the regression model performed better if the variable was normalized using the min-max normalization - where the furthest range of each variables was set as the maximum point.

To make sure this is already the optimum, A proceeded with step-wise.

Re-tuning the model: Step-wise

1. Backward Elimination Step-Wise

step(object = concrete_rm5,direction = "backward")

#> Start:  AIC=-6617.19
#> strength ~ cement + slag + flyash + water + super_plast + coarse_agg + 
#>     fine_agg + age
#> 
#>               Df Sum of Sq      RSS     AIC
#> <none>                     0.056800 -6617.2
#> - fine_agg     1 0.0001653 0.056966 -6617.1
#> - coarse_agg   1 0.0002204 0.057021 -6616.5
#> - super_plast  1 0.0006528 0.057453 -6611.1
#> - water        1 0.0007256 0.057526 -6610.3
#> - flyash       1 0.0025756 0.059376 -6588.0
#> - slag         1 0.0052461 0.062047 -6557.0
#> - cement       1 0.0105175 0.067318 -6499.6
#> - age          1 0.0275678 0.084368 -6340.7

#> 
#> Call:
#> lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
#>     coarse_agg + fine_agg + age, data = mm_trainn)
#> 
#> Coefficients:
#> (Intercept)       cement         slag       flyash        water  super_plast  
#>    -0.01999      0.11745      0.09719      0.08573     -0.14213      0.31169  
#>  coarse_agg     fine_agg          age  
#>     0.01846      0.01811      0.12647

2. Forward Selection Step-Wise

step(object = concrete_rm5,scope = list(lower = lm(formula = strength ~ 1, data = mm_trainn), upper =concrete_rm5),direction = "forward", trace = 0)

#> 
#> Call:
#> lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
#>     coarse_agg + fine_agg + age, data = mm_trainn)
#> 
#> Coefficients:
#> (Intercept)       cement         slag       flyash        water  super_plast  
#>    -0.01999      0.11745      0.09719      0.08573     -0.14213      0.31169  
#>  coarse_agg     fine_agg          age  
#>     0.01846      0.01811      0.12647

3. Both

step(object = concrete_rm5,scope = list(lower = lm(formula = strength ~ 1, data = mm_train), upper =concrete_rm5),direction = "both", trace = 0)

#> 
#> Call:
#> lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
#>     coarse_agg + fine_agg + age, data = mm_trainn)
#> 
#> Coefficients:
#> (Intercept)       cement         slag       flyash        water  super_plast  
#>    -0.01999      0.11745      0.09719      0.08573     -0.14213      0.31169  
#>  coarse_agg     fine_agg          age  
#>     0.01846      0.01811      0.12647

Step-Wise Feature Selection suggested no variable removal. Hence, A proceed directly to next tuning

Model Tuning 4: Log transforming the Variable

Log-Transforming the Dataset

mm_trainlog <- log1p(mm_train)
mm_testlog <- log1p(mm_test)

Model Generation

concrete_rm5log <- lm(data = mm_trainlog, formula = strength ~ .)

Adjusted R-Square from Data Train

summary(concrete_rm5log)$adj.r.squared

#> [1] 0.8041318

From above summary, the adjusted R-squared score is 80.41%. Means, the model only able to explain 80.41% of the target variable - which was way better than any models above.

Computing MAE from the Data Train

rm_pred5log_train <- predict(object = concrete_rm5log,newdata = mm_trainlog, interval = "confidence", level = 0.95)

exp_rmp5ltrain <- exp(rm_pred5log_train)

MAE(pred = exp_rmp5ltrain, obs = mm_trainlog$strength)

#> [1] 32.99998

MAE generated from data train: 32.9

Computing MAE from model application to Data Test

rm5log_prediction <- predict(object = concrete_rm5log,newdata = mm_testlog, interval = "confidence", level = 0.95)

exp_rmp5ltest <- exp(rm5log_prediction)

MAE(pred = exp_rmp5ltest, obs = mm_test$strength)

#> [1] 6.24852

MAE generated from data train: 6.24

Assumption Checking

a. Normality

A checked her model normality to ensure that the error/residual coming from her prediction was normally distributed. She used shapiro.test and later visualized her error in a histogram.

For Log Transformation

shapiro.test(x = concrete_rm5log$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  concrete_rm5log$residuals
#> W = 0.98736, p-value = 0.000009023

ggplot(data = as.data.frame(concrete_rm5log$residuals), aes(x = concrete_rm5$residuals)) +
  geom_histogram(fill = "#CC0000", color = "orange") +
  labs( title = "Regression Residual Distribution", subtitle = "Log Transformation", x = "residual")

Assumption of Normality: met

b. Homoscedasticity

A checked her model whether its error variance wasn’t making a certain pattern (instead, scattered randomly). She used Breusch-Pagan hypotesis test and later visualized her error in a scatterplot

bptest(concrete_rm5log)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  concrete_rm5log
#> BP = 63.314, df = 8, p-value = 0.0000000001039

plot(concrete_rm5log$fitted.values, concrete_rm5$residuals)
abline(h = 0, col = "red")

Assumption of homoscedasticity: met

c. Multicolinearity

A checked her model whether her independent variables had no correlation among themselves (only correlated to the concrete performance). She used Variance Inflation Factor score, and ensured that there were no variable with VIF score greater than 10.

vif(concrete_rm5log)

#>      cement        slag      flyash       water super_plast  coarse_agg 
#>    2.731454    2.971403    2.989579    3.584346    3.524065    2.526502 
#>    fine_agg         age 
#>    2.806878    1.031586

*Assumption of multicolinearity: met

LIME Model Interpretation

Setting up the model type

model_type.lm <- function(x){
  return('regression')
}

predict_model.lm <- function(x, newdata, type = "response") {
  #for classification model
  res<- predict(x, newdata, type = "response") %>% as.data.frame()
  return(res)
}

class(concrete_rm5log)

#> [1] "lm"

model_type.lm(concrete_rm5log)

#> [1] "regression"

Setting up LIME explainer

explainer <- lime(mm_train %>% select(-strength), model = concrete_rm5log)

Executing the Explainer for 4 First Observations

mm_testlime <- mm_test %>% 
  select(-strength)

set.seed(921)
testlime <- mm_testlime[2:5,]
explanation <- explain(testlime,
                       explainer = explainer, 
                       n_features = 8)

LIME Visualization & Interpretation

plot_features(explanation)

From above LIME analysis, A concluded that up to 80.41% explainability and average error of 32.9 (data train) & 6.24 (data test), the most relevant/significant variable to determine the concrete strength were the following 5 (not in particular order):

age
cement
water
super plast
slag

Model Application to New Data

datanew <- read.csv("data_input/data-test.csv")

datanew$strength <- as.numeric(datanew$strength)
datanew$strength <- predict(concrete_rm5log, newdata = datanew)
submission <- datanew[,c("id","strength")]

Result:

knitr::include_graphics("./figures/concrete_rm5log.jpg")

2. Random Forest

Since the performance hadn’t meet the expectation yet, A tried other alternative methods. Random forest is a brute-force classifying method using varying decision trees. However, it also able to identify other prediction case.

Setting Up the Train & Test Object

A reused her train & test objects from previous model: mm_train (85%) and mm_test (15%). The observations were randomly sampled in a way that the strength variable in both sets as similar as possible

Setting Up the Random Forest Model

Before going to

A used the K-fold cross validation method to resample her trees. K-fold Cross-Validation was picked because it generally results in a less biased estimate of the model performance compared to other resampling methods.

The k picked was 5, to minimize bias in her Random Forest training.

set.seed(921)
ctrl <- trainControl(method = "cv", number = 5,repeats = 3)

RF Model Generation

concrete_rf <- train(strength ~., data = mm_train, method = "rf", trControl= ctrl)

RF Model Inspection

Model Summary:

concrete_rf

#> Random Forest 
#> 
#> 704 samples
#>   8 predictor
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 564, 563, 562, 564, 563 
#> Resampling results across tuning parameters:
#> 
#>   mtry  RMSE      Rsquared   MAE     
#>   2     6.134173  0.8842844  4.602431
#>   5     5.687851  0.8894546  4.077606
#>   8     5.750056  0.8861399  4.044689
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was mtry = 5.

The final model out of the 5 folds:

concrete_rf$finalModel

#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: regression
#>                      Number of trees: 500
#> No. of variables tried at each split: 5
#> 
#>           Mean of squared residuals: 28.2658
#>                     % Var explained: 89.95

Model Visualization:

plot(concrete_rf)

From above model, the number of predictors generate the most optimum (smallest) residual is at 5 predictors. The most significant variables were generated through below function:

varImp(concrete_rf)

#> rf variable importance
#> 
#>             Overall
#> age         100.000
#> cement       81.349
#> water        31.145
#> super_plast  12.931
#> slag          7.354
#> fine_agg      4.551
#> coarse_agg    1.074
#> flyash        0.000

The most important variables were:

Age
Cement
Water
Super plast
Slag

Computing Rsquared & MAE from the Data Train

actualtrain <- mm_train$strength
predictedtrain <- unname(predict(concrete_rf, mm_train))

R2train <- 1 - (sum((actualtrain-predictedtrain)^2)/sum((actualtrain-mean(actualtrain))^2))
R2train

#> [1] 0.9782211

rf_prediction_train <- predict(concrete_rf, mm_train,interval = "confidence", level = 0.95)

MAE(pred = rf_prediction_train, obs = mm_train$strength)

#> [1] 1.737291

Rsquared of the RF model: 97.82%
MAE generated from data train: 1.74

Computing MAE from model application to Data Test

rf_prediction <- predict(object = concrete_rf,newdata = mm_test, interval = "confidence", level = 0.95)

MAE(pred = rf_prediction, obs = mm_test$strength)

#> [1] 3.047134

MAE generated from data test: 3.03

LIME Model Interpretation

Setting up LIME explainer

explainerrf <- lime(mm_train, model = concrete_rf)

Executing the Explainer for 8 Observations

set.seed(921)
testlimerf <- mm_test[2:5,]
explanationrf <- explain(testlimerf,
                       explainer = explainerrf, 
                       n_features = 8)

LIME Visualization

plot_features(explanationrf)

From above LIME analysis, A concluded that up to 97.82% explainability and average error of 1.74 (data train) & 3.03 (data test), the most relevant/significant variable to determine the concrete strength were the following 5 (not in particular order):

age
cement
water
super plast
slag

The major variables were pretty much similar to the LIME analysis result of concrete_rm5log, as well as the manual analysis to the random forest model.

Model Application to New Data

datanew1 <- datanew

datanew1$strength <- as.numeric(datanew1$strength)
datanew1$strength <- predict(concrete_rf, newdata = datanew1)
submissionrf <- datanew1[,c("id","strength")]

Result:

knitr::include_graphics("./figures/concrete_rf.jpg")

Summary

Is your goal achieved?: No, since both modelling process is yet to explain the required explainability of >90% despite the MAE score is already minimized to 3.
Is the problem can be solved by machine learning?: potentially yes
What model did you use and how is the performance? regression model showed the best performance when the predictors are log-transformed. However, Random Forest algorithm performs the best overall

Z-Score Standardization Linear Regression Model (concrete_rm1)

Adjusted R Square: 61.97%
MAE train : 364.07
MAE test : 398.639

Linear Regression Model (concrete_rm4)

Adjusted R Square: 61.967%
MAE train : 8.314
MAE test : 8.609

Min-Max Standardization Linear Regression Model (concrete_rm5)

Adjusted R Square: 61.967%
MAE train : 49.033
MAE test : 49.087

Log-Transformation Regression Model (concrete_rm5log)

Adjusted R Square: 80.404%
MAE train : 32.998
MAE test : 6.248
New Data Rsquare : 78%
New Data MAE : 6.03

Random Forest (5-fold Cross-Validation) (concrete_rf)

Adjusted R Square: 97.822%
MAE train : 1.742
MAE test : 3.047
New Data Rsquare : 90%
New Data MAE : 3.76

Concrete Strength Prediction

Dee

April 7, 2020

Modelling Context & Purpose

Data Preparation

Read Data

Data Inspection

Checking missing value

Variable exploration

Variable Data Distribution

Variance Analysis

Variable Pre-processing

Variable Correlation Analysis

Finding out correlation index of strengh to each predictors

Confirming the significance

Key Information

Modelling Process

1. Linear Regression Model

Cross Validation

Model Generation

Model 1st Evaluation

Model Tuning with Step-Wise Regression

1. Backward Elimination Step-Wise

2. Forward Selection Step-Wise

3. Both

4. Model 2nd Evaluation

a. Comparing Adjusted R-squared

b. Computing MAE from the Data Train

c. Computing MAE from model application to Data Test

Model Tuning 2: Using Normal Variable

Cross Validation

Model Generation

Adjusted R-Square from Data Train

Computing MAE from the Data Train

Computing MAE from model application to Data Test

Model Tuning 3: Variable Min-Max Normalization

Min-max normalization

Cross Validation

Model Generation

Adjusted R-Square from Data Train

Computing MAE from the Data Train

Computing MAE from model application to Data Test

Re-tuning the model: Step-wise

1. Backward Elimination Step-Wise

2. Forward Selection Step-Wise

3. Both

Model Tuning 4: Log transforming the Variable

Log-Transforming the Dataset

Model Generation

Adjusted R-Square from Data Train

Computing MAE from the Data Train

Computing MAE from model application to Data Test

Assumption Checking

a. Normality

b. Homoscedasticity

c. Multicolinearity

LIME Model Interpretation

Setting up the model type

Setting up LIME explainer

Executing the Explainer for 4 First Observations

LIME Visualization & Interpretation

Model Application to New Data

2. Random Forest

Setting Up the Train & Test Object

Setting Up the Random Forest Model

RF Model Generation

RF Model Inspection

Computing Rsquared & MAE from the Data Train

Computing MAE from model application to Data Test

LIME Model Interpretation

Setting up LIME explainer

Executing the Explainer for 8 Observations

LIME Visualization

Model Application to New Data

Summary