1 Introduction

This problem was originally proposed by Prof. I-Cheng Yeh, Department of Information Management Chung-Hua University, Hsin Chu, Taiwan in 2007. It is related to his research in 1998 about how to predict compression strength in a concrete structure.

The conventional process of testing the compressive strength of concrete involves casting several cubes for the respective grade and observing the strength of the concrete over a period of time ranging from 7 to 28 days.

Various combinations of the components of concrete are selected and cubes for each combination is casted and its test strength at 7, 14 and 28 days is noted dow. This is a time consuming and rather tedious process.

This project aims to predict the compressive strength of concrete with maximum accuracy and lowest error (evaluation metrics MAE) , for various quantities of constituent components as the input. The conrete cube exhibits behavioral differences in their compressive strengths for cubes that are cured/not cured. Curing is the process of maintaining the moisture to ensure uninterrupted hydration of concrete.

The concrete strength increases if the concrete cubes are cured periodically. The rate of increase in strength is described here.

Time % Of Total Strength Achieved

  • 1 day 16%
  • 3 days 40%
  • 7 days 65%
  • 14 days 90%
  • 28 days 99%

At 28 days, concrete achieves 99% of the strength. Thus usual measurements of strength are taken at 28 days

The goal of this case is to Predict the compression strength based on the mixture properties.

2 Data Preparation

2.2 Read Data

#> Observations: 825
#> Variables: 10
#> $ id          <fct> S1, S2, S3, S4, S5, S6, S7, S8, S9, S10, S11, S12, S13,...
#> $ cement      <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 380.0, 380.0, 475.0,...
#> $ slag        <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 95.0, 95.0, 0.0, 132.4, ...
#> $ flyash      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ water       <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 192, 192, 228, ...
#> $ super_plast <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
#> $ coarse_agg  <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932....
#> $ fine_agg    <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 594.0, 594.0, 594.0,...
#> $ age         <int> 28, 28, 270, 365, 360, 365, 28, 28, 90, 28, 28, 90, 90,...
#> $ strength    <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 43.70, 36.45, 39.29,...

The observation data consists of the following variables:

  • id : Id of each cement mixture,
  • cement : The amount of cement (Kg) in a m3 mixture,
  • slag : The amount of blast furnace slag (Kg) in a m3 mixture,
  • flyash : The amount of fly ash (Kg) in a m3 mixture,
  • water : The amount of water (Kg) in a m3 mixture,
  • super_plast: The amount of Superplasticizer (Kg) in a m3 mixture,
  • coarse_agg : The amount of Coarse Aggreagate (Kg) in a m3 mixture,
  • fine_agg : The amount of Fine Aggregate (Kg) in a m3 mixture,
  • age : the number of resting days before the compressive strength measurement,
  • strength : Concrete compressive strength measurement in MPa unit.

2.3 Data Preprocess

Removing unused column id :

Take a peek of our dataset :

#>      cement           slag            flyash           water      
#>  Min.   :102.0   Min.   :  0.00   Min.   :  0.00   Min.   :121.8  
#>  1st Qu.:194.7   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:164.9  
#>  Median :275.1   Median : 20.00   Median :  0.00   Median :184.0  
#>  Mean   :280.9   Mean   : 73.18   Mean   : 54.03   Mean   :181.1  
#>  3rd Qu.:350.0   3rd Qu.:141.30   3rd Qu.:118.20   3rd Qu.:192.0  
#>  Max.   :540.0   Max.   :359.40   Max.   :200.10   Max.   :247.0  
#>   super_plast       coarse_agg        fine_agg          age        
#>  Min.   : 0.000   Min.   : 801.0   Min.   :594.0   Min.   :  1.00  
#>  1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:734.0   1st Qu.:  7.00  
#>  Median : 6.500   Median : 968.0   Median :780.1   Median : 28.00  
#>  Mean   : 6.266   Mean   : 972.8   Mean   :775.6   Mean   : 45.14  
#>  3rd Qu.:10.100   3rd Qu.:1028.4   3rd Qu.:826.8   3rd Qu.: 56.00  
#>  Max.   :32.200   Max.   :1145.0   Max.   :992.6   Max.   :365.00  
#>     strength    
#>  Min.   : 2.33  
#>  1st Qu.:23.64  
#>  Median :34.57  
#>  Mean   :35.79  
#>  3rd Qu.:45.94  
#>  Max.   :82.60

It seems that all of the predictor variables that we have are numerical values. We can do scaling features (if needed) in the cross-validation section later.

NA value checking :

#> [1] FALSE

Outliers checking :

Boxplot

Based on the boxplot above we can find outliers in several columns like slag, water, super_plast, fine_agg, age, strength. We can treat the column, either by removing or imputing. However, specifically for the super_plast and age columns, we have to leave the data as it is because I have observed submission dataset, and found the same thing. It is intended to be able to predict the same condition in our new dataset.

Outliers collecting :

The next thing we should do are removing outliers from slag, water, fine_agg (only top outliers because 594 as fine_agg is detected as outliers here but we find the same value in our submission data as well) and also strength column (and the total only 2.55 % from our total dataset observation).

#> [1] 804

2.4 Exploratory Data Analysis

There are some tasks that we have to try to answer in this case. We have to explore the relation between the target and the features :

  • Is strength positively correlated with age? The correlation value between strength and age is 0.347. As already mentioned in the introduction that the number of resting days before the measurement effect on the concrete’s compressive strength.
  • Is strength and cement has strong correlation? The correlation value between strength and cement is 0.49. Based on the theory the compressive strength of concrete is strongly influenced by the ratio of cement and water. And that is makes sense if we found that cement and water are variables that influences the strength of concrete.
  • Is super_plast has a linear correlation with the strength? Super plasticizer is also known as water reducer. The amount of Superplasticizer definitely has a positive correlation with strength and also has a negative correlation with water. Correlation values can be seen in the chart above that is 0.35

3 Modelling

3.1 Cross Val

The next step is split data into a training dataset (to build model) and testing dataset (to validate model). There are no specific rules for the splitting data section. The proportion of the training vs testing dataset 80:20 would be the most commonly used also referred to as the [Pareto principle][1]. Tranditionally we use 5-folders cross validation to verify our algorithm, so 80:20 would be find. But actually the real apportion of train and test dataset is related with the real situation and the quantity of the dataset.

[1] : https://en.wikipedia.org/wiki/Pareto_principle

Specifically for this case, we already have data-train (804 observations) and data-submission (205). Therefore we should use 75 : 25 (around 600:200 observations) ratio to split data-train into training dataset and testing/validation dataset, in order to get the same proportion with the data-submission that we will try to predict later.

3.2 Model 1 : Linear Regression

Build Model

#> [1] 0.6170792

Based on the results of the summary model, super_plast variable does not have a significant effect on the target variable strength. It does not mean for another case or project (dataset) super_plast has no influence on concrete’s compressive strength. Or it could be that the effect of the super_plast variable has been summarized by variables such as water, that we already know is that Super plasticizer also has the meaning or purpose of water reducer.

Predict

Evaluate

#> [1] 8.12385
#> [1] 8.581061

Based on Adjusted r-squared (0.6052) and MAE (8.21 in-sample or training data and 7.96 out-sample or testing data), it can be concluded Linear Regression model has not good enough performance. We can tune it later with other model approaches.

3.3 Model 2 : Random Forest

3.3.1 Model 2A (5 fold & 5 repeates)

Build Model

#> Random Forest 
#> 
#> 604 samples
#>   8 predictor
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 10 times) 
#> Summary of sample sizes: 544, 544, 543, 543, 544, 544, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  RMSE      Rsquared   MAE     
#>   2     5.996248  0.8892703  4.629198
#>   5     5.436498  0.8942179  4.077684
#>   8     5.477170  0.8904525  4.092211
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was mtry = 5.

Out-of-Bag (OOB) Error

#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry, importance = ..1) 
#>                Type of random forest: regression
#>                      Number of trees: 500
#> No. of variables tried at each split: 5
#> 
#>           Mean of squared residuals: 28.80675
#>                     % Var explained: 89.27

Predict

#>       20       21       29       36       43       46 
#> 39.52133 50.68752 41.44782 14.60070 11.18817 37.41862

Evaluate

#> [1] 2.235684
#> [1] 2.528387

To get the value of R-squared we can see the % Var explained from the summary of our final model or by manual calculating as below.

#> [1] 0.9453077

3.3.2 Model 2B (10 fold & 10 repeates)

Build Model

#> Random Forest 
#> 
#> 604 samples
#>   8 predictor
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 10 times) 
#> Summary of sample sizes: 544, 544, 543, 543, 544, 544, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  RMSE      Rsquared   MAE     
#>   2     5.996248  0.8892703  4.629198
#>   5     5.436498  0.8942179  4.077684
#>   8     5.477170  0.8904525  4.092211
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was mtry = 5.

Out-of-Bag (OOB) Error

#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry, importance = ..1) 
#>                Type of random forest: regression
#>                      Number of trees: 500
#> No. of variables tried at each split: 5
#> 
#>           Mean of squared residuals: 28.80675
#>                     % Var explained: 89.27

Predict

#>       20       21       29       36       43       46 
#> 39.52133 50.68752 41.44782 14.60070 11.18817 37.41862

Evaluate

#> [1] 2.235684
#> [1] 2.528387
#> [1] 0.9453077

Based on the Random Forest model, we get the value of R-squared (0.8927 from % Var explained or 0.88 from the manual calculating) and MAE (1.79 for training data and 3.74 for testing data). For this case, there is no difference in performance metrics even though using a different combination of repeats and k-fold number.

3.5 Model 3 : Tidymodels using randomForest engine package

MODEL FITTING USING PARSNIP

#> Random Forest Model Specification (regression)
#> 
#> Main Arguments:
#>   mtry = 5
#>   trees = 650
#>   min_n = 1
#> Random Forest Model Specification (regression)
#> 
#> Main Arguments:
#>   mtry = 5
#>   trees = 650
#>   min_n = 1
#> 
#> Computational engine: randomForest

To fit our model, we have two options:

  • Formula interface
  • X-Y interface
#> parsnip model object
#> 
#> Fit in:  3.4s
#> Call:
#>  randomForest(x = as.data.frame(x), y = y, ntree = ~650, mtry = ~5,      nodesize = ~1) 
#>                Type of random forest: regression
#>                      Number of trees: 650
#> No. of variables tried at each split: 5
#> 
#>           Mean of squared residuals: 0.09477608
#>                     % Var explained: 90.51

Predict

Evaluate

Back Transform

Evaluation Metrics

#> [1] "data.frame"

3.6 Model 4 : Tidymodels using ranger engine package

MODEL FITTING USING PARSNIP

#> Random Forest Model Specification (regression)
#> 
#> Main Arguments:
#>   mtry = 5
#>   trees = 650
#>   min_n = 1
#> Random Forest Model Specification (regression)
#> 
#> Main Arguments:
#>   mtry = 5
#>   trees = 650
#>   min_n = 1
#> 
#> Engine-Specific Arguments:
#>   seed = 100
#> 
#> Computational engine: ranger

using fit()

using fit_xy()

#> parsnip model object
#> 
#> Fit in:  1.6sRanger result
#> 
#> Call:
#>  ranger::ranger(formula = formula, data = data, mtry = ~5, num.trees = ~650,      min.node.size = ~1, seed = ~100, num.threads = 1, verbose = FALSE) 
#> 
#> Type:                             Regression 
#> Number of trees:                  650 
#> Sample size:                      684 
#> Number of independent variables:  8 
#> Mtry:                             5 
#> Target node size:                 1 
#> Variable importance mode:         none 
#> Splitrule:                        variance 
#> OOB prediction error (MSE):       0.09295988 
#> R squared (OOB):                  0.9070401

Predict

Evaluate

Evaluation Metrics

#> [1] "data.frame"

4 Conclusion

4.1 Choosing the best model

In this project I did the tuning by using a number of model approaches to improve its performance.

  1. Linear Regression Model
  1. Random Forest Model
  1. Tidy Models “randomForest”
  1. Tidy Models “ranger”

The following is a comparison of the performance of the models that I have tuned. combined1 is for 75:25 proportion of training and validation dataset combined2 is for 85:15 proportion of training and validation dataset

We can say it does not mean (85:15) is the better than (75:25) for training and validation data proportion, because as I have mentioned before, everything depends on the conditions or goals to be achieved. In this case I want to build the best model to predict strength value of the submission data that has been provided.

4.2 Tidymodels using randomForest engine

Based on several modeling experiments or tuning, I decided to use the Tidymodels randomForest as the best tuned model in this case. I also do some combinations of the existing tidymodels function parameters, which are as follows :

  • mtry : The number of predictors that will be randomly sampled at each split when creating the tree models.
  • trees : The number of trees contained in the ensemble.
  • min_n : The minimum number of data points in a node that are required for the node to be split further.
  • data pre-process :
    • step_center() tidy() : Centering numeric data
    • step_scale() tidy() : Scaling Numeric Data
    • step_corr() tidy() : High Correlation Filter
    • step_sqrt() tidy() : Square Root Transformation

The following are the results of the combination that has been done

(mtry, trees, data pre-process) : R-Squared(%) / MAE

  • 5, 500 corr : 88.92 / 3.77
  • 4, 500 corr : 89.17 / 3.75
  • 4, 650 corr : 89.21 / 3.79
  • 5, 650 corr : 88.93 / 3.73

  • 5, 650 sqrt : 88.85 / 3.70
  • 4, 650 sqrt : 88.91 / 3.78
  • 4, 500 sqrt : 89.00 / 3.75
  • 5, 500 sqrt : 88.82 / 3.71

  • 5, 650 corr + sqrt : 89.07 / 3.69

  • 5, 650 corr + sqrt
  • min_n = 1 : 89.16 / 3.58
  • min_n = 5 : 88.76 / 3.74
  • min_n = 10 : 88.19 / 3.98
  • min_n = 15 : 87.32 / 4.30

min_n = 1 :

  • 5, 500 corr + sqrt : 89.05 / 3.58
  • 5, 600 corr + sqrt : 89.14 / 3.60
  • 5, 1000 corr + sqrt : 89.12 / 3.60

train:validation proportion :

  • 85 : 15 : 91.92 / 3.39
  • 80 : 20 : 90.22 / 3.49

4.3 Predict Submission Data

#> Observations: 205
#> Variables: 10
#> $ id          <fct> S826, S827, S828, S829, S830, S831, S832, S833, S834, S...
#> $ cement      <dbl> 266.0, 266.0, 427.5, 190.0, 380.0, 427.5, 198.6, 332.5,...
#> $ slag        <dbl> 114.0, 114.0, 47.5, 190.0, 0.0, 47.5, 132.4, 142.5, 237...
#> $ flyash      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ water       <dbl> 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 192.0, 228.0,...
#> $ super_plast <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
#> $ coarse_agg  <dbl> 932.0, 932.0, 932.0, 932.0, 932.0, 932.0, 978.4, 932.0,...
#> $ fine_agg    <dbl> 670.0, 670.0, 594.0, 670.0, 670.0, 594.0, 825.5, 594.0,...
#> $ age         <int> 90, 28, 270, 90, 270, 28, 180, 90, 180, 365, 365, 180, ...
#> $ strength    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> [1]  5.523025 76.601212

After I submit, the results are 91 % for R-Squared and 3.5 for MAE model performance metrics