1. Overview and Objective

This subject was originally proposed by Prof. I-Cheng Yeh, Department of Information Management Chung-Hua University, Hsin Chu, Taiwan in 2007. It is all based on his research in 1998 about how to predict compression strength in a concrete structure.

The objective is to predict the compressive strength of concrete with maximum accuracy and lowest error, for various mixtures of materials as input. The conrete cube exhibits behavioral differences in their compressive strengths for cubes that are cured/not cured. Curing is the process of maintaining the moisture to ensure uninterrupted hydration of concrete.

The concrete strength increases if the concrete cubes are cured periodically. The rate of increase in strength is described here.

Time % Of Total Strength Achieved

1 day 16% 3 days 40% 7 days 65% 14 days 90% 28 days 99% At 28 days, concrete achieves 99% of the strength. Thus usual measurements of strength are taken at 28 days

Again, the goal of this subject is to predict the compression strength based on the mixture of materials.

2. Data Preparation

Here are some data pre processing step that has to be performed before we proceed to the next step: Set Up Libraries, Read Data, Take a look of glimpse of data and check several first rows/ bottom rows if needed, data structures with summaries, removing unnecessary columns and are there any N/A values

After that we will check possiblities to do scaling features, is it necessar or not?. The last, we will detect some outliers and what we will have to do with them.

Set Up Libraries

First we will set up some libraries which are needed to process the data.

## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:MLmetrics':
## 
##     MAE, RMSE
## The following object is masked from 'package:purrr':
## 
##     lift
## -- Attaching packages ------------------------------------------------------------------------------- tidymodels 0.1.0 --
## v broom     0.5.4     v rsample   0.0.5
## v dials     0.0.4     v tune      0.0.1
## v infer     0.5.1     v workflows 0.1.0
## v parsnip   0.0.5     v yardstick 0.0.5
## v recipes   0.1.9
## -- Conflicts ---------------------------------------------------------------------------------- tidymodels_conflicts() --
## x scales::discard()      masks purrr::discard()
## x dplyr::filter()        masks stats::filter()
## x recipes::fixed()       masks stringr::fixed()
## x dplyr::lag()           masks stats::lag()
## x caret::lift()          masks purrr::lift()
## x dials::margin()        masks ggplot2::margin()
## x yardstick::precision() masks caret::precision()
## x yardstick::recall()    masks caret::recall()
## x yardstick::spec()      masks readr::spec()
## x recipes::step()        masks stats::step()
## x recipes::yj_trans()    masks scales::yj_trans()
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ranger':
## 
##     importance
## The following object is masked from 'package:dials':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## 
## Attaching package: 'lime'
## The following object is masked from 'package:dplyr':
## 
##     explain

Read Data

## Observations: 825
## Variables: 10
## $ id          <fct> S1, S2, S3, S4, S5, S6, S7, S8, S9, S10, S11, S12, S13,...
## $ cement      <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 380.0, 380.0, 475.0,...
## $ slag        <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 95.0, 95.0, 0.0, 132.4, ...
## $ flyash      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ water       <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 192, 192, 228, ...
## $ super_plast <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
## $ coarse_agg  <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932....
## $ fine_agg    <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 594.0, 594.0, 594.0,...
## $ age         <int> 28, 28, 270, 365, 360, 365, 28, 28, 90, 28, 28, 90, 90,...
## $ strength    <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 43.70, 36.45, 39.29,...

Data obesrvation consists of the some variables below:

id : Id of each cement mixture, cement : The amount of cement (Kg) in a m3 mixture, slag : The amount of blast furnace slag (Kg) in a m3 mixture, flyash : The amount of fly ash (Kg) in a m3 mixture, water : The amount of water (Kg) in a m3 mixture, super_plast: The amount of Superplasticizer (Kg) in a m3 mixture, coarse_agg : The amount of Coarse Aggreagate (Kg) in a m3 mixture, fine_agg : The amount of Fine Aggregate (Kg) in a m3 mixture, age : the number of resting days before the compressive strength measurement, strength : Concrete compressive strength measurement in MPa unit.

Removing unnecessary column

Removing id column & a short glance of datae

Data Summary and Structure

##      cement           slag            flyash           water      
##  Min.   :102.0   Min.   :  0.00   Min.   :  0.00   Min.   :121.8  
##  1st Qu.:194.7   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:164.9  
##  Median :275.1   Median : 20.00   Median :  0.00   Median :184.0  
##  Mean   :280.9   Mean   : 73.18   Mean   : 54.03   Mean   :181.1  
##  3rd Qu.:350.0   3rd Qu.:141.30   3rd Qu.:118.20   3rd Qu.:192.0  
##  Max.   :540.0   Max.   :359.40   Max.   :200.10   Max.   :247.0  
##   super_plast       coarse_agg        fine_agg          age        
##  Min.   : 0.000   Min.   : 801.0   Min.   :594.0   Min.   :  1.00  
##  1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:734.0   1st Qu.:  7.00  
##  Median : 6.500   Median : 968.0   Median :780.1   Median : 28.00  
##  Mean   : 6.266   Mean   : 972.8   Mean   :775.6   Mean   : 45.14  
##  3rd Qu.:10.100   3rd Qu.:1028.4   3rd Qu.:826.8   3rd Qu.: 56.00  
##  Max.   :32.200   Max.   :1145.0   Max.   :992.6   Max.   :365.00  
##     strength    
##  Min.   : 2.33  
##  1st Qu.:23.64  
##  Median :34.57  
##  Mean   :35.79  
##  3rd Qu.:45.94  
##  Max.   :82.60
## Observations: 825
## Variables: 9
## $ cement      <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 380.0, 380.0, 475.0,...
## $ slag        <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 95.0, 95.0, 0.0, 132.4, ...
## $ flyash      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ water       <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 192, 192, 228, ...
## $ super_plast <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
## $ coarse_agg  <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932....
## $ fine_agg    <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 594.0, 594.0, 594.0,...
## $ age         <int> 28, 28, 270, 365, 360, 365, 28, 28, 90, 28, 28, 90, 90,...
## $ strength    <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 43.70, 36.45, 39.29,...

All of the predictor variables are already in numerical format. We could use scaling features if necessary in the cross-validation section to make all variables in the same scale.

Check missing values

## [1] FALSE

Check Outliers

  1. cement variables

  1. slag variables

we have one outlier values

  1. flyash variables

we have no outliers in flyash variables

  1. water variables

we have three outliers, 2 on top and one at the bottom.

  1. super_plast variables

we have two outliers, 2 on top.

  1. coarse_agg variables

We have no outliers for coarse_agg

  1. fine_agg variables

We have two outliers for fine_agg, one on top and one at the bottom

  1. data_age variables

We have three outliers for data_age, three on top

  1. strength variables

We have some outliers for strength.

Based on the boxplot of data-set above we can find outliers in several columns like slag, water, super_plast, fine_agg, age, strength. We can treat the column, either by removing or imputing. For this we have have to check also the outliers in data test submission in order for data equality.

## Observations: 205
## Variables: 9
## $ cement      <dbl> 266.0, 266.0, 427.5, 190.0, 380.0, 427.5, 198.6, 332.5,...
## $ slag        <dbl> 114.0, 114.0, 47.5, 190.0, 0.0, 47.5, 132.4, 142.5, 237...
## $ flyash      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ water       <dbl> 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 192.0, 228.0,...
## $ super_plast <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
## $ coarse_agg  <dbl> 932.0, 932.0, 932.0, 932.0, 932.0, 932.0, 978.4, 932.0,...
## $ fine_agg    <dbl> 670.0, 670.0, 594.0, 670.0, 670.0, 594.0, 825.5, 594.0,...
## $ age         <int> 90, 28, 270, 90, 270, 28, 180, 90, 180, 365, 365, 180, ...
## $ strength    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
  1. slag variablesin data test submission

  1. water variables in data test submission

  1. super_plast variables in data test submission

  1. fine_agg variables in data test submission

  1. data_age variables in data test submission

Specifically for the super_plast and age columns, we will not do any data removal or imputation because from the observation of submission data test, we could found the same matters. The reason for this is to be able to predict the same condition in our new dataset test.

  1. Detail of Outliers manipulation step:

The next thing we should do are removing outliers from slag, water, fine_agg (only top outliers because 594 as fine_agg is detected as outliers here but we find the same value in our submission data as well) and also strength column (and the total only 2.55 % from our total dataset observation).

## [1] 804

3. Data Exploratories

Correlation Matrix

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

There are some tasks that we have to try to answer in this case. We have to explore the relation between the target and the features :

Is strength positively correlated with age? The correlation value between strength and age is 0.347. Positively yes, but not strong. As already mentioned in the introduction that the number of resting days before the measurement effect on the concrete’s compressive strength.

Is strength and cement has strong correlation? The correlation value between strength and cement is 0.49 (ggpairs), 0.5 (ggcorr) Based on the theory the compressive strength of concrete is strongly influenced by the ratio of cement and water. And that is makes sense if we found that cement and water are variables that influences the strength of concrete, but not individually cement only without water.

Is super_plast has a linear correlation with the strength? Super plasticizer is also known as water reducer. The amount of Superplasticizer definitely has a positive correlation with strength and also has a negative correlation with water. Correlation values can be seen in the chart above that is 0.35 (ggpairs) and 0.4 with ggcorr

4.Modelling

Cross Validation

For further step, we have to divide data into training dataset (to develop models) and testing dataset (for models validation). There are no specific rules for dividing proportion of the data. The proportion of the training vs testing dataset 80:20 would be the most commonly used also referred to as the [Pareto principle][1]. Traditionally we use 5-folders cross validation to verify our algorithm, so 80:20 would be ok. Actually the real portion of train and test dataset is related with the real situation and the quantity of the dataset.

[1] : https://en.wikipedia.org/wiki/Pareto_principle

Specifically for this subject, we already have data-train (804 observations) and data test submission (205). Therefore we should use 75 : 25 (around 600:200 observations) ratio to split data-train into training dataset and testing/validation dataset, in order to get the same proportion with the data-submission that we will try to predict later.

Model 1 | Linear Regression

Model Development

## 
## Call:
## lm(formula = strength ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.5366  -6.5395   0.9119   7.3042  27.9477 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 16.657271  36.091891   0.462             0.644589    
## cement       0.105495   0.011465   9.201 < 0.0000000000000002 ***
## slag         0.090794   0.013244   6.856      0.0000000000176 ***
## flyash       0.075124   0.016319   4.603      0.0000050712854 ***
## water       -0.217980   0.057817  -3.770             0.000179 ***
## super_plast  0.249187   0.132828   1.876             0.061135 .  
## coarse_agg   0.008304   0.012317   0.674             0.500451    
## fine_agg     0.004161   0.014343   0.290             0.771831    
## age          0.121650   0.007089  17.161 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.34 on 603 degrees of freedom
## Multiple R-squared:  0.612,  Adjusted R-squared:  0.6068 
## F-statistic: 118.9 on 8 and 603 DF,  p-value: < 0.00000000000000022

## [1] 0.6115207

Predict

Evaluate

## [1] 8.301161
## [1] 7.701927

Based on Adjusted r-squared result 0.6224 and MAE 8.118 in-sample or training data and 8.399 out-sample or testing data, we have a conclusion that Linear Regression model do not have good enough performance. We can have any other model approaches to compare, in this case, we will use: Random Forest and trials with some engine like tidy models and rangers

Model 2 | Random Forest

## Random Forest 
## 
## 612 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 488, 489, 490, 490, 491, 491, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     6.094318  0.8879792  4.613207
##   5     5.453978  0.8957533  3.977163
##   8     5.525872  0.8899515  4.007702
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
## Random Forest 
## 
## 612 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 551, 549, 552, 550, 551, 550, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     5.805998  0.8983617  4.391536
##   5     5.207266  0.9050365  3.798351
##   8     5.280290  0.9001829  3.832484
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.

We have R Squared of 0.8964 and MAE 3.87

Out-of-Bag (OOB) Error

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = ..1) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 25.45268
##                     % Var explained: 90.62
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = ..1) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 25.69966
##                     % Var explained: 90.53

Predict

##        3        7       25       28       29       31 
## 40.88507 39.68581 52.51470 44.97232 41.45121 42.59622
##        3        7       25       28       29       31 
## 40.74696 39.97782 52.76630 43.86445 41.59919 42.67330

Evaluate

## [1] 1.668849
## [1] 1.666181
## [1] 4.252441
## [1] 4.275926

To get the value of R-squared we can see the % Var explained from the summary of our final model or by manual calculating as below.

## [1] 0.8746272
## [1] 0.8731011

This model is slightly overfit and we have slightly below expectation in value of R Squared in test data validation

Model 3 | Tidy Models

Data preprocessing using Recipes

3a. Tidy model using Random Forest Engine

Model Fitting using PARSNIP

## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = 2
##   trees = 1123
##   min_n = 3
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = 2
##   trees = 1123
##   min_n = 3
## 
## Computational engine: randomForest

To fit the model, we can have have two options:

Formula interface X-Y interface

## parsnip model object
## 
## Fit time:  2.4s 
## 
## Call:
##  randomForest(x = as.data.frame(x), y = y, ntree = ~1123, mtry = ~2,      nodesize = ~3) 
##                Type of random forest: regression
##                      Number of trees: 1123
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.1114947
##                     % Var explained: 88.83

Predict

Evaluate

Back Transform

Evaluation Metrics

## [1] "data.frame"

With this model we cannot have satifactory results

Tidymodels using ranger

Parsnip

## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = 5
##   trees = 555
##   min_n = 5
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = 5
##   trees = 555
##   min_n = 5
## 
## Engine-Specific Arguments:
##   seed = 555
## 
## Computational engine: ranger
## parsnip model object
## 
## Fit time:  801ms 
## Ranger result
## 
## Call:
##  ranger::ranger(formula = formula, data = data, mtry = ~5, num.trees = ~555,      min.node.size = ~5, seed = ~555, num.threads = 1, verbose = FALSE) 
## 
## Type:                             Regression 
## Number of trees:                  555 
## Sample size:                      612 
## Number of independent variables:  8 
## Mtry:                             5 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.09082269 
## R squared (OOB):                  0.9091773

using fit_xy()

## parsnip model object
## 
## Fit time:  750ms 
## Ranger result
## 
## Call:
##  ranger::ranger(formula = formula, data = data, mtry = ~5, num.trees = ~555,      min.node.size = ~5, seed = ~555, num.threads = 1, verbose = FALSE) 
## 
## Type:                             Regression 
## Number of trees:                  555 
## Sample size:                      612 
## Number of independent variables:  8 
## Mtry:                             5 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.09082269 
## R squared (OOB):                  0.9091773

Predict

Evaluate

Evaluation Metrics

## [1] "data.frame"

5. Conclusion

Best Model

Random Forest

We have R Squared of 0.8964 (almost 90%) and MAE 3.87 with Random Forest 10r10n

## [1] 4.275926

From this model, we will apply to achieve prediction for strength variable in submission data

Data joining with Id

## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   x = col_double()
## )

Notes: Goal Achieved, only with slightly below expectation results in data validation for R Squared Machine learning could solve problem in predicting concrete strength with random forest as the best model The performance is measured by MAE (Mean Absolute Error) below 4.0, R Squared in prediction 0.90, and in validation almost 0.90 For this we could achieve Business Implementation for potential best mixture of concrete strength that will be achieved in ID S858 and S859

6. LIME Interpretation

Building explainer with train data

Explaining selected samples

LIME Conclusion With LIME, I do not need to scale back, because it already contained original value/ unscaled data. I use 2 features to have 2 most materials that will contribute to strength. With LIME, we could see what breakdown and see possibilities component that contribute the bost and possibility for best mixtures.

The difference in interpreting: Blackbox: we could get accuracy and optimization of the models by minimizing errors and fit the models LIME: We could breakdown the variables that has contribution, correlation and combination to provide impact to target in the models

From 4 observation in the plot, Explanation fit tell us Cement and Super_plast is the most influential materials to contribute strength and effecting age.