Domain Specific Knowledge

Domain Specific Knowledge

Research into properties of concrete

Data Exploration

Distribution of Variables

Examinng the univariate distribution of the variables

## No id variables; using all as measure variables

Summary of Data Distribution

##      Cement      BlastFurnaceSlag     FlyAsh           Water      
##  Min.   :102.0   Min.   :  0.0    Min.   :  0.00   Min.   :121.8  
##  1st Qu.:192.4   1st Qu.:  0.0    1st Qu.:  0.00   1st Qu.:164.9  
##  Median :272.9   Median : 22.0    Median :  0.00   Median :185.0  
##  Mean   :281.2   Mean   : 73.9    Mean   : 54.19   Mean   :181.6  
##  3rd Qu.:350.0   3rd Qu.:142.9    3rd Qu.:118.30   3rd Qu.:192.0  
##  Max.   :540.0   Max.   :359.4    Max.   :200.10   Max.   :247.0  
##  Superplasticizer CoarseAggregate  FineAggregate        Age        
##  Min.   : 0.000   Min.   : 801.0   Min.   :594.0   Min.   :  1.00  
##  1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:731.0   1st Qu.:  7.00  
##  Median : 6.400   Median : 968.0   Median :779.5   Median : 28.00  
##  Mean   : 6.205   Mean   : 972.9   Mean   :773.6   Mean   : 45.66  
##  3rd Qu.:10.200   3rd Qu.:1029.4   3rd Qu.:824.0   3rd Qu.: 56.00  
##  Max.   :32.200   Max.   :1145.0   Max.   :992.6   Max.   :365.00  
##  ConcreteStrength
##  Min.   : 2.33   
##  1st Qu.:23.71   
##  Median :34.45   
##  Mean   :35.82   
##  3rd Qu.:46.13   
##  Max.   :82.60

Univariate Distributions

Intercorrelation of Predictor variables

Correlation Heat Map

  • Cement is strongest factor for strength.
  • FLyAsh is the weakest correlation to Strength but still at -0.106 , perhaps not insignificant.
  • Strongest correlations between predictors is -0.66. - Super Plasticiser and Water.

Lets explore the interrelationship of most correlated variables.

Domain research suggests relevance of relationships between - water:cement, coarse:fine, and total aggregate:cement

Model Fitting

When fitting our model we can consider the predictive value of the interrelated variables.

We used the caret package to perform cross validated training and testing of various models. We set up a training control variable,set to perform 3 time repeated 5 fold splits.

Linear Model

## RMSE: 9.79
## RMSE: 0.66

Variable importances

## lm variable importance
## 
##                          Overall
## Age                      100.000
## Cement                    84.877
## `Water:Age`               80.779
## BlastFurnaceSlag          60.830
## `Water:Superplasticizer`  34.919
## Superplasticizer          29.981
## `Water:FineAggregate`     24.496
## FineAggregate             22.504
## Water                     15.397
## `Cement:FlyAsh`           11.817
## FlyAsh                     7.422
## CoarseAggregate            0.000

Linear Model Model Variable Importances

## GLM Model

## RMSE: 9.88
## RMSE: 0.65

Variable importances

## glm variable importance
## 
##                           Overall
## Cement                   100.0000
## Age                       94.5583
## `Water:Age`               75.0802
## BlastFurnaceSlag          65.3983
## FlyAsh                    32.0394
## `Water:Superplasticizer`  25.0380
## Water                     23.2030
## Superplasticizer          19.3032
## CoarseAggregate            0.1565
## FineAggregate              0.0000

Generalized Linear Model Variable Importances

GAM Model

## RMSE: 5.67
## RMSE: 0.89

Random Forest Model

## Loading required package: e1071
## RMSE: 4.89
## RMSE: 0.92

Variable importances

## ranger variable importance
## 
##                   Overall
## Age              100.0000
## Cement            81.2197
## Water             28.4932
## BlastFurnaceSlag  12.7374
## Superplasticizer  12.3864
## FineAggregate      4.0920
## CoarseAggregate    0.6761
## FlyAsh             0.0000

Random Forest Model Variable Importances

GBM Gradient Boosted Machine Model

## RMSE: 4.1
## RMSE: 0.94

#### Variable importances

## gbm variable importance
## 
##                               Overall
## Water:Age                     100.000
## Cement                         74.993
## Water                          43.239
## Age                            24.298
## BlastFurnaceSlag               24.178
## Cement:CoarseAggregate         22.545
## CoarseAggregate:FineAggregate  13.579
## Superplasticizer               11.620
## Water:Superplasticizer          8.878
## FineAggregate                   6.223
## CoarseAggregate                 4.373
## FlyAsh                          2.590
## Cement:Water                    0.000

Gradient Boosted Machine Model Variable Importances

XG Boost Model

## RMSE: 4.19
## RMSE: 0.94

Variable importances

## xgbTree variable importance
## 
##                               Overall
## Water:Age                     100.000
## Cement                         85.066
## Age                            72.045
## Water                          45.502
## Cement:CoarseAggregate         39.804
## Superplasticizer               17.583
## CoarseAggregate:FineAggregate  14.463
## BlastFurnaceSlag               13.721
## Water:Superplasticizer          7.250
## FineAggregate                   4.301
## CoarseAggregate                 1.452
## Cement:Water                    1.166
## FlyAsh                          0.000

XG Boost Model Variable Importances

Get Predictions

Linear Model Predictions Plot

GLM Predictions Plot

GAM Predictions Predictions Plot

GBM Predictions Predictions Plot

Random Forest Predictions Plot

XG BOOST Predictions Plot

Model Performances Compared

Long Model rsquared RMSE
Linear Model lm 0.6571239 9.793557
Generalized Linear Model (GLM) glm 0.6515166 9.880447
Generalized Additive Model (GAM) gam 0.8859791 5.673155
Gradient Boosted Machine gbm 0.9399769 4.128572
Random Forest rf 0.9180005 5.068819
Extreme Gradient Boosting.XG Boost xgb 0.9370400 4.246084

Random Forest Model Variable Importances

XG Boost Model Variable Importances

Closing Discussion

  • Age
  • Water Content
  • Water and ages inter-relationship
  • Concrete content

These factors have by far and away the biggest importance on predicting Concrete Strength.

Given that Age has such a strong predictive power, that our data suggested age measurements were taking at sometimes wide time intervals, and the unreasonable effectiveness of data; We see collecting more freguent age measurements as the single biggest improvement that could be made in improving the predictive accuracy of our models.