Happy Beton in Rostock, Germany, where my father-in-law had his career

Data

We opted for a Concrete Compressive Strength problem from the UC Irvine Machine Learning Repository:

https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength

Data selection

We picked a regression problem because for the first three assignments we worked with a classification problem. After looking through ~50 datasets we selected this one because it’s interesting learning about something new and we weren’t prepared to spend the significantly longer time required to do a sufficient domain dive for any of the bioinformatic problems.

Data Import

Here we load the data and provide simpler names for the predictors and target variable.

library(tidyverse)
library(readxl)
library(randomForest)
library(caret)
library(pdp)
library(iml)

df0 <- read_excel("~/Documents/D622/HW4/Concrete_Data.xls")

colnames(df0) <- c("Cement", "BlastFurnaceSlag", "FlyAsh", "Water", "Superplasticizer", "CoarseAggregate", "FineAggregate", "Age", "MPa")

Variables

Cement and Water are combined to make a paste, this is then mixed with CoarseAggregate (gravel) and FineAggregate (sand) to bind together and harden into concrete.

BlastFurnaceSlag, a by-product from the steel industry, contributes to a lighter in color, smoother, slower to harden, more workable concrete.

FlyAsh, a by-product of coal power plants, contribute silica to concrete and can replace a portion of the concrete for slower-acting hardening which is important for large scale projects.

Concrete once poured needs to cure over time to harden fully. The first 28 days of Age is the initial hydration where the cement and water molecularly combine to create calcium silicate hydrates. Slag or fly ash in the mixture undergoes the same reaction but more slowly with most of the hardening occurring in the first year but in ideal conditions concrete will continue to harden for years.

Superplasticizer is added to concrete to make it easier to pour, without needing as much water, allowing for a lower water-to-cement ratio, making for a harder concrete.

We are measuring concrete compressive strength in megapascals, or MPa.

glimpse(df0)

## Rows: 1,030
## Columns: 9
## $ Cement           <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 266.0, 380.0, 380.…
## $ BlastFurnaceSlag <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 114.0, 95.0, 95.0, 114…
## $ FlyAsh           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Water            <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 228, 228, 192…
## $ Superplasticizer <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ CoarseAggregate  <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 93…
## $ FineAggregate    <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 670.0, 594.0, 594.…
## $ Age              <dbl> 28, 28, 270, 365, 360, 90, 365, 28, 28, 28, 90, 28, 2…
## $ MPa              <dbl> 79.986111, 61.887366, 40.269535, 41.052780, 44.296075…

Domain Dive

Here are some optional, additional background for concrete.

The water-to-cement ratio impacts the resulting concrete. A low w/c ratio like 0.4 will be stronger with fewer smaller pores, a high w/c ratio like 0.6 will have greater porosity, weakening the concrete.

Cement is calcium oxide and silicates, the water molecularly combines with the cement in a chemical reaction that produces heat and results in the durable, binding molecule calcium silicate hydrates (CSH) as well as the weaker, chemically reactive calcium hydroxide. The benefit of adding slag and fly ash to the mixture is that they can react with the calcium hydroxide to make more of the durable and binding calcium silicate hydrates out of the weak calcium hydroxide.

Having consistent temperature and hydration contributes to proper curing which could introduce noise into our data if some data points were improperly cured. The chemical reaction that produces calcium silicate hydrate also produces heat (called the heat of hydration) and if you have a large infrastructure project like a damn, the heat from the reaction could cause the curing concrete have a temperature differential between the hot inside and cool outside resulting in cracks. Having the slower reaction of slag or by replacing some of the fast-acting concrete with fly ash can lower the hardening reaction, and reduce the heat of hydration to where it can dissipate without cracking the concrete.

Note, slag produces a lighter color concrete which is valued for modern architecture. And, fly ash, rich in silica and alumina, in addition to slowing down the heat of hydration, can replace some of the concrete even further slowing down the heat of hydration.

The aggregates provide bulk to the concrete. Fine aggregates (sand) increases the ability of the mixture to be pumped, poured and worked into the final shape, but too much sand in the mixture decreases strength.

Coarse Aggregates (gravel) help with load distribution and locking the concrete mixture together, but if the gravel is too large or not evenly graded, then it could lead to weaker concrete due to uneven stress distribution. Poorly graded coarse aggregates could be uniformly graded where they are all the same size, or gap-graded where they are missing a band of sizes. Having an even gradation of sizes increases the packing efficiency, whereas if they are poorly graded imagine that the amount of cement mixture between the aggregates won’t be as uniform.

So besides strength we are also looking at cost, workability, coloration, smoothness.

Also note, we don’t have any information of the quality of the coarse aggregates so we are assuming that it’s evenly graded and consistent across all of our samples. Also, maybe the ratio of sand to rocks is important as a constructed variable.

Lastly, superplasticizers are high-range water-reducing admixtures (HRWRAs) that improve workability without needing to increase water content. This makes it possible to pour high strength low water-to-cement concrete, which is important for complex structural elements like buildings with lots of steel rebar that the concrete has to flow around and adhere to.

Summary Statistics

Note some of the concrete doesn’t have slag, ash or plasticizers. All units, except Age (days) and MPa (megapascals of compressive strength or hardness) are in the same measurement of kilogram per cubic meter of mixture. Note the aggregates make up the greatest bulk of the concrete.

summary(df0)

##      Cement      BlastFurnaceSlag     FlyAsh           Water      
##  Min.   :102.0   Min.   :  0.0    Min.   :  0.00   Min.   :121.8  
##  1st Qu.:192.4   1st Qu.:  0.0    1st Qu.:  0.00   1st Qu.:164.9  
##  Median :272.9   Median : 22.0    Median :  0.00   Median :185.0  
##  Mean   :281.2   Mean   : 73.9    Mean   : 54.19   Mean   :181.6  
##  3rd Qu.:350.0   3rd Qu.:142.9    3rd Qu.:118.27   3rd Qu.:192.0  
##  Max.   :540.0   Max.   :359.4    Max.   :200.10   Max.   :247.0  
##  Superplasticizer CoarseAggregate  FineAggregate        Age        
##  Min.   : 0.000   Min.   : 801.0   Min.   :594.0   Min.   :  1.00  
##  1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:731.0   1st Qu.:  7.00  
##  Median : 6.350   Median : 968.0   Median :779.5   Median : 28.00  
##  Mean   : 6.203   Mean   : 972.9   Mean   :773.6   Mean   : 45.66  
##  3rd Qu.:10.160   3rd Qu.:1029.4   3rd Qu.:824.0   3rd Qu.: 56.00  
##  Max.   :32.200   Max.   :1145.0   Max.   :992.6   Max.   :365.00  
##       MPa        
##  Min.   : 2.332  
##  1st Qu.:23.707  
##  Median :34.443  
##  Mean   :35.818  
##  3rd Qu.:46.136  
##  Max.   :82.599

Missing Values

There are no missing values

colSums(is.na(df0))

##           Cement BlastFurnaceSlag           FlyAsh            Water 
##                0                0                0                0 
## Superplasticizer  CoarseAggregate    FineAggregate              Age 
##                0                0                0                0 
##              MPa 
##                0

Correlation Matrix

Interesting - more cement means more strength. More superplasticizer means more strength (MPa) but that’s probably because it allows there to be less water used for it to still flow through pipes and be worked into place. This is supported by the negative correlation between superplasticizer and water.

The top three positive correlations are cement makes the concrete mixture stronger, superplasticizer makes it stronger but probably becasue it allows for less water, and age makes it stronger, because yes the concrete needs to hydrate to form those strong calcium silicate hydrates.

The top three negative correlations are fly ash and cement, because fly ash can be used in place of cement for slower binding reactions. Superplasticizer replaces water, and water and fine aggregate. This is because more sand allows for the concrete mixture to flow better and so you wouldn’t need as much water for the same workability/flow.

# Correlation matrix plot
correlations <- cor(df0[, sapply(df0, is.numeric)])
corrplot::corrplot(correlations, method="square", type = "upper", order = 'original')

Multicollinearity

Our Variance Inflation Factors (VIF) indicate there is no multicollinearity in our data. If Cement and Water were multicollinear then we would want to replace them with a water-to-cement ratio identified as critical in our domain dive and which we may still do later to compare results between.

lm_model <- lm(MPa ~ ., data = df0)
vif_values <- car::vif(lm_model)
vif_values

##           Cement BlastFurnaceSlag           FlyAsh            Water 
##         7.488657         7.276529         6.171455         7.004663 
## Superplasticizer  CoarseAggregate    FineAggregate              Age 
##         2.965297         5.076044         7.005346         1.118357

Linear Regression model

The residuals range from -28.653 to 34.446 with a median of 0.704 which is good because it is close to zero. We didn’t normalize our data first so it’s hard to tell if the residual range is wide or not, but it seems wide compared to the magnitude of the coefficients so maybe there’s some non-linear relationships at play or different classes of concrete produced by different ratios in the recipe. This could point to doing a clustering project.

CoarseAggregate and FineAggregate are barely significant so we can see if they have importance when analyzed non-linearly, or we could rerun with them combined.

61.55% of the variance in MPa is explained by the model. That’s a good fit but relatively low for a physical phenomenon which strongly suggests this isn’t a linear problem.

Out of completeness we mention that the p-value of the F-statistic is virtually zero so the model is significant.

summary(lm_model)

## 
## Call:
## lm(formula = MPa ~ ., data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.653  -6.303   0.704   6.562  34.446 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -23.163756  26.588421  -0.871 0.383851    
## Cement             0.119785   0.008489  14.110  < 2e-16 ***
## BlastFurnaceSlag   0.103847   0.010136  10.245  < 2e-16 ***
## FlyAsh             0.087943   0.012585   6.988 5.03e-12 ***
## Water             -0.150298   0.040179  -3.741 0.000194 ***
## Superplasticizer   0.290687   0.093460   3.110 0.001921 ** 
## CoarseAggregate    0.018030   0.009394   1.919 0.055227 .  
## FineAggregate      0.020154   0.010703   1.883 0.059968 .  
## Age                0.114226   0.005427  21.046  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.4 on 1021 degrees of freedom
## Multiple R-squared:  0.6155, Adjusted R-squared:  0.6125 
## F-statistic: 204.3 on 8 and 1021 DF,  p-value: < 2.2e-16

Residual Analysis

We check the Q-Q plot and it looks like the residuals are near perfectly normally distributed.

When we look at the Residual vs. Fitted it looks a little like a football, big in the middle, pointy at the ends, and this seems to say that when something impairs the hardness of the concrete it’s a linear problem and when the concrete is super hard, it’s a linear problem, but in the middle it’s a non-linear problem how exactly the ratios come together to produce the hardness.

Also, maybe some of the additives can only help so much, so doubling an additive that helps maybe won’t do anything if the concrete mixture is already fully saturated with it.

plot(lm_model)

Log Transformation

Taking the log of the MPa made the model worse, from 61.55% to 55.62%, and we should have anticipated that because the residuals from the untransformed model were normally distributed.

lm_model_log <- lm(log(MPa) ~ ., data = df0)

summary(lm_model_log)

## 
## Call:
## lm(formula = log(MPa) ~ ., data = df0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.75973 -0.22786  0.08013  0.27953  0.87255 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.7495785  0.9426556   1.856  0.06374 .  
## Cement            0.0037070  0.0003010  12.316  < 2e-16 ***
## BlastFurnaceSlag  0.0030350  0.0003594   8.445  < 2e-16 ***
## FlyAsh            0.0031324  0.0004462   7.020 4.03e-12 ***
## Water            -0.0040572  0.0014245  -2.848  0.00449 ** 
## Superplasticizer  0.0096986  0.0033135   2.927  0.00350 ** 
## CoarseAggregate   0.0004878  0.0003331   1.465  0.14331    
## FineAggregate     0.0003758  0.0003794   0.990  0.32227    
## Age               0.0037939  0.0001924  19.717  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3687 on 1021 degrees of freedom
## Multiple R-squared:  0.5562, Adjusted R-squared:  0.5527 
## F-statistic: 159.9 on 8 and 1021 DF,  p-value: < 2.2e-16

Interaction Terms

Here I tried adding the water-to-cement ratio and three versions of a rocks to sand interaction

sand / rocks = no impact
sand * rocks = no impact
rocks / sand = very impactful

So, the p-value of the rocks/sand was 0.000118 which means it’s not just luck. It makes sense though about rocks / sand. efficient packing requires an even gradient of aggregate sizes. if there’s too much coarse aggregate the compressive strength would go down (-3.864 coefficient) because you need enough sand to fill the gaps of the bigger rocks. But it’s curious, no, that rocks/sand works but sand/rocks doesn’t?

The water-to-cement ratio didn’t work but we can try a non-linear algorithm with and without the w/c ratio to see if it makes a difference.

lm_model_interactions <- lm(MPa ~ . + I(Water / Cement) + I(CoarseAggregate / FineAggregate), data = df0)

summary(lm_model_interactions)

## 
## Call:
## lm(formula = MPa ~ . + I(Water/Cement) + I(CoarseAggregate/FineAggregate), 
##     data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.412  -6.088   0.706   6.870  33.387 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       43.678861  30.633648   1.426 0.154219    
## Cement                             0.113393   0.011213  10.113  < 2e-16 ***
## BlastFurnaceSlag                   0.107115   0.010075  10.632  < 2e-16 ***
## FlyAsh                             0.085587   0.012478   6.859 1.20e-11 ***
## Water                             -0.127100   0.041016  -3.099 0.001996 ** 
## Superplasticizer                   0.324821   0.093033   3.491 0.000501 ***
## CoarseAggregate                    0.087708   0.020682   4.241 2.43e-05 ***
## FineAggregate                     -0.066855   0.024924  -2.682 0.007429 ** 
## Age                                0.116149   0.005438  21.358  < 2e-16 ***
## I(Water/Cement)                   -4.149785   2.808787  -1.477 0.139870    
## I(CoarseAggregate/FineAggregate) -52.638459  13.621819  -3.864 0.000118 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.3 on 1019 degrees of freedom
## Multiple R-squared:  0.6238, Adjusted R-squared:  0.6201 
## F-statistic:   169 on 10 and 1019 DF,  p-value: < 2.2e-16

Random Forest

So we tried random forest and we’re getting significantly better R2, or the variance explained by our Random Forest model, at 90.97%. This gives us more confidence that concrete compressive strength is a non-linear process, however we still need to check that our model isn’t overfit.

set.seed(175328)
rf_model <- randomForest(MPa ~ ., data = df0)

tail(rf_model$rsq, 1)

## [1] 0.9097463

Feature Importance

While our Random Forest model is predicting the hardness of the different cement mixtures much better than the linear regression model, we’re having a hard time demonstrating those non-linear relationships in the data. What we can see below is that age is what drives the hardening process and that the amount of cement is the primary driver in how hard the concrete is. This makes sense because it mixed with water is what binds the aggregates together, whereas slag and ash influences the characteristics of the concrete.

importance(rf_model)

##                  IncNodePurity
## Cement                57576.61
## BlastFurnaceSlag      16939.18
## FlyAsh                13947.27
## Water                 34636.06
## Superplasticizer      25293.90
## CoarseAggregate       17900.76
## FineAggregate         19024.02
## Age                   86024.57

varImpPlot(rf_model)

Cross-Validation for Stability

Here we use 10-fold cross-validation to check for stability in the Random Forest model and that it’s not overfit.

We find that 92.08% of our variance is accounted for by this model.

Note that the ideal mtry was 5, this means that at each tree out of the eight predictors, five were allowed to be considered. However only mtry values of 2, 5 or 8 were considered so we’ll look at values between 2 and 6 next for the ideal mtry.

set.seed(175328)
rf_cv <- train(MPa ~ ., data = df0, method = "rf", trControl = trainControl(method = "cv", number = 10))

print(rf_cv)

## Random Forest 
## 
## 1030 samples
##    8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 926, 928, 927, 926, 928, 929, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     5.212192  0.9169188  3.837329
##   5     4.737032  0.9208419  3.316702
##   8     4.810363  0.9174042  3.333973
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.

Ideal number of predictors.

We’re showing the ideal number of predictors to consider for each tree or stump in the random forest is four. This means that it’s more than just simple two-predictor interactions that are driving hardness.

This is consistent with our domain dive where the interplay of water, concrete, plasticizers, an even gradient of aggregates and the addition of silicates and alumina from fly ash and slag all work together to produce hardness.

# Define a grid of mtry values to test
tune_grid <- expand.grid(mtry = c(2, 3, 4, 5, 6))

# Perform 10-fold cross-validation with the specified mtry values
rf_cv <- train(
  MPa ~ ., 
  data = df0, 
  method = "rf", 
  trControl = trainControl(method = "cv", number = 10),
  tuneGrid = tune_grid
)

# Print the results
print(rf_cv)

## Random Forest 
## 
## 1030 samples
##    8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 926, 926, 927, 926, 927, 928, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     5.133948  0.9221144  3.800211
##   3     4.724437  0.9279820  3.398484
##   4     4.605343  0.9287047  3.261051
##   5     4.603553  0.9271294  3.241732
##   6     4.629922  0.9256030  3.248778
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.

And here’s the plot RMSE with the lowest value at the mtry of 4.

plot(rf_cv)

SVM revisited

It’s not necessary but since we had so much trouble implementing SVM for our previous, category-heavy problem, we attempted it here.

It performed perfectly adequately and so easily compared to the difficulties in working with the prior dataset specifically with SVM. This is probably because with this dataset all of the predictors are numeric and that is what SVM is naturally suited for.

However the best Rsquared, 88.89%, does not exceed what we achieved with Random Forest at 92.08%.

I deleted it for brevity but when I ran the same model with first scaling the data, the Rsquared was lower at 72.36%.

tune_grid <- expand.grid(C = c(0.1, 1, 10), sigma = c(0.01, 0.1, 1))
train_control <- trainControl(method = "cv", number = 10)

set.seed(175328)
svm_model <- train(
  MPa ~ ., 
  data = df0, 
  method = "svmRadial",
  trControl = train_control,
  tuneGrid = tune_grid)

print(svm_model)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1030 samples
##    8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 926, 928, 927, 926, 928, 929, ... 
## Resampling results across tuning parameters:
## 
##   C     sigma  RMSE       Rsquared   MAE     
##    0.1  0.01   11.270178  0.6240059  9.117620
##    0.1  0.10    8.814860  0.7419568  6.807329
##    0.1  1.00   12.142112  0.5966044  9.169449
##    1.0  0.01    8.430162  0.7485941  6.570599
##    1.0  0.10    6.573446  0.8466288  4.817187
##    1.0  1.00    6.933983  0.8389037  4.833237
##   10.0  0.01    7.590229  0.7936221  5.744932
##   10.0  0.10    5.556466  0.8888635  3.985473
##   10.0  1.00    6.081161  0.8683396  4.050270
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.1 and C = 10.

Parameter tuning

Here’s the visualization of the parameter tuning. We can see the lowest cross-validated RMSE is with a cost of 10 and a sigma of 0.1.

plot(svm_model)

Partial Dependence Plot (PDP)

It looks like SVM is great for numerical prediction but hard to interpret. I couldn’t find feature importance function. However we can produce these Partial Dependence Plots which start to show some of the two-term non-linear relationships.

Here we can visualize how high levels of concrete and low levels of water are required to produce the hardest concrete. The only way you could have that low water would be if there’s high sand, high plasticizers and high slag, which all improve workability.

pdp_multiple <- partial(svm_model, pred.var = c("Water", "Cement"), train = df0)

plotPartial(pdp_multiple, main = "Partial Dependence of Water and Cement")

XGBoost

Here we ran XGBoost on the problem. Initially we used a tune grid to fit the model with a two-fold cross validation, then extracted the best performing parameters using `print(xgb_model$bestTune), to run the model one more time with 10-fold cross validation

So far we have the best performance with an Rsquared of 94.36%, beating out the next highest Random Forest at 92.08%

#tune_grid <- expand.grid(
#  nrounds = c(100, 200),    # Number of boosting rounds
#  max_depth = c(3, 6, 9),   # Maximum tree depth
#  eta = c(0.01, 0.1, 0.3),  # Learning rate
#  gamma = c(0, 1),          # Minimum loss reduction
#  colsample_bytree = c(0.7, 1), # Subsample ratio of columns
#  min_child_weight = c(1, 5),   # Minimum sum of weights in a child node
#  subsample = c(0.7, 1)     # Subsample ratio of the training data
#)

tune_grid <- expand.grid(
  nrounds = 200,    # Number of boosting rounds
  max_depth = 6,   # Maximum tree depth
  eta = 0.1,  # Learning rate
  gamma = 0,          # Minimum loss reduction
  colsample_bytree = 1, # Subsample ratio of columns
  min_child_weight = 5,   # Minimum sum of weights in a child node
  subsample = 0.7     # Subsample ratio of the training data
)

#train_control <- trainControl(method = "cv", number = 2)
train_control <- trainControl(method = "cv", number = 10)

set.seed(175328)
xgb_model <- train(
  MPa ~ ., 
  data = df0, 
  method = "xgbTree", 
  trControl = train_control, 
  tuneGrid = tune_grid
)

#print(xgb_model$bestTune)
print(xgb_model)

## eXtreme Gradient Boosting 
## 
## 1030 samples
##    8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 926, 928, 927, 926, 928, 929, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   3.891582  0.943551  2.490643
## 
## Tuning parameter 'nrounds' was held constant at a value of 200
## Tuning
##  held constant at a value of 5
## Tuning parameter 'subsample' was held
##  constant at a value of 0.7

Feature Importance

Here we don’t see much new. Again, feature importance doesn’t tell us as about two-term non-linear relationships like partial dependence plots do.

importance <- varImp(xgb_model)

plot(importance, main = "Feature Importance")

Partial Dependence Plot

Here we used a partial dependence plot to compare age against BlastFurnace Slag.

Slag provides molecules to convert more of the weak concrete molecules into durable concrete molecules, but that process occurs primarily over the first 365 days, instead of the first 28 days like with cement and water.

And so we we should see concrete with a higher slag content be harder for ages past the initial 28 days.

And we have confirmation as we only see the hardest concrete, out of concrete with a high amount of slag, at ages closest to the year mark.

pdp_multiple <- partial(xgb_model, pred.var = c("Age", "BlastFurnaceSlag"), train = df0)
plotPartial(pdp_multiple, main = "Partial Dependence of Age and Blast Furnace Slag")

SHapley Additive exPlanations (SHAP)

We included a SHAP chart as it’s something we’d like to get better at using to interpret non-linear models but it’s still a little elusive. I would presume the numbers on the left are the SHAP values of each predictor but it looks like cement is the number one predictor having a phi bar all the way to 22.5, but it’s number is 540 and CoarseAggregate (rocks) has a higher value of 1040 even though it’s phi value is roughly 3.5.

One takeaway of this is that since BlastFurnaceSlag has negative phi maybe it’s presence predicts for lower hardness, but that’s because slag requires the longer aging to produce calcium silicate hydrate and so on average for shorter days of aging the presence of slag, which replaces other components in the mixture, like cement, does make for less hardness but only for the on average earlier ages.

# Create a predictor object
predictor <- Predictor$new(xgb_model, data = df0, y = df0$MPa)

# Calculate SHAP values
shap <- Shapley$new(predictor, x.interest = df0[1, ])  # Example for the first instance
plot(shap)

Clustering

Here we used an unsupervised technique to tap into the full range of what we learned this semester.

Framing the problem

From our domain dive we know that there are different types of concrete like “portland concrete” and architectural concrete.

Our attempt here is to cluster the data into different “recipes” for different types of concrete and get some summary statistics on their makeup.

Since hardness is not part of the recipe we’ll take that out of the data for purposes of clustering.

Outcome

I’m so excited - this came out way better than I imagined!

Cluster 3 is clearly Architectural concrete. It has the high slag content that makes for a light coloration that works well in modern glass and concrete houses, it has the high workability to be made into intricate shapes and the smooth finish for beautiful surfaces.

Cluster 2 is clearly Portland concrete. This is your basic, general-purpose concrete and is identified because it has the highest cement content and very low ash and slag.

Cluster 3 is Fly Ash concrete. Identified because it has the highest FlyAsh content. Fly Ash concrete is used for large infrastructure projects like dams because it reduces the heat of hydration, by reducing the amount of cement needed (which reacts and produces heat quickly) with more fly ash (which reacts and produces heat slowly). Because fly ash concrete has the lowest rate of heating due to hydration it’s the least likely to build up a high temperature differential and then crack due to uneven expansion of the concrete due to that heat. Hence the best for dams and large infrastructure projects.

df_scaled <- as.data.frame(scale(df0))

clustering_data <- df_scaled[, -which(names(df_scaled) == "MPa")]

set.seed(175328)
kmeans_result <- kmeans(clustering_data, centers = 3)

df0$cluster <- kmeans_result$cluster

t(aggregate(df0[, -which(names(df0) == "cluster")], by = list(Cluster = df0$cluster), FUN = mean))

##                       [,1]         [,2]       [,3]
## Cluster            1.00000    2.0000000   3.000000
## Cement           266.78062  363.2333333 234.517121
## BlastFurnaceSlag  33.04587   11.3386364 177.911970
## FlyAsh           107.27959    4.1780303  24.048030
## Water            167.85069  195.6094697 188.453182
## Superplasticizer  10.45958    0.4221591   5.204182
## CoarseAggregate  971.80952 1002.9625000 950.348788
## FineAggregate    807.42420  751.0003788 746.924848
## Age               35.31651   75.1742424  35.721212
## MPa               37.15376   33.6639015  35.775949

Essay

Problem to be Solved

We are trying to predict the hardness, or compressive strength, of concrete based on its age and mixture of ingredients and their ratios to each other.

Secondarily we want to understand the interactions between the mixture components and how they influence hardness.

As an extension of this work, a concrete company could combine cost information and engineering requirements for a specific project to try to optimize a recipe for concrete balancing cost, workability and suitability for a project.

Data and Preprocessing

Our data was sourced from the UC Irvine Machine Learning Repository and contains 1,030 records of specific concrete mixtures with ingredients measured in kilograms per cubic meter, then aged to a certain number of days and the hardness of the concrete measured in megapascals of compressive force resisted before breaking.

Our data was clean with no missing values. Variance inflation factors (VIF) revealed no multicollinearity.

A correlation matrix supported some of the initial insights obtained from our domain dive.

Methodologies of Analysis

A range of predictive models were used to explore the best fit for the data.

Linear Regression

Linear Regression produced the worst predictive model, explaining 61.55% of the variation in the data. This strongly suggested the process for concrete hardening is not linear.

However we did show that our residuals were normal and so there was no need to transform our data.

Also we showed that the water-to-cement ratio, by itself, did not contribute meaningfully to the linear regression prediction, a gravel-to-sand ratio did. Note gravel and sand are used interchangably to coarse and fine aggregates in the analysis.

Random Forest

Random Forest provide a significant improvement with 92.08% of the variance in the data explained. This confirmed the non-linear nature of the problem. Note we used 10-fold cross validation to confirm for stability in all models.

The optimal number of predictors used to form stumps in the forest was four, indicating that that more there are meaningful interactions in at least some cases between four predictors, including age.

SVM

Support Vector Machines slightly underperformed Random Forest with 88.89% of the variation in the data explained. However we got to visualize a partial dependence plot between water and cement showing how the hardest concrete is only achievable with high amounts of cement and low water.

XGBoost

eXtreme Gradient Boosting outperformed them all, explaining 94.36% of the variance.

We looked at a partial dependence plot of age and slag to confirm that since slag’s contribution to the hardening process takes place over a 365 day period instead of a 28 day period for cement, that for concrete mixtures with a high slag component, the hardest mixtures would all have been tested after nearly 365 days of aging, and this was observed in the partial dependence plot.

We also tentatively explored SHapley Additive exPlanations (SHAP) since we are expanding our toolbelt to interpret machine learning models that are more complex than statistics-based models.

Clustering

Lastly we looked at clustering, not as a way to predict hardness of a particular mixture of concrete after a certain number of days of curing, but to understand how all of these different mixtures might be grouped into different recipes. We found three clusters that tied neatly to Portland cement (standard cement used say for sidewalks), architectural cement which uses a high amount of slag, and fly ash cement which is used in large scale industrial projects like dams because of it’s low heat of hydration.

Purpose

The purpose of all of this analysis is to develop confidence that:

We can predict hardness of concrete based on it’s particular mixture and amount of curing
We can understand the relationships and ratios between the ingredients to influence characteristics of the concrete and support that understanding through partial dependence plots
We can identify signature recipe types.

And in developing, potentially make our own recipes to meet business and client requirements.

Conclusions

The XGBoost model has the best predictive value for concrete hardness.

Through partial dependence plots, we learned that the hardest concrete has a low water-to-cement ratio and concrete mixtures with a high proportion of slag take nearly a year to achieve the highest compressive strengths.

There are three main types of concrete recipes that fit well with what we learned in our domain dive: portland concrete, architectural concrete and ash concrete.

Business Impact

Knowing these things we should be able to customize concrete mixtures for our clients to fit their budget and construction requirements.

D622 HW4: Concrete Compressive Strength

PK O’Flaherty

2025-01-29

Data

Data selection

Data Import

Variables

Domain Dive

Summary Statistics

Missing Values

Correlation Matrix

Multicollinearity

Linear Regression model

Residual Analysis

Log Transformation

Interaction Terms

Random Forest

Feature Importance

Cross-Validation for Stability

Ideal number of predictors.

SVM revisited

Parameter tuning

Partial Dependence Plot (PDP)

XGBoost

Feature Importance

Partial Dependence Plot

SHapley Additive exPlanations (SHAP)

Clustering

Framing the problem

Outcome

Essay

Problem to be Solved

Data and Preprocessing

Methodologies of Analysis

Linear Regression

Random Forest

SVM

XGBoost

Clustering

Purpose

Conclusions

Business Impact