We opted for a Concrete Compressive Strength problem from the UC Irvine Machine Learning Repository:
https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength
We picked a regression problem because for the first three assignments we worked with a classification problem. After looking through ~50 datasets we selected this one because it’s interesting learning about something new and we weren’t prepared to spend the significantly longer time required to do a sufficient domain dive for any of the bioinformatic problems.
Here we load the data and provide simpler names for the predictors and target variable.
library(tidyverse)
library(readxl)
library(randomForest)
library(caret)
library(pdp)
library(iml)
df0 <- read_excel("~/Documents/D622/HW4/Concrete_Data.xls")
colnames(df0) <- c("Cement", "BlastFurnaceSlag", "FlyAsh", "Water", "Superplasticizer", "CoarseAggregate", "FineAggregate", "Age", "MPa")
Cement and Water are combined to make a paste, this is then mixed with CoarseAggregate (gravel) and FineAggregate (sand) to bind together and harden into concrete.
BlastFurnaceSlag, a by-product from the steel industry, contributes to a lighter in color, smoother, slower to harden, more workable concrete.
FlyAsh, a by-product of coal power plants, contribute silica to concrete and can replace a portion of the concrete for slower-acting hardening which is important for large scale projects.
Concrete once poured needs to cure over time to harden fully. The first 28 days of Age is the initial hydration where the cement and water molecularly combine to create calcium silicate hydrates. Slag or fly ash in the mixture undergoes the same reaction but more slowly with most of the hardening occurring in the first year but in ideal conditions concrete will continue to harden for years.
Superplasticizer is added to concrete to make it easier to pour, without needing as much water, allowing for a lower water-to-cement ratio, making for a harder concrete.
We are measuring concrete compressive strength in megapascals, or MPa.
glimpse(df0)
## Rows: 1,030
## Columns: 9
## $ Cement <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 266.0, 380.0, 380.…
## $ BlastFurnaceSlag <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 114.0, 95.0, 95.0, 114…
## $ FlyAsh <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Water <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 228, 228, 192…
## $ Superplasticizer <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ CoarseAggregate <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 93…
## $ FineAggregate <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 670.0, 594.0, 594.…
## $ Age <dbl> 28, 28, 270, 365, 360, 90, 365, 28, 28, 28, 90, 28, 2…
## $ MPa <dbl> 79.986111, 61.887366, 40.269535, 41.052780, 44.296075…
Here are some optional, additional background for concrete.
The water-to-cement ratio impacts the resulting concrete. A low w/c ratio like 0.4 will be stronger with fewer smaller pores, a high w/c ratio like 0.6 will have greater porosity, weakening the concrete.
Cement is calcium oxide and silicates, the water molecularly combines with the cement in a chemical reaction that produces heat and results in the durable, binding molecule calcium silicate hydrates (CSH) as well as the weaker, chemically reactive calcium hydroxide. The benefit of adding slag and fly ash to the mixture is that they can react with the calcium hydroxide to make more of the durable and binding calcium silicate hydrates out of the weak calcium hydroxide.
Having consistent temperature and hydration contributes to proper curing which could introduce noise into our data if some data points were improperly cured. The chemical reaction that produces calcium silicate hydrate also produces heat (called the heat of hydration) and if you have a large infrastructure project like a damn, the heat from the reaction could cause the curing concrete have a temperature differential between the hot inside and cool outside resulting in cracks. Having the slower reaction of slag or by replacing some of the fast-acting concrete with fly ash can lower the hardening reaction, and reduce the heat of hydration to where it can dissipate without cracking the concrete.
Note, slag produces a lighter color concrete which is valued for modern architecture. And, fly ash, rich in silica and alumina, in addition to slowing down the heat of hydration, can replace some of the concrete even further slowing down the heat of hydration.
The aggregates provide bulk to the concrete. Fine aggregates (sand) increases the ability of the mixture to be pumped, poured and worked into the final shape, but too much sand in the mixture decreases strength.
Coarse Aggregates (gravel) help with load distribution and locking the concrete mixture together, but if the gravel is too large or not evenly graded, then it could lead to weaker concrete due to uneven stress distribution. Poorly graded coarse aggregates could be uniformly graded where they are all the same size, or gap-graded where they are missing a band of sizes. Having an even gradation of sizes increases the packing efficiency, whereas if they are poorly graded imagine that the amount of cement mixture between the aggregates won’t be as uniform.
So besides strength we are also looking at cost, workability, coloration, smoothness.
Also note, we don’t have any information of the quality of the coarse aggregates so we are assuming that it’s evenly graded and consistent across all of our samples. Also, maybe the ratio of sand to rocks is important as a constructed variable.
Lastly, superplasticizers are high-range water-reducing admixtures (HRWRAs) that improve workability without needing to increase water content. This makes it possible to pour high strength low water-to-cement concrete, which is important for complex structural elements like buildings with lots of steel rebar that the concrete has to flow around and adhere to.
Note some of the concrete doesn’t have slag, ash or plasticizers. All
units, except Age
(days) and MPa (megapascals of
compressive strength or hardness) are in the same measurement of
kilogram per cubic meter of mixture. Note the aggregates make up the
greatest bulk of the concrete.
summary(df0)
## Cement BlastFurnaceSlag FlyAsh Water
## Min. :102.0 Min. : 0.0 Min. : 0.00 Min. :121.8
## 1st Qu.:192.4 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:164.9
## Median :272.9 Median : 22.0 Median : 0.00 Median :185.0
## Mean :281.2 Mean : 73.9 Mean : 54.19 Mean :181.6
## 3rd Qu.:350.0 3rd Qu.:142.9 3rd Qu.:118.27 3rd Qu.:192.0
## Max. :540.0 Max. :359.4 Max. :200.10 Max. :247.0
## Superplasticizer CoarseAggregate FineAggregate Age
## Min. : 0.000 Min. : 801.0 Min. :594.0 Min. : 1.00
## 1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:731.0 1st Qu.: 7.00
## Median : 6.350 Median : 968.0 Median :779.5 Median : 28.00
## Mean : 6.203 Mean : 972.9 Mean :773.6 Mean : 45.66
## 3rd Qu.:10.160 3rd Qu.:1029.4 3rd Qu.:824.0 3rd Qu.: 56.00
## Max. :32.200 Max. :1145.0 Max. :992.6 Max. :365.00
## MPa
## Min. : 2.332
## 1st Qu.:23.707
## Median :34.443
## Mean :35.818
## 3rd Qu.:46.136
## Max. :82.599
There are no missing values
colSums(is.na(df0))
## Cement BlastFurnaceSlag FlyAsh Water
## 0 0 0 0
## Superplasticizer CoarseAggregate FineAggregate Age
## 0 0 0 0
## MPa
## 0
Interesting - more cement means more strength. More superplasticizer means more strength (MPa) but that’s probably because it allows there to be less water used for it to still flow through pipes and be worked into place. This is supported by the negative correlation between superplasticizer and water.
The top three positive correlations are cement makes the concrete mixture stronger, superplasticizer makes it stronger but probably becasue it allows for less water, and age makes it stronger, because yes the concrete needs to hydrate to form those strong calcium silicate hydrates.
The top three negative correlations are fly ash and cement, because fly ash can be used in place of cement for slower binding reactions. Superplasticizer replaces water, and water and fine aggregate. This is because more sand allows for the concrete mixture to flow better and so you wouldn’t need as much water for the same workability/flow.
# Correlation matrix plot
correlations <- cor(df0[, sapply(df0, is.numeric)])
corrplot::corrplot(correlations, method="square", type = "upper", order = 'original')
Our Variance Inflation Factors (VIF) indicate there is no multicollinearity in our data. If Cement and Water were multicollinear then we would want to replace them with a water-to-cement ratio identified as critical in our domain dive and which we may still do later to compare results between.
lm_model <- lm(MPa ~ ., data = df0)
vif_values <- car::vif(lm_model)
vif_values
## Cement BlastFurnaceSlag FlyAsh Water
## 7.488657 7.276529 6.171455 7.004663
## Superplasticizer CoarseAggregate FineAggregate Age
## 2.965297 5.076044 7.005346 1.118357
The residuals range from -28.653 to 34.446 with a median of 0.704 which is good because it is close to zero. We didn’t normalize our data first so it’s hard to tell if the residual range is wide or not, but it seems wide compared to the magnitude of the coefficients so maybe there’s some non-linear relationships at play or different classes of concrete produced by different ratios in the recipe. This could point to doing a clustering project.
CoarseAggregate and FineAggregate are barely significant so we can see if they have importance when analyzed non-linearly, or we could rerun with them combined.
61.55% of the variance in MPa is explained by the model. That’s a good fit but relatively low for a physical phenomenon which strongly suggests this isn’t a linear problem.
Out of completeness we mention that the p-value of the F-statistic is virtually zero so the model is significant.
summary(lm_model)
##
## Call:
## lm(formula = MPa ~ ., data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.653 -6.303 0.704 6.562 34.446
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.163756 26.588421 -0.871 0.383851
## Cement 0.119785 0.008489 14.110 < 2e-16 ***
## BlastFurnaceSlag 0.103847 0.010136 10.245 < 2e-16 ***
## FlyAsh 0.087943 0.012585 6.988 5.03e-12 ***
## Water -0.150298 0.040179 -3.741 0.000194 ***
## Superplasticizer 0.290687 0.093460 3.110 0.001921 **
## CoarseAggregate 0.018030 0.009394 1.919 0.055227 .
## FineAggregate 0.020154 0.010703 1.883 0.059968 .
## Age 0.114226 0.005427 21.046 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.4 on 1021 degrees of freedom
## Multiple R-squared: 0.6155, Adjusted R-squared: 0.6125
## F-statistic: 204.3 on 8 and 1021 DF, p-value: < 2.2e-16
We check the Q-Q plot and it looks like the residuals are near perfectly normally distributed.
When we look at the Residual vs. Fitted it looks a little like a football, big in the middle, pointy at the ends, and this seems to say that when something impairs the hardness of the concrete it’s a linear problem and when the concrete is super hard, it’s a linear problem, but in the middle it’s a non-linear problem how exactly the ratios come together to produce the hardness.
Also, maybe some of the additives can only help so much, so doubling an additive that helps maybe won’t do anything if the concrete mixture is already fully saturated with it.
plot(lm_model)
Taking the log of the MPa made the model worse, from 61.55% to 55.62%, and we should have anticipated that because the residuals from the untransformed model were normally distributed.
lm_model_log <- lm(log(MPa) ~ ., data = df0)
summary(lm_model_log)
##
## Call:
## lm(formula = log(MPa) ~ ., data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.75973 -0.22786 0.08013 0.27953 0.87255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7495785 0.9426556 1.856 0.06374 .
## Cement 0.0037070 0.0003010 12.316 < 2e-16 ***
## BlastFurnaceSlag 0.0030350 0.0003594 8.445 < 2e-16 ***
## FlyAsh 0.0031324 0.0004462 7.020 4.03e-12 ***
## Water -0.0040572 0.0014245 -2.848 0.00449 **
## Superplasticizer 0.0096986 0.0033135 2.927 0.00350 **
## CoarseAggregate 0.0004878 0.0003331 1.465 0.14331
## FineAggregate 0.0003758 0.0003794 0.990 0.32227
## Age 0.0037939 0.0001924 19.717 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3687 on 1021 degrees of freedom
## Multiple R-squared: 0.5562, Adjusted R-squared: 0.5527
## F-statistic: 159.9 on 8 and 1021 DF, p-value: < 2.2e-16
Here I tried adding the water-to-cement ratio and three versions of a rocks to sand interaction
So, the p-value of the rocks/sand was 0.000118 which means it’s not just luck. It makes sense though about rocks / sand. efficient packing requires an even gradient of aggregate sizes. if there’s too much coarse aggregate the compressive strength would go down (-3.864 coefficient) because you need enough sand to fill the gaps of the bigger rocks. But it’s curious, no, that rocks/sand works but sand/rocks doesn’t?
The water-to-cement ratio didn’t work but we can try a non-linear algorithm with and without the w/c ratio to see if it makes a difference.
lm_model_interactions <- lm(MPa ~ . + I(Water / Cement) + I(CoarseAggregate / FineAggregate), data = df0)
summary(lm_model_interactions)
##
## Call:
## lm(formula = MPa ~ . + I(Water/Cement) + I(CoarseAggregate/FineAggregate),
## data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.412 -6.088 0.706 6.870 33.387
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.678861 30.633648 1.426 0.154219
## Cement 0.113393 0.011213 10.113 < 2e-16 ***
## BlastFurnaceSlag 0.107115 0.010075 10.632 < 2e-16 ***
## FlyAsh 0.085587 0.012478 6.859 1.20e-11 ***
## Water -0.127100 0.041016 -3.099 0.001996 **
## Superplasticizer 0.324821 0.093033 3.491 0.000501 ***
## CoarseAggregate 0.087708 0.020682 4.241 2.43e-05 ***
## FineAggregate -0.066855 0.024924 -2.682 0.007429 **
## Age 0.116149 0.005438 21.358 < 2e-16 ***
## I(Water/Cement) -4.149785 2.808787 -1.477 0.139870
## I(CoarseAggregate/FineAggregate) -52.638459 13.621819 -3.864 0.000118 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.3 on 1019 degrees of freedom
## Multiple R-squared: 0.6238, Adjusted R-squared: 0.6201
## F-statistic: 169 on 10 and 1019 DF, p-value: < 2.2e-16
So we tried random forest and we’re getting significantly better R2, or the variance explained by our Random Forest model, at 90.97%. This gives us more confidence that concrete compressive strength is a non-linear process, however we still need to check that our model isn’t overfit.
set.seed(175328)
rf_model <- randomForest(MPa ~ ., data = df0)
tail(rf_model$rsq, 1)
## [1] 0.9097463
While our Random Forest model is predicting the hardness of the different cement mixtures much better than the linear regression model, we’re having a hard time demonstrating those non-linear relationships in the data. What we can see below is that age is what drives the hardening process and that the amount of cement is the primary driver in how hard the concrete is. This makes sense because it mixed with water is what binds the aggregates together, whereas slag and ash influences the characteristics of the concrete.
importance(rf_model)
## IncNodePurity
## Cement 57576.61
## BlastFurnaceSlag 16939.18
## FlyAsh 13947.27
## Water 34636.06
## Superplasticizer 25293.90
## CoarseAggregate 17900.76
## FineAggregate 19024.02
## Age 86024.57
varImpPlot(rf_model)
Here we use 10-fold cross-validation to check for stability in the Random Forest model and that it’s not overfit.
We find that 92.08% of our variance is accounted for by this model.
Note that the ideal mtry
was 5, this means that at each
tree out of the eight predictors, five were allowed to be considered.
However only mtry
values of 2, 5 or 8 were considered so
we’ll look at values between 2 and 6 next for the ideal
mtry
.
set.seed(175328)
rf_cv <- train(MPa ~ ., data = df0, method = "rf", trControl = trainControl(method = "cv", number = 10))
print(rf_cv)
## Random Forest
##
## 1030 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 926, 928, 927, 926, 928, 929, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 5.212192 0.9169188 3.837329
## 5 4.737032 0.9208419 3.316702
## 8 4.810363 0.9174042 3.333973
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
We’re showing the ideal number of predictors to consider for each tree or stump in the random forest is four. This means that it’s more than just simple two-predictor interactions that are driving hardness.
This is consistent with our domain dive where the interplay of water, concrete, plasticizers, an even gradient of aggregates and the addition of silicates and alumina from fly ash and slag all work together to produce hardness.
# Define a grid of mtry values to test
tune_grid <- expand.grid(mtry = c(2, 3, 4, 5, 6))
# Perform 10-fold cross-validation with the specified mtry values
rf_cv <- train(
MPa ~ .,
data = df0,
method = "rf",
trControl = trainControl(method = "cv", number = 10),
tuneGrid = tune_grid
)
# Print the results
print(rf_cv)
## Random Forest
##
## 1030 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 926, 926, 927, 926, 927, 928, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 5.133948 0.9221144 3.800211
## 3 4.724437 0.9279820 3.398484
## 4 4.605343 0.9287047 3.261051
## 5 4.603553 0.9271294 3.241732
## 6 4.629922 0.9256030 3.248778
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
And here’s the plot RMSE with the lowest value at the
mtry
of 4.
plot(rf_cv)
It’s not necessary but since we had so much trouble implementing SVM for our previous, category-heavy problem, we attempted it here.
It performed perfectly adequately and so easily compared to the difficulties in working with the prior dataset specifically with SVM. This is probably because with this dataset all of the predictors are numeric and that is what SVM is naturally suited for.
However the best Rsquared, 88.89%, does not exceed what we achieved with Random Forest at 92.08%.
I deleted it for brevity but when I ran the same model with first scaling the data, the Rsquared was lower at 72.36%.
tune_grid <- expand.grid(C = c(0.1, 1, 10), sigma = c(0.01, 0.1, 1))
train_control <- trainControl(method = "cv", number = 10)
set.seed(175328)
svm_model <- train(
MPa ~ .,
data = df0,
method = "svmRadial",
trControl = train_control,
tuneGrid = tune_grid)
print(svm_model)
## Support Vector Machines with Radial Basis Function Kernel
##
## 1030 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 926, 928, 927, 926, 928, 929, ...
## Resampling results across tuning parameters:
##
## C sigma RMSE Rsquared MAE
## 0.1 0.01 11.270178 0.6240059 9.117620
## 0.1 0.10 8.814860 0.7419568 6.807329
## 0.1 1.00 12.142112 0.5966044 9.169449
## 1.0 0.01 8.430162 0.7485941 6.570599
## 1.0 0.10 6.573446 0.8466288 4.817187
## 1.0 1.00 6.933983 0.8389037 4.833237
## 10.0 0.01 7.590229 0.7936221 5.744932
## 10.0 0.10 5.556466 0.8888635 3.985473
## 10.0 1.00 6.081161 0.8683396 4.050270
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.1 and C = 10.
Here’s the visualization of the parameter tuning. We can see the
lowest cross-validated RMSE is with a cost
of 10 and a
sigma
of 0.1.
plot(svm_model)
It looks like SVM is great for numerical prediction but hard to interpret. I couldn’t find feature importance function. However we can produce these Partial Dependence Plots which start to show some of the two-term non-linear relationships.
Here we can visualize how high levels of concrete and low levels of water are required to produce the hardest concrete. The only way you could have that low water would be if there’s high sand, high plasticizers and high slag, which all improve workability.
pdp_multiple <- partial(svm_model, pred.var = c("Water", "Cement"), train = df0)
plotPartial(pdp_multiple, main = "Partial Dependence of Water and Cement")
Here we ran XGBoost on the problem. Initially we used a tune grid to fit the model with a two-fold cross validation, then extracted the best performing parameters using `print(xgb_model$bestTune), to run the model one more time with 10-fold cross validation
So far we have the best performance with an Rsquared of 94.36%, beating out the next highest Random Forest at 92.08%
#tune_grid <- expand.grid(
# nrounds = c(100, 200), # Number of boosting rounds
# max_depth = c(3, 6, 9), # Maximum tree depth
# eta = c(0.01, 0.1, 0.3), # Learning rate
# gamma = c(0, 1), # Minimum loss reduction
# colsample_bytree = c(0.7, 1), # Subsample ratio of columns
# min_child_weight = c(1, 5), # Minimum sum of weights in a child node
# subsample = c(0.7, 1) # Subsample ratio of the training data
#)
tune_grid <- expand.grid(
nrounds = 200, # Number of boosting rounds
max_depth = 6, # Maximum tree depth
eta = 0.1, # Learning rate
gamma = 0, # Minimum loss reduction
colsample_bytree = 1, # Subsample ratio of columns
min_child_weight = 5, # Minimum sum of weights in a child node
subsample = 0.7 # Subsample ratio of the training data
)
#train_control <- trainControl(method = "cv", number = 2)
train_control <- trainControl(method = "cv", number = 10)
set.seed(175328)
xgb_model <- train(
MPa ~ .,
data = df0,
method = "xgbTree",
trControl = train_control,
tuneGrid = tune_grid
)
#print(xgb_model$bestTune)
print(xgb_model)
## eXtreme Gradient Boosting
##
## 1030 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 926, 928, 927, 926, 928, 929, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 3.891582 0.943551 2.490643
##
## Tuning parameter 'nrounds' was held constant at a value of 200
## Tuning
## held constant at a value of 5
## Tuning parameter 'subsample' was held
## constant at a value of 0.7
Here we don’t see much new. Again, feature importance doesn’t tell us as about two-term non-linear relationships like partial dependence plots do.
importance <- varImp(xgb_model)
plot(importance, main = "Feature Importance")
Here we used a partial dependence plot to compare age against BlastFurnace Slag.
Slag provides molecules to convert more of the weak concrete molecules into durable concrete molecules, but that process occurs primarily over the first 365 days, instead of the first 28 days like with cement and water.
And so we we should see concrete with a higher slag content be harder for ages past the initial 28 days.
And we have confirmation as we only see the hardest concrete, out of concrete with a high amount of slag, at ages closest to the year mark.
pdp_multiple <- partial(xgb_model, pred.var = c("Age", "BlastFurnaceSlag"), train = df0)
plotPartial(pdp_multiple, main = "Partial Dependence of Age and Blast Furnace Slag")
We included a SHAP chart as it’s something we’d like to get better at
using to interpret non-linear models but it’s still a little elusive. I
would presume the numbers on the left are the SHAP values of each
predictor but it looks like cement
is the number one
predictor having a phi bar all the way to 22.5, but it’s number is 540
and CoarseAggregate
(rocks) has a higher value of 1040 even
though it’s phi value is roughly 3.5.
One takeaway of this is that since BlastFurnaceSlag
has
negative phi maybe it’s presence predicts for lower hardness, but that’s
because slag requires the longer aging to produce calcium silicate
hydrate and so on average for shorter days of aging the presence of
slag, which replaces other components in the mixture, like cement, does
make for less hardness but only for the on average earlier ages.
# Create a predictor object
predictor <- Predictor$new(xgb_model, data = df0, y = df0$MPa)
# Calculate SHAP values
shap <- Shapley$new(predictor, x.interest = df0[1, ]) # Example for the first instance
plot(shap)
Here we used an unsupervised technique to tap into the full range of what we learned this semester.
From our domain dive we know that there are different types of concrete like “portland concrete” and architectural concrete.
Our attempt here is to cluster the data into different “recipes” for different types of concrete and get some summary statistics on their makeup.
Since hardness is not part of the recipe we’ll take that out of the data for purposes of clustering.
I’m so excited - this came out way better than I imagined!
Cluster 3 is clearly Architectural concrete. It has the high slag content that makes for a light coloration that works well in modern glass and concrete houses, it has the high workability to be made into intricate shapes and the smooth finish for beautiful surfaces.
Cluster 2 is clearly Portland concrete. This is your basic, general-purpose concrete and is identified because it has the highest cement content and very low ash and slag.
Cluster 3 is Fly Ash concrete. Identified because it has the highest FlyAsh content. Fly Ash concrete is used for large infrastructure projects like dams because it reduces the heat of hydration, by reducing the amount of cement needed (which reacts and produces heat quickly) with more fly ash (which reacts and produces heat slowly). Because fly ash concrete has the lowest rate of heating due to hydration it’s the least likely to build up a high temperature differential and then crack due to uneven expansion of the concrete due to that heat. Hence the best for dams and large infrastructure projects.
df_scaled <- as.data.frame(scale(df0))
clustering_data <- df_scaled[, -which(names(df_scaled) == "MPa")]
set.seed(175328)
kmeans_result <- kmeans(clustering_data, centers = 3)
df0$cluster <- kmeans_result$cluster
t(aggregate(df0[, -which(names(df0) == "cluster")], by = list(Cluster = df0$cluster), FUN = mean))
## [,1] [,2] [,3]
## Cluster 1.00000 2.0000000 3.000000
## Cement 266.78062 363.2333333 234.517121
## BlastFurnaceSlag 33.04587 11.3386364 177.911970
## FlyAsh 107.27959 4.1780303 24.048030
## Water 167.85069 195.6094697 188.453182
## Superplasticizer 10.45958 0.4221591 5.204182
## CoarseAggregate 971.80952 1002.9625000 950.348788
## FineAggregate 807.42420 751.0003788 746.924848
## Age 35.31651 75.1742424 35.721212
## MPa 37.15376 33.6639015 35.775949
We are trying to predict the hardness, or compressive strength, of concrete based on its age and mixture of ingredients and their ratios to each other.
Secondarily we want to understand the interactions between the mixture components and how they influence hardness.
As an extension of this work, a concrete company could combine cost information and engineering requirements for a specific project to try to optimize a recipe for concrete balancing cost, workability and suitability for a project.
Our data was sourced from the UC Irvine Machine Learning Repository and contains 1,030 records of specific concrete mixtures with ingredients measured in kilograms per cubic meter, then aged to a certain number of days and the hardness of the concrete measured in megapascals of compressive force resisted before breaking.
Our data was clean with no missing values. Variance inflation factors (VIF) revealed no multicollinearity.
A correlation matrix supported some of the initial insights obtained from our domain dive.
A range of predictive models were used to explore the best fit for the data.
Linear Regression produced the worst predictive model, explaining 61.55% of the variation in the data. This strongly suggested the process for concrete hardening is not linear.
However we did show that our residuals were normal and so there was no need to transform our data.
Also we showed that the water-to-cement ratio, by itself, did not contribute meaningfully to the linear regression prediction, a gravel-to-sand ratio did. Note gravel and sand are used interchangably to coarse and fine aggregates in the analysis.
Random Forest provide a significant improvement with 92.08% of the variance in the data explained. This confirmed the non-linear nature of the problem. Note we used 10-fold cross validation to confirm for stability in all models.
The optimal number of predictors used to form stumps in the forest was four, indicating that that more there are meaningful interactions in at least some cases between four predictors, including age.
Support Vector Machines slightly underperformed Random Forest with 88.89% of the variation in the data explained. However we got to visualize a partial dependence plot between water and cement showing how the hardest concrete is only achievable with high amounts of cement and low water.
eXtreme Gradient Boosting outperformed them all, explaining 94.36% of the variance.
We looked at a partial dependence plot of age and slag to confirm that since slag’s contribution to the hardening process takes place over a 365 day period instead of a 28 day period for cement, that for concrete mixtures with a high slag component, the hardest mixtures would all have been tested after nearly 365 days of aging, and this was observed in the partial dependence plot.
We also tentatively explored SHapley Additive exPlanations (SHAP) since we are expanding our toolbelt to interpret machine learning models that are more complex than statistics-based models.
Lastly we looked at clustering, not as a way to predict hardness of a particular mixture of concrete after a certain number of days of curing, but to understand how all of these different mixtures might be grouped into different recipes. We found three clusters that tied neatly to Portland cement (standard cement used say for sidewalks), architectural cement which uses a high amount of slag, and fly ash cement which is used in large scale industrial projects like dams because of it’s low heat of hydration.
The purpose of all of this analysis is to develop confidence that:
We can predict hardness of concrete based on it’s particular mixture and amount of curing
We can understand the relationships and ratios between the ingredients to influence characteristics of the concrete and support that understanding through partial dependence plots
We can identify signature recipe types.
And in developing, potentially make our own recipes to meet business and client requirements.
The XGBoost model has the best predictive value for concrete hardness.
Through partial dependence plots, we learned that the hardest concrete has a low water-to-cement ratio and concrete mixtures with a high proportion of slag take nearly a year to achieve the highest compressive strengths.
There are three main types of concrete recipes that fit well with what we learned in our domain dive: portland concrete, architectural concrete and ash concrete.
Knowing these things we should be able to customize concrete mixtures for our clients to fit their budget and construction requirements.