Concrete is the most important material in civil engineering. It has revolutionized the way construction is being carried out in the last century. The ability to make concrete into any shape and size has led to the many intriguing structures. The characteristics of concrete are dependent on the materials it is made up of and the proportions of the materials. The main materials that make up concrete are water, cement, and aggregates of varying proportions. The composition of these materials determine its compressive strength (Chopra et al., 2014).
The compressive strength of concrete refers to its ability to carry the load on its surface without any crack or deflection. Compressive strength of concrete is used to determine its quality and it is the most important concrete property. Therefore, the ability to predict concrete compression strength is of immense importance. The required compression strength of concrete can vary from 17 MPa for residential purposes to 70 MPa for some commercial con
In this study, multiple regression analysis was carried out for predicting compression strength of concrete using eight variables cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, fine aggregate, and age. The performance of the developed model was tested using test data.
We then wanted to understand if given the ingredients of concrete, could we accurately predict if the resultant strength of that concrete would meet industry standards (4000 PSI). So, we developed a logistic-regression model and assessed it’s accuracy with training and testing subsets.
The data used for this project was obtained from UCI repository. https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength
library(tidyverse,warn.conflicts = F) # includes 'magrittr' for piping and 'dplyr' for munging
library(lubridate,warn.conflicts = F) # for munging time-series data
library(plotly,warn.conflicts = F) # for making ggplot visuals interactive via ggplotly()
library(tidytext,warn.conflicts = F) # for ordering columns by value in each faceted plot
library(car, warn.conflicts = F) # avPlots
library(MASS, warn.conflicts = F) # boxcox plots
library(dplyr)
knitr::opts_chunk$set(echo = TRUE)
options(scipen=999) # turn off scientific notation
source: https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength
# Load data
df <- read.csv("C:\\Users\\workd\\Desktop\\concrete.csv")
#
colnames(df)
## [1] "rownames" "cement" "blast_furnace_slag"
## [4] "fly_ash" "water" "superplasticizer"
## [7] "coarse_aggregate" "fine_aggregate" "age"
## [10] "compressive_strength"
# Rename columns for simplicity
df <- df %>%
rename(
Cement = cement,
Slag = blast_furnace_slag,
FlyAsh = fly_ash,
Water = water,
SuperPlast = superplasticizer,
CourseAgg = coarse_aggregate,
FineAgg = fine_aggregate,
Age = age,
ConcreteStr = compressive_strength
)
# Check and remove duplicate rows
paste('Rows with duplicated data (raw):', sum(duplicated(df)))
## [1] "Rows with duplicated data (raw): 0"
df <- df[!duplicated(df), ]
paste('Rows with duplicated data after fix:', sum(duplicated(df)))
## [1] "Rows with duplicated data after fix: 0"
# Check for missing data
paste('Rows with missing data:', sum(is.na(df)))
## [1] "Rows with missing data: 0"
# Print dataframe dimensions
paste('Dimensions of dataframe:', dim(df))
## [1] "Dimensions of dataframe: 1030" "Dimensions of dataframe: 10"
# Attach dataframe for easier column access
attach(df)
# Inspect data structure
glimpse(df)
## Rows: 1,030
## Columns: 10
## $ rownames <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ Cement <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 266.0, 380.0, 380.0, 26~
## $ Slag <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 114.0, 95.0, 95.0, 114.0, 0~
## $ FlyAsh <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Water <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 228, 228, 192, 192~
## $ SuperPlast <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0~
## $ CourseAgg <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932.0, ~
## $ FineAgg <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 670.0, 594.0, 594.0, 67~
## $ Age <int> 28, 28, 270, 365, 360, 90, 365, 28, 28, 28, 90, 28, 270, 9~
## $ ConcreteStr <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 47.03, 43.70, 36.45, 45~
Below, we plot all variables as histograms. Some preliminary observations include:
Many samples have no slag, superplasticizer, nor fly ash, suggesting that they are unnecessary additives potentially used to strengthen concrete
Concrete strength seems right-skewed. Most samples have between 20-40 MPA in strength, yet some reach a very high level of strength
df %>%
pivot_longer(cols = df %>%
colnames()) %>%
ggplot() +
geom_histogram(aes(value), size = .5, bins = 25) +
facet_wrap(~name, scales = 'free') +
theme_minimal() +
theme(axis.title.y = element_blank())
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## i Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Observations from independent variables plotted against concrete strength:
all independent variables except cement do not correlate well with concrete strength
some variables show a very wide range of concrete strength with the same value (FlyAsh = 0, for example). This may indicate interactivity between independent variables. For example, FlyAsh alone may not strengthen concrete, but when combined with Slag and Superplasticizer, it may (hypothetical example). We will dive into that further in later steps.
# scatter plots, all vars against strength
df %>%
pivot_longer(cols = df %>%
dplyr::select(-c(ConcreteStr)) %>%
colnames()) %>%
ggplot() +
geom_point(aes(x = ConcreteStr, y = value), size = .5) +
geom_smooth(aes(x = ConcreteStr, y = value),method='lm') +
facet_wrap(~name, scales = 'free') +
theme_minimal() +
theme(axis.title.y = element_blank())
Below, we use pairs to examine multi-collinearity among independent variables. We don’t see anything worth removing.
pairs(df[1:4])
pairs(df[5:8])
Some of the variables like FlyAsh, SuperPlast, and Slag have a wide-range of concrete strength values when equal to 0.
Therefore, a variable like FlyAsh likely has a greater effect on other independent variables rather than on Concrete Strength directly.
We believe this emphasizes the need for interactive terms between independent variables in our model. Below, we investigate each of those three variables and how they relate to other independent variables.
# slag binary term
df %>%
mutate(Slag_cat = if_else(Slag == 0, 'Zero', 'Non-Zero')) %>%
pivot_longer(cols = df %>%
dplyr::select(-c('ConcreteStr')) %>%
colnames()) %>%
ggplot() +
geom_point(aes(x = ConcreteStr, y = value, color = Slag_cat), size = 1) +
facet_wrap(~name, scales = 'free') +
theme_minimal() +
theme(axis.title.y = element_blank())
# Slag by all x vars
df %>%
pivot_longer(cols = df %>%
dplyr::select(-c('Slag','ConcreteStr')) %>%
colnames()) %>%
ggplot() +
geom_point(aes(x = Slag, y = value, color = ConcreteStr), size = 2, alpha = .5) +
#geom_smooth(aes(x = FlyAsh, y = value),method='lm') +
facet_wrap(~name, scales = 'free') +
theme_minimal() +
theme(axis.title.y = element_blank()) +
ggplot2::scale_color_continuous(type = 'viridis')
# Plot independent vars against Concrete Str and color by FlyAsh
df %>%
mutate(FlyAsh_cat = if_else(FlyAsh == 0, 'Zero', 'Non-Zero')) %>%
mutate(SuperPlast_cat = if_else(SuperPlast == 0, 'Zero', 'Non-Zero')) %>%
mutate(Slag_cat = if_else(Slag == 0, 'Zero', 'Non-Zero')) %>%
pivot_longer(cols = df %>%
dplyr::select(-c('ConcreteStr','Slag')) %>%
colnames()) %>%
ggplot() +
geom_point(aes(x = ConcreteStr, y = value, color = Slag), size = 2) +
facet_wrap(~name, scales = 'free') +
theme_minimal() +
theme(axis.title.y = element_blank())
FlyAsh
Below, we create a binary variable that categorizes FlyAsh as zero or non-zero. We then plot all other independent variables against concrete strength, and color the points by our engineered variable FlyAsh_cat (binary: zero, or non-zero).
When interpretting each plot, start at the Y-axis and read across from left-to-right. If the color changes, then it may indicate an effect that FlyAsh has on Concrete Strength for a fixed quantity of the other independent variable.
Using this methodology, in later steps we will investigate potential interactive terms between FlyAsh and Age and FlyAsh and Water.
df %>%
mutate(FlyAsh_cat = if_else(FlyAsh == 0, 'Zero', 'Non-Zero')) %>%
mutate(SuperPlast_cat = if_else(SuperPlast == 0, 'Zero', 'Non-Zero')) %>%
mutate(Slag_cat = if_else(Slag == 0, 'Zero', 'Non-Zero')) %>%
pivot_longer(cols = df %>%
dplyr::select(-c('ConcreteStr')) %>%
colnames()) %>%
ggplot() +
geom_point(aes(x = ConcreteStr, y = value, color = FlyAsh_cat), size = 1) +
facet_wrap(~name, scales = 'free') +
theme_minimal() +
theme(axis.title.y = element_blank())
# Plot independent vars against Concrete Str and color by FlyAsh
df %>%
mutate(FlyAsh_cat = if_else(FlyAsh == 0, 'Zero', 'Non-Zero')) %>%
mutate(SuperPlast_cat = if_else(SuperPlast == 0, 'Zero', 'Non-Zero')) %>%
mutate(Slag_cat = if_else(Slag == 0, 'Zero', 'Non-Zero')) %>%
pivot_longer(cols = df %>%
dplyr::select(-c('ConcreteStr','FlyAsh')) %>%
colnames()) %>%
ggplot() +
geom_point(aes(x = ConcreteStr, y = value, color = FlyAsh), size = 2) +
facet_wrap(~name, scales = 'free') +
theme_minimal() +
theme(axis.title.y = element_blank())
SuperPlast
df %>%
mutate(FlyAsh_cat = if_else(FlyAsh == 0, 'Zero', 'Non-Zero')) %>%
mutate(SuperPlast_cat = if_else(SuperPlast == 0, 'Zero', 'Non-Zero')) %>%
mutate(Slag_cat = if_else(Slag == 0, 'Zero', 'Non-Zero')) %>%
pivot_longer(cols = df %>%
dplyr::select(-c('ConcreteStr')) %>%
colnames()) %>%
ggplot() +
geom_point(aes(x = ConcreteStr, y = value, color = SuperPlast_cat), size = 1) +
facet_wrap(~name, scales = 'free') +
theme_minimal() +
theme(axis.title.y = element_blank())
In our first model, we have no interactive terms. Adjusted R^2 is roughly .6.
model0 <- lm(ConcreteStr ~ Cement + Slag + FlyAsh + Water + SuperPlast + CourseAgg + FineAgg + Age)
summary(model0)
##
## Call:
## lm(formula = ConcreteStr ~ Cement + Slag + FlyAsh + Water + SuperPlast +
## CourseAgg + FineAgg + Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.654 -6.302 0.703 6.569 34.450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.331214 26.585504 -0.878 0.380372
## Cement 0.119804 0.008489 14.113 < 0.0000000000000002 ***
## Slag 0.103866 0.010136 10.247 < 0.0000000000000002 ***
## FlyAsh 0.087934 0.012583 6.988 0.00000000000502 ***
## Water -0.149918 0.040177 -3.731 0.000201 ***
## SuperPlast 0.292225 0.093424 3.128 0.001810 **
## CourseAgg 0.018086 0.009392 1.926 0.054425 .
## FineAgg 0.020190 0.010702 1.887 0.059491 .
## Age 0.114222 0.005427 21.046 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.4 on 1021 degrees of freedom
## Multiple R-squared: 0.6155, Adjusted R-squared: 0.6125
## F-statistic: 204.3 on 8 and 1021 DF, p-value: < 0.00000000000000022
All the variables have significant p-values except coarse aggregate and fine aggregate. Also, from the added variable plot these two variables are not significant in estimating the concrete compression strength as their slopes are almost zero.
The slope for superplasticity is also almost 0, but the visuals above suggest superplasticity has an impact on other independent variables. Therefore, we will keep it and explore adding it in interactive terms with other independent variables.
avPlots(model0)
Next, we’ll remove CourseAgg and FineAgg then add in some interactive terms estimated from the visual relationship between dependent variables in the above steps.
model1 <- lm(ConcreteStr ~ Cement + Slag + FlyAsh + Water + SuperPlast + Age +
FlyAsh:Age + Slag:Age + SuperPlast:Age + SuperPlast:Slag + SuperPlast:CourseAgg)
summary(model1)
##
## Call:
## lm(formula = ConcreteStr ~ Cement + Slag + FlyAsh + Water + SuperPlast +
## Age + FlyAsh:Age + Slag:Age + SuperPlast:Age + SuperPlast:Slag +
## SuperPlast:CourseAgg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.712 -5.750 0.197 6.133 35.246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.43060036 4.10721384 4.000 0.000067805386495 ***
## Cement 0.11179927 0.00382881 29.199 < 0.0000000000000002 ***
## Slag 0.06703302 0.00593673 11.291 < 0.0000000000000002 ***
## FlyAsh 0.03711830 0.00876344 4.236 0.000024859964239 ***
## Water -0.14785733 0.01992723 -7.420 0.000000000000247 ***
## SuperPlast -2.05053613 0.43533241 -4.710 0.000002816561120 ***
## Age 0.06442521 0.00634794 10.149 < 0.0000000000000002 ***
## FlyAsh:Age 0.00105559 0.00016098 6.557 0.000000000086924 ***
## Slag:Age 0.00025384 0.00005807 4.371 0.000013634834882 ***
## SuperPlast:Age 0.01386550 0.00151351 9.161 < 0.0000000000000002 ***
## Slag:SuperPlast 0.00253235 0.00061679 4.106 0.000043538098736 ***
## SuperPlast:CourseAgg 0.00181418 0.00047097 3.852 0.000124 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.07 on 1018 degrees of freedom
## Multiple R-squared: 0.7084, Adjusted R-squared: 0.7053
## F-statistic: 224.8 on 11 and 1018 DF, p-value: < 0.00000000000000022
Based on the diagnostic plots, the model seems reasonable.
par(mfrow = c(2,2))
plot(model1)
plot(model1$fitted.values, ConcreteStr)
par(mfrow = c(2,5))
rstand <- rstandard(model1)
plot(rstand ~ Cement)
plot(rstand ~ Slag)
plot(rstand ~ FlyAsh)
plot(rstand ~ Water)
plot(rstand ~ SuperPlast)
plot(rstand ~ Age)
plot(rstand ~ FlyAsh:Age)
plot(rstand ~ SuperPlast:Age)
plot(rstand ~ Slag:SuperPlast)
plot(rstand ~ model1$fitted.values) # fitted v standard resid
plot(ConcreteStr~ model1$fitted.values) # fitted v actual
abline(lm(ConcreteStr ~ model1$fitted.values))
Based on the inverse response plot and boxcox plot, no transformation seems necessary for the independent variable.
inverseResponsePlot(model1)
boxcox(model1, lambda = seq( 0,1,by =0.01))
avPlots(model1)
Based on the marginal model plots, our model fits well. Specifically, the ‘data’ and ‘model’ lines trace closely for the different variables.
#marginal model plot
mmps(model1,layout=c(2,2))
To demonstrate the impact of these interactive terms, we will perform an anova test between full and partial models.
The low p-value after excluding the interactive term between SuperPlast:CourseAgg is well-below any reasonable alpha-value, indicating that it is significant to the model.
model1_partial <- lm(ConcreteStr ~ Cement + Slag + FlyAsh + Water + SuperPlast + Age +
FlyAsh:Age + Slag:Age + SuperPlast:Age + SuperPlast:Slag) # Removed SuperPlast:CourseAgg
anova(model1_partial, model1)
Compressive strength is important as it is the main criteria used to determine whether a given concrete mixture will meet the needs of a specific job. The threshold for compressive strength of a concrete pass is 4000 psi = 27.58 Mpa.
https://cor-tuf.com/everything-you-need-to-know-about-concrete-strength/
threshold <- 27.58 #4000 psi to MPA
df_expanded <- df %>%
# create binary variable above or below the concrete strength threshold
mutate(Binary_ConcreteStr = ifelse(ConcreteStr>threshold,1,0)) %>%
# create a unique identifier to split the train-test sets using anti-join
mutate(rownumber = row_number())
# inspect contents
head(df_expanded)
# split 80% of df_expanded into train set
df_train <- sample_frac(df_expanded, size = 0.8)
# create test set using 25% of df_expanded not in train set
df_test <- df_expanded %>% anti_join(df_train, by = 'rownumber')
# create model using the training dataset,
# prevent information leakage by excluding the testing dataset
model1_log <- glm(Binary_ConcreteStr ~ Cement + Slag + FlyAsh + Water + SuperPlast +
Age + FlyAsh:Age + Slag:Age + SuperPlast:Age + SuperPlast:Slag +SuperPlast:CourseAgg,
data = df_train, family = binomial)
summary(model1_log)
##
## Call:
## glm(formula = Binary_ConcreteStr ~ Cement + Slag + FlyAsh + Water +
## SuperPlast + Age + FlyAsh:Age + Slag:Age + SuperPlast:Age +
## SuperPlast:Slag + SuperPlast:CourseAgg, family = binomial,
## data = df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6381 -0.3430 0.0300 0.3731 2.2650
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.4150656 1.8524279 -3.463 0.000534 ***
## Cement 0.0259249 0.0024807 10.451 < 0.0000000000000002 ***
## Slag 0.0073884 0.0029681 2.489 0.012800 *
## FlyAsh 0.0021416 0.0054069 0.396 0.692043
## Water -0.0193218 0.0089992 -2.147 0.031789 *
## SuperPlast -0.5172861 0.2338731 -2.212 0.026979 *
## Age 0.0291569 0.0056994 5.116 0.000000312 ***
## FlyAsh:Age 0.0004525 0.0001865 2.426 0.015279 *
## Slag:Age 0.0004895 0.0001088 4.501 0.000006776 ***
## SuperPlast:Age 0.0061277 0.0025376 2.415 0.015746 *
## Slag:SuperPlast 0.0002204 0.0003080 0.716 0.474281
## SuperPlast:CourseAgg 0.0005293 0.0002481 2.134 0.032865 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1062.72 on 823 degrees of freedom
## Residual deviance: 466.85 on 812 degrees of freedom
## AIC: 490.85
##
## Number of Fisher Scoring iterations: 8
D <- 555.78
Do <- 1297.93
R_square <- 1 - (D/Do)
print(paste("The R^2 vaue is ",round(R_square,2)))
## [1] "The R^2 vaue is 0.57"
library(class) # import for knn function
## Warning: package 'class' was built under R version 4.1.3
temp <- class::knn(train = df_train, test = df_test, cl = df_train$Binary_ConcreteStr)
print(paste('The alternative R^2 value is close at:',
round(sum(as.integer(temp)-1)/length(temp),2)))
## [1] "The alternative R^2 value is close at: 0.66"
concrete_passed_data <- length(which(df_expanded$Binary_ConcreteStr == 1))
print(paste(round(concrete_passed_data),
"concrete samples passed the threshold of 4000 psi (27.58 MPa)"))
## [1] "680 concrete samples passed the threshold of 4000 psi (27.58 MPa)"
concrete_passed_model <- predict(model1_log, df_test, type = "response")
count_concrete_passed_model <- length(which(concrete_passed_model > 0.5))
print(paste("The model predicted that",round(count_concrete_passed_model),
"concrete samples passed the threshold of 4000 psi (27.58 MPa)"))
## [1] "The model predicted that 144 concrete samples passed the threshold of 4000 psi (27.58 MPa)"
fit_passed.pred <- rep("Didn't Pass", nrow(df_test))
fit_passed.pred[concrete_passed_model > 0.5] = "Passed"
confusion <- table(fit_passed.pred, df_test$Binary_ConcreteStr)
print(confusion)
##
## fit_passed.pred 0 1
## Didn't Pass 52 10
## Passed 13 131
accurancy <- round((c(confusion)[1] + c(confusion)[4])/(sum(c(confusion)))*100,0)
print(paste("The training accuracy of the model is",accurancy,"%"))
## [1] "The training accuracy of the model is 89 %"
Problem statement and how we addressed it:
We wanted to understand if given the ingredients of concrete, could we accurately predict if the resultant compressive strength of that concrete would meet industry standards (4000 PSI).
First, we performed EDA to refine a multi-linear regression model. Then, we took our model and our engineered term of above or below 4000 PSI to train a subset of our data and assess the accuracy against a testing subset.
Our final model came out at 87% accurate. Below, you can find our interpretation of this value.
Interesting insights and limitations:
Some variables of concrete are not ‘necessary’ but are rather additives that can strengthen concrete by enhancing the effects of more primary ingredients like cement. These types of elements (like Superplasticity, Slag, and Fly Ash) have strong interactivity with other concrete ingredients to improve the overall strength. Once we added interactive terms to our multi-linear regression for these ancillary ingredients with more primary ingredients, the model R2 value improved by 15%
Some variables have a direct effect on the strength of concrete without any interactive term (or added ingredients). For example, cement content correlates closely with concrete compressive strength. This is evident in the low p-value from the model summary, and a simple scatter plot between the two variables.
Our logistic regression model had an accuracy of 87%. In other words, if someone has the ingredients to make concrete and plugs those values into our model, our model will predict whether the resultant compressive strength is above or below 4000 PSI. 87% of the time, our model will accurately predict if the concrete strength is above or below that threshold. These types of logistic regression models are likely widely used in the real-world. If concrete unexpectedly fails, the consequences can be severe.
While decent, our model could likely be improved. We estimate that more time would be needed to determine exact interactive terms between variables. In this model, we managed to capture a few obvious ones from some of our diagnostic plots and EDA. Given how those interactive terms improved our model accuracy, more refined ones may further improve this model.
Chopra, P., Sharma, R.K. and Kumar, M. (2014) “Predicting compressive strength of concrete for varying workability using regression models,” International Journal Of Engineering & Applied Sciences, 6(4), pp. 10–10. Available at: https://doi.org/10.24107/ijeas.251233.
Peng, X., Zhuang, Z. and Yang, Q. (2022) “Predictive modeling of compressive strength for concrete at super early age,” Materials, 15(14), p. 4914. Available at: https://doi.org/10.3390/ma15144914.