Concrete Compressive strength

Author

Onesmus Kabui

Introduction

Data set overview

This data set shows the relationship between compressive strength of concrete and mixture proportions of concrete. We have cement, blast furnace slag, water,fly ash, superplasticizer, coarse aggregate, fine aggregate and age as feature variables for predicting the target variable compressive strength measured in mega pascals. Using this data we will develop models to predict compressive strength using the 8 variables. Predicting concrete compressive strength holds major significance to ensure material optimization, efficiency and construction safety. We have sourced this data from the UCI machine learning repository.

library(readxl)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
Warning: package 'caret' was built under R version 4.4.3
Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift
library(ggplot2)
#load dataset
setwd("C:/Users/user/Desktop/datasets")
concrete_data<-read_excel("Concrete_Data.xls")
glimpse(concrete_data)
Rows: 1,030
Columns: 9
$ `Cement (component 1)(kg in a m^3 mixture)`             <dbl> 540.0, 540.0, …
$ `Blast Furnace Slag (component 2)(kg in a m^3 mixture)` <dbl> 0.0, 0.0, 142.…
$ `Fly Ash (component 3)(kg in a m^3 mixture)`            <dbl> 0, 0, 0, 0, 0,…
$ `Water  (component 4)(kg in a m^3 mixture)`             <dbl> 162, 162, 228,…
$ `Superplasticizer (component 5)(kg in a m^3 mixture)`   <dbl> 2.5, 2.5, 0.0,…
$ `Coarse Aggregate  (component 6)(kg in a m^3 mixture)`  <dbl> 1040.0, 1055.0…
$ `Fine Aggregate (component 7)(kg in a m^3 mixture)`     <dbl> 676.0, 676.0, …
$ `Age (day)`                                             <dbl> 28, 28, 270, 3…
$ `Concrete compressive strength(MPa, megapascals)`       <dbl> 79.986111, 61.…
concrete_data %>% 
  colnames()#check column names
[1] "Cement (component 1)(kg in a m^3 mixture)"            
[2] "Blast Furnace Slag (component 2)(kg in a m^3 mixture)"
[3] "Fly Ash (component 3)(kg in a m^3 mixture)"           
[4] "Water  (component 4)(kg in a m^3 mixture)"            
[5] "Superplasticizer (component 5)(kg in a m^3 mixture)"  
[6] "Coarse Aggregate  (component 6)(kg in a m^3 mixture)" 
[7] "Fine Aggregate (component 7)(kg in a m^3 mixture)"    
[8] "Age (day)"                                            
[9] "Concrete compressive strength(MPa, megapascals)"      
#renaming columns
concrete_data<- concrete_data %>% 
  rename(cement=`Cement (component 1)(kg in a m^3 mixture)`,
         blast_furnace_slag=`Blast Furnace Slag (component 2)(kg in a m^3 mixture)`,
         fly_ash=`Fly Ash (component 3)(kg in a m^3 mixture)`,
         water=`Water  (component 4)(kg in a m^3 mixture)`,
         super_plasticizer=`Superplasticizer (component 5)(kg in a m^3 mixture)`,
         coarse_aggregate=`Coarse Aggregate  (component 6)(kg in a m^3 mixture)`,
         fine_aggregate=`Fine Aggregate (component 7)(kg in a m^3 mixture)`,strength=`Concrete compressive strength(MPa, megapascals)`)

#check column names
colnames(concrete_data)
[1] "cement"             "blast_furnace_slag" "fly_ash"           
[4] "water"              "super_plasticizer"  "coarse_aggregate"  
[7] "fine_aggregate"     "Age (day)"          "strength"          

Linear Modelling

In the first part of our analysis we work with the assumption of linearity and so we fit a multiple linear regression. We first split the data to training and testing sets. We would like to keep a training set at 80% the total data.

set.seed(66)
train_index<- createDataPartition(concrete_data$strength,p=0.8,list = FALSE)

train_concrete<- concrete_data[train_index,]
test_concrete<- concrete_data[-train_index,]

Linear regression

colnames(train_concrete)
[1] "cement"             "blast_furnace_slag" "fly_ash"           
[4] "water"              "super_plasticizer"  "coarse_aggregate"  
[7] "fine_aggregate"     "Age (day)"          "strength"          
concrete_train_lm<- lm(strength ~., data=train_concrete)
summary(concrete_train_lm)

Call:
lm(formula = strength ~ ., data = train_concrete)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.963  -6.311   0.599   6.799  33.999 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -15.227226  30.917902  -0.493  0.62249    
cement               0.115529   0.009660  11.959  < 2e-16 ***
blast_furnace_slag   0.100209   0.011745   8.532  < 2e-16 ***
fly_ash              0.078321   0.014499   5.402 8.65e-08 ***
water               -0.161495   0.045994  -3.511  0.00047 ***
super_plasticizer    0.328893   0.106001   3.103  0.00198 ** 
coarse_aggregate     0.016932   0.010925   1.550  0.12157    
fine_aggregate       0.016287   0.012399   1.314  0.18935    
`Age (day)`          0.109581   0.005944  18.436  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.48 on 817 degrees of freedom
Multiple R-squared:  0.6112,    Adjusted R-squared:  0.6074 
F-statistic: 160.6 on 8 and 817 DF,  p-value: < 2.2e-16

From the base linear model we can make some observations already.

From the residuals; minimum, first quartile, median, third quartile and maximum residuals, We see that 50% of wrong predictions of compressive strength fall between -6.311 and 6.799MPa, of the true value which on itself is not extremely bad but on the flip side we have predictions from our model that go as far as 33.999 MPa from the true value. These extreme values explain why our model explains 61% of variability that is R squared 0.6112.

From the coefficients; cement, blast furnace slag, fly ash, super plasticizer and age show a strong positive effects on concrete compressive strength showing that higher values of these variables are associated with higher concrete compressive strength.

On the other hand, fine aggregate and coarse aggregate doesn’t show statistically significant coefficients suggesting their variation doesn’t strongly influence concrete compressive strength in this data set.

Water as a predictor has a statistically significant negative coefficient meaning that an extreme amount of water has negative effect on concrete compressive strength.

Making predictions

With our training data model we can make prediction

strength_pred<- predict(concrete_train_lm, newdata= test_concrete)# predicted values of strength

pred_actual<- data.frame(actual= test_concrete$strength,strength_pred)#dataframe of actual and predicted strength values

head(pred_actual,10)
      actual strength_pred
1  61.887366      53.76031
2  41.052780      66.09752
3  41.836714      47.28041
4   8.063422      22.07614
5  52.516697      58.54319
6  38.603761      29.05057
7  55.260122      57.37887
8  42.229026      38.14581
9  50.459301      27.82621
10 35.076402      28.32287

We now have a data frame of predicted values and test data set actual values and can move on and evaluate the quality of fit

ggplot(pred_actual,aes(x=actual,y=strength_pred))+
         geom_point(color="blue")+
         geom_abline(slope = 1,intercept = 0,color="red")+
         labs(title = "actual strength vs predicted strength",x="actual strength",y="predicted strength")+
         theme_minimal()

From this plot we see that the predicted values tend to fit fairly well to the ideal horizontal line but we see more prediction error in the medium strength and also our model seems to underpredict very strong concrete.

We can also plot the residual between the predicted and actual training strength and see the distribution of residuals

concrete_residuals<-test_concrete$strength- strength_pred
hist(concrete_residuals,main = "Residual Distribution",xlab = "Error (MPa)")

From the residual distribution plot it’s satisfactory with most residuals are centered around 0 and has a normal distribution.

Lastly, to quantify the accuracy of our model is we use the metrics; RMSE and R squared

#root mean squared error
RMSE<-sqrt(mean((concrete_residuals^2)))
R_squared<- cor(strength_pred, test_concrete$strength)^2
RMSE
[1] 10.14504
R_squared
[1] 0.6304237

The model’s average prediction error is approximately 10MPa and that linear regression model explains roughly 63% of the variability in concrete compressive strength so we know linear model still leaves alot of variability in concrete compressive strength unexplained.

Diagnostics

We can proceed with residual diagnostics to see if the residuals meet regression assumptions.

par(mfrow=c(2,2))
plot(concrete_train_lm)

Linearity~ from the plot of residuals against predicted values we see a slight curve in the fitted line suggesting possible heteroscedasticity. So the relationship between variables and the response might not be perfectly linear

Normality~ from the normality plot most points other than in the ends, possible outliers fall in the diagonal a strong evidence of normality.

the scale location plot hints at heteroscedasticity as well as variance increases with fitted values from the horizontal line that isn’t very horizontal.

most points lie in the low leverage there are some outliers but they seem not to be very influential as outliers as they are inside the cook’s distance.

Findings

  • The predictors explain 63% of the variance in compressive strength that’s fair but it can be improved through interaction terms and quadratic transformations..

  • The model also predicts on average with an error of 10MPa for each predicted value either more or less.

  • From the model we see can see that several variables are more important predictors while others are nor as important.

    at 5% level of significance,

  • Cement is significant, for every 1kg/m^3 increase in cement there is a corresponding increase of 0.12MPa while holding everything constant.

  • Blast furnace slug ,fly ash and superplasticizer also show a statistical significance because their p values are well below 0.05

  • Age is very significant as well so the longer the concrete heals it gets stronger. On average it registers that the the strength improves by 0.11MPa per day that passes

  • Water on the other hand is also very significant but in a different direction showing inverse effect with increase of water past a certain point