Develop a Linear Regression Model to predict Concrete compressive strength

1. Introduction

Concrete is the most important material in Civil engineering among other construction material. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate. Sand and gravel are chemically inert particles that are bonded together with cement and water. Concrete Compressive strength refers to concrete’s capacity to withstand load either from live load( moving load) or dead load( stationary load) that decreases the size of concrete.

Our goal :-

Will be to predict Concrete Compressive strength using a Linear regression model

2. Data Description

Number of instances : 1030 Number of Attributes: 9 Attribute breakdown: 8 quantitative input variables, and 1 quantitative output variable

2.1 Variable Information

  • Cement – The agents that binds other materials together - Input variable(quantitative) – kg/m3 
  • Blast Furnace Slag – Formed when iron ore and iron pellets is melted in a blast furnace-  Input variable(quantitative) – kg/m3
  • Fly Ash –  Coal combustion product driven out of coal fired boilers together with flue gases- Input variable(quantitative) – kg/m3 
  • Water – Hydrogen and oxygen - Input variable(quantitative) – kg/m3 
  • Superplasticizer – Additives used in making high strength concrete - Input variable(quantitative) – kg/m3 
  • Coarse Aggregate – A construction component made of rock quarry Input variable(quantitative) – kg/m3 
  • Fine Aggregate – Natural sand particles -Input variable(quantitative)– kg/m3 
  • Age – Input variable(quantitative) – Day (1~365)
  • Concrete compressive strength – Strength to withstand load - Output variable(quantitative) – MPa(mega pascals)

3. Exploratory Data Analysis

# Importing all libraries
library(here)  # to assign the working directory to the current directory

## here() starts at /Users/kelvinosuagwu/Desktop/parent/Machine_Learning_A-Z

library(readxl)  #to read dataset in excel file
library(ggplot2)  #for iterative plots
library(ggpubr)   #for grid plots
library(corrplot)   # for correlations plot

## corrplot 0.84 loaded

library(caTools)# to split dataset  
library(e1071) #to get the skeweness of variables
library(caret) # to apply k-fold

## Loading required package: lattice

library(leaps) # to subset Variables
setwd(here::here())

3.1 Data Preprocessing

# loading the dataset as excel package using the library "readxl" above
concrete.data = read_excel("dataSet/Concrete_Data.xls")

Renaming the features

# renaming the features
 names(concrete.data) <- c("cement", "blastFurnace",
                 "flyAsh", "water",
                 "superPlasticizer",
                 "coarseAggregate",
                 "fineAggregate",
                 "age",
                 "concreteStrength")

# preview 10 rows
head(concrete.data, 10)

# confirming the dimension of the dataset
dim(concrete.data)

## [1] 1030    9

Summarize the variables

summary(concrete.data)

##      cement       blastFurnace       flyAsh           water      
##  Min.   :102.0   Min.   :  0.0   Min.   :  0.00   Min.   :121.8  
##  1st Qu.:192.4   1st Qu.:  0.0   1st Qu.:  0.00   1st Qu.:164.9  
##  Median :272.9   Median : 22.0   Median :  0.00   Median :185.0  
##  Mean   :281.2   Mean   : 73.9   Mean   : 54.19   Mean   :181.6  
##  3rd Qu.:350.0   3rd Qu.:142.9   3rd Qu.:118.27   3rd Qu.:192.0  
##  Max.   :540.0   Max.   :359.4   Max.   :200.10   Max.   :247.0  
##  superPlasticizer coarseAggregate  fineAggregate        age        
##  Min.   : 0.000   Min.   : 801.0   Min.   :594.0   Min.   :  1.00  
##  1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:731.0   1st Qu.:  7.00  
##  Median : 6.350   Median : 968.0   Median :779.5   Median : 28.00  
##  Mean   : 6.203   Mean   : 972.9   Mean   :773.6   Mean   : 45.66  
##  3rd Qu.:10.160   3rd Qu.:1029.4   3rd Qu.:824.0   3rd Qu.: 56.00  
##  Max.   :32.200   Max.   :1145.0   Max.   :992.6   Max.   :365.00  
##  concreteStrength
##  Min.   : 2.332  
##  1st Qu.:23.707  
##  Median :34.443  
##  Mean   :35.818  
##  3rd Qu.:46.136  
##  Max.   :82.599

From the summary , concrete strength is our response/target variable.We can see that it ranges from 2.33 to 82.60. The median and mean seems really close, but median is smaller, this results in a slight skew to the right.

FlyAsh product has a minimum of zero values, which can be assumed that in the mix, flyash was not used. Also the median is 0. SuperPlasticizer has minimum of zero values and blastFurnace also has minimum of zero value. It can be assumed also that in that mix, superPlasticizer and blastFurnace was not used. We shall explore more.

Structure of the data

str(concrete.data)

## tibble [1,030 × 9] (S3: tbl_df/tbl/data.frame)
##  $ cement          : num [1:1030] 540 540 332 332 199 ...
##  $ blastFurnace    : num [1:1030] 0 0 142 142 132 ...
##  $ flyAsh          : num [1:1030] 0 0 0 0 0 0 0 0 0 0 ...
##  $ water           : num [1:1030] 162 162 228 228 192 228 228 228 228 228 ...
##  $ superPlasticizer: num [1:1030] 2.5 2.5 0 0 0 0 0 0 0 0 ...
##  $ coarseAggregate : num [1:1030] 1040 1055 932 932 978 ...
##  $ fineAggregate   : num [1:1030] 676 676 594 594 826 ...
##  $ age             : num [1:1030] 28 28 270 365 360 90 365 28 28 28 ...
##  $ concreteStrength: num [1:1030] 80 61.9 40.3 41.1 44.3 ...

The datatypes of the variables are numerical. Also all the variables are double except age that is integer. This is okay and no conversion is needed.

Check for missing values

# check for NA
concrete.data.isNa <- is.na(concrete.data) #check for missing values
sum(concrete.data.isNa)   #4 missing values double checked

## [1] 0

Technically clean data with no missing values or NA’S

3.2 Univariate stastitics using visualization to check data

Let us check the skewness of variables. If the distribution is roughly symmetric, then skewness will be close to zero.

# checking how skewed the disribution is
apply(concrete.data,2,skewness)

##           cement     blastFurnace           flyAsh            water 
##       0.50803436       0.79840662       0.53588075       0.07410764 
## superPlasticizer  coarseAggregate    fineAggregate              age 
##       0.90546945      -0.04008937      -0.25224294       3.25966169 
## concreteStrength 
##       0.41570873

Let us visualize the skeweness

# plot all variables using grid library (ggpubr) above
ggarrange(cement.plot,
          blastFurnace.plot, 
          flyAsh.plot,
          water.plot,
          superPlasticizer.plot,
          coarseAggregate.plot,
          fineAggregate.plot,
          age.plot,
          ncol = 3, nrow = 3)

The distribution of fine aggregate looks quite close to a normal distribution
We can confirm that age is very skewed to the right, followed by superPlasticizer and then blastFurnace and lastly flyAsh. I can assume that from the data, “age” is more frequent and dense at 28 days due to conventional practice and lab testing states that at 28 days, concrete achieves 99% of it compressive strength. So that is why we have the bar at the highest point at 28 on the y-axis.

Let us visualize the response variable (concrete strength)

       ggplot(data = concrete.data) +
       geom_histogram(mapping = aes(x = concreteStrength),
                      bins = 15, boundary = 0, fill = "lightblue", col = "black")  +
       geom_vline(xintercept = mean(concrete.data$concreteStrength), col = "blue", size = 1) +
       geom_vline(xintercept = median(concrete.data$concreteStrength), col = "red", size = 1)  +
       annotate("text", label = "Median = 34.44", x = 23, y = 100, col = "red", size = 5) +
       annotate("text", label = "Mean = 35.82", x = 45, y = 45, col = "blue", size = 5) +
       ggtitle("Histogram of concrete strength") +
       theme_bw() +
       theme_classic() +
       theme_minimal()

The distribution of the concrete strength variable is not perfectly normal, However, we are going to proceed regardless. The mean is more representative of the data than the median.

lets look at the same distributions using boxplots

# ?hist
# # visualize using histograms of each attribute
# par(mfrow=c(2,5))
#  for(i in 1:8) {
# hist(concrete.data[,i], main=names(concrete.data)[i])
#  }

# boxplots for each attribute
?boxplot
par(mfrow=c(2,4))
for(i in 1:8) {
boxplot(concrete.data[,i], main=names(concrete.data)[i], col =  c("orange"))
}

We can see there are several potential outliers in the age, average, water, superplasticizer, fineAggregate and, blastFurnace. However, I consider that “age” could somewhat have problem. Let us take a look at in isolation:

# Using boxplot
boxplot(concrete.data$age, col = "yellow")

Let’s find out how many of them are in the data to see the possibility of removing them. We check from above 100

# check outliers above 100 point on the y-axis
age.outliers <- which(concrete.data$age > 100)

# check the dimension
dim(concrete.data[age.outliers, "age"])

## [1] 62  1

We see that there are 62 of them not just 4 from our results. let see what numbers are actually.

age.outliers

##  [1]   3   4   5   7  13  18  20  21  24  25  26  27  28  31  32  33  34  35  36
## [20]  40  42  43  44  48  51  52  57  59  61  62  64  65  66  67 604 605 610 611
## [39] 616 617 620 621 622 623 726 731 736 756 757 763 769 770 792 793 795 796 798
## [58] 799 814 815 821 824

The numbers are repeated several times with 62 data points out of 1,030. This seems too high a number to get rid of. We would be losing lots of information.

3.2.1 Multicollinearity

# check if they are independent of each other
pairs(concrete.data[-c(9)])

From the graphic representation, There is no issue with multicollinearity in the dataset of independent variables. i.e Independence of all the variable from each other is satisfied. If not we will have to remove the correlated variables to avoid bias predictions by the model.

Correlation matrix

# use correlation function
corrplot(cor(concrete.data[-c(9)]))

SuperPlasticizer and flyAsh are have weak correlation around 0.2-0.4. Also water and age have weak correlation between 0.2-0.4. So we do not need to remove correlated attributes.

4. Supervised Learning Experiments

Splitting the dataset into Training Set and Test Set on a 80/20

# splitting the dataset into the training set and the test set
set.seed(123)
split.data = sample.split(concrete.data$concreteStrength, SplitRatio = 0.8) # splitting the data by 80 : 20
training.set <- subset(concrete.data, split.data == TRUE)
test.set <- subset(concrete.data, split.data == FALSE)

Initialize a dataframe to store the test set results.

#Dataframe to store all the results from different models / algorithm
results <- data.frame(test.instance = test.set$concreteStrength)

Next thing is using all the independent variables to verify the statistical significance

 # .Using all the independent variables to verify the statistical significance .
regress.multiple <- lm(formula = concreteStrength ~ .,
                        data =  training.set)
#View the statistical informations
summary(regress.multiple)

## 
## Call:
## lm(formula = concreteStrength ~ ., data = training.set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.646  -6.386   0.810   6.548  33.490 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -43.293389  29.601248  -1.463  0.14398    
## cement             0.126005   0.009379  13.436  < 2e-16 ***
## blastFurnace       0.114229   0.011384  10.034  < 2e-16 ***
## flyAsh             0.097976   0.013990   7.003 5.24e-12 ***
## water             -0.134608   0.044242  -3.043  0.00242 ** 
## superPlasticizer   0.240285   0.102239   2.350  0.01900 *  
## coarseAggregate    0.026145   0.010554   2.477  0.01344 *  
## fineAggregate      0.028769   0.011938   2.410  0.01618 *  
## age                0.119164   0.006386  18.661  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.47 on 815 degrees of freedom
## Multiple R-squared:  0.6182, Adjusted R-squared:  0.6144 
## F-statistic: 164.9 on 8 and 815 DF,  p-value: < 2.2e-16

Observation from the statistical informations table

Hypothesis test
H0: No relationship exist between predictors and concrete compressive strength
H1: There exist some relationship between predictors and the concrete compressive strength

For this model, the significance level is 0.05. if p-value is less than 0.05, then we reject the null hypothesis that the coefficient is 0 thereby accepting the alternative hypothesis (H1).

From the summary, Cement, blastFurnace, flyAsh, age and water are the independent variables that has the most impact on the concrete compressive strength with p-value significantly less than 0.05. They reject the null Hypothesis in favour of the alternative Hypotehsis, which means industries can acertain that cement, blastFurnace, flyAsh, age and water needs to be combined to give high concrete compressive strength.

Also from the statistical information , we can see that the p-values of superplasticizer, coarse aggregate and fine aggregate are less than 0.05 but not very significant. However, they reject the null hypothesis. Even with their low statistical significance. We will still keep this variables.

R-Square

We have an R-square of 62.14%, also an adjusted R-Square that is lower. The adjusted Rsquared adjusts the metric for predictors in the model. The adjusted R-Square has to always be lower than R-Square, and a wide difference between gives a clue that the model might have too many predictors than is required. Hence our model is quite okay since the difference is not so far apart.

Let us visualize the relationship of response variable with respect to the predictors by building the model

4.1 Model Building with K-fold validation

So lets build the Linear model with all the independent variable.

# lets us apply 10 Fold Cross Validation
ctrl <- trainControl(method = "cv", number=10)
set.seed(12345)  #randomization
lmModel.fit <- train(form = concreteStrength  ~ .,
                     data = training.set,
                     method = "lm",
                     trControl = ctrl)
lmModel.fit

## Linear Regression 
## 
## 824 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 741, 742, 741, 744, 743, 741, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE     
##   10.5392  0.6112689  8.258166
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Lets predict the results on the test set using the Linear Regression Model

# predict the compressive strength of concrete using the training set 
results$lm.predict <- predict(lmModel.fit, test.set)

So lets build the SVM Radial model with all the independent variable.

#SVM
set.seed(12345)  #randomization
svmModel.fit <- train(form = concreteStrength  ~ .,
                      data = training.set,
                      method = "svmRadial",
                      trControl = ctrl,
                      tuneLength = 10)
svmModel.fit

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 824 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 741, 742, 741, 744, 743, 741, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE     
##     0.25  8.062180  0.7808144  6.135342
##     0.50  7.443007  0.8115086  5.551977
##     1.00  6.880396  0.8382757  5.043029
##     2.00  6.437072  0.8576558  4.682646
##     4.00  6.237900  0.8654134  4.511356
##     8.00  6.106665  0.8704855  4.419172
##    16.00  5.956974  0.8764042  4.260051
##    32.00  5.960119  0.8763146  4.178089
##    64.00  5.922881  0.8777646  4.120953
##   128.00  5.916695  0.8784661  4.059372
## 
## Tuning parameter 'sigma' was held constant at a value of 0.106065
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.106065 and C = 128.

Let us look at the Predictions below with respect to the observed value of the test set

results$svm.predict <- predict(svmModel.fit, test.set)
results

Note that SVM’s predictions are close to the observed value of the test dataset much than the Linear Regression model

let us estimation the models accuracy.

Estimating the accuracy of the two Models using baseline comparison. The data has different units of measure.Let us scale the data.The algorithms we are comparing below use default tuning parameters.

We will use 10-fold cross-validation with 3 repeats. We shall also evaluate the two algorithms using RMSE AND Rsquared metrics. Rsquared will tell the best fit of the data where (1 is good and 0 is bad). Also RMSE will show how wrong all predictions are. Values close to 0 is good.

# Run algorithms using 10-fold cross-validation
ctrl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
# # LM
set.seed(12345)  #randomization
lm.fit <- train(form = concreteStrength  ~ .,
                data=training.set,
                method="lm",
                metric=metric,
                preProc=c("center","scale"),
                trControl=ctrl)


# SVM
set.seed(12345)  #randomization
svm.fit <- train(form = concreteStrength  ~ .,
                 data=training.set,
                method="svmRadial",
                metric=metric,
                preProc=c("center","scale"),
                trControl=ctrl)

Compare the two algorithms

# Compare algorithms
models.result <- resamples(list(LM=lm.fit,
                                SVM=svm.fit))
summary(models.result)

## 
## Call:
## summary.resamples(object = models.result)
## 
## Models: LM, SVM 
## Number of resamples: 30 
## 
## MAE 
##         Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## LM  6.637125 7.815566 8.128950 8.254382 8.583904 9.600142    0
## SVM 4.062141 4.622325 5.084283 5.060745 5.433060 6.175693    0
## 
## RMSE 
##         Min.   1st Qu.    Median      Mean  3rd Qu.      Max. NA's
## LM  8.799941 10.024167 10.321398 10.521906 11.15035 11.774198    0
## SVM 5.449613  6.325419  6.880557  6.859827  7.44712  8.365718    0
## 
## Rsquared 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LM  0.4712671 0.5731756 0.6324937 0.6127763 0.6536723 0.7117546    0
## SVM 0.7521655 0.8102512 0.8400413 0.8376898 0.8644023 0.9048097    0

We can also see that SVM is best fitted for the data. Rsquared are higher for SVM and the concrete compressive strength can be explained by other variables in the dataset. Also SVM model has the lowest root mean squared error(RMSE) .

4.2 Variable Selection

Our aim is to try and fit a linear model given a set of variables. We can use this as a way of fine tuning the model to get a better accuracy. We will use Exhaustive search to checking if there are possibilities of focusing on the variables that have more impact.

# Creates a exhaustive subset of the dataset
regfit.full.exhaustive <- regsubsets(concreteStrength ~ .,data = concrete.data,
method = "exhaustive", really.big=T, nvmax = 10)
subset.summary.exhaustive <- summary(regfit.full.exhaustive)
subset.summary.exhaustive$outmat

##          cement blastFurnace flyAsh water superPlasticizer coarseAggregate
## 1  ( 1 ) "*"    " "          " "    " "   " "              " "            
## 2  ( 1 ) "*"    " "          " "    " "   "*"              " "            
## 3  ( 1 ) "*"    " "          " "    " "   "*"              " "            
## 4  ( 1 ) "*"    "*"          " "    "*"   " "              " "            
## 5  ( 1 ) "*"    "*"          "*"    "*"   " "              " "            
## 6  ( 1 ) "*"    "*"          "*"    "*"   "*"              " "            
## 7  ( 1 ) "*"    "*"          "*"    "*"   "*"              "*"            
## 8  ( 1 ) "*"    "*"          "*"    "*"   "*"              "*"            
##          fineAggregate age
## 1  ( 1 ) " "           " "
## 2  ( 1 ) " "           " "
## 3  ( 1 ) " "           "*"
## 4  ( 1 ) " "           "*"
## 5  ( 1 ) " "           "*"
## 6  ( 1 ) " "           "*"
## 7  ( 1 ) " "           "*"
## 8  ( 1 ) "*"           "*"

# Graph for asjusted R-square visualization.
plot(subset.summary.exhaustive$adjr2,type = "o",col = "red",ylab = "adjusted R2",xlab = "Number of variable")

In this case, the best 5-variables includes the following independent variables: 1, 2, 3, 4, and age which is Cement, blastFurnace, flyAsh, age and water. We can also confirm this from section 4 summary statistics using the p-value. let us discard 5, 6 and 7 (which is superPlasticizer, coarseAggregate and fineAggregate) if you recall,they reject the null hypothesis but they have little significance with the concrete strength.

let us see if there is an improvement in the Rsquared compared to the case where we had all the independent variables.

Create a subset from the training set

# discard superPlasticizer, coarseAggregate and fieAggregate)
training.set.new <- training.set[-c(5:7)]

let use ascertain the models accuracy using Rsquared.

 # .Using only the most significant variables to verify the statistical significance .
regressor <- lm(formula = concreteStrength ~ cement + blastFurnace + flyAsh + water +  age,
                        data =  training.set)
#View the statistical informations
summary(regressor)

## 
## Call:
## lm(formula = concreteStrength ~ cement + blastFurnace + flyAsh + 
##     water + age, data = training.set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.861  -6.450   0.674   6.420  33.233 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.369392   4.138509   8.546   <2e-16 ***
## cement        0.108944   0.004403  24.742   <2e-16 ***
## blastFurnace  0.093749   0.005225  17.943   <2e-16 ***
## flyAsh        0.078232   0.007617  10.271   <2e-16 ***
## water        -0.256037   0.018640 -13.736   <2e-16 ***
## age           0.118229   0.006365  18.574   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.52 on 818 degrees of freedom
## Multiple R-squared:  0.6136, Adjusted R-squared:  0.6113 
## F-statistic: 259.8 on 5 and 818 DF,  p-value: < 2.2e-16

5. Conclusion

As We can see the adjusted R-squared is 61.36% when we removed the less significant variables as opposed to 61.71% when we had all the independent variables. We thereby conclude that mixing 1 kilogram of cement, blastFurnace, flyAsh, water, superPlasticizer, coarseAggregate, fineAggregate per cubic meter and, per day will thereby increase 61.36% of concrete compressive strength.

SVM model performs better than Linear regression model.