Introduction

Concrete is the most important material in civil engineering. This dataset has 9 attributes including 8 quantitative input variables, and 1 quantitative output variable.

This dataset can be viewed: https://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength

Load the packages

library(tidyverse) # We need to load below packages.
## -- Attaching packages ----------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.1
## v tidyr   1.1.1     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts -------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(readxl)
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
concrete_data <- read_excel("Concrete_Data.xls") # Import dataset and store it.

Understand the dataset

First, we are trying to understand the data types of all the attributes in this dataset. From below glimpse, we can observe that this dataset only contains numeric variables.

glimpse(concrete_data)
## Rows: 1,030
## Columns: 9
## $ `Cement (component 1)(kg in a m^3 mixture)`             <dbl> 540.0, 540....
## $ `Blast Furnace Slag (component 2)(kg in a m^3 mixture)` <dbl> 0.0, 0.0, 1...
## $ `Fly Ash (component 3)(kg in a m^3 mixture)`            <dbl> 0, 0, 0, 0,...
## $ `Water  (component 4)(kg in a m^3 mixture)`             <dbl> 162, 162, 2...
## $ `Superplasticizer (component 5)(kg in a m^3 mixture)`   <dbl> 2.5, 2.5, 0...
## $ `Coarse Aggregate  (component 6)(kg in a m^3 mixture)`  <dbl> 1040.0, 105...
## $ `Fine Aggregate (component 7)(kg in a m^3 mixture)`     <dbl> 676.0, 676....
## $ `Age (day)`                                             <dbl> 28, 28, 270...
## $ `Concrete compressive strength(MPa, megapascals)`       <dbl> 79.986111, ...
names(concrete_data) #display all the column names
## [1] "Cement (component 1)(kg in a m^3 mixture)"            
## [2] "Blast Furnace Slag (component 2)(kg in a m^3 mixture)"
## [3] "Fly Ash (component 3)(kg in a m^3 mixture)"           
## [4] "Water  (component 4)(kg in a m^3 mixture)"            
## [5] "Superplasticizer (component 5)(kg in a m^3 mixture)"  
## [6] "Coarse Aggregate  (component 6)(kg in a m^3 mixture)" 
## [7] "Fine Aggregate (component 7)(kg in a m^3 mixture)"    
## [8] "Age (day)"                                            
## [9] "Concrete compressive strength(MPa, megapascals)"

From above output, we can see that some columns have very long and complicated names. To simplify the efforts in our following analysis, we rename each columns with the acronym.

colnames(concrete_data) = c("cem", "bfs", "fa", "water", "sp", "cagg", "fagg", "age", "ccs")
names(concrete_data) # validate the renaming is successful
## [1] "cem"   "bfs"   "fa"    "water" "sp"    "cagg"  "fagg"  "age"   "ccs"

Now we can see the column names are much shorter and cleaner.

summary(concrete_data) #understand the summary statistics
##       cem             bfs              fa             water      
##  Min.   :102.0   Min.   :  0.0   Min.   :  0.00   Min.   :121.8  
##  1st Qu.:192.4   1st Qu.:  0.0   1st Qu.:  0.00   1st Qu.:164.9  
##  Median :272.9   Median : 22.0   Median :  0.00   Median :185.0  
##  Mean   :281.2   Mean   : 73.9   Mean   : 54.19   Mean   :181.6  
##  3rd Qu.:350.0   3rd Qu.:142.9   3rd Qu.:118.27   3rd Qu.:192.0  
##  Max.   :540.0   Max.   :359.4   Max.   :200.10   Max.   :247.0  
##        sp              cagg             fagg            age        
##  Min.   : 0.000   Min.   : 801.0   Min.   :594.0   Min.   :  1.00  
##  1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:731.0   1st Qu.:  7.00  
##  Median : 6.350   Median : 968.0   Median :779.5   Median : 28.00  
##  Mean   : 6.203   Mean   : 972.9   Mean   :773.6   Mean   : 45.66  
##  3rd Qu.:10.160   3rd Qu.:1029.4   3rd Qu.:824.0   3rd Qu.: 56.00  
##  Max.   :32.200   Max.   :1145.0   Max.   :992.6   Max.   :365.00  
##       ccs        
##  Min.   : 2.332  
##  1st Qu.:23.707  
##  Median :34.443  
##  Mean   :35.818  
##  3rd Qu.:46.136  
##  Max.   :82.599

By running the summary report of this dataset, we can see that there is no NA value in any attribute, so we don’t have to perform any further clean up on the dataset.

In the next step, we are trying to understand the correlation between the dependent variable and the explanatory variables. Below we examine the complete set of variables. From the outcome, We can observe that the correlation between Cement and Concrete Compressive Strength (CCS) is the highest among all pairs, which is 0.5.

library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.3
## corrplot 0.84 loaded
corrplot(cor(concrete_data), method = "number")

We further draw a scatter plot between Cement and Concrete Compressive Strength (CCS). The distribution of the data points indicate a linear trend looking like the higher cement value, the better CCS.

ggplot(concrete_data) + geom_point(aes(y = ccs, x = cem))

cor(concrete_data$cem, concrete_data$ccs)
## [1] 0.4978327

Create Linear Regression Models

Simple Linear Regression

As an initial step, we will use the explanatory variable showing the highest correlation and the dependent variable to create a simple linear regression model as our baseline.

slm <- lm(ccs~cem,data=concrete_data)
coef(slm)
## (Intercept)         cem 
## 13.44279487  0.07957957

We can see from above command outcome, the intercept is 13.44279487, and the coefficient for Cement is 0.07957957. In other words, the formula can be depicted as CCS ~ 0.07957957*Cement + 13.44279487. Let’s try to compare the trend lines by both using geom_smooth command with ‘lm’ method and geom_abline command with the coefficient we retrieved from above model creation.

ggplot(data=concrete_data, aes(x=cem,y=ccs)) + geom_point()+ 
  geom_smooth(method='lm')
## `geom_smooth()` using formula 'y ~ x'

coef.lm <- coef(slm) #saving the coefficients to use in ggplot
ggplot(data=concrete_data, aes(x=cem,y=ccs)) + geom_point()+ 
  geom_abline(intercept=coef.lm[1],
  slope=coef.lm[2])

We can see that the two lines are equivalent, and the only difference is that geom_smooth also includes the error estimate.

summary(slm)
## 
## Call:
## lm(formula = ccs ~ cem, data = concrete_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.594 -10.952  -0.572   9.992  43.241 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.442795   1.296925   10.37   <2e-16 ***
## cem          0.079580   0.004324   18.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.5 on 1028 degrees of freedom
## Multiple R-squared:  0.2478, Adjusted R-squared:  0.2471 
## F-statistic: 338.7 on 1 and 1028 DF,  p-value: < 2.2e-16

There are several important statistics we observed from above summary.

  1. The p-values for both the intercept and the coefficient of Cement are very small, indicating strong statistical significance of our model, which is a good sign
  2. The Adjusted R-Squared score is around 0.247, representing that about 25% of the variance in data can be explained by our model. Not so exciting result, but at least we have a baseline model

The next step, we need to further examine the performance of our model.

plot(slm)

These 4 plots provide us more information about the model.

  1. The first plot shows that the residuals spread equally around a horizontal line without distinct patterns, indicating there is no non-linear relationship between our predictor variable and the outcome variable
  2. The second plot shows that our residuals follow a straight line pretty well, indicating that our residuals are normally distributed
  3. The third plot shows if residuals are spread equally along the ranges of predictors, in our case we can see a horizontal line with randomly spread points
  4. The fourth plot checks the influence of “outliers” on our model. We can see no data points falling outside Cook’s distance, which indicates that those outliers are not influencial, i.e., it’s ok to either include or exlcude them

In summary, all of these 4 performance analysis demonstrates pretty normal outcome, indicating our model is relatively good.

Model Revision - Multiple Regression

From the simple linear regression model practice, we get model with good performance but the R-Squared score is relatively low. It indicates that some variances are not explained by our model. Hence we are trying to create a multiple regression model.To make assumption simple,we first include all the explanatory variables.

mlm1 <- lm(ccs~.,data=concrete_data)
summary(mlm1)
## 
## Call:
## lm(formula = ccs ~ ., data = concrete_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.653  -6.303   0.704   6.562  34.446 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -23.163756  26.588421  -0.871 0.383851    
## cem           0.119785   0.008489  14.110  < 2e-16 ***
## bfs           0.103847   0.010136  10.245  < 2e-16 ***
## fa            0.087943   0.012585   6.988 5.03e-12 ***
## water        -0.150298   0.040179  -3.741 0.000194 ***
## sp            0.290687   0.093460   3.110 0.001921 ** 
## cagg          0.018030   0.009394   1.919 0.055227 .  
## fagg          0.020154   0.010703   1.883 0.059968 .  
## age           0.114226   0.005427  21.046  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.4 on 1021 degrees of freedom
## Multiple R-squared:  0.6155, Adjusted R-squared:  0.6125 
## F-statistic: 204.3 on 8 and 1021 DF,  p-value: < 2.2e-16

From above model summary, R-squared score is much better than the simple linear regression model. However, some of the variables become marginally significant ( e.g., fagg and cagg ). Let’s do some collinearity analysis on input variables.

library(mctest) # use package mctest
## Warning: package 'mctest' was built under R version 3.6.3
VIF <- function(linear.model, no.intercept=FALSE, all.diagnostics=FALSE, plot=FALSE) {
    require(mctest)
    if(no.intercept==FALSE) design.matrix <- model.matrix(linear.model)[,-1]
    if(no.intercept==TRUE) design.matrix <- model.matrix(linear.model)
    if(plot==TRUE) mc.plot(design.matrix,linear.model$model[1])
    if(all.diagnostics==FALSE) output <- imcdiag(linear.model, method='VIF')$idiags[,1]
    if(all.diagnostics==TRUE) output <- imcdiag(linear.model)
    output
} #define VIF process
VIF(mlm1, all.diagnostics=TRUE) # use VIF to analyse collinearity
## 
## Call:
## imcdiag(mod = linear.model)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##          VIF    TOL       Wi        Fi Leamer   CVIF Klein   IND1   IND2
## cem   7.4887 0.1335 947.3440 1106.3161 0.3654 8.4309     1 0.0009 1.1843
## bfs   7.2765 0.1374 916.3732 1070.1481 0.3707 8.1920     1 0.0009 1.1790
## fa    6.1715 0.1620 755.0324  881.7330 0.4025 6.9479     1 0.0011 1.1453
## water 7.0047 0.1428 876.6808 1023.7951 0.3778 7.8860     1 0.0010 1.1717
## sp    2.9653 0.3372 286.9334  335.0832 0.5807 3.3384     1 0.0023 0.9059
## cagg  5.0760 0.1970 595.1024  694.9655 0.4439 5.7147     1 0.0013 1.0975
## fagg  7.0053 0.1427 876.7805 1023.9114 0.3778 7.8867     1 0.0010 1.1717
## age   1.1184 0.8942  17.2801   20.1799 0.9456 1.2591     0 0.0061 0.1447
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## cagg , fagg , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.6155 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

From above summary, collinearity is detected by the test. It suggests that we need to remove cagg and fagg from our model.

library("Boruta")
## Warning: package 'Boruta' was built under R version 3.6.3
boruta_output <- Boruta(ccs ~ ., data=na.omit(concrete_data), doTrace=2)
##  1. run of importance source...
##  2. run of importance source...
##  3. run of importance source...
##  4. run of importance source...
##  5. run of importance source...
##  6. run of importance source...
##  7. run of importance source...
##  8. run of importance source...
##  9. run of importance source...
##  10. run of importance source...
## After 10 iterations, +3.7 secs:
##  confirmed 8 attributes: age, bfs, cagg, cem, fa and 3 more;
##  no more attributes left.
boruta_signif <- names(boruta_output$finalDecision[boruta_output$finalDecision %in% c("Confirmed", "Tentative")])  # collect Confirmed and Tentative variables
print(boruta_signif)
## [1] "cem"   "bfs"   "fa"    "water" "sp"    "cagg"  "fagg"  "age"
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")

We also test the variable importance. From above plot, we can identify age and cement are two most important variables. All other variables are showing similar importance.

Based on above analysis, we revise the model by removing fagg and cagg variables.

mlm2 <- lm(ccs~ age + cem  + sp + fa + bfs + water,data=concrete_data)
summary(mlm2)
## 
## Call:
## lm(formula = ccs ~ age + cem + sp + fa + bfs + water, data = concrete_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.014  -6.474   0.650   6.546  34.726 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.030224   4.212476   6.891 9.64e-12 ***
## age          0.113495   0.005408  20.987  < 2e-16 ***
## cem          0.105427   0.004248  24.821  < 2e-16 ***
## sp           0.239003   0.084586   2.826  0.00481 ** 
## fa           0.068708   0.007736   8.881  < 2e-16 ***
## bfs          0.086494   0.004975  17.386  < 2e-16 ***
## water       -0.218292   0.021128 -10.332  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.41 on 1023 degrees of freedom
## Multiple R-squared:  0.614,  Adjusted R-squared:  0.6117 
## F-statistic: 271.2 on 6 and 1023 DF,  p-value: < 2.2e-16

We can see R-squared didn’t drop much from mlm1, 0.6125 vs. 0.6117. Now we can observe all variables have strong significance except sp. Let’s create another model without sp.

mlm3 <- lm(ccs~ age + cem  + fa + bfs + water,data=concrete_data)
summary(mlm3)
## 
## Call:
## lm(formula = ccs ~ age + cem + fa + bfs + water, data = concrete_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.787  -6.440   0.757   6.442  33.386 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.825788   3.691910   9.433   <2e-16 ***
## age          0.114016   0.005423  21.025   <2e-16 ***
## cem          0.110045   0.003934  27.974   <2e-16 ***
## fa           0.079616   0.006727  11.835   <2e-16 ***
## bfs          0.092359   0.004536  20.360   <2e-16 ***
## water       -0.254971   0.016727 -15.243   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.45 on 1024 degrees of freedom
## Multiple R-squared:  0.611,  Adjusted R-squared:  0.6091 
## F-statistic: 321.6 on 5 and 1024 DF,  p-value: < 2.2e-16

R-Squared didn’t change much compared to mlm2, so we have already got a relatively good model.

plot(mlm3)

From above plots, we can observe similar patten as simple linear regression performace analysis. In conclusion, we have got mlm3 model which has good performance and good explanation of variance.