Concrete is the most important material in civil engineering. This dataset has 9 attributes including 8 quantitative input variables, and 1 quantitative output variable.
This dataset can be viewed: https://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength
library(tidyverse) # We need to load below packages.
## -- Attaching packages ----------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.1
## v tidyr 1.1.1 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts -------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readxl)
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
concrete_data <- read_excel("Concrete_Data.xls") # Import dataset and store it.
First, we are trying to understand the data types of all the attributes in this dataset. From below glimpse, we can observe that this dataset only contains numeric variables.
glimpse(concrete_data)
## Rows: 1,030
## Columns: 9
## $ `Cement (component 1)(kg in a m^3 mixture)` <dbl> 540.0, 540....
## $ `Blast Furnace Slag (component 2)(kg in a m^3 mixture)` <dbl> 0.0, 0.0, 1...
## $ `Fly Ash (component 3)(kg in a m^3 mixture)` <dbl> 0, 0, 0, 0,...
## $ `Water (component 4)(kg in a m^3 mixture)` <dbl> 162, 162, 2...
## $ `Superplasticizer (component 5)(kg in a m^3 mixture)` <dbl> 2.5, 2.5, 0...
## $ `Coarse Aggregate (component 6)(kg in a m^3 mixture)` <dbl> 1040.0, 105...
## $ `Fine Aggregate (component 7)(kg in a m^3 mixture)` <dbl> 676.0, 676....
## $ `Age (day)` <dbl> 28, 28, 270...
## $ `Concrete compressive strength(MPa, megapascals)` <dbl> 79.986111, ...
names(concrete_data) #display all the column names
## [1] "Cement (component 1)(kg in a m^3 mixture)"
## [2] "Blast Furnace Slag (component 2)(kg in a m^3 mixture)"
## [3] "Fly Ash (component 3)(kg in a m^3 mixture)"
## [4] "Water (component 4)(kg in a m^3 mixture)"
## [5] "Superplasticizer (component 5)(kg in a m^3 mixture)"
## [6] "Coarse Aggregate (component 6)(kg in a m^3 mixture)"
## [7] "Fine Aggregate (component 7)(kg in a m^3 mixture)"
## [8] "Age (day)"
## [9] "Concrete compressive strength(MPa, megapascals)"
From above output, we can see that some columns have very long and complicated names. To simplify the efforts in our following analysis, we rename each columns with the acronym.
colnames(concrete_data) = c("cem", "bfs", "fa", "water", "sp", "cagg", "fagg", "age", "ccs")
names(concrete_data) # validate the renaming is successful
## [1] "cem" "bfs" "fa" "water" "sp" "cagg" "fagg" "age" "ccs"
Now we can see the column names are much shorter and cleaner.
summary(concrete_data) #understand the summary statistics
## cem bfs fa water
## Min. :102.0 Min. : 0.0 Min. : 0.00 Min. :121.8
## 1st Qu.:192.4 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:164.9
## Median :272.9 Median : 22.0 Median : 0.00 Median :185.0
## Mean :281.2 Mean : 73.9 Mean : 54.19 Mean :181.6
## 3rd Qu.:350.0 3rd Qu.:142.9 3rd Qu.:118.27 3rd Qu.:192.0
## Max. :540.0 Max. :359.4 Max. :200.10 Max. :247.0
## sp cagg fagg age
## Min. : 0.000 Min. : 801.0 Min. :594.0 Min. : 1.00
## 1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:731.0 1st Qu.: 7.00
## Median : 6.350 Median : 968.0 Median :779.5 Median : 28.00
## Mean : 6.203 Mean : 972.9 Mean :773.6 Mean : 45.66
## 3rd Qu.:10.160 3rd Qu.:1029.4 3rd Qu.:824.0 3rd Qu.: 56.00
## Max. :32.200 Max. :1145.0 Max. :992.6 Max. :365.00
## ccs
## Min. : 2.332
## 1st Qu.:23.707
## Median :34.443
## Mean :35.818
## 3rd Qu.:46.136
## Max. :82.599
By running the summary report of this dataset, we can see that there is no NA value in any attribute, so we don’t have to perform any further clean up on the dataset.
In the next step, we are trying to understand the correlation between the dependent variable and the explanatory variables. Below we examine the complete set of variables. From the outcome, We can observe that the correlation between Cement and Concrete Compressive Strength (CCS) is the highest among all pairs, which is 0.5.
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.3
## corrplot 0.84 loaded
corrplot(cor(concrete_data), method = "number")
We further draw a scatter plot between Cement and Concrete Compressive Strength (CCS). The distribution of the data points indicate a linear trend looking like the higher cement value, the better CCS.
ggplot(concrete_data) + geom_point(aes(y = ccs, x = cem))
cor(concrete_data$cem, concrete_data$ccs)
## [1] 0.4978327
As an initial step, we will use the explanatory variable showing the highest correlation and the dependent variable to create a simple linear regression model as our baseline.
slm <- lm(ccs~cem,data=concrete_data)
coef(slm)
## (Intercept) cem
## 13.44279487 0.07957957
We can see from above command outcome, the intercept is 13.44279487, and the coefficient for Cement is 0.07957957. In other words, the formula can be depicted as CCS ~ 0.07957957*Cement + 13.44279487. Let’s try to compare the trend lines by both using geom_smooth command with ‘lm’ method and geom_abline command with the coefficient we retrieved from above model creation.
ggplot(data=concrete_data, aes(x=cem,y=ccs)) + geom_point()+
geom_smooth(method='lm')
## `geom_smooth()` using formula 'y ~ x'
coef.lm <- coef(slm) #saving the coefficients to use in ggplot
ggplot(data=concrete_data, aes(x=cem,y=ccs)) + geom_point()+
geom_abline(intercept=coef.lm[1],
slope=coef.lm[2])
We can see that the two lines are equivalent, and the only difference is that geom_smooth also includes the error estimate.
summary(slm)
##
## Call:
## lm(formula = ccs ~ cem, data = concrete_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.594 -10.952 -0.572 9.992 43.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.442795 1.296925 10.37 <2e-16 ***
## cem 0.079580 0.004324 18.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.5 on 1028 degrees of freedom
## Multiple R-squared: 0.2478, Adjusted R-squared: 0.2471
## F-statistic: 338.7 on 1 and 1028 DF, p-value: < 2.2e-16
There are several important statistics we observed from above summary.
The next step, we need to further examine the performance of our model.
plot(slm)
These 4 plots provide us more information about the model.
In summary, all of these 4 performance analysis demonstrates pretty normal outcome, indicating our model is relatively good.
From the simple linear regression model practice, we get model with good performance but the R-Squared score is relatively low. It indicates that some variances are not explained by our model. Hence we are trying to create a multiple regression model.To make assumption simple,we first include all the explanatory variables.
mlm1 <- lm(ccs~.,data=concrete_data)
summary(mlm1)
##
## Call:
## lm(formula = ccs ~ ., data = concrete_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.653 -6.303 0.704 6.562 34.446
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.163756 26.588421 -0.871 0.383851
## cem 0.119785 0.008489 14.110 < 2e-16 ***
## bfs 0.103847 0.010136 10.245 < 2e-16 ***
## fa 0.087943 0.012585 6.988 5.03e-12 ***
## water -0.150298 0.040179 -3.741 0.000194 ***
## sp 0.290687 0.093460 3.110 0.001921 **
## cagg 0.018030 0.009394 1.919 0.055227 .
## fagg 0.020154 0.010703 1.883 0.059968 .
## age 0.114226 0.005427 21.046 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.4 on 1021 degrees of freedom
## Multiple R-squared: 0.6155, Adjusted R-squared: 0.6125
## F-statistic: 204.3 on 8 and 1021 DF, p-value: < 2.2e-16
From above model summary, R-squared score is much better than the simple linear regression model. However, some of the variables become marginally significant ( e.g., fagg and cagg ). Let’s do some collinearity analysis on input variables.
library(mctest) # use package mctest
## Warning: package 'mctest' was built under R version 3.6.3
VIF <- function(linear.model, no.intercept=FALSE, all.diagnostics=FALSE, plot=FALSE) {
require(mctest)
if(no.intercept==FALSE) design.matrix <- model.matrix(linear.model)[,-1]
if(no.intercept==TRUE) design.matrix <- model.matrix(linear.model)
if(plot==TRUE) mc.plot(design.matrix,linear.model$model[1])
if(all.diagnostics==FALSE) output <- imcdiag(linear.model, method='VIF')$idiags[,1]
if(all.diagnostics==TRUE) output <- imcdiag(linear.model)
output
} #define VIF process
VIF(mlm1, all.diagnostics=TRUE) # use VIF to analyse collinearity
##
## Call:
## imcdiag(mod = linear.model)
##
##
## All Individual Multicollinearity Diagnostics Result
##
## VIF TOL Wi Fi Leamer CVIF Klein IND1 IND2
## cem 7.4887 0.1335 947.3440 1106.3161 0.3654 8.4309 1 0.0009 1.1843
## bfs 7.2765 0.1374 916.3732 1070.1481 0.3707 8.1920 1 0.0009 1.1790
## fa 6.1715 0.1620 755.0324 881.7330 0.4025 6.9479 1 0.0011 1.1453
## water 7.0047 0.1428 876.6808 1023.7951 0.3778 7.8860 1 0.0010 1.1717
## sp 2.9653 0.3372 286.9334 335.0832 0.5807 3.3384 1 0.0023 0.9059
## cagg 5.0760 0.1970 595.1024 694.9655 0.4439 5.7147 1 0.0013 1.0975
## fagg 7.0053 0.1427 876.7805 1023.9114 0.3778 7.8867 1 0.0010 1.1717
## age 1.1184 0.8942 17.2801 20.1799 0.9456 1.2591 0 0.0061 0.1447
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## cagg , fagg , coefficient(s) are non-significant may be due to multicollinearity
##
## R-square of y on all x: 0.6155
##
## * use method argument to check which regressors may be the reason of collinearity
## ===================================
From above summary, collinearity is detected by the test. It suggests that we need to remove cagg and fagg from our model.
library("Boruta")
## Warning: package 'Boruta' was built under R version 3.6.3
boruta_output <- Boruta(ccs ~ ., data=na.omit(concrete_data), doTrace=2)
## 1. run of importance source...
## 2. run of importance source...
## 3. run of importance source...
## 4. run of importance source...
## 5. run of importance source...
## 6. run of importance source...
## 7. run of importance source...
## 8. run of importance source...
## 9. run of importance source...
## 10. run of importance source...
## After 10 iterations, +3.7 secs:
## confirmed 8 attributes: age, bfs, cagg, cem, fa and 3 more;
## no more attributes left.
boruta_signif <- names(boruta_output$finalDecision[boruta_output$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables
print(boruta_signif)
## [1] "cem" "bfs" "fa" "water" "sp" "cagg" "fagg" "age"
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")
We also test the variable importance. From above plot, we can identify age and cement are two most important variables. All other variables are showing similar importance.
Based on above analysis, we revise the model by removing fagg and cagg variables.
mlm2 <- lm(ccs~ age + cem + sp + fa + bfs + water,data=concrete_data)
summary(mlm2)
##
## Call:
## lm(formula = ccs ~ age + cem + sp + fa + bfs + water, data = concrete_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.014 -6.474 0.650 6.546 34.726
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.030224 4.212476 6.891 9.64e-12 ***
## age 0.113495 0.005408 20.987 < 2e-16 ***
## cem 0.105427 0.004248 24.821 < 2e-16 ***
## sp 0.239003 0.084586 2.826 0.00481 **
## fa 0.068708 0.007736 8.881 < 2e-16 ***
## bfs 0.086494 0.004975 17.386 < 2e-16 ***
## water -0.218292 0.021128 -10.332 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.41 on 1023 degrees of freedom
## Multiple R-squared: 0.614, Adjusted R-squared: 0.6117
## F-statistic: 271.2 on 6 and 1023 DF, p-value: < 2.2e-16
We can see R-squared didn’t drop much from mlm1, 0.6125 vs. 0.6117. Now we can observe all variables have strong significance except sp. Let’s create another model without sp.
mlm3 <- lm(ccs~ age + cem + fa + bfs + water,data=concrete_data)
summary(mlm3)
##
## Call:
## lm(formula = ccs ~ age + cem + fa + bfs + water, data = concrete_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.787 -6.440 0.757 6.442 33.386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.825788 3.691910 9.433 <2e-16 ***
## age 0.114016 0.005423 21.025 <2e-16 ***
## cem 0.110045 0.003934 27.974 <2e-16 ***
## fa 0.079616 0.006727 11.835 <2e-16 ***
## bfs 0.092359 0.004536 20.360 <2e-16 ***
## water -0.254971 0.016727 -15.243 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.45 on 1024 degrees of freedom
## Multiple R-squared: 0.611, Adjusted R-squared: 0.6091
## F-statistic: 321.6 on 5 and 1024 DF, p-value: < 2.2e-16
R-Squared didn’t change much compared to mlm2, so we have already got a relatively good model.
plot(mlm3)
From above plots, we can observe similar patten as simple linear regression performace analysis. In conclusion, we have got mlm3 model which has good performance and good explanation of variance.