Data Description

Abstract

Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate.

Data Sourcing

The dataset was sourced from UC Irvine Machine Learning Repository. Click here

Original Owner and Donor
Prof. I-Cheng Yeh,
Department of Information Management,
Chung-Hua University,
Hsin Chu, Taiwan - 30067,R.O.C.
E-mail: icyeh@chu.edu.tw
TEL: 886-3-5186511

Data Characteristics

The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from laboratory. Data is in raw form (not scaled).

Number of instances (observations): 1030
Number of Attributes: 9
Attribute breakdown: 8 quantitative input variables, and 1 quantitative output variable
Missing Attribute Values: None

Variable Information

      • Cement – quantitative – kg in a m3 mixture
      • Blast Furnace Slag – quantitative – kg in a m3 mixture
      • Fly Ash – quantitative – kg in a m3 mixture
      • Water – quantitative – kg in a m3 mixture
      • Superplasticizer – quantitative – kg in a m3 mixture
      • Coarse Aggregate – quantitative – kg in a m3 mixture
      • Fine Aggregate – quantitative – kg in a m3 mixture
      • Age – quantitative – Day (1~365)
      • Concrete compressive strength – quantitative – MPa

Data Analysis

Data Pre Processing

Libraries Used
library(corrplot)
library(caret)
library(ggplot2)
library(knitr)
library(e1071)
library(rattle)
library(kableExtra)
Importing Data
conc <- read.csv("D:\\R\\Data Sets\\concrete.csv")



Data Preview
kable(head(conc), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
cementcomp slag flyash water superplastisizer coraseaggr finraggr age CCS
540.0 0.0 0 162 2.5 1040.0 676.0 28 79.99
540.0 0.0 0 162 2.5 1055.0 676.0 28 61.89
332.5 142.5 0 228 0.0 932.0 594.0 270 40.27
332.5 142.5 0 228 0.0 932.0 594.0 365 41.05
198.6 132.4 0 192 0.0 978.4 825.5 360 44.30
266.0 114.0 0 228 0.0 932.0 670.0 90 47.03
Summary of variables
summary(conc)
##    cementcomp         slag           flyash           water      
##  Min.   :102.0   Min.   :  0.0   Min.   :  0.00   Min.   :121.8  
##  1st Qu.:192.4   1st Qu.:  0.0   1st Qu.:  0.00   1st Qu.:164.9  
##  Median :272.9   Median : 22.0   Median :  0.00   Median :185.0  
##  Mean   :281.2   Mean   : 73.9   Mean   : 54.19   Mean   :181.6  
##  3rd Qu.:350.0   3rd Qu.:142.9   3rd Qu.:118.30   3rd Qu.:192.0  
##  Max.   :540.0   Max.   :359.4   Max.   :200.10   Max.   :247.0  
##  superplastisizer   coraseaggr        finraggr          age        
##  Min.   : 0.000   Min.   : 801.0   Min.   :594.0   Min.   :  1.00  
##  1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:731.0   1st Qu.:  7.00  
##  Median : 6.400   Median : 968.0   Median :779.5   Median : 28.00  
##  Mean   : 6.205   Mean   : 972.9   Mean   :773.6   Mean   : 45.66  
##  3rd Qu.:10.200   3rd Qu.:1029.4   3rd Qu.:824.0   3rd Qu.: 56.00  
##  Max.   :32.200   Max.   :1145.0   Max.   :992.6   Max.   :365.00  
##       CCS       
##  Min.   : 2.33  
##  1st Qu.:23.71  
##  Median :34.45  
##  Mean   :35.82  
##  3rd Qu.:46.13  
##  Max.   :82.60
Check the skewness of variables. If the distribution is roughly symmetric, skewness will be close to zero.
apply(conc,2,skewness)
##       cementcomp             slag           flyash            water 
##       0.50799821       0.79838622       0.53578981       0.07441116 
## superplastisizer       coraseaggr         finraggr              age 
##       0.90456195      -0.04010268      -0.25227315       3.25966169 
##              CCS 
##       0.41576358

Age is skewed. But since the ‘Age’ attribute has some ‘0’ values, it can not be transformed.

CCS Distribution Plot
ggplot(conc, aes(x = CCS)) + geom_histogram(bins = 40) + labs(x = "CCS", y = "Frequency", title = "Distribution of CCS") 

Correlation matrix to find interaction between variables.
corrplot(cor(conc), method = "number",type = "upper")

Scatter plot - Water vs Superplastisizer
qplot(conc$superplastisizer, conc$water) + labs(x = "Superplastisizer", y = "Water", title = "Water vs Superplastisizer")



Partition the data into training and testing set. 70-30.
#Splitting data into train and test set.
inTrain <- createDataPartition(conc$CCS, p = 0.7, list = F)
trainSET <- conc[inTrain,]
testSET <- conc[-inTrain,]
Create a dataframe to store the results.
#Dataframe to store results
results <- data.frame(test_obs = testSET$CCS)



Model Building

  1. Linear Regression Model
#10 Fold Cross Validation
ctrl <- trainControl(method = "cv")

lmmodel <- train(CCS ~ ., data = trainSET, method = "lm", trControl = ctrl)
lmmodel
## Linear Regression 
## 
## 722 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 649, 650, 650, 649, 650, 650, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   10.58996  0.6070471  8.375624
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Predict the results on the test set.

results$lm_predictions <- predict(lmmodel, testSET)
  1. CART (Classification and Regression Trees)
treemodel <- train(CCS ~ ., data = trainSET, method = "rpart", trControl = ctrl, tuneLength = 5)
plot(treemodel)


Fancy Decision Tree Plot

fancyRpartPlot(treemodel$finalModel)


Predict the results on the test set.

results$tree_predictions <- predict(treemodel, testSET)

3.Multi adaptive Regression Model

marmodel <- train(CCS ~ ., data = trainSET, method = "earth", trControl = ctrl, tuneLength = 15)
plot(marmodel)

Predict the results on the test set.

results$mar_predictions <- predict(marmodel, testSET)
  1. SVM Radial
svmmodel <- train(CCS ~ ., data = trainSET, method = "svmRadial", trControl = ctrl, tuneLength = 10)
svmmodel
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 722 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 650, 650, 650, 650, 650, 648, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE     
##     0.25  8.193874  0.7727503  6.216962
##     0.50  7.510859  0.8057671  5.588933
##     1.00  6.975163  0.8315825  5.139610
##     2.00  6.614476  0.8484595  4.842685
##     4.00  6.335832  0.8617612  4.601815
##     8.00  6.187779  0.8681764  4.447095
##    16.00  6.024921  0.8747392  4.274296
##    32.00  5.994894  0.8761061  4.134932
##    64.00  6.047779  0.8746937  4.126316
##   128.00  6.257243  0.8664588  4.220552
## 
## Tuning parameter 'sigma' was held constant at a value of 0.1117766
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.1117766 and C = 32.
plot(svmmodel)

Predict the results on the test set.

results$svm_predictions <- predict(svmmodel, testSET)

Predicted Values

kable(head(results), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
test_obs lm_predictions tree_predictions mar_predictions svm_predictions
40.27 55.56344 40.41024 47.18905 40.89970
44.30 59.60197 40.41024 39.63560 46.70834
43.70 66.59378 57.15409 50.17871 43.40097
36.45 29.91729 57.15409 38.31050 37.20360
40.56 36.66490 57.15409 48.29527 39.59831
42.62 47.84235 57.15409 49.37454 42.85674

Model Assesment

#Linear Regression
postResample(results$lm_predictions,results$test_obs)
##       RMSE   Rsquared        MAE 
## 10.1496003  0.6338727  8.2266351
#CART
postResample(results$tree_predictions, results$test_obs)
##       RMSE   Rsquared        MAE 
## 11.4951801  0.5315966  9.2424359
#MARS
postResample(results$mar_predictions, results$test_obs)
##      RMSE  Rsquared       MAE 
## 6.2252879 0.8622315 4.8071490
#SVMRadial
postResample(results$svm_predictions, results$test_obs)
##      RMSE  Rsquared       MAE 
## 5.4140668 0.8965451 3.8116596

It is seen that the SVM Radial model outperforms all other models with the highest RSquared, Lowest MAE and RMSE. This would be the best fit model for our data which gives the most accurate predictions.