Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate.
The dataset was sourced from UC Irvine Machine Learning Repository. Click here
Original Owner and Donor
Prof. I-Cheng Yeh,
Department of Information Management,
Chung-Hua University,
Hsin Chu, Taiwan - 30067,R.O.C.
E-mail: icyeh@chu.edu.tw
TEL: 886-3-5186511
The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from laboratory. Data is in raw form (not scaled).
Number of instances (observations): 1030
Number of Attributes: 9
Attribute breakdown: 8 quantitative input variables, and 1 quantitative output variable
Missing Attribute Values: None
     • Cement – quantitative – kg in a m3 mixture
     • Blast Furnace Slag – quantitative – kg in a m3 mixture
     • Fly Ash – quantitative – kg in a m3 mixture
     • Water – quantitative – kg in a m3 mixture
     • Superplasticizer – quantitative – kg in a m3 mixture
     • Coarse Aggregate – quantitative – kg in a m3 mixture
     • Fine Aggregate – quantitative – kg in a m3 mixture
     • Age – quantitative – Day (1~365)
     • Concrete compressive strength – quantitative – MPa
library(corrplot)
library(caret)
library(ggplot2)
library(knitr)
library(e1071)
library(rattle)
library(kableExtra)
conc <- read.csv("D:\\R\\Data Sets\\concrete.csv")
kable(head(conc), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
cementcomp | slag | flyash | water | superplastisizer | coraseaggr | finraggr | age | CCS |
---|---|---|---|---|---|---|---|---|
540.0 | 0.0 | 0 | 162 | 2.5 | 1040.0 | 676.0 | 28 | 79.99 |
540.0 | 0.0 | 0 | 162 | 2.5 | 1055.0 | 676.0 | 28 | 61.89 |
332.5 | 142.5 | 0 | 228 | 0.0 | 932.0 | 594.0 | 270 | 40.27 |
332.5 | 142.5 | 0 | 228 | 0.0 | 932.0 | 594.0 | 365 | 41.05 |
198.6 | 132.4 | 0 | 192 | 0.0 | 978.4 | 825.5 | 360 | 44.30 |
266.0 | 114.0 | 0 | 228 | 0.0 | 932.0 | 670.0 | 90 | 47.03 |
summary(conc)
## cementcomp slag flyash water
## Min. :102.0 Min. : 0.0 Min. : 0.00 Min. :121.8
## 1st Qu.:192.4 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:164.9
## Median :272.9 Median : 22.0 Median : 0.00 Median :185.0
## Mean :281.2 Mean : 73.9 Mean : 54.19 Mean :181.6
## 3rd Qu.:350.0 3rd Qu.:142.9 3rd Qu.:118.30 3rd Qu.:192.0
## Max. :540.0 Max. :359.4 Max. :200.10 Max. :247.0
## superplastisizer coraseaggr finraggr age
## Min. : 0.000 Min. : 801.0 Min. :594.0 Min. : 1.00
## 1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:731.0 1st Qu.: 7.00
## Median : 6.400 Median : 968.0 Median :779.5 Median : 28.00
## Mean : 6.205 Mean : 972.9 Mean :773.6 Mean : 45.66
## 3rd Qu.:10.200 3rd Qu.:1029.4 3rd Qu.:824.0 3rd Qu.: 56.00
## Max. :32.200 Max. :1145.0 Max. :992.6 Max. :365.00
## CCS
## Min. : 2.33
## 1st Qu.:23.71
## Median :34.45
## Mean :35.82
## 3rd Qu.:46.13
## Max. :82.60
apply(conc,2,skewness)
## cementcomp slag flyash water
## 0.50799821 0.79838622 0.53578981 0.07441116
## superplastisizer coraseaggr finraggr age
## 0.90456195 -0.04010268 -0.25227315 3.25966169
## CCS
## 0.41576358
Age is skewed. But since the ‘Age’ attribute has some ‘0’ values, it can not be transformed.
ggplot(conc, aes(x = CCS)) + geom_histogram(bins = 40) + labs(x = "CCS", y = "Frequency", title = "Distribution of CCS")
corrplot(cor(conc), method = "number",type = "upper")
qplot(conc$superplastisizer, conc$water) + labs(x = "Superplastisizer", y = "Water", title = "Water vs Superplastisizer")
#Splitting data into train and test set.
inTrain <- createDataPartition(conc$CCS, p = 0.7, list = F)
trainSET <- conc[inTrain,]
testSET <- conc[-inTrain,]
#Dataframe to store results
results <- data.frame(test_obs = testSET$CCS)
#10 Fold Cross Validation
ctrl <- trainControl(method = "cv")
lmmodel <- train(CCS ~ ., data = trainSET, method = "lm", trControl = ctrl)
lmmodel
## Linear Regression
##
## 722 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 649, 650, 650, 649, 650, 650, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 10.58996 0.6070471 8.375624
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Predict the results on the test set.
results$lm_predictions <- predict(lmmodel, testSET)
treemodel <- train(CCS ~ ., data = trainSET, method = "rpart", trControl = ctrl, tuneLength = 5)
plot(treemodel)
Fancy Decision Tree Plot
fancyRpartPlot(treemodel$finalModel)
Predict the results on the test set.
results$tree_predictions <- predict(treemodel, testSET)
3.Multi adaptive Regression Model
marmodel <- train(CCS ~ ., data = trainSET, method = "earth", trControl = ctrl, tuneLength = 15)
plot(marmodel)
Predict the results on the test set.
results$mar_predictions <- predict(marmodel, testSET)
svmmodel <- train(CCS ~ ., data = trainSET, method = "svmRadial", trControl = ctrl, tuneLength = 10)
svmmodel
## Support Vector Machines with Radial Basis Function Kernel
##
## 722 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 650, 650, 650, 650, 650, 648, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 8.193874 0.7727503 6.216962
## 0.50 7.510859 0.8057671 5.588933
## 1.00 6.975163 0.8315825 5.139610
## 2.00 6.614476 0.8484595 4.842685
## 4.00 6.335832 0.8617612 4.601815
## 8.00 6.187779 0.8681764 4.447095
## 16.00 6.024921 0.8747392 4.274296
## 32.00 5.994894 0.8761061 4.134932
## 64.00 6.047779 0.8746937 4.126316
## 128.00 6.257243 0.8664588 4.220552
##
## Tuning parameter 'sigma' was held constant at a value of 0.1117766
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.1117766 and C = 32.
plot(svmmodel)
Predict the results on the test set.
results$svm_predictions <- predict(svmmodel, testSET)
Predicted Values
kable(head(results), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
test_obs | lm_predictions | tree_predictions | mar_predictions | svm_predictions |
---|---|---|---|---|
40.27 | 55.56344 | 40.41024 | 47.18905 | 40.89970 |
44.30 | 59.60197 | 40.41024 | 39.63560 | 46.70834 |
43.70 | 66.59378 | 57.15409 | 50.17871 | 43.40097 |
36.45 | 29.91729 | 57.15409 | 38.31050 | 37.20360 |
40.56 | 36.66490 | 57.15409 | 48.29527 | 39.59831 |
42.62 | 47.84235 | 57.15409 | 49.37454 | 42.85674 |
#Linear Regression
postResample(results$lm_predictions,results$test_obs)
## RMSE Rsquared MAE
## 10.1496003 0.6338727 8.2266351
#CART
postResample(results$tree_predictions, results$test_obs)
## RMSE Rsquared MAE
## 11.4951801 0.5315966 9.2424359
#MARS
postResample(results$mar_predictions, results$test_obs)
## RMSE Rsquared MAE
## 6.2252879 0.8622315 4.8071490
#SVMRadial
postResample(results$svm_predictions, results$test_obs)
## RMSE Rsquared MAE
## 5.4140668 0.8965451 3.8116596
It is seen that the SVM Radial model outperforms all other models with the highest RSquared, Lowest MAE and RMSE. This would be the best fit model for our data which gives the most accurate predictions.