This problem was originally proposed by Prof. I-Cheng Yeh, Department of Information Management Chung-Hua University, Hsin Chu, Taiwan in 2007. It is related to his research in 1998 about how to predict compression strength in a concrete structure.
The conventional process of testing the compressive strength of concrete involves casting several cubes for the respective grade and observing the strength of the concrete over a period of time ranging from 7 to 28 days.
Various combinations of the components of concrete are selected and cubes for each combination is casted and its test strength at 7, 14 and 28 days is noted dow. This is a time consuming and rather tedious process.
This project aims to predict the compressive strength of concrete with maximum accuracy and lowest error (evaluation metrics MAE) , for various quantities of constituent components as the input. The conrete cube exhibits behavioral differences in their compressive strengths for cubes that are cured/not cured. Curing is the process of maintaining the moisture to ensure uninterrupted hydration of concrete.
The concrete strength increases if the concrete cubes are cured periodically. The rate of increase in strength is described here.
Time % Of Total Strength Achieved
At 28 days, concrete achieves 99% of the strength. Thus usual measurements of strength are taken at 28 days
The goal of this case is to Predict the compression strength based on the mixture properties.
data <- read.csv("data_input/data-train.csv")
submission <- read.csv("data_input/data-submission.csv")
glimpse(data)#> Observations: 825
#> Variables: 10
#> $ id <fct> S1, S2, S3, S4, S5, S6, S7, S8, S9, S10, S11, S12, S13,...
#> $ cement <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 380.0, 380.0, 475.0,...
#> $ slag <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 95.0, 95.0, 0.0, 132.4, ...
#> $ flyash <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ water <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 192, 192, 228, ...
#> $ super_plast <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
#> $ coarse_agg <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932....
#> $ fine_agg <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 594.0, 594.0, 594.0,...
#> $ age <int> 28, 28, 270, 365, 360, 365, 28, 28, 90, 28, 28, 90, 90,...
#> $ strength <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 43.70, 36.45, 39.29,...
The observation data consists of the following variables:
id : Id of each cement mixture,cement : The amount of cement (Kg) in a m3 mixture,slag : The amount of blast furnace slag (Kg) in a m3 mixture,flyash : The amount of fly ash (Kg) in a m3 mixture,water : The amount of water (Kg) in a m3 mixture,super_plast: The amount of Superplasticizer (Kg) in a m3 mixture,coarse_agg : The amount of Coarse Aggreagate (Kg) in a m3 mixture,fine_agg : The amount of Fine Aggregate (Kg) in a m3 mixture,age : the number of resting days before the compressive strength measurement,strength : Concrete compressive strength measurement in MPa unit.Removing unused column id :
Take a peek of our dataset :
#> cement slag flyash water
#> Min. :102.0 Min. : 0.00 Min. : 0.00 Min. :121.8
#> 1st Qu.:194.7 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:164.9
#> Median :275.1 Median : 20.00 Median : 0.00 Median :184.0
#> Mean :280.9 Mean : 73.18 Mean : 54.03 Mean :181.1
#> 3rd Qu.:350.0 3rd Qu.:141.30 3rd Qu.:118.20 3rd Qu.:192.0
#> Max. :540.0 Max. :359.40 Max. :200.10 Max. :247.0
#> super_plast coarse_agg fine_agg age
#> Min. : 0.000 Min. : 801.0 Min. :594.0 Min. : 1.00
#> 1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:734.0 1st Qu.: 7.00
#> Median : 6.500 Median : 968.0 Median :780.1 Median : 28.00
#> Mean : 6.266 Mean : 972.8 Mean :775.6 Mean : 45.14
#> 3rd Qu.:10.100 3rd Qu.:1028.4 3rd Qu.:826.8 3rd Qu.: 56.00
#> Max. :32.200 Max. :1145.0 Max. :992.6 Max. :365.00
#> strength
#> Min. : 2.33
#> 1st Qu.:23.64
#> Median :34.57
#> Mean :35.79
#> 3rd Qu.:45.94
#> Max. :82.60
It seems that all of the predictor variables that we have are numerical values. We can do scaling features (if needed) in the cross-validation section later.
NA value checking :
#> [1] FALSE
Outliers checking :
Boxplot
Based on the boxplot above we can find outliers in several columns like slag, water, super_plast, fine_agg, age, strength. We can treat the column, either by removing or imputing. However, specifically for the super_plast and age columns, we have to leave the data as it is because I have observed submission dataset, and found the same thing. It is intended to be able to predict the same condition in our new dataset.
Outliers collecting :
slag <- which(data$slag %in% boxplot(data$slag, plot=FALSE)$out) #2 obs
water <- which(data$water %in% boxplot(data$water, plot=FALSE)$out) #9 obs
# which(data$fine_agg %in% boxplot(data$fine_agg, plot=FALSE)$out) #27 obs
fine_agg <- which(data$fine_agg %in% boxplot(data$fine_agg, plot=FALSE)$out)[23:27] #5
strength <- which(data$strength %in% boxplot(data$strength, plot=FALSE)$out) #5 obsThe next thing we should do are removing outliers from slag, water, fine_agg (only top outliers because 594 as fine_agg is detected as outliers here but we find the same value in our submission data as well) and also strength column (and the total only 2.55 % from our total dataset observation).
#> [1] 804
There are some tasks that we have to try to answer in this case. We have to explore the relation between the target and the features :
strength positively correlated with age? The correlation value between strength and age is 0.347. As already mentioned in the introduction that the number of resting days before the measurement effect on the concrete’s compressive strength.strength and cement has strong correlation? The correlation value between strength and cement is 0.49. Based on the theory the compressive strength of concrete is strongly influenced by the ratio of cement and water. And that is makes sense if we found that cement and water are variables that influences the strength of concrete.super_plast has a linear correlation with the strength? Super plasticizer is also known as water reducer. The amount of Superplasticizer definitely has a positive correlation with strength and also has a negative correlation with water. Correlation values can be seen in the chart above that is 0.35The next step is split data into a training dataset (to build model) and testing dataset (to validate model). There are no specific rules for the splitting data section. The proportion of the training vs testing dataset 80:20 would be the most commonly used also referred to as the [Pareto principle][1]. Tranditionally we use 5-folders cross validation to verify our algorithm, so 80:20 would be find. But actually the real apportion of train and test dataset is related with the real situation and the quantity of the dataset.
[1] : https://en.wikipedia.org/wiki/Pareto_principle
Specifically for this case, we already have data-train (804 observations) and data-submission (205). Therefore we should use 75 : 25 (around 600:200 observations) ratio to split data-train into training dataset and testing/validation dataset, in order to get the same proportion with the data-submission that we will try to predict later.
Build Model
# names(train)
# LR <- lm(formula = strength~., data = train)
# step(object= LR, direction = "backward")#> [1] 0.6170792
Based on the results of the summary model,
super_plastvariable does not have a significant effect on the target variablestrength. It does not mean for another case or project (dataset) super_plast has no influence on concrete’s compressive strength. Or it could be that the effect of thesuper_plastvariable has been summarized by variables such aswater, that we already know is that Super plasticizer also has the meaning or purpose of water reducer.
Predict
LR.prediction1 <- predict(object = LR, newdata = train)
LR.prediction2 <- predict(object = LR, newdata = test)Evaluate
#> [1] 8.12385
#> [1] 8.581061
Based on Adjusted r-squared (0.6052) and MAE (8.21 in-sample or
training dataand 7.96 out-sample ortesting data), it can be concluded Linear Regression model has not good enough performance. We can tune it later with other model approaches.
Build Model
ctrl1<- trainControl(method = "repeatedcv", number = 5, repeats = 5)
# RF5 <- train(strength~., data = train, method = "rf", trControl = ctrl1, importance=T)#> Random Forest
#>
#> 604 samples
#> 8 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 10 times)
#> Summary of sample sizes: 544, 544, 543, 543, 544, 544, ...
#> Resampling results across tuning parameters:
#>
#> mtry RMSE Rsquared MAE
#> 2 5.996248 0.8892703 4.629198
#> 5 5.436498 0.8942179 4.077684
#> 8 5.477170 0.8904525 4.092211
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was mtry = 5.
Out-of-Bag (OOB) Error
#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry, importance = ..1)
#> Type of random forest: regression
#> Number of trees: 500
#> No. of variables tried at each split: 5
#>
#> Mean of squared residuals: 28.80675
#> % Var explained: 89.27
Predict
RF5.prediction1 <- predict(RF5, newdata = train)
RF5.prediction2 <- predict(RF5, newdata = test)
head(RF5.prediction2)#> 20 21 29 36 43 46
#> 39.52133 50.68752 41.44782 14.60070 11.18817 37.41862
Evaluate
#> [1] 2.235684
#> [1] 2.528387
To get the value of R-squared we can see the % Var explained from the summary of our final model or by manual calculating as below.
actual <- test$strength
predicted <- RF5.prediction2
R2test_5 <- 1 - (sum((actual-predicted)^2)/sum((actual-mean(actual))^2))
R2test_5#> [1] 0.9453077
Build Model
ctrl2 <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
# RF10 <- train(strength~., data = train, method = "rf", trControl = ctrl2, importance=T)#> Random Forest
#>
#> 604 samples
#> 8 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 10 times)
#> Summary of sample sizes: 544, 544, 543, 543, 544, 544, ...
#> Resampling results across tuning parameters:
#>
#> mtry RMSE Rsquared MAE
#> 2 5.996248 0.8892703 4.629198
#> 5 5.436498 0.8942179 4.077684
#> 8 5.477170 0.8904525 4.092211
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was mtry = 5.
Out-of-Bag (OOB) Error
#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry, importance = ..1)
#> Type of random forest: regression
#> Number of trees: 500
#> No. of variables tried at each split: 5
#>
#> Mean of squared residuals: 28.80675
#> % Var explained: 89.27
Predict
RF10.prediction1 <- predict(RF10, newdata = train)
RF10.prediction2 <- predict(RF10, newdata = test)
head(RF10.prediction2)#> 20 21 29 36 43 46
#> 39.52133 50.68752 41.44782 14.60070 11.18817 37.41862
Evaluate
#> [1] 2.235684
#> [1] 2.528387
actual <- test$strength
predicted <- RF10.prediction2
R2test_10 <- 1 - (sum((actual-predicted)^2)/sum((actual-mean(actual))^2))
R2test_10#> [1] 0.9453077
Based on the Random Forest model, we get the value of R-squared (0.8927 from
% Var explainedor 0.88 from the manual calculating) and MAE (1.79 for training data and 3.74 for testing data). For this case, there is no difference in performance metrics even though using a different combination ofrepeatsandk-fold number.
DATA PREPROCESS USING RECIPES
randomForest engine packageMODEL FITTING USING PARSNIP
# set-up model specification
model_spec <- rand_forest(
mode = "regression",
mtry = 5,
trees = 650,
min_n = 1
)
model_spec#> Random Forest Model Specification (regression)
#>
#> Main Arguments:
#> mtry = 5
#> trees = 650
#> min_n = 1
# set-up model engine
model_engine1 <- set_engine(
object = model_spec,
engine = "randomForest"
)
model_engine1#> Random Forest Model Specification (regression)
#>
#> Main Arguments:
#> mtry = 5
#> trees = 650
#> min_n = 1
#>
#> Computational engine: randomForest
To fit our model, we have two options:
#> parsnip model object
#>
#> Fit in: 3.4s
#> Call:
#> randomForest(x = as.data.frame(x), y = y, ntree = ~650, mtry = ~5, nodesize = ~1)
#> Type of random forest: regression
#> Number of trees: 650
#> No. of variables tried at each split: 5
#>
#> Mean of squared residuals: 0.09477608
#> % Var explained: 90.51
# fit the model
TidyRF1 <- fit_xy(
object = model_engine1,
x = select(data_train, -strength),
y = select(data_train, strength)
)Predict
scaled_prediction <- data_test %>%
select(strength) %>%
bind_cols((predict(TidyRF1, data_test)))
# quick check
scaled_predictionEvaluate
Back Transform
# data_recipe$steps
recipe_bt <- function(x, data_recipe){
means <- data_recipe$steps[[3]]$means[["strength"]]
sds <- data_recipe$steps[[4]]$sds[["strength"]]
x <- (x*sds+means)^2
}revert_prediction1 <- apply(scaled_prediction, MARGIN = 2, FUN = recipe_bt, data_recipe = data_recipe) %>%
as.data.frame()
head(revert_prediction1)Evaluation Metrics
#> [1] "data.frame"
revert_prediction1 %>%
summarise(
R_SQUARED = rsq_vec(strength, .pred),
RMSE = rmse_vec(strength, .pred),
MAE = mae_vec(strength, .pred),
MAPE = mape_vec(strength, .pred),
MASE = mase_vec(strength, .pred)
)ranger engine packageMODEL FITTING USING PARSNIP
#> Random Forest Model Specification (regression)
#>
#> Main Arguments:
#> mtry = 5
#> trees = 650
#> min_n = 1
# set-up model engine
model_engine2 <- set_engine(
object = model_spec2,
engine = "ranger",
seed = 100
)
model_engine2#> Random Forest Model Specification (regression)
#>
#> Main Arguments:
#> mtry = 5
#> trees = 650
#> min_n = 1
#>
#> Engine-Specific Arguments:
#> seed = 100
#>
#> Computational engine: ranger
using fit()
# fit the model
TidyRanger <- fit(
object = model_engine2,
formula = strength ~ .,
data = data_train
)
TidyRangerusing fit_xy()
# fit the model
TidyRanger <- fit_xy(
object = model_engine2,
x = select(data_train, -strength),
y = select(data_train, strength)
)
# quick check
TidyRanger#> parsnip model object
#>
#> Fit in: 1.6sRanger result
#>
#> Call:
#> ranger::ranger(formula = formula, data = data, mtry = ~5, num.trees = ~650, min.node.size = ~1, seed = ~100, num.threads = 1, verbose = FALSE)
#>
#> Type: Regression
#> Number of trees: 650
#> Sample size: 684
#> Number of independent variables: 8
#> Mtry: 5
#> Target node size: 1
#> Variable importance mode: none
#> Splitrule: variance
#> OOB prediction error (MSE): 0.09295988
#> R squared (OOB): 0.9070401
Predict
scaled_prediction2 <- data_test %>%
select(strength) %>%
bind_cols((predict(TidyRanger, data_test)))
# quick check
scaled_prediction2Evaluate
revert_prediction2 <- apply(scaled_prediction2, MARGIN = 2, FUN = recipe_bt, data_recipe = data_recipe) %>%
as.data.frame()
head(revert_prediction2)Evaluation Metrics
#> [1] "data.frame"
revert_prediction2 %>%
summarise(
R_SQUARED = rsq_vec(strength, .pred),
RMSE = rmse_vec(strength, .pred),
MAE = mae_vec(strength, .pred),
MAPE = mape_vec(strength, .pred),
MASE = mase_vec(strength, .pred)
)In this project I did the tuning by using a number of model approaches to improve its performance.
LR_Rsquared <- round(summary(LR)$r.squared*100,2)
LR_MAE <- round(MAE(pred = LR.prediction2, obs = test$strength),2)
LR_Model <- cbind(Model = "Linear Regression", `R-squared`=LR_Rsquared,MAE=LR_MAE ) %>%
as.data.frame()
LR_ModelRF5_Rsquared <- round(R2test_5*100,2)
RF5_MAE <- round(MAE(RF5.prediction2, test$strength),2)
RF5_Model <- cbind(Model = "Random Forest (5r5n)", `R-squared`=RF5_Rsquared,MAE=RF5_MAE ) %>%
as.data.frame()
RF5_ModelRF10_Rsquared <- round(R2test_5*100,2)
RF10_MAE <- round(MAE(RF10.prediction2, test$strength),2)
RF10_Model <- cbind(Model = "Random Forest (10r10n)", `R-squared`=RF10_Rsquared,MAE=RF10_MAE ) %>%
as.data.frame()
RF10_ModelTidyrf1 <- revert_prediction1 %>%
summarise(
R_SQUARED = rsq_vec(strength, .pred),
MAE = mae_vec(strength, .pred)
)TidyRF1_Rsquared <- round(Tidyrf1$R_SQUARED*100,2)
TidyRF1_MAE <- round(Tidyrf1$MAE,2)
TidyRF1_Model <- cbind(Model = "Tidymodels Random Forest", `R-squared`=TidyRF1_Rsquared,MAE=TidyRF1_MAE ) %>%
as.data.frame()
TidyRF1_ModelTidyranger <- revert_prediction2 %>%
summarise(
R_SQUARED = rsq_vec(strength, .pred),
MAE = mae_vec(strength, .pred)
)TidyRanger_Rsquared <- round(Tidyranger$R_SQUARED*100,2)
TidyRanger_MAE <- round(Tidyranger$MAE,2)
TidyRanger_Model <- cbind(Model = "Tidymodels Ranger", `R-squared`=TidyRanger_Rsquared,MAE=TidyRanger_MAE ) %>%
as.data.frame()
TidyRanger_Model# saveRDS(combined2, "combined8515.RDS")
combined1 <- readRDS("combined7525.RDS")
combined2 <- readRDS("combined8515.RDS")The following is a comparison of the performance of the models that I have tuned. combined1 is for 75:25 proportion of training and validation dataset combined2 is for 85:15 proportion of training and validation dataset
We can say it does not mean (85:15) is the better than (75:25) for training and validation data proportion, because as I have mentioned before, everything depends on the conditions or goals to be achieved. In this case I want to build the best model to predict
strengthvalue of the submission data that has been provided.
randomForest engineBased on several modeling experiments or tuning, I decided to use the Tidymodels randomForest as the best tuned model in this case. I also do some combinations of the existing tidymodels function parameters, which are as follows :
mtry : The number of predictors that will be randomly sampled at each split when creating the tree models.trees : The number of trees contained in the ensemble.min_n : The minimum number of data points in a node that are required for the node to be split further.step_center() tidy(step_scale() tidy(step_corr() tidy(step_sqrt() tidy(The following are the results of the combination that has been done
(mtry, trees, data pre-process) : R-Squared(%) / MAE
5, 650 corr : 88.93 / 3.73
5, 500 sqrt : 88.82 / 3.71
5, 650 corr + sqrt : 89.07 / 3.69
min_n = 15 : 87.32 / 4.30
min_n = 1 :
train:validation proportion :
#> Observations: 205
#> Variables: 10
#> $ id <fct> S826, S827, S828, S829, S830, S831, S832, S833, S834, S...
#> $ cement <dbl> 266.0, 266.0, 427.5, 190.0, 380.0, 427.5, 198.6, 332.5,...
#> $ slag <dbl> 114.0, 114.0, 47.5, 190.0, 0.0, 47.5, 132.4, 142.5, 237...
#> $ flyash <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ water <dbl> 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 192.0, 228.0,...
#> $ super_plast <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
#> $ coarse_agg <dbl> 932.0, 932.0, 932.0, 932.0, 932.0, 932.0, 978.4, 932.0,...
#> $ fine_agg <dbl> 670.0, 670.0, 594.0, 670.0, 670.0, 594.0, 825.5, 594.0,...
#> $ age <int> 90, 28, 270, 90, 270, 28, 180, 90, 180, 365, 365, 180, ...
#> $ strength <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
submission$strength <- c(1:205) #replacing NA to random value
data_sub <- bake(data_recipe, submission)
head(data_sub)# data_recipe$steps
recipe_bt <- function(x, data_recipe){
means <- data_recipe$steps[[3]]$means[["strength"]]
sds <- data_recipe$steps[[4]]$sds[["strength"]]
x <- (x*sds+means)^2
}revert_sub <- apply(scaled.prediction, MARGIN = 2, FUN = recipe_bt, data_recipe = data_recipe) %>%
as.data.frame()
head(revert_sub)#> [1] 5.523025 76.601212
After I submit, the results are 91 % for R-Squared and 3.5 for MAE model performance metrics