Introduction
In making concrete, there are mixture and treatments that need to be done in order to yield strong concrete. This experiment is base of Prof. I-Cheng Yeh research on how to predict the compression strength in a concrete structure.
The goal to this experiment is to find the “perfect recipe” in predicting the concrete’s compression strength whilst we explain the relation between ingredients and the age of testing to the compression strength.
Within our dataset, there are variables such as below that yield varying compression strength:
id: IDs of each cement mixturecement: The amount of cement (kg) in a \(m^3\) misxtureslag: The amount of blast furnace slag (kg) in a \(m^3\) misxtureflyash: The amount of fly ash (kg) in a \(m^3\) misxturewater: The amount of water (kg) in a \(m^3\) misxturesuper_plast: The amount of superplasticizer (kg) in a \(m^3\) misxturecoarse_agg: The amount of coarse aggreagate (kg) in a \(m^3\) misxturefine_agg: The amount of fine aggreagate (kg) in a \(m^3\) misxtureage: The number of resting days before the compressive strength measuredstrength: Concrete compressive strenth measurement in \(M Pa\) unit
Data Preprocessing
Into our dataset, we can firstly begin by subsetting unnecessery variable such as the id and check wether our datatype are correct or not.
'data.frame': 825 obs. of 10 variables:
$ id : chr "S1" "S2" "S3" "S4" ...
$ cement : num 540 540 332 332 199 ...
$ slag : num 0 0 142 142 132 ...
$ flyash : num 0 0 0 0 0 0 0 0 0 0 ...
$ water : num 162 162 228 228 192 228 228 228 192 192 ...
$ super_plast: num 2.5 2.5 0 0 0 0 0 0 0 0 ...
$ coarse_agg : num 1040 1055 932 932 978 ...
$ fine_agg : num 676 676 594 594 826 ...
$ age : int 28 28 270 365 360 365 28 28 90 28 ...
$ strength : num 80 61.9 40.3 41 44.3 ...
If our datatypes are already correct, we can now check if there are any NA values slip in our dataset. To this we can use the synatx below to check:
cement slag flyash water super_plast coarse_agg
0 0 0 0 0 0
fine_agg age strength
0 0 0
Data Exploratory
Before continuing to the model creation, it is better at first to see the correlation between the predictor variables and the target variable to determine and to observe which variable have the strongest influence to the target variable. To this we shall use the GGally::ggcorr() function to our dataset.
Our observation on the variables correlation to the target value indicate that only few variable that influence heavily to the target strength variable. Indicated above, the ones that have correlation value 0.1 above (\(corr => 0.1\)) are the cement, slag, super-plast, and age variables only. But the ones that have a stronger influence are the age, super-plast, and cement variables.
Beside observing the correlation in our dataset, we also need to see about our predictor influence to the target variable. We can do this by determining the range of our predictors each.
cement slag flyash water
Min. :102.0 Min. : 0.00 Min. : 0.00 Min. :121.8
1st Qu.:194.7 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:164.9
Median :275.1 Median : 20.00 Median : 0.00 Median :184.0
Mean :280.9 Mean : 73.18 Mean : 54.03 Mean :181.1
3rd Qu.:350.0 3rd Qu.:141.30 3rd Qu.:118.20 3rd Qu.:192.0
Max. :540.0 Max. :359.40 Max. :200.10 Max. :247.0
super_plast coarse_agg fine_agg age
Min. : 0.000 Min. : 801.0 Min. :594.0 Min. : 1.00
1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:734.0 1st Qu.: 7.00
Median : 6.500 Median : 968.0 Median :780.1 Median : 28.00
Mean : 6.266 Mean : 972.8 Mean :775.6 Mean : 45.14
3rd Qu.:10.100 3rd Qu.:1028.4 3rd Qu.:826.8 3rd Qu.: 56.00
Max. :32.200 Max. :1145.0 Max. :992.6 Max. :365.00
Using the syntax above we can gather all the information we need to determine the point for categorization. Here we shall use the mean of every predictor and make a binary categorization.
We will use the abbreviation abv and blw to ease our categorization. As “abv” or above means that values that are equal or above the mean value of each predictor are consider as abv. While “blw” or below means that any value that are below or lower the mean value of each predictor will be consider as blw.
To this we shall do as below,
# Copy the dataset
ctg.ccrt <- ccrt
# Categorizing
ctg.ccrt$cement <- ifelse(ctg.ccrt$cement<mean(ccrt$cement),"blw.cement","abv.cement") %>% as.factor()
ctg.ccrt$slag <- ifelse(ctg.ccrt$slag<mean(ccrt$slag),"blw.slag","abv.slag") %>% as.factor()
ctg.ccrt$flyash <- ifelse(ctg.ccrt$flyash<mean(ccrt$flyash),"blw.flyash","abv.flyash") %>% as.factor()
ctg.ccrt$water <- ifelse(ctg.ccrt$water<mean(ccrt$water),"blw.water","abv.water") %>% as.factor()
ctg.ccrt$super_plast <- ifelse(ctg.ccrt$super_plast<mean(ccrt$super_plast),"blw.super_plast","abv.super_plast") %>% as.factor()
ctg.ccrt$coarse_agg <- ifelse(ctg.ccrt$coarse_agg<mean(ccrt$coarse_agg),"blw.coarse_agg","abv.coarse_agg") %>% as.factor()
ctg.ccrt$fine_agg <- ifelse(ctg.ccrt$fine_agg<mean(ccrt$fine_agg),"blw.fine_agg","abv.fine_agg") %>% as.factor()
ctg.ccrt$age <- ifelse(ctg.ccrt$age<mean(ccrt$age),"blw.age","abv.age") %>% as.factor()
# Checking
head(ctg.ccrt) cement slag flyash water super_plast coarse_agg
1 abv.cement blw.slag blw.flyash blw.water blw.super_plast abv.coarse_agg
2 abv.cement blw.slag blw.flyash blw.water blw.super_plast abv.coarse_agg
3 abv.cement abv.slag blw.flyash abv.water blw.super_plast blw.coarse_agg
4 abv.cement abv.slag blw.flyash abv.water blw.super_plast blw.coarse_agg
5 blw.cement abv.slag blw.flyash abv.water blw.super_plast abv.coarse_agg
6 abv.cement abv.slag blw.flyash abv.water blw.super_plast blw.coarse_agg
fine_agg age strength
1 blw.fine_agg blw.age 79.99
2 blw.fine_agg blw.age 61.89
3 blw.fine_agg abv.age 40.27
4 blw.fine_agg abv.age 41.05
5 abv.fine_agg abv.age 44.30
6 blw.fine_agg abv.age 43.70
After we categorize our dataset, we shall now see on how each predictors influenced the target variable.
It is infer from the boxplot above that the influence of each predictors are vary. Some when they have above mean value will give a positive influence to the concrete’s strength, while some others give a positive influence when they are below mean value hypothetically assuming that the less amount we use them, the stronger our concrete’s strength will be.
To our seperated category, we can check the similiarity of the mean spread between the seperated data by conducting a t-test using the t.test() function. To the boxplot above, we will pick the age variable because it give a lot of influence to the strength. Here we shall see wether the abv.age is different in mean spread to the blw.age
Welch Two Sample t-test
data: abv.age and blw.age
t = 17.488, df = 225.9, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
86.6052 108.6013
sample estimates:
mean of x mean of y
116.24554 18.64226
After conducting the t-test, it is concluded that the null hypothesis (H0) is rejected and we say that abv and blw have different mean spread.
We also need to see how our data spreads per variables so later we can decide wether we need to transform it or not. We can see this by using th hist() function and see it per variables.
Our observation above concludes that, the slag, flyash, super_plast, and age variables have skewed spread.
We can use the dataset as is (not transforming some of the variables) or we can transform the skewed variables. But for this instance, we shall do both and compare them later by their MAE and R-squared scores.
Data Splitting
In every machne learning task, a dataset is needed to be split into 2 sets: training and testing datasets. The use of this is to when we have created a certain machine learning model from our training dataset, we can cross validate it to our testing dataset to see its performance and to not have a biased end result.
To this we shall split our dataset in 80:20 proportion.
Model Creation
As mentioned earlier, to our dataset and our model creation, our goal is to create a model that can predict the strength of a concrete depending to the materials/compositions used.
Prior to this report is written and will not be written here, we have conducted several tests to see which method that yields the lowest MAE and the highest R-square. To the test, we gain the conclusion by using the stepwise forward. We gained a considerable low MEA with quite high R-square compared to the model that uses all variables or custom variables using the correlation table as guidence.
1. Non-transformed Dataset
Into our model creation, we shall use the following syntax:
lm.none <- lm(strength ~ 1, tr)
lm.all <- lm(strength ~ ., tr)
fwd <- step(lm.none, scope = list(lower=lm.none,upper=lm.all), direction = "forward") Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.52148948 5.190790879 4.916686 1.114383e-06
cement 0.10273710 0.005332436 19.266446 7.473091e-66
age 0.12954938 0.007371105 17.575299 6.363016e-57
super_plast 0.34144112 0.107725570 3.169546 1.598014e-03
slag 0.07971645 0.006161837 12.937124 2.997323e-34
water -0.19586023 0.025879119 -7.568273 1.290180e-13
flyash 0.06005248 0.009812022 6.120296 1.609605e-09
Using the forward stepwise method above, we gain the influencing variables as below: \[strength = cement + superplast + age + slag + water + flyash\] with each variables give estimating coeficient values as shown below:
(Intercept) cement age super_plast slag water
25.52148948 0.10273710 0.12954938 0.34144112 0.07971645 -0.19586023
flyash
0.06005248
The estimate coefficient shows that for every increment of the variable cement, age, super_plast, slag, and flyash will increase the concrete strength by 0.1027371, 0.1295494, 0.3414411, 0.0797164, and 0.0600525 scores respectively. And for every increment on the water variable, it will decrease the concrete’s strength by -0.1958602 score.
1.1. Model Evaluation
After we created our model for the strength compression prediction. It is better for us to evaluate the model by looking at the residual vs fitted plot to see the model’s linearity and other things such as normality, multicolinearity, heteroscedacity.
Linearity
From our above plot, our residuals vs fitted plot have line that turn negatively by the end of the plot. This indicates our current model does not have linearity what so over and may result in lower R-square.
Normality Shapiro-test
Shapiro-Wilk normality test
data: fwd$residuals
W = 0.99687, p-value = 0.2294
Usng the Shapiro-Wilk test, it is infered that our residual p-value have a score that is above 0.05 (\(pvalue > 0.05\)). We conclude that our model’s residual are following a normal distribution path.
- Multicollinearity
cement age super_plast slag water flyash
1.863885 1.113726 2.348014 1.742804 1.852011 2.368945
The above multicollinearity score indicates that our variables does not have any problem with multicollinearity because of the \(VIF < 5\) score in VIF with a rule of thumb if $ VIF > 10$ it have problems wih multicollinearity issue.
- Heteroscedacity
studentized Breusch-Pagan test
data: fwd
BP = 98.506, df = 6, p-value < 2.2e-16
Seeing our model p-value is below 0.05 (\(< 0.05\)) using the Breusch-Pagan test, we can conclude that there are heterocesdacity within our current model.
1.2. Model Prediction
After we review our model by evaluating it, we now see the model’s performance by using it to predict our test dataset and review the MAE and R-square produced.
[1] 8.513054
[1] 61.68372
Our model yieled \(MEA = 8.5\) and \(R^2 = 61 \%\), a not so bad result for a regular forward stepwise method. Until this passage, we can hypothtically say that our model might have some bad influences (outliers) that hold it in it’s current performance.
To prove our hypothesis, we shall now do something to reduce these bad influences (outliers) and try to get a better performance for our model.
1.3. Outliers Removal
One way to detect and handle the outliers that might scatter across our dataset is by using Cook’s distance (CD). According to Wikipedia, CD is commonly used estimate of the influence of a data point when performing a least-squares regression analysis.
CD can be used in several ways:
- To indicate influential data points that are particularly worth checking for validity
- To indicate regions of the design space where it would be good to be able to obtain more data points.
To our model, using olsrr::ols_plot_cooksd_bar() function we can plot our CD bar plot and see which row-data that are considered outlier by CD standard
We can see here there are numerous amount of row-data that are considered outliers by CD’s standard (indicated red). The horizontal red line that stretch across the plot is CD’s threshold line with a value of 0.006 (shown top right corner).
CD threshold is calculated by using \(threshold_{CD} = 4 / nrow(data)\) and if there are any data that “jumps” over the threshold line will be considered outliers by CD.
To remove the outliers, we first bind our model’s CD to our training data.
# Make new copy of training dataset
cd_tr <- tr
# Attach CD value to copied training set
cd_tr$cd <- cooks.distance(fwd)We now then find the row-data that have CD value that are greater than our model’s own CD threshold and we then find the CD value that underline changes to our model (indicates outlier within said value).
# Extracting row-data that have greater CD vaue than the model's CD threshold
lrg_cd <- subset(cd_tr, cd > 4/nrow(cd_tr))
# Quantiling the extracted row-data to find whing quantile that happens to hold outliers
quantile(lrg_cd$cd, probs = seq(0, 1, 0.05)) 0% 5% 10% 15% 20% 25%
0.006217755 0.006485609 0.006628140 0.006798822 0.006811185 0.007018945
30% 35% 40% 45% 50% 55%
0.007580165 0.007772170 0.007986549 0.008130277 0.009077721 0.010348965
60% 65% 70% 75% 80% 85%
0.011695175 0.012573797 0.015801682 0.016165813 0.018868862 0.022142909
90% 95% 100%
0.030828194 0.058063588 0.080110802
Form our observation from the quantile result above, there are a strong gap between the 45% and the 50% quantile as if it indicates a sudden “jump”. Using the 50% quantile value, we then subset any row-data that have CD value above 0.009077721.
And after we remove the presumably outliers, we then now try using the newly created dataset minus the outlier into the forward stepwise model creator.
Start: AIC=3562.41
strength ~ 1
Df Sum of Sq RSS AIC
+ cement 1 42352 129317 3384.2
+ age 1 33826 137844 3424.8
+ super_plast 1 25266 146404 3463.2
+ water 1 14481 157189 3508.4
+ fine_agg 1 7442 164228 3536.2
+ coarse_agg 1 3893 167776 3549.8
+ slag 1 3241 168429 3552.3
+ flyash 1 1299 170370 3559.6
<none> 171670 3562.4
Step: AIC=3384.23
strength ~ cement
Df Sum of Sq RSS AIC
+ age 1 27887.3 101430 3231.7
+ super_plast 1 21315.9 108002 3271.7
+ slag 1 13741.9 115576 3314.8
+ water 1 10939.4 118378 3330.0
+ flyash 1 2296.4 127021 3374.8
+ coarse_agg 1 1694.9 127623 3377.8
+ fine_agg 1 1319.8 127998 3379.7
<none> 129317 3384.2
Step: AIC=3231.75
strength ~ cement + age
Df Sum of Sq RSS AIC
+ super_plast 1 28399.7 73030 3024.8
+ water 1 20021.2 81409 3093.9
+ slag 1 15479.7 85950 3128.4
+ flyash 1 3698.0 97732 3210.1
+ coarse_agg 1 2171.8 99258 3220.0
+ fine_agg 1 361.7 101069 3231.5
<none> 101430 3231.7
Step: AIC=3024.82
strength ~ cement + age + super_plast
Df Sum of Sq RSS AIC
+ slag 1 12645.2 60385 2905.9
+ fine_agg 1 2771.6 70259 3002.2
+ water 1 2073.7 70957 3008.5
+ flyash 1 442.4 72588 3023.0
<none> 73030 3024.8
+ coarse_agg 1 21.7 73009 3026.6
Step: AIC=2905.9
strength ~ cement + age + super_plast + slag
Df Sum of Sq RSS AIC
+ water 1 4295.3 56090 2861.0
+ flyash 1 3303.6 57082 2872.1
+ coarse_agg 1 804.5 59581 2899.4
<none> 60385 2905.9
+ fine_agg 1 54.0 60331 2907.3
Step: AIC=2860.97
strength ~ cement + age + super_plast + slag + water
Df Sum of Sq RSS AIC
+ flyash 1 4273.0 51817 2812.6
+ fine_agg 1 1176.2 54914 2849.5
<none> 56090 2861.0
+ coarse_agg 1 4.3 56086 2862.9
Step: AIC=2812.57
strength ~ cement + age + super_plast + slag + water + flyash
Df Sum of Sq RSS AIC
<none> 51817 2812.6
+ coarse_agg 1 27.363 51790 2814.2
+ fine_agg 1 0.005 51817 2814.6
To our model, we then use it to predict the outcome strength and check the MAE and R-square score
[1] 8.902201
[1] 69.81593
At this point, we can say that by removing the outliers using CD we can see that the R-square score is increased to around 69.8159262 compared to the previous 61.6837156 while our MAE got a severe 8.9022011 compared to the previous 8.5130542.
Thus for this experiment on removing the outliers, we would say that removing outliers have a positive effect in increasing the R-squared but just be careful on the severing MEA.
2. Transformed Dataset
Mentioned earlier about our dataset’s variable’s spreading in 1.1 Model Evaluation, some of our variable have skewed spreads within its data. To perhaps normalize the data spread within the variables we can use a transformation method using logarithmic solution.
Looking back to our previous histogram, some of its variable need to be normalize in order to have a fairly distributed data. To this we shall aplicate logarithmic tranformation to the skewed data only.
log.ccrt$slag<-ifelse(log.ccrt$slag<=1,1,log.ccrt$slag)
log.ccrt$slag <-log(log.ccrt$slag)
log.ccrt$flyash <- ifelse(log.ccrt$flyash<=1,1,log.ccrt$flyash)
log.ccrt$flyash <- log(log.ccrt$flyash)
log.ccrt$super_plast <- ifelse(log.ccrt$super_plast<=1,1,log.ccrt$super_plast)
log.ccrt$super_plast <- log(log.ccrt$super_plast)
log.ccrt$age <- ifelse(log.ccrt$age<=1,1,log.ccrt$age)
log.ccrt$age <- log(log.ccrt$age)2.1 Transformed Data Exploratory
After we have perform the log transform, we then review our freshly transformed dataset and look at the spreadness of their data using hist() function.
The transformed data turns out quiet spreaded even though their frequencies are not high enough. But from the spreadiness of our dataset, it turns out quiet nicely.
We also intered in the correlation between variables to the targer predictor.
The correlation to our transformed dataset shows that there is an increasing correlation value than the untransformed dataset.
2.2. Transformed Model Creation
To the newly log transformed dataset, we then try to create a new model based from it with splitting the data beforehand with the same proportion of 80:20.
set.seed(1708) # Random
idx <- initial_split(log.ccrt,0.8,"strength")
tr.log <- training(idx)
ts.log <- testing(idx)And we create the model using the splitted training dataset.
log.none <- lm(strength ~ 1, tr.log)
log.all <- lm(strength ~ ., ts.log)
fwd.log <- step(log.none, scope = list(lower=log.none,upper=log.all), direction = "forward") Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.71053653 6.558415319 4.987567 7.846888e-07
age 8.54907548 0.229518761 37.247829 1.058802e-163
cement 0.08678604 0.003540640 24.511402 1.412444e-94
super_plast 2.35791154 0.434574683 5.425791 8.138015e-08
slag 2.09798670 0.150626730 13.928382 8.460616e-39
water -0.22936981 0.019430782 -11.804456 2.768655e-29
fine_agg -0.02035161 0.004324561 -4.706051 3.084903e-06
flyash 0.32545470 0.193453461 1.682341 9.298079e-02
After transforming the skewed variables using the logarithmic transformation to normalize them, using the same variable selection method as before (forward stepwise) we gain the following influencing variables: \[strength = cement + superplast + age + slag + water + flyash + fineagg\]
All of the predictor shown here are about the same with the predictor used in the non-transformed stepwise model. The only difference here is with the addition of the fine_agg variable.
And by seeing the estimating coefficient,
(Intercept) age cement super_plast slag water
32.71053653 8.54907548 0.08678604 2.35791154 2.09798670 -0.22936981
fine_agg flyash
-0.02035161 0.32545470
the increasing use of super_plast and slag and also the increament of the drying age shown in the age variable will definitely increse our concrete’s strength because they give such score of 2.3579115, fwd.log$coefficients[5], fwd.log$coefficients[2] respectively to our model.
While the addition of water and fine_agg variables will decrease our effort by fwd.log$coefficients[6] and fwd.log$coefficients[7] respectively.
Our current assumption on our current model that gives a different predictors and their values to our non-transformed model is that perhaps we have done something in making our data more spreaded thus it may give us a more linear residual compare to the non-transformed dataset. Thus impacting on how our model (forward stepwise) picking the predictors and calculates their values.
2.3. Transformed Model Evaluation
After we acquired the transformized model, we then try to evaluate it by the same method as before: chacking it’s linearity, normality, multicolinearity, and heteroscedacity
Linearity
From our above plot, our residuals vs fitted plot have line that is near straight across the plot. This can be indicated that our data is liniear, which we hypothesize by the end that the model can predict strength values that are close to the actual strength value (small MEA).
Normality Shapiro-test
Shapiro-Wilk normality test
data: fwd.log$residuals
W = 0.9959, p-value = 0.08148
Usng the Shapiro-Wilk test, it is infered that our residual p-value have a score that is above 0.05 (\(pvalue > 0.05\)). We conclude that our model’s residual are following a normal distribution path.
- Multicollinearity
age cement super_plast slag water fine_agg
1.043669 1.896162 3.222187 1.825590 2.281857 1.706894
flyash
2.891306
The above multicollinearity score indicates that our variables does not have any problem with multicollinearity because of the \(VIF < 5\) score in VIF with a rule of thumb if $ VIF > 10$ it have problems wih multicollinearity issue.
- Heteroscedacity
studentized Breusch-Pagan test
data: fwd.log
BP = 75.333, df = 7, p-value = 1.227e-13
Seeing our model p-value is below 0.05 (\(< 0.05\)) using the Breusch-Pagan test, we can conclude that there are heterocesdacity within our current model.
2.4. Transformed Model Prediction
To our transformed model, we now use it to predict the also transformed test dataset.
[1] 6.481728
[1] 82.95717
Our model yieled \(MEA = 6.4\) and \(R^2 = 82.9 \%\) for the prediction of the test dataset, a much better result than the previous untransformed methods.
Conclussion
For our concrete strength composition analysis, we can state that:
- It is best in using stepwise method in determining which variable is better suited for the target variable. For this particular case, using the forward path yields a wonderful result.
- Beside seing the correlation of each variable to the target variable, it is wise to also look the spreadiness of data in each variables as to plan wether we need to implement transformation to our data or not. This could be done using
hist(). - Use Cook’s Distance (CD) to determine and remove bad influence (outliers) of the dataset and it may yields on the increment of R-squared of the model created.
- In this instance, applicating CD will only yield on the increment of R-square but not for MEA. And removing the outliers using CD only show little change to both MEA and R-square.