Introduction

In making concrete, there are mixture and treatments that need to be done in order to yield strong concrete. This experiment is base of Prof. I-Cheng Yeh research on how to predict the compression strength in a concrete structure.

The goal to this experiment is to find the “perfect recipe” in predicting the concrete’s compression strength whilst we explain the relation between ingredients and the age of testing to the compression strength.

Within our dataset, there are variables such as below that yield varying compression strength:

id: IDs of each cement mixture
cement: The amount of cement (kg) in a $m^3$ misxture
slag: The amount of blast furnace slag (kg) in a $m^3$ misxture
flyash: The amount of fly ash (kg) in a $m^3$ misxture
water: The amount of water (kg) in a $m^3$ misxture
super_plast: The amount of superplasticizer (kg) in a $m^3$ misxture
coarse_agg: The amount of coarse aggreagate (kg) in a $m^3$ misxture
fine_agg: The amount of fine aggreagate (kg) in a $m^3$ misxture
age: The number of resting days before the compressive strength measured
strength: Concrete compressive strenth measurement in $M Pa$ unit

Data Preprocessing

Into our dataset, we can firstly begin by subsetting unnecessery variable such as the id and check wether our datatype are correct or not.

'data.frame':   825 obs. of  10 variables:
 $ id         : chr  "S1" "S2" "S3" "S4" ...
 $ cement     : num  540 540 332 332 199 ...
 $ slag       : num  0 0 142 142 132 ...
 $ flyash     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ water      : num  162 162 228 228 192 228 228 228 192 192 ...
 $ super_plast: num  2.5 2.5 0 0 0 0 0 0 0 0 ...
 $ coarse_agg : num  1040 1055 932 932 978 ...
 $ fine_agg   : num  676 676 594 594 826 ...
 $ age        : int  28 28 270 365 360 365 28 28 90 28 ...
 $ strength   : num  80 61.9 40.3 41 44.3 ...

If our datatypes are already correct, we can now check if there are any NA values slip in our dataset. To this we can use the synatx below to check:

colSums(is.na(ccrt))

     cement        slag      flyash       water super_plast  coarse_agg 
          0           0           0           0           0           0 
   fine_agg         age    strength 
          0           0           0

Data Exploratory

Before continuing to the model creation, it is better at first to see the correlation between the predictor variables and the target variable to determine and to observe which variable have the strongest influence to the target variable. To this we shall use the GGally::ggcorr() function to our dataset.

Our observation on the variables correlation to the target value indicate that only few variable that influence heavily to the target strength variable. Indicated above, the ones that have correlation value 0.1 above ($corr => 0.1$) are the cement, slag, super-plast, and age variables only. But the ones that have a stronger influence are the age, super-plast, and cement variables.

Beside observing the correlation in our dataset, we also need to see about our predictor influence to the target variable. We can do this by determining the range of our predictors each.

summary(ccrt[,-9]) # 9 is for the target predictor `strength`

     cement           slag            flyash           water      
 Min.   :102.0   Min.   :  0.00   Min.   :  0.00   Min.   :121.8  
 1st Qu.:194.7   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:164.9  
 Median :275.1   Median : 20.00   Median :  0.00   Median :184.0  
 Mean   :280.9   Mean   : 73.18   Mean   : 54.03   Mean   :181.1  
 3rd Qu.:350.0   3rd Qu.:141.30   3rd Qu.:118.20   3rd Qu.:192.0  
 Max.   :540.0   Max.   :359.40   Max.   :200.10   Max.   :247.0  
  super_plast       coarse_agg        fine_agg          age        
 Min.   : 0.000   Min.   : 801.0   Min.   :594.0   Min.   :  1.00  
 1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:734.0   1st Qu.:  7.00  
 Median : 6.500   Median : 968.0   Median :780.1   Median : 28.00  
 Mean   : 6.266   Mean   : 972.8   Mean   :775.6   Mean   : 45.14  
 3rd Qu.:10.100   3rd Qu.:1028.4   3rd Qu.:826.8   3rd Qu.: 56.00  
 Max.   :32.200   Max.   :1145.0   Max.   :992.6   Max.   :365.00

Using the syntax above we can gather all the information we need to determine the point for categorization. Here we shall use the mean of every predictor and make a binary categorization.

We will use the abbreviation abv and blw to ease our categorization. As “abv” or above means that values that are equal or above the mean value of each predictor are consider as abv. While “blw” or below means that any value that are below or lower the mean value of each predictor will be consider as blw.

To this we shall do as below,

# Copy the dataset
ctg.ccrt <- ccrt

# Categorizing
ctg.ccrt$cement <- ifelse(ctg.ccrt$cement<mean(ccrt$cement),"blw.cement","abv.cement") %>% as.factor()
ctg.ccrt$slag <- ifelse(ctg.ccrt$slag<mean(ccrt$slag),"blw.slag","abv.slag") %>% as.factor()
ctg.ccrt$flyash <- ifelse(ctg.ccrt$flyash<mean(ccrt$flyash),"blw.flyash","abv.flyash") %>% as.factor()
ctg.ccrt$water <- ifelse(ctg.ccrt$water<mean(ccrt$water),"blw.water","abv.water") %>% as.factor()
ctg.ccrt$super_plast <- ifelse(ctg.ccrt$super_plast<mean(ccrt$super_plast),"blw.super_plast","abv.super_plast") %>% as.factor()
ctg.ccrt$coarse_agg <- ifelse(ctg.ccrt$coarse_agg<mean(ccrt$coarse_agg),"blw.coarse_agg","abv.coarse_agg") %>% as.factor()
ctg.ccrt$fine_agg <- ifelse(ctg.ccrt$fine_agg<mean(ccrt$fine_agg),"blw.fine_agg","abv.fine_agg") %>% as.factor()
ctg.ccrt$age <- ifelse(ctg.ccrt$age<mean(ccrt$age),"blw.age","abv.age") %>% as.factor()

# Checking
head(ctg.ccrt)

      cement     slag     flyash     water     super_plast     coarse_agg
1 abv.cement blw.slag blw.flyash blw.water blw.super_plast abv.coarse_agg
2 abv.cement blw.slag blw.flyash blw.water blw.super_plast abv.coarse_agg
3 abv.cement abv.slag blw.flyash abv.water blw.super_plast blw.coarse_agg
4 abv.cement abv.slag blw.flyash abv.water blw.super_plast blw.coarse_agg
5 blw.cement abv.slag blw.flyash abv.water blw.super_plast abv.coarse_agg
6 abv.cement abv.slag blw.flyash abv.water blw.super_plast blw.coarse_agg
      fine_agg     age strength
1 blw.fine_agg blw.age    79.99
2 blw.fine_agg blw.age    61.89
3 blw.fine_agg abv.age    40.27
4 blw.fine_agg abv.age    41.05
5 abv.fine_agg abv.age    44.30
6 blw.fine_agg abv.age    43.70

After we categorize our dataset, we shall now see on how each predictors influenced the target variable.

It is infer from the boxplot above that the influence of each predictors are vary. Some when they have above mean value will give a positive influence to the concrete’s strength, while some others give a positive influence when they are below mean value hypothetically assuming that the less amount we use them, the stronger our concrete’s strength will be.

To our seperated category, we can check the similiarity of the mean spread between the seperated data by conducting a t-test using the t.test() function. To the boxplot above, we will pick the age variable because it give a lot of influence to the strength. Here we shall see wether the abv.age is different in mean spread to the blw.age

t.test(abv.age, blw.age, alternative = "two.sided")


    Welch Two Sample t-test

data:  abv.age and blw.age
t = 17.488, df = 225.9, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  86.6052 108.6013
sample estimates:
mean of x mean of y 
116.24554  18.64226

After conducting the t-test, it is concluded that the null hypothesis (H0) is rejected and we say that abv and blw have different mean spread.

We also need to see how our data spreads per variables so later we can decide wether we need to transform it or not. We can see this by using th hist() function and see it per variables.

Our observation above concludes that, the slag, flyash, super_plast, and age variables have skewed spread.

We can use the dataset as is (not transforming some of the variables) or we can transform the skewed variables. But for this instance, we shall do both and compare them later by their MAE and R-squared scores.

Data Splitting

In every machne learning task, a dataset is needed to be split into 2 sets: training and testing datasets. The use of this is to when we have created a certain machine learning model from our training dataset, we can cross validate it to our testing dataset to see its performance and to not have a biased end result.

To this we shall split our dataset in 80:20 proportion.

set.seed(1612) # Random
idx <- initial_split(ccrt,0.8,"strength")

tr <- training(idx)
ts <- testing(idx)

Model Creation

As mentioned earlier, to our dataset and our model creation, our goal is to create a model that can predict the strength of a concrete depending to the materials/compositions used.

Prior to this report is written and will not be written here, we have conducted several tests to see which method that yields the lowest MAE and the highest R-square. To the test, we gain the conclusion by using the stepwise forward. We gained a considerable low MEA with quite high R-square compared to the model that uses all variables or custom variables using the correlation table as guidence.

1. Non-transformed Dataset

Into our model creation, we shall use the following syntax:

lm.none <- lm(strength ~ 1, tr)
lm.all <- lm(strength ~ ., tr)

fwd <- step(lm.none, scope = list(lower=lm.none,upper=lm.all), direction = "forward")

summary(fwd)$coefficient

               Estimate  Std. Error   t value     Pr(>|t|)
(Intercept) 25.52148948 5.190790879  4.916686 1.114383e-06
cement       0.10273710 0.005332436 19.266446 7.473091e-66
age          0.12954938 0.007371105 17.575299 6.363016e-57
super_plast  0.34144112 0.107725570  3.169546 1.598014e-03
slag         0.07971645 0.006161837 12.937124 2.997323e-34
water       -0.19586023 0.025879119 -7.568273 1.290180e-13
flyash       0.06005248 0.009812022  6.120296 1.609605e-09

Using the forward stepwise method above, we gain the influencing variables as below: \[strength = cement + superplast + age + slag + water + flyash\] with each variables give estimating coeficient values as shown below:

(Intercept)      cement         age super_plast        slag       water 
25.52148948  0.10273710  0.12954938  0.34144112  0.07971645 -0.19586023 
     flyash 
 0.06005248

The estimate coefficient shows that for every increment of the variable cement, age, super_plast, slag, and flyash will increase the concrete strength by 0.1027371, 0.1295494, 0.3414411, 0.0797164, and 0.0600525 scores respectively. And for every increment on the water variable, it will decrease the concrete’s strength by -0.1958602 score.

1.1. Model Evaluation

After we created our model for the strength compression prediction. It is better for us to evaluate the model by looking at the residual vs fitted plot to see the model’s linearity and other things such as normality, multicolinearity, heteroscedacity.

Linearity From our above plot, our residuals vs fitted plot have line that turn negatively by the end of the plot. This indicates our current model does not have linearity what so over and may result in lower R-square.
Normality Shapiro-test


    Shapiro-Wilk normality test

data:  fwd$residuals
W = 0.99687, p-value = 0.2294

Usng the Shapiro-Wilk test, it is infered that our residual p-value have a score that is above 0.05 ($pvalue > 0.05$). We conclude that our model’s residual are following a normal distribution path.

Multicollinearity

     cement         age super_plast        slag       water      flyash 
   1.863885    1.113726    2.348014    1.742804    1.852011    2.368945

The above multicollinearity score indicates that our variables does not have any problem with multicollinearity because of the $VIF < 5$ score in VIF with a rule of thumb if $ VIF > 10$ it have problems wih multicollinearity issue.

Heteroscedacity


    studentized Breusch-Pagan test

data:  fwd
BP = 98.506, df = 6, p-value < 2.2e-16

Seeing our model p-value is below 0.05 ($< 0.05$) using the Breusch-Pagan test, we can conclude that there are heterocesdacity within our current model.

1.2. Model Prediction

After we review our model by evaluating it, we now see the model’s performance by using it to predict our test dataset and review the MAE and R-square produced.

# MAE
MAE(fwd.pred,ts$strength)

[1] 8.513054

# R-squared (in percent)
summary(fwd)$r.squared*100

[1] 61.68372

Our model yieled $MEA = 8.5$ and $R^2 = 61 \%$, a not so bad result for a regular forward stepwise method. Until this passage, we can hypothtically say that our model might have some bad influences (outliers) that hold it in it’s current performance.

To prove our hypothesis, we shall now do something to reduce these bad influences (outliers) and try to get a better performance for our model.

1.3. Outliers Removal

One way to detect and handle the outliers that might scatter across our dataset is by using Cook’s distance (CD). According to Wikipedia, CD is commonly used estimate of the influence of a data point when performing a least-squares regression analysis.

CD can be used in several ways:

To indicate influential data points that are particularly worth checking for validity
To indicate regions of the design space where it would be good to be able to obtain more data points.

To our model, using olsrr::ols_plot_cooksd_bar() function we can plot our CD bar plot and see which row-data that are considered outlier by CD standard

We can see here there are numerous amount of row-data that are considered outliers by CD’s standard (indicated red). The horizontal red line that stretch across the plot is CD’s threshold line with a value of 0.006 (shown top right corner).

CD threshold is calculated by using $threshold_{CD} = 4 / nrow(data)$ and if there are any data that “jumps” over the threshold line will be considered outliers by CD.

To remove the outliers, we first bind our model’s CD to our training data.

# Make new copy of training dataset
cd_tr <- tr

# Attach CD value to copied training set
cd_tr$cd <- cooks.distance(fwd)

We now then find the row-data that have CD value that are greater than our model’s own CD threshold and we then find the CD value that underline changes to our model (indicates outlier within said value).

# Extracting row-data that have greater CD vaue than the model's CD threshold
lrg_cd <- subset(cd_tr, cd > 4/nrow(cd_tr))

# Quantiling the extracted row-data to find whing quantile  that happens to hold outliers
quantile(lrg_cd$cd, probs = seq(0, 1, 0.05))

         0%          5%         10%         15%         20%         25% 
0.006217755 0.006485609 0.006628140 0.006798822 0.006811185 0.007018945 
        30%         35%         40%         45%         50%         55% 
0.007580165 0.007772170 0.007986549 0.008130277 0.009077721 0.010348965 
        60%         65%         70%         75%         80%         85% 
0.011695175 0.012573797 0.015801682 0.016165813 0.018868862 0.022142909 
        90%         95%        100% 
0.030828194 0.058063588 0.080110802

Form our observation from the quantile result above, there are a strong gap between the 45% and the 50% quantile as if it indicates a sudden “jump”. Using the 50% quantile value, we then subset any row-data that have CD value above 0.009077721.

cd2_tr <- subset(cd_tr, cd < 0.009077721)
cd2_tr <- cd2_tr %>% select(-cd)

And after we remove the presumably outliers, we then now try using the newly created dataset minus the outlier into the forward stepwise model creator.

Start:  AIC=3562.41
strength ~ 1

              Df Sum of Sq    RSS    AIC
+ cement       1     42352 129317 3384.2
+ age          1     33826 137844 3424.8
+ super_plast  1     25266 146404 3463.2
+ water        1     14481 157189 3508.4
+ fine_agg     1      7442 164228 3536.2
+ coarse_agg   1      3893 167776 3549.8
+ slag         1      3241 168429 3552.3
+ flyash       1      1299 170370 3559.6
<none>                     171670 3562.4

Step:  AIC=3384.23
strength ~ cement

              Df Sum of Sq    RSS    AIC
+ age          1   27887.3 101430 3231.7
+ super_plast  1   21315.9 108002 3271.7
+ slag         1   13741.9 115576 3314.8
+ water        1   10939.4 118378 3330.0
+ flyash       1    2296.4 127021 3374.8
+ coarse_agg   1    1694.9 127623 3377.8
+ fine_agg     1    1319.8 127998 3379.7
<none>                     129317 3384.2

Step:  AIC=3231.75
strength ~ cement + age

              Df Sum of Sq    RSS    AIC
+ super_plast  1   28399.7  73030 3024.8
+ water        1   20021.2  81409 3093.9
+ slag         1   15479.7  85950 3128.4
+ flyash       1    3698.0  97732 3210.1
+ coarse_agg   1    2171.8  99258 3220.0
+ fine_agg     1     361.7 101069 3231.5
<none>                     101430 3231.7

Step:  AIC=3024.82
strength ~ cement + age + super_plast

             Df Sum of Sq   RSS    AIC
+ slag        1   12645.2 60385 2905.9
+ fine_agg    1    2771.6 70259 3002.2
+ water       1    2073.7 70957 3008.5
+ flyash      1     442.4 72588 3023.0
<none>                    73030 3024.8
+ coarse_agg  1      21.7 73009 3026.6

Step:  AIC=2905.9
strength ~ cement + age + super_plast + slag

             Df Sum of Sq   RSS    AIC
+ water       1    4295.3 56090 2861.0
+ flyash      1    3303.6 57082 2872.1
+ coarse_agg  1     804.5 59581 2899.4
<none>                    60385 2905.9
+ fine_agg    1      54.0 60331 2907.3

Step:  AIC=2860.97
strength ~ cement + age + super_plast + slag + water

             Df Sum of Sq   RSS    AIC
+ flyash      1    4273.0 51817 2812.6
+ fine_agg    1    1176.2 54914 2849.5
<none>                    56090 2861.0
+ coarse_agg  1       4.3 56086 2862.9

Step:  AIC=2812.57
strength ~ cement + age + super_plast + slag + water + flyash

             Df Sum of Sq   RSS    AIC
<none>                    51817 2812.6
+ coarse_agg  1    27.363 51790 2814.2
+ fine_agg    1     0.005 51817 2814.6

To our model, we then use it to predict the outcome strength and check the MAE and R-square score

# MAE
MAE(fwd_cd.pred,ts$strength)

[1] 8.902201

# R-squared (in percent)
summary(fwd.cd)$r.squared*100

[1] 69.81593

At this point, we can say that by removing the outliers using CD we can see that the R-square score is increased to around 69.8159262 compared to the previous 61.6837156 while our MAE got a severe 8.9022011 compared to the previous 8.5130542.

Thus for this experiment on removing the outliers, we would say that removing outliers have a positive effect in increasing the R-squared but just be careful on the severing MEA.

2. Transformed Dataset

Mentioned earlier about our dataset’s variable’s spreading in 1.1 Model Evaluation, some of our variable have skewed spreads within its data. To perhaps normalize the data spread within the variables we can use a transformation method using logarithmic solution.

Looking back to our previous histogram, some of its variable need to be normalize in order to have a fairly distributed data. To this we shall aplicate logarithmic tranformation to the skewed data only.

log.ccrt$slag<-ifelse(log.ccrt$slag<=1,1,log.ccrt$slag)
log.ccrt$slag <-log(log.ccrt$slag)

log.ccrt$flyash <- ifelse(log.ccrt$flyash<=1,1,log.ccrt$flyash)
log.ccrt$flyash <- log(log.ccrt$flyash)

log.ccrt$super_plast <- ifelse(log.ccrt$super_plast<=1,1,log.ccrt$super_plast)
log.ccrt$super_plast <- log(log.ccrt$super_plast)

log.ccrt$age <- ifelse(log.ccrt$age<=1,1,log.ccrt$age)
log.ccrt$age <- log(log.ccrt$age)

2.1 Transformed Data Exploratory

After we have perform the log transform, we then review our freshly transformed dataset and look at the spreadness of their data using hist() function.

The transformed data turns out quiet spreaded even though their frequencies are not high enough. But from the spreadiness of our dataset, it turns out quiet nicely.

We also intered in the correlation between variables to the targer predictor.

The correlation to our transformed dataset shows that there is an increasing correlation value than the untransformed dataset.

2.2. Transformed Model Creation

To the newly log transformed dataset, we then try to create a new model based from it with splitting the data beforehand with the same proportion of 80:20.

set.seed(1708) # Random
idx <- initial_split(log.ccrt,0.8,"strength")

tr.log <- training(idx)
ts.log <- testing(idx)

And we create the model using the splitted training dataset.

log.none <- lm(strength ~ 1, tr.log)
log.all <- lm(strength ~ ., ts.log)

fwd.log <- step(log.none, scope = list(lower=log.none,upper=log.all), direction = "forward")

summary(fwd.log)$coefficient

               Estimate  Std. Error    t value      Pr(>|t|)
(Intercept) 32.71053653 6.558415319   4.987567  7.846888e-07
age          8.54907548 0.229518761  37.247829 1.058802e-163
cement       0.08678604 0.003540640  24.511402  1.412444e-94
super_plast  2.35791154 0.434574683   5.425791  8.138015e-08
slag         2.09798670 0.150626730  13.928382  8.460616e-39
water       -0.22936981 0.019430782 -11.804456  2.768655e-29
fine_agg    -0.02035161 0.004324561  -4.706051  3.084903e-06
flyash       0.32545470 0.193453461   1.682341  9.298079e-02

After transforming the skewed variables using the logarithmic transformation to normalize them, using the same variable selection method as before (forward stepwise) we gain the following influencing variables: \[strength = cement + superplast + age + slag + water + flyash + fineagg\]

All of the predictor shown here are about the same with the predictor used in the non-transformed stepwise model. The only difference here is with the addition of the fine_agg variable.

And by seeing the estimating coefficient,

(Intercept)         age      cement super_plast        slag       water 
32.71053653  8.54907548  0.08678604  2.35791154  2.09798670 -0.22936981 
   fine_agg      flyash 
-0.02035161  0.32545470

the increasing use of super_plast and slag and also the increament of the drying age shown in the age variable will definitely increse our concrete’s strength because they give such score of 2.3579115, fwd.log$coefficients[5], fwd.log$coefficients[2] respectively to our model.

While the addition of water and fine_agg variables will decrease our effort by fwd.log$coefficients[6] and fwd.log$coefficients[7] respectively.

Our current assumption on our current model that gives a different predictors and their values to our non-transformed model is that perhaps we have done something in making our data more spreaded thus it may give us a more linear residual compare to the non-transformed dataset. Thus impacting on how our model (forward stepwise) picking the predictors and calculates their values.

2.3. Transformed Model Evaluation

After we acquired the transformized model, we then try to evaluate it by the same method as before: chacking it’s linearity, normality, multicolinearity, and heteroscedacity

Linearity From our above plot, our residuals vs fitted plot have line that is near straight across the plot. This can be indicated that our data is liniear, which we hypothesize by the end that the model can predict strength values that are close to the actual strength value (small MEA).
Normality Shapiro-test


    Shapiro-Wilk normality test

data:  fwd.log$residuals
W = 0.9959, p-value = 0.08148

Multicollinearity

        age      cement super_plast        slag       water    fine_agg 
   1.043669    1.896162    3.222187    1.825590    2.281857    1.706894 
     flyash 
   2.891306

Heteroscedacity


    studentized Breusch-Pagan test

data:  fwd.log
BP = 75.333, df = 7, p-value = 1.227e-13