Concrete is the most important material in civial engineering.
This problem was orginally proposed by Prof. I-Cheng Yeh, Deparment of Information Management Chung-Hua University, Hsin Chu, Taiwan in 2007. It is related to his research in 1998 about how to predict compression strength in a concrete structure.
This project aims to predict the compressive strength of concrete with regression subject to maximum accuracy and lowest error MAE with various variables of components as the input.
The goal of this problem is to predict the compression strength based on the mixture properties.
># Observations: 825
># Variables: 10
># $ id <fct> S1, S2, S3, S4, S5, S6, S7, S8, S9, S10, S11, S12, S13,...
># $ cement <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 380.0, 380.0, 475.0,...
># $ slag <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 95.0, 95.0, 0.0, 132.4, ...
># $ flyash <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
># $ water <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 192, 192, 228, ...
># $ super_plast <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
># $ coarse_agg <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932....
># $ fine_agg <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 594.0, 594.0, 594.0,...
># $ age <int> 28, 28, 270, 365, 360, 365, 28, 28, 90, 28, 28, 90, 90,...
># $ strength <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 43.70, 36.45, 39.29,...
Information data :
id : Id of each cement mixture,
cement : The amount of cement (Kg) in a m3 mixture,
slag : The amount of blast furnace slag (Kg) in a m3 mixture,
flyash : The amount of fly ash (Kg) in a m3 mixture,
water : The amount of water (Kg) in a m3 mixture,
super_plast: The amount of Superplasticizer (Kg) in a m3 mixture,
coarse_agg : The amount of Coarse Aggreagate (Kg) in a m3 mixture,
fine_agg : The amount of Fine Aggregate (Kg) in a m3 mixture,
age : the number of resting days before the compressive strength measurement,
strength : Concrete compressive strength measurement in MPa unit.
Dependent Variable : Strength
First we will check if there are any missing values for each variables,
># id cement slag flyash water super_plast
># 0 0 0 0 0 0
># coarse_agg fine_agg age strength
># 0 0 0 0
># id cement slag flyash water super_plast
># 0 0 0 0 0 0
># coarse_agg fine_agg age strength
># 0 0 0 205
From the result, can be seen that there are no missing values except strength variable in data test that we want to predict.
Begin with all variables, I want to see how the distribution all variables and are there any outliers in train and test data using boxplot.
From the boxplot, can be seen that I have several variables that have very skewed distributions (seen from Q2 or median in boxplot) such as :
in data train = Slag, Flyash, Water, Superplast, and Age
in data test = slag, Flyash, Water, Superplast, and Age
Can be seen too that several variables have outlier such as :
in data train = Slag, Water, SuperPlast, FineAgg, Age, and Strength
in data test = Super_plast and Age
So we have two problems such as : very skewed distributions and outlier
Now I will make logarithm transformation for several variables that have very skewed distributions
Before we make transformation, we will see how the value distribution for each variables
># id cement slag flyash
># S1 : 1 Min. :102.0 Min. : 0.00 Min. : 0.00
># S10 : 1 1st Qu.:194.7 1st Qu.: 0.00 1st Qu.: 0.00
># S100 : 1 Median :275.1 Median : 20.00 Median : 0.00
># S101 : 1 Mean :280.9 Mean : 73.18 Mean : 54.03
># S102 : 1 3rd Qu.:350.0 3rd Qu.:141.30 3rd Qu.:118.20
># S103 : 1 Max. :540.0 Max. :359.40 Max. :200.10
># (Other):819
># water super_plast coarse_agg fine_agg
># Min. :121.8 Min. : 0.000 Min. : 801.0 Min. :594.0
># 1st Qu.:164.9 1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:734.0
># Median :184.0 Median : 6.500 Median : 968.0 Median :780.1
># Mean :181.1 Mean : 6.266 Mean : 972.8 Mean :775.6
># 3rd Qu.:192.0 3rd Qu.:10.100 3rd Qu.:1028.4 3rd Qu.:826.8
># Max. :247.0 Max. :32.200 Max. :1145.0 Max. :992.6
>#
># age strength
># Min. : 1.00 Min. : 2.33
># 1st Qu.: 7.00 1st Qu.:23.64
># Median : 28.00 Median :34.57
># Mean : 45.14 Mean :35.79
># 3rd Qu.: 56.00 3rd Qu.:45.94
># Max. :365.00 Max. :82.60
>#
Can be seen that there are several variables have zero value such Slag, Flyash, SuperPlast, and Age. we can’t make logarithm transformation for zero value. So I will manipulate the data for zero value to become one. (\(\log1=0\))
Manipulate data for data train
train <- train %>% select(-1)
train$slag <- ifelse(train$slag <=1, 1, train$slag)
train$flyash <- ifelse(train$flyash <=1, 1, train$flyash)
train$super_plast <- ifelse(train$super_plast <=1, 1, train$super_plast)
train$age <- ifelse(train$age <=1, 1, train$age)After manipulation data, I will make logarithm transformation for variable that have very skewed distribution such as Slag, Flyash, SuperPlast, Age, and Water. I will check the boxplot again after that
train$slag <- log(train$slag)
train$flyash <- log(train$flyash)
train$super_plast <- log(train$super_plast)
train$age <- log(train$age)
train$water <- log(train$water)
boxplot(train, las = 2)Now, can be seen outlier in data train = water, fineagg, strength. I will remove the outlier
Remove outlier in data train
water1 <- which(train$water %in% boxplot(train$water, plot=FALSE)$out)
train <- train[-water1, ]
nrow(train)># [1] 812
fine_agg1 <- which(train$fine_agg %in% boxplot(train$fine_agg, plot=FALSE)$out)
train <- train[-fine_agg1, ]
nrow(train)># [1] 790
strength1 <- which(train$strength %in% boxplot(train$strength, plot=FALSE)$out)
train <- train[-strength1, ]
nrow(train)># [1] 789
From 825 data, after logarithm transformation and delete outlier, final for data train is 789
Because the data test that I want to predict must have same unit variable with data train, I will make logarithm transformation in data test same like data train
Manipulate data for data test
#test <- test %>% select(-1)
test$slag <- ifelse(test$slag <=1, 1, test$slag)
test$flyash <- ifelse(test$flyash <=1, 1, test$flyash)
test$super_plast <- ifelse(test$super_plast <=1, 1, test$super_plast)
test$age <- ifelse(test$age <=1, 1, test$age)test$slag <- log(test$slag)
test$flyash <- log(test$flyash)
test$super_plast <- log(test$super_plast)
test$age <- log(test$age)
test$water <- log(test$water)Can be seen, the test data have outlier. But I will not remove it because I want to predict the strength. Now, data train and data test are ready to be explore.
First, I want to see how each variables are related one another
Insight :
1. Strength as dependent variable have enough strong positive relationship with Age that have correlation value (0.565)
2. Strength have enough strong positive relationship with cement too with correlation value (0.495)
3. SuperPlast and Cement have a very small negative relationship with correlation value (-0.00456)
Can be conclude strength have strong relationship with Cement (0.495), Water (-0.342), SuperPlast (0.345), and Age (0.565) compared to other variables.
From the data distribution, we can be seen from the data that variables looks normal distribution (peaking in the middle of the data) are Cement, Water, CoarseAgg, FineAgg, and Strength.
data that looks have skewed distribution are Slag, Flyash, SuperPlast, and Age.
We will divide the data train into data_train and data_test. For the data_test, I will make the amount is as close as possible to the actual data test (205 data). Because the target is data test around 200 data, I will make proportion 70:30. Cross validation used to see models that can predict the visible data, how accurate with Rsquared and residuals (error).
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(nrow(train), nrow(train)*0.70)
data_train <- train[index,]
data_test <- train[-index,]
nrow(data_train)
nrow(data_test)># [1] 552
># [1] 237
Right now I will make stepwise to get the best OLS model with Akaike Information Criterion (AIC) and regsubsets based by the highest Adjusted R-squared
modelnone <- lm(strength~1, data_train)
modelfull <- lm(strength~., data_train)
modelaic <- step(modelnone, scope = list(lower = modelnone, upper = modelfull), direction = "both")># Start: AIC=3109.99
># strength ~ 1
>#
># Df Sum of Sq RSS AIC
># + age 1 49083 104803 2899.9
># + cement 1 40768 113118 2942.1
># + super_plast 1 18467 135418 3041.4
># + water 1 15246 138640 3054.4
># + fine_agg 1 9358 144528 3077.4
># + slag 1 6035 147851 3089.9
># + coarse_agg 1 2555 151330 3102.8
># <none> 153886 3110.0
># + flyash 1 447 153439 3110.4
>#
># Step: AIC=2899.95
># strength ~ age
>#
># Df Sum of Sq RSS AIC
># + cement 1 40864 63939 2629.2
># + water 1 23944 80859 2758.8
># + super_plast 1 20242 84561 2783.5
># + slag 1 7038 97765 2863.6
># + fine_agg 1 6320 98483 2867.6
># + coarse_agg 1 2164 102639 2890.4
># <none> 104803 2899.9
># + flyash 1 197 104606 2900.9
># - age 1 49083 153886 3110.0
>#
># Step: AIC=2629.18
># strength ~ age + cement
>#
># Df Sum of Sq RSS AIC
># + super_plast 1 21633 42307 2403.2
># + slag 1 17530 46409 2454.3
># + water 1 14551 49388 2488.6
># + flyash 1 3246 60693 2602.4
># + fine_agg 1 1375 62564 2619.2
># + coarse_agg 1 1123 62817 2621.4
># <none> 63939 2629.2
># - cement 1 40864 104803 2899.9
># - age 1 49179 113118 2942.1
>#
># Step: AIC=2403.21
># strength ~ age + cement + super_plast
>#
># Df Sum of Sq RSS AIC
># + slag 1 12205 30102 2217.3
># + fine_agg 1 2628 39679 2369.8
># + flyash 1 2137 40169 2376.6
># + water 1 1220 41087 2389.1
># <none> 42307 2403.2
># + coarse_agg 1 1 42306 2405.2
># - super_plast 1 21633 63939 2629.2
># - cement 1 42254 84561 2783.5
># - age 1 51019 93326 2837.9
>#
># Step: AIC=2217.33
># strength ~ age + cement + super_plast + slag
>#
># Df Sum of Sq RSS AIC
># + water 1 3166 26936 2158.0
># + coarse_agg 1 1617 28485 2188.9
># + flyash 1 140 29962 2216.8
># <none> 30102 2217.3
># + fine_agg 1 101 30001 2217.5
># - slag 1 12205 42307 2403.2
># - super_plast 1 16307 46409 2454.3
># - cement 1 50620 80722 2759.8
># - age 1 52224 82326 2770.7
>#
># Step: AIC=2157.99
># strength ~ age + cement + super_plast + slag + water
>#
># Df Sum of Sq RSS AIC
># + fine_agg 1 1102 25834 2136.9
># + flyash 1 656 26280 2146.4
># + coarse_agg 1 265 26671 2154.5
># <none> 26936 2158.0
># - water 1 3166 30102 2217.3
># - super_plast 1 3614 30550 2225.5
># - slag 1 14151 41087 2389.1
># - cement 1 43496 70431 2686.6
># - age 1 55035 81971 2770.3
>#
># Step: AIC=2136.94
># strength ~ age + cement + super_plast + slag + water + fine_agg
>#
># Df Sum of Sq RSS AIC
># + flyash 1 367 25467 2131.0
># <none> 25834 2136.9
># + coarse_agg 1 79 25755 2137.2
># - fine_agg 1 1102 26936 2158.0
># - super_plast 1 3091 28925 2197.3
># - water 1 4167 30001 2217.5
># - slag 1 10122 35956 2317.4
># - cement 1 32189 58024 2581.6
># - age 1 54377 80211 2760.3
>#
># Step: AIC=2131.03
># strength ~ age + cement + super_plast + slag + water + fine_agg +
># flyash
>#
># Df Sum of Sq RSS AIC
># <none> 25467 2131.0
># + coarse_agg 1 23 25443 2132.5
># - flyash 1 367 25834 2136.9
># - super_plast 1 741 26208 2144.9
># - fine_agg 1 813 26280 2146.4
># - water 1 4445 29912 2217.8
># - slag 1 8774 34240 2292.4
># - cement 1 26204 51670 2519.6
># - age 1 54743 80209 2762.3
modeluji <- regsubsets(strength ~ cement + coarse_agg + age + super_plast + slag + water + fine_agg +
flyash, data = data_train, nbest = 2)
plot(modeluji, scale = "adjr2")Conclusion Model :
\[{Strength}={\beta_{0}} + {\beta_{1}}\log({Age}) + {\beta_{2}}{Cement} + {\beta_{3}}\log({SuperPlast}) +{\beta_{4}}\log({Slag})+ {\beta_{5}}\log({Water}) + {\beta_{6}}{FineAgg} + {\beta_{7}}\log({Flyash}) + {\upsilon_{i}}\]
>#
># Call:
># lm(formula = strength ~ age + cement + super_plast + slag + water +
># fine_agg + flyash, data = data_train)
>#
># Residuals:
># Min 1Q Median 3Q Max
># -22.1132 -4.7018 0.5054 4.5971 24.0549
>#
># Coefficients:
># Estimate Std. Error t value Pr(>|t|)
># (Intercept) 194.383230 23.017869 8.445 0.000000000000000278 ***
># age 8.584118 0.251027 34.196 < 0.0000000000000002 ***
># cement 0.089857 0.003798 23.659 < 0.0000000000000002 ***
># super_plast 1.951518 0.490462 3.979 0.000078601594019467 ***
># slag 2.256388 0.164820 13.690 < 0.0000000000000002 ***
># water -39.445423 4.047928 -9.745 < 0.0000000000000002 ***
># fine_agg -0.019782 0.004748 -4.167 0.000035922771413228 ***
># flyash 0.578654 0.206527 2.802 0.00526 **
># ---
># Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>#
># Residual standard error: 6.842 on 544 degrees of freedom
># Multiple R-squared: 0.8345, Adjusted R-squared: 0.8324
># F-statistic: 391.9 on 7 and 544 DF, p-value: < 0.00000000000000022
\[{Strength}= 194.383 + 8.584\log({Age}) + 0.089{Cement} + 1.951\log({SuperPlast}) + 2.256\log({Slag}) - 39.445\log({Water}) - 0.019{FineAgg} + 0.578\log({Flyash}) + {\upsilon_{i}}\] Analysis from the Model:
1. From generally perspective, can be seen from F-statistics have significant p-value < 0.05 (level of significance), it can be concluded that our models is a good model.
2. Because this is multiple regression, we see Adjusted R-squared that have value 0.8324. it means that 83.24% variation of independent variables can explain variation of dependent variable. We have a good model.
3. Look more specifically with each of independent variables, with t-test, all variables have significant p-value < 0.05. It means that all independent variables can explain dependent variables (Strength).
4. Residual standard error 6.842, it means that average distance of the data points from the fitted line is about 6,842%.
Interpretation from the Model:
1. Without intervention or influence from independent variable, concrete compressive strength have initial capital 194.383 unit value, Ceteris Paribus.
2. From all independent variables, Age is the variable that can provide the highest positive added value for Concrete compressive strength. Every increase in 1 percent Age (percent because log), it can increases concrete compresssive strength 8.584 unit value, Ceteris Paribus.
3. From all independent variables, Water is the variable that can provide the highest negative added value for Concrete compressive strength. Every increase in 1 percent Water (percent because log), it can decreases concrete compresssive strength 39.445 unit value, Ceteris Paribus.
4. The rest of the variables have influence but are not very strong like Cement (0.089), SuperPlast (1.951), Slag (2.256), FineAgg (-0.019), and flyash (0.578), Ceteris Paribus.
Notes : Ceteris Paribus = “all other things being unchanged or constant”
Now we will predict how the performance MAE in our data train
predict1 <- predict(object = modelaic, newdata = data_train)
predict2 <- predict(object = modelaic, newdata = data_test)># [1] 5.399918
># [1] 5.225995
Can be seen, MAE predict with own data as well and the split not too different (5.399918) in data_train with (5.225995) in data_test, we have good model.
means that the residuals from the linear regression model should be normally distributed because we expect to get residuals near the zero value. to test the condition, we will use Shapiro-Wilk Test for our residual :
>#
># Shapiro-Wilk normality test
>#
># data: modelaic$residuals
># W = 0.997, p-value = 0.4048
\(H_{0}\) = Residuals are normally distributed.
\(H_{1}\) = Residuals are not normally distributed.
The test has a p-value above the significance level of 0.05. Therefore we don’t reject the null hypothesis. We can conclude that the residuals are normally distributed.
How if not normal, will this be a problem with the model?
Quoting Gujarati (2004),
“Violation of the normality assumption doesn’t contribute to bias or inefficiency in regression models. It is only important for the calculation of p-values for significance testing, but this is only a consideration when sample size is very small. When the sample size is sufficiently large (>200), the normality assumption is not needed at all as the Central Limit Theorem (CLT) ensures that the distribution of disturbance term will approximate normality”
Conclusions :
Since our sample have sufficiently large (700++ samples), normality assumption is not needed because Central Limit Theorem applicable.
means to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related. for this assumption, we will test use Variance Inflation Factor (VIF).
># age cement super_plast slag water fine_agg
># 1.037311 1.811081 3.437216 1.868256 2.170822 1.505277
># flyash
># 2.799520
A common rule of thumbs is that a VIF number greater than 10 may indicate high collinearity and worth further inspection. Our VIF test to the model doesn’t have a VIF number that greater than 10, it means in the model don’t have high collinearity.
Conclusions : this assumption is fulfilled
common statistical test to check if there is Homoscedasticity or Heteroscedasticity using the Breusch-Pagan Test. Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value. In a linear regression model, if the variance of its error is showing unequal variation across the target variable range, it shows that heteroscedasticity is present and the implication to that is related to the previous statement of a non-random pattern in residual.
>#
># studentized Breusch-Pagan test
>#
># data: modelaic
># BP = 62.869, df = 7, p-value = 0.00000000004027
\(H_{0}\) = Homoscedasticity.
\(H_{1}\) = Heteroscedasticity.
The test has a p-value < 0.05, therefore we reject the null hypothesis.
Conclusions : The residuals has not a constant variance or Heteroscedasticity.
To cure this problem, I will use the Box-Cox in the part below.
Box-Cox Transformation is the tools that transform residuals from having a non-normal distribution to a normal distribution.
BCdist <- BoxCoxTrans(data_train$strength)
data_train <- cbind(data_train, dist_BC = predict(BCdist, data_train$strength))
BClmdata <- lm(dist_BC ~ age + water + fine_agg + coarse_agg +
flyash + super_plast, data = data_train)
gvlma(BClmdata)>#
># Call:
># lm(formula = dist_BC ~ age + water + fine_agg + coarse_agg +
># flyash + super_plast, data = data_train)
>#
># Coefficients:
># (Intercept) age water fine_agg coarse_agg flyash
># 162.97814 2.16529 -23.32692 -0.02477 -0.01735 -0.30053
># super_plast
># 0.25956
>#
>#
># ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
># USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
># Level of Significance = 0.05
>#
># Call:
># gvlma(x = BClmdata)
>#
># Value p-value Decision
># Global Stat 27.47476 0.00001593 Assumptions NOT satisfied!
># Skewness 15.71702 0.00007356 Assumptions NOT satisfied!
># Kurtosis 3.69636 0.05453135 Assumptions acceptable.
># Link Function 0.06496 0.79882835 Assumptions acceptable.
># Heteroscedasticity 7.99642 0.00468699 Assumptions NOT satisfied!
After the Box-Cox Transformation, now the model still not fulfill the Homocedasticity assumption. Maybe for the advanced,, we can use GLS method to cure the homocedasticity but not discussed in this project.
In this part, I will divide the age (before the log) into 4 categories (0-90), (91-180), (181-270), (270-365) to find out the difference for each composition.
train1 <- read.csv("data/data-train.csv")
train1 <- train1 %>%
mutate(agegroup = case_when(age < 91 ~ "00-90",
age >= 91 & age < 181 ~ "91-180",
age >= 181 & age < 271 ~ "181-270",
age >= 271 & age < 366 ~ "271-365"))
trainfinal <- train1[, c(10,11)]
anova <- aov(strength~agegroup, data=trainfinal)
summary(anova)># Df Sum Sq Mean Sq F value Pr(>F)
># agegroup 3 25708 8569 34.4 <0.0000000000000002 ***
># Residuals 821 204503 249
># ---
># Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
to find the difference of mean of concrete compression strength in age, I will use One Way ANOVA,
Define Hypothesis :
\(H_{0} : \mu_{00-90} = \mu_{91-180} = \mu_{181-270} = \mu_{271-365}\)
\(H_{1}\) : Not all population means are equal
Analysis : Can be seen from agegroup variable in One Way ANOVA table, the p=value < 0.05 (Reject Ho) which is significant.
Can be conclude that age group class are different each other to influence concrete compression strength.
Now we have another problem that ANOVA test doesn’t indicate which ones differ. to find out which age group means differ, I will using pairwise t-test to see more detail.
>#
># Pairwise comparisons using t tests with pooled SD
>#
># data: trainfinal$strength and trainfinal$agegroup
>#
># 00-90 181-270 271-365
># 181-270 0.0004 - -
># 271-365 0.0094 0.2676 -
># 91-180 <0.0000000000000002 0.8864 0.1488
>#
># P value adjustment method: none
Define Hypothesis :
\(H_{0}\) : not different each other
\(H_{1}\) : different each other
if p-value < 0.05 , Reject Ho.
From the pairwise t-test table result, can be seen ONLY category 00-90 age group that have pvalue < 0.05 with all another age group.
Conclusions : use 00-90 age group to get maximum or higher concrete compression strength.
To validate our model, we will predict to data test(unseen data) that will be submit to leaderboard.
predict3 <- predict(modelaic, newdata = test)
hasil <- data.frame(id = test$id, strength = predict3)
write.csv(hasil, "data/datatest-concreteanalysisfinal.csv")From predict to data test in Algoritma Leaderboard score, I get MAE score 6 and R-squared 78%.