Concrete is the most important material in civial engineering.

This problem was orginally proposed by Prof. I-Cheng Yeh, Deparment of Information Management Chung-Hua University, Hsin Chu, Taiwan in 2007. It is related to his research in 1998 about how to predict compression strength in a concrete structure.

This project aims to predict the compressive strength of concrete with regression subject to maximum accuracy and lowest error MAE with various variables of components as the input.

The goal of this problem is to predict the compression strength based on the mixture properties.

1 Input Data

1.1 Library

library(tidyverse)
library(plotly)
library(data.table)
library(GGally)
library(car)
library(scales)
library(lmtest)
library(dplyr)
library(ggplot2)
library(corrplot)
library(leaps)
library(gvlma)
library(caret)
library(MLmetrics)
library(caTools)
library(psych)

1.2 Data Setup

train <- read.csv("data/data-train.csv")
test <- read.csv("data/data-test.csv")
glimpse(train)

># Observations: 825
># Variables: 10
># $ id          <fct> S1, S2, S3, S4, S5, S6, S7, S8, S9, S10, S11, S12, S13,...
># $ cement      <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 380.0, 380.0, 475.0,...
># $ slag        <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 95.0, 95.0, 0.0, 132.4, ...
># $ flyash      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
># $ water       <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 192, 192, 228, ...
># $ super_plast <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
># $ coarse_agg  <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932....
># $ fine_agg    <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 594.0, 594.0, 594.0,...
># $ age         <int> 28, 28, 270, 365, 360, 365, 28, 28, 90, 28, 28, 90, 90,...
># $ strength    <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 43.70, 36.45, 39.29,...

Information data :
id : Id of each cement mixture,
cement : The amount of cement (Kg) in a m3 mixture,
slag : The amount of blast furnace slag (Kg) in a m3 mixture,
flyash : The amount of fly ash (Kg) in a m3 mixture,
water : The amount of water (Kg) in a m3 mixture,
super_plast: The amount of Superplasticizer (Kg) in a m3 mixture,
coarse_agg : The amount of Coarse Aggreagate (Kg) in a m3 mixture,
fine_agg : The amount of Fine Aggregate (Kg) in a m3 mixture,
age : the number of resting days before the compressive strength measurement,
strength : Concrete compressive strength measurement in MPa unit.

Dependent Variable : Strength

2 Data Preprocessing

First we will check if there are any missing values for each variables,

colSums(is.na(train))

>#          id      cement        slag      flyash       water super_plast 
>#           0           0           0           0           0           0 
>#  coarse_agg    fine_agg         age    strength 
>#           0           0           0           0

colSums(is.na(test))

>#          id      cement        slag      flyash       water super_plast 
>#           0           0           0           0           0           0 
>#  coarse_agg    fine_agg         age    strength 
>#           0           0           0         205

From the result, can be seen that there are no missing values except strength variable in data test that we want to predict.

Begin with all variables, I want to see how the distribution all variables and are there any outliers in train and test data using boxplot.

boxplot(train, las = 2)

boxplot(test, las = 2)

From the boxplot, can be seen that I have several variables that have very skewed distributions (seen from Q2 or median in boxplot) such as :
in data train = Slag, Flyash, Water, Superplast, and Age
in data test = slag, Flyash, Water, Superplast, and Age

Can be seen too that several variables have outlier such as :
in data train = Slag, Water, SuperPlast, FineAgg, Age, and Strength
in data test = Super_plast and Age

So we have two problems such as : very skewed distributions and outlier
Now I will make logarithm transformation for several variables that have very skewed distributions

2.1 Logarithm Transformation and Detect Outlier

Before we make transformation, we will see how the value distribution for each variables

summary(train)

>#        id          cement           slag            flyash      
>#  S1     :  1   Min.   :102.0   Min.   :  0.00   Min.   :  0.00  
>#  S10    :  1   1st Qu.:194.7   1st Qu.:  0.00   1st Qu.:  0.00  
>#  S100   :  1   Median :275.1   Median : 20.00   Median :  0.00  
>#  S101   :  1   Mean   :280.9   Mean   : 73.18   Mean   : 54.03  
>#  S102   :  1   3rd Qu.:350.0   3rd Qu.:141.30   3rd Qu.:118.20  
>#  S103   :  1   Max.   :540.0   Max.   :359.40   Max.   :200.10  
>#  (Other):819                                                    
>#      water        super_plast       coarse_agg        fine_agg    
>#  Min.   :121.8   Min.   : 0.000   Min.   : 801.0   Min.   :594.0  
>#  1st Qu.:164.9   1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:734.0  
>#  Median :184.0   Median : 6.500   Median : 968.0   Median :780.1  
>#  Mean   :181.1   Mean   : 6.266   Mean   : 972.8   Mean   :775.6  
>#  3rd Qu.:192.0   3rd Qu.:10.100   3rd Qu.:1028.4   3rd Qu.:826.8  
>#  Max.   :247.0   Max.   :32.200   Max.   :1145.0   Max.   :992.6  
>#                                                                   
>#       age            strength    
>#  Min.   :  1.00   Min.   : 2.33  
>#  1st Qu.:  7.00   1st Qu.:23.64  
>#  Median : 28.00   Median :34.57  
>#  Mean   : 45.14   Mean   :35.79  
>#  3rd Qu.: 56.00   3rd Qu.:45.94  
>#  Max.   :365.00   Max.   :82.60  
>#

Can be seen that there are several variables have zero value such Slag, Flyash, SuperPlast, and Age. we can’t make logarithm transformation for zero value. So I will manipulate the data for zero value to become one. (\(\log1=0\))

Manipulate data for data train

train <- train %>% select(-1)
train$slag <- ifelse(train$slag <=1, 1, train$slag)
train$flyash <- ifelse(train$flyash <=1, 1, train$flyash)
train$super_plast <- ifelse(train$super_plast <=1, 1, train$super_plast)
train$age <- ifelse(train$age <=1, 1, train$age)

After manipulation data, I will make logarithm transformation for variable that have very skewed distribution such as Slag, Flyash, SuperPlast, Age, and Water. I will check the boxplot again after that

train$slag <- log(train$slag)
train$flyash <- log(train$flyash)
train$super_plast <- log(train$super_plast)
train$age <- log(train$age)
train$water <- log(train$water)
boxplot(train, las = 2)

Now, can be seen outlier in data train = water, fineagg, strength. I will remove the outlier

Remove outlier in data train

water1 <- which(train$water %in% boxplot(train$water, plot=FALSE)$out)
train <- train[-water1, ]
nrow(train)

># [1] 812

fine_agg1 <- which(train$fine_agg %in% boxplot(train$fine_agg, plot=FALSE)$out)
train <- train[-fine_agg1, ]
nrow(train)

># [1] 790

strength1 <- which(train$strength %in% boxplot(train$strength, plot=FALSE)$out)
train <- train[-strength1, ]
nrow(train)

># [1] 789

From 825 data, after logarithm transformation and delete outlier, final for data train is 789

Because the data test that I want to predict must have same unit variable with data train, I will make logarithm transformation in data test same like data train

Manipulate data for data test

#test <- test %>% select(-1)
test$slag <- ifelse(test$slag <=1, 1, test$slag)
test$flyash <- ifelse(test$flyash <=1, 1, test$flyash)
test$super_plast <- ifelse(test$super_plast <=1, 1, test$super_plast)
test$age <- ifelse(test$age <=1, 1, test$age)

test$slag <- log(test$slag)
test$flyash <- log(test$flyash)
test$super_plast <- log(test$super_plast)
test$age <- log(test$age)
test$water <- log(test$water)

Can be seen, the test data have outlier. But I will not remove it because I want to predict the strength. Now, data train and data test are ready to be explore.

3 Exploratory Data

First, I want to see how each variables are related one another

GGally::ggpairs(train)

Insight :
1. Strength as dependent variable have enough strong positive relationship with Age that have correlation value (0.565)
2. Strength have enough strong positive relationship with cement too with correlation value (0.495)
3. SuperPlast and Cement have a very small negative relationship with correlation value (-0.00456)

Can be conclude strength have strong relationship with Cement (0.495), Water (-0.342), SuperPlast (0.345), and Age (0.565) compared to other variables.

From the data distribution, we can be seen from the data that variables looks normal distribution (peaking in the middle of the data) are Cement, Water, CoarseAgg, FineAgg, and Strength.
data that looks have skewed distribution are Slag, Flyash, SuperPlast, and Age.

4 Machine Learning and Modelling

4.1 Cross Validation

We will divide the data train into data_train and data_test. For the data_test, I will make the amount is as close as possible to the actual data test (205 data). Because the target is data test around 200 data, I will make proportion 70:30. Cross validation used to see models that can predict the visible data, how accurate with Rsquared and residuals (error).

RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(nrow(train), nrow(train)*0.70)
data_train <- train[index,]
data_test <- train[-index,]
nrow(data_train)
nrow(data_test)

># [1] 552
># [1] 237

4.2 Akaike Information Criterion (AIC)

Right now I will make stepwise to get the best OLS model with Akaike Information Criterion (AIC) and regsubsets based by the highest Adjusted R-squared

modelnone <- lm(strength~1, data_train)
modelfull <- lm(strength~., data_train)
modelaic <- step(modelnone, scope = list(lower = modelnone, upper = modelfull), direction = "both")

># Start:  AIC=3109.99
># strength ~ 1
># 
>#               Df Sum of Sq    RSS    AIC
># + age          1     49083 104803 2899.9
># + cement       1     40768 113118 2942.1
># + super_plast  1     18467 135418 3041.4
># + water        1     15246 138640 3054.4
># + fine_agg     1      9358 144528 3077.4
># + slag         1      6035 147851 3089.9
># + coarse_agg   1      2555 151330 3102.8
># <none>                     153886 3110.0
># + flyash       1       447 153439 3110.4
># 
># Step:  AIC=2899.95
># strength ~ age
># 
>#               Df Sum of Sq    RSS    AIC
># + cement       1     40864  63939 2629.2
># + water        1     23944  80859 2758.8
># + super_plast  1     20242  84561 2783.5
># + slag         1      7038  97765 2863.6
># + fine_agg     1      6320  98483 2867.6
># + coarse_agg   1      2164 102639 2890.4
># <none>                     104803 2899.9
># + flyash       1       197 104606 2900.9
># - age          1     49083 153886 3110.0
># 
># Step:  AIC=2629.18
># strength ~ age + cement
># 
>#               Df Sum of Sq    RSS    AIC
># + super_plast  1     21633  42307 2403.2
># + slag         1     17530  46409 2454.3
># + water        1     14551  49388 2488.6
># + flyash       1      3246  60693 2602.4
># + fine_agg     1      1375  62564 2619.2
># + coarse_agg   1      1123  62817 2621.4
># <none>                      63939 2629.2
># - cement       1     40864 104803 2899.9
># - age          1     49179 113118 2942.1
># 
># Step:  AIC=2403.21
># strength ~ age + cement + super_plast
># 
>#               Df Sum of Sq   RSS    AIC
># + slag         1     12205 30102 2217.3
># + fine_agg     1      2628 39679 2369.8
># + flyash       1      2137 40169 2376.6
># + water        1      1220 41087 2389.1
># <none>                     42307 2403.2
># + coarse_agg   1         1 42306 2405.2
># - super_plast  1     21633 63939 2629.2
># - cement       1     42254 84561 2783.5
># - age          1     51019 93326 2837.9
># 
># Step:  AIC=2217.33
># strength ~ age + cement + super_plast + slag
># 
>#               Df Sum of Sq   RSS    AIC
># + water        1      3166 26936 2158.0
># + coarse_agg   1      1617 28485 2188.9
># + flyash       1       140 29962 2216.8
># <none>                     30102 2217.3
># + fine_agg     1       101 30001 2217.5
># - slag         1     12205 42307 2403.2
># - super_plast  1     16307 46409 2454.3
># - cement       1     50620 80722 2759.8
># - age          1     52224 82326 2770.7
># 
># Step:  AIC=2157.99
># strength ~ age + cement + super_plast + slag + water
># 
>#               Df Sum of Sq   RSS    AIC
># + fine_agg     1      1102 25834 2136.9
># + flyash       1       656 26280 2146.4
># + coarse_agg   1       265 26671 2154.5
># <none>                     26936 2158.0
># - water        1      3166 30102 2217.3
># - super_plast  1      3614 30550 2225.5
># - slag         1     14151 41087 2389.1
># - cement       1     43496 70431 2686.6
># - age          1     55035 81971 2770.3
># 
># Step:  AIC=2136.94
># strength ~ age + cement + super_plast + slag + water + fine_agg
># 
>#               Df Sum of Sq   RSS    AIC
># + flyash       1       367 25467 2131.0
># <none>                     25834 2136.9
># + coarse_agg   1        79 25755 2137.2
># - fine_agg     1      1102 26936 2158.0
># - super_plast  1      3091 28925 2197.3
># - water        1      4167 30001 2217.5
># - slag         1     10122 35956 2317.4
># - cement       1     32189 58024 2581.6
># - age          1     54377 80211 2760.3
># 
># Step:  AIC=2131.03
># strength ~ age + cement + super_plast + slag + water + fine_agg + 
>#     flyash
># 
>#               Df Sum of Sq   RSS    AIC
># <none>                     25467 2131.0
># + coarse_agg   1        23 25443 2132.5
># - flyash       1       367 25834 2136.9
># - super_plast  1       741 26208 2144.9
># - fine_agg     1       813 26280 2146.4
># - water        1      4445 29912 2217.8
># - slag         1      8774 34240 2292.4
># - cement       1     26204 51670 2519.6
># - age          1     54743 80209 2762.3

modeluji <- regsubsets(strength ~ cement + coarse_agg + age + super_plast + slag + water + fine_agg + 
    flyash, data = data_train, nbest = 2)
plot(modeluji, scale = "adjr2")

Conclusion Model :
\[{Strength}={\beta_{0}} + {\beta_{1}}\log({Age}) + {\beta_{2}}{Cement} + {\beta_{3}}\log({SuperPlast}) +{\beta_{4}}\log({Slag})+ {\beta_{5}}\log({Water}) + {\beta_{6}}{FineAgg} + {\beta_{7}}\log({Flyash}) + {\upsilon_{i}}\]

4.3 Regression Result

summary(modelaic)

># 
># Call:
># lm(formula = strength ~ age + cement + super_plast + slag + water + 
>#     fine_agg + flyash, data = data_train)
># 
># Residuals:
>#      Min       1Q   Median       3Q      Max 
># -22.1132  -4.7018   0.5054   4.5971  24.0549 
># 
># Coefficients:
>#               Estimate Std. Error t value             Pr(>|t|)    
># (Intercept) 194.383230  23.017869   8.445 0.000000000000000278 ***
># age           8.584118   0.251027  34.196 < 0.0000000000000002 ***
># cement        0.089857   0.003798  23.659 < 0.0000000000000002 ***
># super_plast   1.951518   0.490462   3.979 0.000078601594019467 ***
># slag          2.256388   0.164820  13.690 < 0.0000000000000002 ***
># water       -39.445423   4.047928  -9.745 < 0.0000000000000002 ***
># fine_agg     -0.019782   0.004748  -4.167 0.000035922771413228 ***
># flyash        0.578654   0.206527   2.802              0.00526 ** 
># ---
># Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
># 
># Residual standard error: 6.842 on 544 degrees of freedom
># Multiple R-squared:  0.8345, Adjusted R-squared:  0.8324 
># F-statistic: 391.9 on 7 and 544 DF,  p-value: < 0.00000000000000022

\[{Strength}= 194.383 + 8.584\log({Age}) + 0.089{Cement} + 1.951\log({SuperPlast}) + 2.256\log({Slag}) - 39.445\log({Water}) - 0.019{FineAgg} + 0.578\log({Flyash}) + {\upsilon_{i}}\] Analysis from the Model:
1. From generally perspective, can be seen from F-statistics have significant p-value < 0.05 (level of significance), it can be concluded that our models is a good model.
2. Because this is multiple regression, we see Adjusted R-squared that have value 0.8324. it means that 83.24% variation of independent variables can explain variation of dependent variable. We have a good model.
3. Look more specifically with each of independent variables, with t-test, all variables have significant p-value < 0.05. It means that all independent variables can explain dependent variables (Strength).
4. Residual standard error 6.842, it means that average distance of the data points from the fitted line is about 6,842%.

Interpretation from the Model:
1. Without intervention or influence from independent variable, concrete compressive strength have initial capital 194.383 unit value, Ceteris Paribus.
2. From all independent variables, Age is the variable that can provide the highest positive added value for Concrete compressive strength. Every increase in 1 percent Age (percent because log), it can increases concrete compresssive strength 8.584 unit value, Ceteris Paribus.
3. From all independent variables, Water is the variable that can provide the highest negative added value for Concrete compressive strength. Every increase in 1 percent Water (percent because log), it can decreases concrete compresssive strength 39.445 unit value, Ceteris Paribus.
4. The rest of the variables have influence but are not very strong like Cement (0.089), SuperPlast (1.951), Slag (2.256), FineAgg (-0.019), and flyash (0.578), Ceteris Paribus.
Notes : Ceteris Paribus = “all other things being unchanged or constant”

Now we will predict how the performance MAE in our data train

4.4 Predict in data train

predict1 <- predict(object = modelaic, newdata = data_train)
predict2 <- predict(object = modelaic, newdata = data_test)

MAE(predict1, data_train$strength)
MAE(predict2, data_test$strength)

># [1] 5.399918
># [1] 5.225995

Can be seen, MAE predict with own data as well and the split not too different (5.399918) in data_train with (5.225995) in data_test, we have good model.

4.5 Assumption Test

4.5.1 Assumption of Normality

means that the residuals from the linear regression model should be normally distributed because we expect to get residuals near the zero value. to test the condition, we will use Shapiro-Wilk Test for our residual :

shapiro.test(modelaic$residuals)

># 
>#  Shapiro-Wilk normality test
># 
># data:  modelaic$residuals
># W = 0.997, p-value = 0.4048

\(H_{0}\) = Residuals are normally distributed.
\(H_{1}\) = Residuals are not normally distributed.

The test has a p-value above the significance level of 0.05. Therefore we don’t reject the null hypothesis. We can conclude that the residuals are normally distributed.

How if not normal, will this be a problem with the model?
Quoting Gujarati (2004),
“Violation of the normality assumption doesn’t contribute to bias or inefficiency in regression models. It is only important for the calculation of p-values for significance testing, but this is only a consideration when sample size is very small. When the sample size is sufficiently large (>200), the normality assumption is not needed at all as the Central Limit Theorem (CLT) ensures that the distribution of disturbance term will approximate normality”
Conclusions :
Since our sample have sufficiently large (700++ samples), normality assumption is not needed because Central Limit Theorem applicable.

4.5.2 Assumption of Multicollinearity

means to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related. for this assumption, we will test use Variance Inflation Factor (VIF).

vif(modelaic)

>#         age      cement super_plast        slag       water    fine_agg 
>#    1.037311    1.811081    3.437216    1.868256    2.170822    1.505277 
>#      flyash 
>#    2.799520

A common rule of thumbs is that a VIF number greater than 10 may indicate high collinearity and worth further inspection. Our VIF test to the model doesn’t have a VIF number that greater than 10, it means in the model don’t have high collinearity.
Conclusions : this assumption is fulfilled

4.5.3 Assumption of Homoscedasticity

common statistical test to check if there is Homoscedasticity or Heteroscedasticity using the Breusch-Pagan Test. Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value. In a linear regression model, if the variance of its error is showing unequal variation across the target variable range, it shows that heteroscedasticity is present and the implication to that is related to the previous statement of a non-random pattern in residual.

bptest(modelaic)

># 
>#  studentized Breusch-Pagan test
># 
># data:  modelaic
># BP = 62.869, df = 7, p-value = 0.00000000004027

\(H_{0}\) = Homoscedasticity.
\(H_{1}\) = Heteroscedasticity.
The test has a p-value < 0.05, therefore we reject the null hypothesis.
Conclusions : The residuals has not a constant variance or Heteroscedasticity.
To cure this problem, I will use the Box-Cox in the part below.

4.5.3.1 Box-Cox Transformation (Model Improvement)

Box-Cox Transformation is the tools that transform residuals from having a non-normal distribution to a normal distribution.

BCdist <- BoxCoxTrans(data_train$strength)
data_train <- cbind(data_train, dist_BC = predict(BCdist, data_train$strength))
BClmdata <- lm(dist_BC ~ age + water + fine_agg + coarse_agg + 
    flyash + super_plast, data = data_train)
gvlma(BClmdata)

># 
># Call:
># lm(formula = dist_BC ~ age + water + fine_agg + coarse_agg + 
>#     flyash + super_plast, data = data_train)
># 
># Coefficients:
># (Intercept)          age        water     fine_agg   coarse_agg       flyash  
>#   162.97814      2.16529    -23.32692     -0.02477     -0.01735     -0.30053  
># super_plast  
>#     0.25956  
># 
># 
># ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
># USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
># Level of Significance =  0.05 
># 
># Call:
>#  gvlma(x = BClmdata) 
># 
>#                       Value    p-value                   Decision
># Global Stat        27.47476 0.00001593 Assumptions NOT satisfied!
># Skewness           15.71702 0.00007356 Assumptions NOT satisfied!
># Kurtosis            3.69636 0.05453135    Assumptions acceptable.
># Link Function       0.06496 0.79882835    Assumptions acceptable.
># Heteroscedasticity  7.99642 0.00468699 Assumptions NOT satisfied!

After the Box-Cox Transformation, now the model still not fulfill the Homocedasticity assumption. Maybe for the advanced,, we can use GLS method to cure the homocedasticity but not discussed in this project.

4.6 Finding Right Material Composition

In this part, I will divide the age (before the log) into 4 categories (0-90), (91-180), (181-270), (270-365) to find out the difference for each composition.

train1 <- read.csv("data/data-train.csv")
train1 <- train1 %>% 
  mutate(agegroup = case_when(age < 91 ~ "00-90",
                              age >= 91 & age < 181 ~ "91-180",
                              age >= 181 & age < 271 ~ "181-270",
                              age >= 271 & age < 366 ~ "271-365"))
trainfinal <- train1[, c(10,11)]
anova <- aov(strength~agegroup, data=trainfinal)
summary(anova)

>#              Df Sum Sq Mean Sq F value              Pr(>F)    
># agegroup      3  25708    8569    34.4 <0.0000000000000002 ***
># Residuals   821 204503     249                                
># ---
># Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

to find the difference of mean of concrete compression strength in age, I will use One Way ANOVA,

Define Hypothesis :
\(H_{0} : \mu_{00-90} = \mu_{91-180} = \mu_{181-270} = \mu_{271-365}\)
\(H_{1}\) : Not all population means are equal

Analysis : Can be seen from agegroup variable in One Way ANOVA table, the p=value < 0.05 (Reject Ho) which is significant.
Can be conclude that age group class are different each other to influence concrete compression strength.
Now we have another problem that ANOVA test doesn’t indicate which ones differ. to find out which age group means differ, I will using pairwise t-test to see more detail.

pairwise.t.test(trainfinal$strength, trainfinal$agegroup, p.adjust.method = "none")

># 
>#  Pairwise comparisons using t tests with pooled SD 
># 
># data:  trainfinal$strength and trainfinal$agegroup 
># 
>#         00-90               181-270 271-365
># 181-270 0.0004              -       -      
># 271-365 0.0094              0.2676  -      
># 91-180  <0.0000000000000002 0.8864  0.1488 
># 
># P value adjustment method: none

Define Hypothesis :
\(H_{0}\) : not different each other
\(H_{1}\) : different each other
if p-value < 0.05 , Reject Ho.

From the pairwise t-test table result, can be seen ONLY category 00-90 age group that have pvalue < 0.05 with all another age group.
Conclusions : use 00-90 age group to get maximum or higher concrete compression strength.

5 Predict Data Test

To validate our model, we will predict to data test(unseen data) that will be submit to leaderboard.

predict3 <- predict(modelaic, newdata = test)
hasil <- data.frame(id = test$id, strength = predict3)
write.csv(hasil, "data/datatest-concreteanalysisfinal.csv")

From predict to data test in Algoritma Leaderboard score, I get MAE score 6 and R-squared 78%.

6 Conclusion

Logarithm transformation needed to overcome the skewed distribution and remove some outlier problem too.
From the regression result, we have F-test pvalue < 0.05 and t-test with all variables pvalue <0.05, which is good to our model and significant to influence the strength variable.
From the regression result, we get R-squared 0.8313 which 83.13% variation of independent variable can explain variation of dependent variable.
This case can become recommendation to choose important material that can explain strength such as age, cement, super_plast, slag, water, fine_agg, and flyash.

7 Citation

Gujarati, D.N. (2004) Basic Econometrics. 4th Edition, McGraw-Hill Companies.
I-Cheng Yeh, “Modeling of strength of high performance concrete using artificial neural networks,” Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).

Concrete, What Factors are Important to explain Strength?

Aji Putera Tanumihardja

3/30/2020