Introduction

Concrete is one of the most important and well-used building materials known to human. Concrete was first found by the Greeks 1 . However, the Romans were the first to fully exploit the properties of concrete. Since then, concrete is the central building materials in.Concrete is evolving with new materials and formulas continously being developed. In this project, we'll be optimzing a formula.

In this paper, we'll learn about the different mixtures of concrete, and how aging / curing contributes to its strength. Our goal is to formulate a new formula to optimize the highest concrete strength. To do that, we'll analyze and learn from a dataset of concrete formula and their strengths. The dataset that will be used is provded as a part of the Machine Learning Capstone Project from Algoritma Data Science School. Because all the data are, and the target variable is numeric, we'll be building regression models for our analysis.

Getting to know our data

Before getting into the, let's import the required libraries

library(tidyverse) # For data processing and visualization
library(GGally) # Create correlation matrix with ggcorr
library(MLmetrics) # To compare regression results (MAPE, MAE, R2_Score)
library(car) # To calculate VIF
library(lmtest) # for Breusch-Pagan test
library(glue) # Pretty printing
library(kableExtra) # Better looking tables

And then import our data

data <- read_csv('data/data-train.csv')
data %>% 
    head() %>% 
    kable() %>% 
    kable_styling()
id cement slag flyash water super_plast coarse_agg fine_agg age strength
S1 540.0 0.0 0 162 2.5 1040.0 676.0 28 79.99
S2 540.0 0.0 0 162 2.5 1055.0 676.0 28 61.89
S3 332.5 142.5 0 228 0.0 932.0 594.0 270 40.27
S4 332.5 142.5 0 228 0.0 932.0 594.0 365 41.05
S5 198.6 132.4 0 192 0.0 978.4 825.5 360 44.30
S6 380.0 95.0 0 228 0.0 932.0 594.0 365 43.70

Our dataset consists of 825 rows and 10 columns showing the composition of concrete mixtures:

  • id : (char) ID of the mixture
  • cement : (numeric) Amount of cement
  • slag : (numeric) Amount of slag
  • flyash : (numeric) Amount of flyash
  • water : (numeric) Amount of water
  • super_plast : (numeric) Amount of superplasticizer
  • coarse_agg : (numeric) Amount of coarse aggregates
  • fine_agg : (numeric) Amount of fine aggregates
  • age : (numeric) Age of the concrete
  • strength : (numeric) Strength of the mixture

Data preprocessing

Before doing analysis, it is good practice to preprocess our data. Preprocessing involves cleaning and transforming data before using it for analysis and modeling.

Missing values

The first thing that we'll do in data preprocessing is to check for missing values.

colSums(is.na(data))
##          id      cement        slag      flyash       water super_plast 
##           0           0           0           0           0           0 
##  coarse_agg    fine_agg         age    strength 
##           0           0           0           0

Our dat does not have missing values, which is good. If theres is missing value, we have to decide to impute the data or exclude the obsevation.

Removing outliers

Let's now take a look at the distribution of our data by drawing boxplots.

data %>% 
   select_if(is.numeric) %>% 
   boxplot(main = 'Boxplot of parameters', xlab = 'Parameters', ylab = 'Values')

Visual inspection of our data helps in seeing whether our data has outliers. Outliers impact the model that we're building by forcing our model to accommodate them. As the name implies, outliers are outside the bulk of our model. Accomodating outliers in our model might lessen the resolving power of our model.

From the boxplot, we can see that some variables have obviously outliers, e.g. age, super_plast, slag. Let's identify the observations that contain these outliers and see if we find anything interesting.

# slag
outlier_slag <- min(boxplot(data['slag'], plot = FALSE)$out)
data.outliers <- data %>% 
   filter(slag >= outlier_slag)

# water
boxplot(data['water'], plot = FALSE)$out
## [1] 121.8 121.8 121.8 121.8 121.8 237.0 247.0 246.9 236.7
# outlier water ada 2 : 121.8 dan > 237.0
data.outliers <- data %>% 
   filter(water <= 121.8 & water >= 237.0)

# super_plast
outlier_super_plast <- min(boxplot(data['super_plast'], plot = FALSE)$out)
data.outliers <- data %>% 
   filter(super_plast >= outlier_super_plast)

# fine_agg
outlier_fine_agg <- min(boxplot(data['fine_agg'], plot = FALSE)$out)

# outlier fine_agg ada 2 : 594.0 dan  992.6
data.outliers <- data %>% 
   filter(fine_agg <= 594.0 & fine_agg >= 992.6)

# age
outlier_age <- min(boxplot(data['age'], plot = FALSE)$out)
data.outliers <- data %>% 
   filter(age >= outlier_age)

data.outliers %>% 
    kable() %>% 
    kable_styling()
id cement slag flyash water super_plast coarse_agg fine_agg age strength
S3 332.5 142.5 0 228 0 932.0 594.0 270 40.27
S4 332.5 142.5 0 228 0 932.0 594.0 365 41.05
S5 198.6 132.4 0 192 0 978.4 825.5 360 44.30
S6 380.0 95.0 0 228 0 932.0 594.0 365 43.70
S14 342.0 38.0 0 228 0 932.0 670.0 365 56.14
S16 475.0 0.0 0 228 0 932.0 594.0 180 42.62
S17 427.5 47.5 0 228 0 932.0 594.0 180 41.84
S20 139.6 209.4 0 192 0 1047.0 806.9 180 44.21
S21 380.0 0.0 0 228 0 932.0 670.0 365 52.52
S22 380.0 95.0 0 228 0 932.0 594.0 270 41.15
S23 342.0 38.0 0 228 0 932.0 670.0 180 52.12
S25 304.0 76.0 0 228 0 932.0 670.0 365 55.26
S26 266.0 114.0 0 228 0 932.0 670.0 365 52.91
S27 475.0 0.0 0 228 0 932.0 594.0 270 42.13
S28 190.0 190.0 0 228 0 932.0 670.0 365 53.69
S29 237.5 237.5 0 228 0 932.0 594.0 270 38.41
S33 380.0 0.0 0 228 0 932.0 670.0 180 53.10
S39 332.5 142.5 0 228 0 932.0 594.0 180 39.78
S45 304.0 76.0 0 228 0 932.0 670.0 180 50.95
S47 304.0 76.0 0 228 0 932.0 670.0 270 54.38
S48 266.0 114.0 0 228 0 932.0 670.0 270 51.73
S50 190.0 190.0 0 228 0 932.0 670.0 270 50.66
S51 266.0 114.0 0 228 0 932.0 670.0 180 48.70
S52 342.0 38.0 0 228 0 932.0 670.0 270 55.06
S53 139.6 209.4 0 192 0 1047.0 806.9 360 44.70
S491 339.0 0.0 0 197 0 968.0 781.0 180 36.45
S495 236.0 0.0 0 193 0 968.0 885.0 180 24.10
S499 277.0 0.0 0 191 0 968.0 856.0 180 32.33
S500 277.0 0.0 0 191 0 968.0 856.0 360 33.70
S503 254.0 0.0 0 198 0 968.0 863.0 180 27.63
S504 254.0 0.0 0 198 0 968.0 863.0 365 29.79
S505 307.0 0.0 0 193 0 968.0 812.0 180 34.49
S506 307.0 0.0 0 193 0 968.0 812.0 365 36.15
S607 540.0 0.0 0 173 0 1125.0 613.0 180 71.62
S608 540.0 0.0 0 173 0 1125.0 613.0 270 74.17
S614 350.0 0.0 0 203 0 974.0 775.0 180 32.72
S620 331.0 0.0 0 192 0 978.0 825.0 180 39.00
S621 331.0 0.0 0 192 0 978.0 825.0 360 41.24
S638 349.0 0.0 0 192 0 1047.0 806.0 180 41.05
S639 349.0 0.0 0 192 0 1047.0 806.0 360 42.13
S641 302.0 0.0 0 203 0 974.0 817.0 180 26.74
S642 525.0 0.0 0 189 0 1125.0 613.0 180 61.92
S644 500.0 0.0 0 200 0 1125.0 613.0 180 51.04
S655 310.0 0.0 0 192 0 970.0 850.0 180 37.33
S656 310.0 0.0 0 192 0 970.0 850.0 360 38.11
S660 525.0 0.0 0 189 0 1125.0 613.0 270 67.11

Let'see if these outliers lie have strength above average.

quantiles <- quantile(data$strength, c(0.25, 0.5, 0.75), type = 1)

data %>% 
    ggplot(aes(x = id, y = strength)) +
    geom_point() + 
    geom_point(data = data.outliers, aes(x = id, y = strength), col = 'red') + 
    labs(
        title = 'Distribution of strength : normal vs outlier (red)',
        x = 'ID',
        y = 'Strength'
    )

From the scatter plot above, we can see that the strength of the outliers population is well above the mean of the data. However, they do not form a group which consistently out peform the others. To more objectively classify if they are really different, Let's do a T-test.

data.no_outliers <- data %>% filter(!id %in% data.outliers$id)
t.test(data.no_outliers$strength, data.outliers$strength)
## 
##  Welch Two Sample t-test
## 
## data:  data.no_outliers$strength and data.outliers$strength
## t = -5.5883, df = 57.951, p-value = 6.449e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -13.256896  -6.264277
## sample estimates:
## mean of x mean of y 
##  35.24376  45.00435

The T-test results in a p-value of < 0.05. This shows that we can safely say that the outliers and the normal populations have different strengths. They may merit a different investigation, but for this project, let's remove the outliers. We have created data.no_outliers for our anlysis.

This is what the data looks like after removing the outliers.

data.no_outliers %>% 
   select_if(is.numeric) %>% 
   boxplot(main = 'Boxplot of parameters, without outliers', xlab = 'Parameters', ylab = 'Values')

As you can see, there are still outliers. These outliers are old data that are now classified as outliers, as the center of the data moves because of transfoarmation. These outliers are now not far from the whiskers of the boxplot.

Exoloratory Data Analysis

Data Distribution

With the outliers gone, let's see the distribution of our data. To do this, we plot a density graph.

data.no_outliers %>% 
    select_if(is.numeric) %>% 
    pivot_longer(cols = -strength, names_to = 'param') %>% 
    ggplot(aes(x = value)) +
    geom_density() +
    facet_wrap(~param, scales = 'free_x')  +
    labs(
        title = 'Density graph of parameters',
        x = 'Parameters',
        y = 'Frequency'
    )

From the graph, we can see that cement, coarse_agg, fine_agg, flyash and slag is relatively uniform. This might means that these variables are used freely used in the combination and are not used to a certain dosage. These are the basic materials in concrete. In most of the recipes, water has a tendency to be used between 150-200. superplast seems to be not used at all, or with a quantity of 10. The recipes are mostly aged 7, 30, 60, 90-100.

Trend

After now. Let's see the trend of our data for each predictor

data.no_outliers %>% 
    select_if(is.numeric) %>% 
    pivot_longer(cols = -strength, names_to = 'param') %>% 
    ggplot(aes(x = value, y = strength)) +
    geom_point() +
    geom_smooth() +
    facet_wrap(~param, scales = 'free_x')  +
    labs(
        title = 'Trends in each parameter',
        x = 'Parameters',
        y = 'Values'
    )

From the plots, we can see that cement and super_plast have strong positive correlation with strength. coarse_agg, fine_agg, fly_ash and slag have a slightly negative correlation. age and water does not show a linear correlation. age shows a negative curve, while water shows a sinusoidal pattern. Regression models works best with linear data. We can try to transform the non-linear data to reshape the distribution to make it more linear.

Let's stry to make our data more linear by taking log of age,

data.no_outliers %>% 
    select_if(is.numeric) %>% 
    select(age, strength) %>% 
    ggplot(aes(x = log(age), y = strength)) +
    geom_point() +
    geom_smooth() + 
    labs(
        title = 'log(age) vs. strength',
        x = 'log(age)',
        y = 'strength'
    )

We see that the two relation is more linear. We'll persist this change to our dataset.

data.no_outliers <- data.no_outliers %>% 
    mutate(age = log(age))

Linearizing Water

Now let's try to linearize water by also taking it's log

data.no_outliers %>% 
    ggplot(aes(x = log(water), y = strength)) +
    geom_point() + 
    geom_smooth() + 
    labs(
        title = 'log(water) vs. strength',
        x = 'log(water)',
        y = 'strength'
    )

The shape is still sinusoidal, and we our transform does not do anything. For this type of data, linear regression does not fit the data well, and we have to use other fitting algorithms.

Correlation

One of the most important factor in linea regression is the correlation between predictors. Predictors with high correlation affects each other which skews the result of the regression model. Correlation can be positive, or negative. To check correlation, we can use the ggcorr function from the GGally package.

ggcorr(data.no_outliers, label = TRUE) + 
    labs(title = 'Correlation matrix')

The higher the value of the correlation (closer to 1 or -1), the more related the predictors are. From the correlation matrix plot above, we can see the variables have relatively low correlation (< abs(0.5)). In our dataset, we see that the highest correlations in the data is between super_plast and water (-0.6), although the correlation is negative.

The highest positive correlations for strength is age, cement and super_plast. This meanst the variables contribute quite strongly in the positive sense to strength. water on the other hand, has the highest negative correlation. This means it negatively affects strength.

Scaling data

Before we use the data, let's see the scales of our the data.

data.no_outliers %>% 
    select_if(is.numeric) %>% 
    pivot_longer(cols = -strength, names_to = 'param') %>% 
    group_by(param) %>% 
    summarize(value = max(value)) %>% 
    ggplot(aes(x = param, y = value)) +
    geom_col(fill = 'blue') + 
    labs(
        title = 'Range of data, unscaled',
        x = 'Parameters',
        y = 'Values'
    )

The chart above shows the range of values for our variables. We can see that they are note balanced, with coarse_agg reaching maximum of 1145 while the value of age only reaches 4.7874917. It seems that we have quite a range. This skew the value of our coefficients : larger values will result in small coefficients while smaller values may result in large coefficients. We'll later compare the values of these coefficients. A skewed analysis of coefficients will not lead to the right result.

To better reflect the influence of each variables, we have to scale the values. By scaling the data, we put each variable on equal footing, so we can interpret the coefficients of our model correctly.

At this step, we'll also remove the ID column, as we'll not be needing it anymore.

# Remove ID and scale
data.scaled <- data.no_outliers %>% 
    select(-id) %>% scale()

data.scaled %>% 
    as.tibble() %>% 
    pivot_longer(cols = -strength, names_to = 'param') %>% 
    group_by(param) %>% 
    summarize(value = max(value)) %>% 
    ggplot(aes(x = param, y = value)) +
    geom_col(fill = 'blue') + 
    labs(
        title = 'Range of data, scaled',
        x = 'Parameters',
        y = 'Values'
    )

As we can see, the values of our parameters are more equal after scaling.

Splitting data

We now split the data into train and validation sets. The training set is used to train the model, which is checked against the validation set.

idx <- sample(nrow(data.scaled), nrow(data.scaled) * 0.8)

data.train <- as.tibble(data.scaled[idx, ])
data.validation <- as.tibble(data.scaled[-idx, ])

Now we are done with preprocessing and preparing our data. Let's start modeling.

Regression Model

With numeric predictors and numeric target, it is natural to create linear regression model. Let's try to build one and see how it fits our data.

Minimal model

Let's begin with a minimal model, a model no parameter and the target variable as its predictor. In this case, the model outputs the mean of the target, because the mean statistically produces the least error.

model.none <- lm(strength ~ 1, data.train)
summary(model.none)
## 
## Call:
## lm(formula = strength ~ 1, data = data.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9790 -0.7447 -0.1003  0.6289  2.7889 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.02401    0.04042   0.594    0.553
## 
## Residual standard error: 1.009 on 622 degrees of freedom

Maximal model

Now we'll build the maximal model. The maximal model uses all parameters in our data.

model.all <- lm(strength ~ ., data.frame(data.train))
summary(model.all)
## 
## Call:
## lm(formula = strength ~ ., data = data.frame(data.train))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.08024 -0.25520 -0.02474  0.23041  1.69142 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.005052   0.015922   0.317 0.751121    
## cement       0.881053   0.042879  20.548  < 2e-16 ***
## slag         0.617157   0.042535  14.509  < 2e-16 ***
## flyash       0.385277   0.039047   9.867  < 2e-16 ***
## water       -0.130143   0.038411  -3.388 0.000749 ***
## super_plast  0.007873   0.026061   0.302 0.762693    
## coarse_agg   0.162681   0.035030   4.644 4.18e-06 ***
## fine_agg     0.195524   0.040543   4.823 1.79e-06 ***
## age          0.611239   0.016073  38.030  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.397 on 614 degrees of freedom
## Multiple R-squared:  0.8472, Adjusted R-squared:  0.8452 
## F-statistic: 425.4 on 8 and 614 DF,  p-value: < 2.2e-16

Logically, as it uses all the data in our model, we expect the maximal model to be the best. However, this might not be the case. So we'll have to check using stepwise regression.

Variable Selection

We've built model.none that uses no predictor and model.all that uses all variables. Stepwise regression is a method to pick out the optimal model using the Akaika Information Criterion (AIC) as is metrics. The method optimizes the model for the least AIC, meaning the least information loss. Let's try to pick the important variables using stepwise regression. It uses a greedy algorithm to find a local minima. Therefore, it does not guarantee the best model.

model.step <- step(model.all, scope = formula(model.none), direction = 'backward', trace = 0)
summary(model.step)
## 
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + coarse_agg + 
##     fine_agg + age, data = data.frame(data.train))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0791 -0.2541 -0.0258  0.2328  1.6867 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.005131   0.015908   0.323    0.747    
## cement       0.881286   0.042840  20.572  < 2e-16 ***
## slag         0.617556   0.042483  14.537  < 2e-16 ***
## flyash       0.386772   0.038703   9.993  < 2e-16 ***
## water       -0.136286   0.032561  -4.186 3.26e-05 ***
## coarse_agg   0.159365   0.033241   4.794 2.05e-06 ***
## fine_agg     0.194036   0.040213   4.825 1.77e-06 ***
## age          0.611798   0.015954  38.348  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3967 on 615 degrees of freedom
## Multiple R-squared:  0.8471, Adjusted R-squared:  0.8454 
## F-statistic: 486.9 on 7 and 615 DF,  p-value: < 2.2e-16

The model.step uses 7 variables only, omitting super_plast. Let's see how this model performs during evaluation.

Model Evaluation

The models were trained using scaled data, so any prediction that is made by the model has to be scaled back the results of the prediction. The function scaleResult below to scales back a column of the model using scaling data in an initial scaled object. To make things easier, we'll also make an unscaled version of data.validation's strength to correctly compare the result.

unscaleColumn <- function(x, scaled_data, column) {
    unscaled <- x * attr(scaled_data, 'scaled:scale')[column] + attr(scaled_data, 'scaled:center')[column]
    return(unscaled)
}

# unscaled `strength` from `data.validation`
data.validation.strength.unscaled <- unscaleColumn(data.validation$strength, data.scaled, 'strength')

Now let's start by evaluating model.all

# Evaluating
result.all <- predict(model.all, data.validation)
result.all <- unscaleColumn(result.all, data.scaled, 'strength')

result.all.mae <- MAE(result.all, data.validation.strength.unscaled)
result.all.mape <- MAPE(result.all, data.validation.strength.unscaled)
result.all.r2 <- R2_Score(result.all, data.validation.strength.unscaled)

glue('Result of `model.all`')
## Result of `model.all`
glue('MAE : {round(result.all.mae, 4)}')
## MAE : 5.7245
glue('MAPE : {round(result.all.mape, 4)}')
## MAPE : 0.2101
glue('R-squared : {round(result.all.r2, 4)}')
## R-squared : 0.7936

Let's compare it to model.step

result.step <- predict(model.step, data.validation)
result.step <- unscaleColumn(result.step, data.scaled, 'strength')

result.step.mae <- MAE(result.step, data.validation.strength.unscaled)
result.step.mape <- MAPE(result.step, data.validation.strength.unscaled)
result.step.r2 <- R2_Score(result.step, data.validation.strength.unscaled)

glue('Result of `model.step`')
## Result of `model.step`
glue('MAE : {round(result.step.mae, 4)}')
## MAE : 5.7362
glue('MAPE : {round(result.step.mape, 4)}')
## MAPE : 0.2107
glue('R-squared : {round(result.step.r2, 4)}')
## R-squared : 0.7926

Comparing the results, we can see that model.all has a better result than model.step. It has a lower MAE, MAPE and higher R-squared values. Moving forward, we'll only be using model.all.

Checking Assumptions

Linear models are made with 4 assumptions. Before we carry on, we have to check whether these assumptions hold for our model.

1. Assumption of linearity

The assumption of linearity assumes that there exists a linear relationship between the predictors and the targe variable, so that our model can correctly describe it. A visual way to evaluate this is to plot the value of residues between our plot and the model.

plot(model.all$fitted.values, model.all$residuals, ylab = 'Residuals', xlab = 'Prediction', main = 'Test')
abline(h = 0, col = 'red')

Visually, we can see that the residuals are not randomly and evenly distributed along the normal residuals. There residuals form clusters below the prediction value of 40, and only distributes evenly after that.

2. Independence of Variables

Our model was built on the assumption that there variables linearly independent, that there is no multicollinearity. We can check this by looking at the Variable Independece Factor using the vif function on the car package.

vif(model.all)
##      cement        slag      flyash       water super_plast  coarse_agg 
##    7.648712    7.180219    5.970600    5.747052    2.832321    4.925328 
##    fine_agg         age 
##    6.682875    1.031804

Independent variables have a VIF value of below 4. From the test, we can see that most of our variables are multicollinear: they affect each other and move in unison. These variables are related and may move together, which might skew the results of our model.

3. Homoscedasticity

Homoscedasticity assumption states that the error term in the relationship between the predictor and target variables is constant across all values of inputs. This assumption can be checked using the Breusch-Pagan test with hypotheses :

  • H0 : Value of error is the same across all inputs (homoscedastic)
  • H1 : Value of error is not the same across all range of inputs (heteroscedastic)

We can test the homoscedasticity of the model using the Breusch-Pagan test.

bptest(model.all)
## 
##  studentized Breusch-Pagan test
## 
## data:  model.all
## BP = 55.2, df = 8, p-value = 4.038e-09

the p-value of the Breusch pagan test is less than 0.5. Therefore, we reject the null hypothesis, and we found that our dataset is heteroscedastic.

4. Normality

Homoscedasticity assumption states that the error term in the relationship between the predictor and target variables is constant across all values of inputs. This assumption can be checked using the Breusch-Pagan test with hypotheses :

  • H0 : Residuals are normal distributed
  • H1 : Residuals are not normally distributed

Because our dataset is relatively small (< 5000 observastions), we can test the homoscedasticity of the model using the Shapiro-Wilk test.

shapiro.test(model.all$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model.all$residuals
## W = 0.99333, p-value = 0.007244

the p-value of the Shapiro-Wilk test is less than 0.5. Therefore, we reject the null hypothesis, and we found that the residuals of model is not normally distributed.

Getting insights

Event though the our linear model does not pass the assumptions, we can still try and get some insights. Our model has a mean average percentage error of 21.01, which is pretty decent.

Let's look at the coefficients of the model.

summary(model.all)
## 
## Call:
## lm(formula = strength ~ ., data = data.frame(data.train))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.08024 -0.25520 -0.02474  0.23041  1.69142 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.005052   0.015922   0.317 0.751121    
## cement       0.881053   0.042879  20.548  < 2e-16 ***
## slag         0.617157   0.042535  14.509  < 2e-16 ***
## flyash       0.385277   0.039047   9.867  < 2e-16 ***
## water       -0.130143   0.038411  -3.388 0.000749 ***
## super_plast  0.007873   0.026061   0.302 0.762693    
## coarse_agg   0.162681   0.035030   4.644 4.18e-06 ***
## fine_agg     0.195524   0.040543   4.823 1.79e-06 ***
## age          0.611239   0.016073  38.030  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.397 on 614 degrees of freedom
## Multiple R-squared:  0.8472, Adjusted R-squared:  0.8452 
## F-statistic: 425.4 on 8 and 614 DF,  p-value: < 2.2e-16

The model was trained on the original data. Only age was log scaled. Looking at the variables, all have positive correlation with strength, except water. This means that according to our model, all variables except water contributes positively.

Let's extract the parameters and plot them to make it easier to look at.

coef.all <- model.all$coefficients[-1]
barplot(coef.all, xlab = names(coef.all), main = 'Values of `model.all` coefficients',  ylab = 'Value')

From the chart, we can see that cement, slag and log(age) are strong factors in influencing concerete strength.

Improving the model

Although we have tried to transform the dataset, we did not manage to get a satifactory model. The data itself might not be linear. So, let's try to implement another model using polynomial regression to improve our fit.

Multiple Polynomial Regression Model

The polynomial regression model is an extension of the linear regression model. What is different is the terms in a polynomial regression model can be made up of a combination of variables, combining to a certain degree. Where as polynomial models contain only linear terms, polynomrial models can contain power terms. Mathematically, the polynomial regression model is decsribed as

\(y = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 ... + \beta_nx^n + \epsilon\)

Just like linear regression, polynomial regression also has a version with multiple variables, multiple polynomial regression. In R, this is made using the polym function.

Model building

Here we define a polynomial model using all the variables in our dataset. We set the maximum degree of each term to be 3.

model.poly <- lm(strength ~ polym(age, cement, super_plast, slag, water, flyash, coarse_agg, fine_agg, degree = 3, raw = T), data.train )

Model evaluaton

Now let's evaluate model.poly

result.poly <- predict(model.poly, data.validation)
result.poly <- unscaleColumn(result.poly, data.scaled, 'strength')

result.poly.mae <- MAE(result.poly, data.validation.strength.unscaled)
result.poly.mape <- MAPE(result.poly, data.validation.strength.unscaled)
result.poly.r2 <- R2_Score(result.poly, data.validation.strength.unscaled)

glue('Result of `model.poly`')
## Result of `model.poly`
glue('MAPE : {round(result.poly.mape, 4)}')
## MAPE : 0.1267
glue('MAE : {round(result.poly.mae, 4)}')
## MAE : 3.7229
glue('R-squared : {round(result.poly.r2, 4)}')
## R-squared : 0.8963

The model produces a much better result than model.all. This shows that our model fits the data better.

Checking assumptions

The polynomial model is also a linear model, with terms that are not linear. So, the model uses the same assumptions of linear regression and we can use the same evaluations to check if the assumptions hold.

1. Assumption of linearity

Let's see how the residuals distribute along the model.

plot(model.poly$fitted.values, model.poly$residuals, xlab = 'Predicted', ylab = 'Residuals', main = 'Distribution of residuals of `model.poly`')
abline(h = 0, col = 'red')

Visually, the residuals of model.poly distributes more evenly randomly around the mean.

2. Independence of variabels

We can not evaluate this factor when using polynomial because the terms in our model are no longer exists independently in the terms. We shall see this as we dive deeper into the model in the next subsection.

3. Homoscedasticity

bptest(model.poly)
## 
##  studentized Breusch-Pagan test
## 
## data:  model.poly
## BP = 235.21, df = 164, p-value = 0.0002242

Breusch-Pagan Test of our model shows that the p-value is below 0.05 (p < 0.05), which means our model is heteroscedastic. This means the variance of the residual changes with changes in input.

4. Normality

shapiro.test(model.poly$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model.poly$residuals
## W = 0.98837, p-value = 7.423e-05

The p-value of the Shapiro-Wilk test is less than 0.05 (p < 0.05). Therefore, this means that the residual values are not disributed evenly along our model.

Interpreting model.poly

Although model.poly does not hold all asumptions of linear model, its metrics provide a better fit to our data, which means that the model can better descibe the data. However, interpreting the relations of coefficients in model.poly is more difficult because of the numerous number of parameter combinations. The model consists of 165 terms, made up of combinations of all the factors. An example of the terms in the model is as follows

model.poly$coefficients[2]
## polym(age, cement, super_plast, slag, water, flyash, coarse_agg, fine_agg, degree = 3, raw = T)1.0.0.0.0.0.0.0 
##                                                                                                      0.7268689

The numeric values at the end refers to the degree of each variable in the corresponding term set in polym(...). The value shows the coefficient of the term. To more easily see the values of the coefficients, let's try to plot them.

coef.poly <- model.poly$coefficients[-1]
plot(coef.poly, type = 'line', xlab = 'Term index', ylab = 'Coefficient', main = 'Values of `model.poly` coefficients')

From the plot, we can see many high and low coefficients. Let's try to pick the top 5 highest coefficients and top 5 lowest coefficients and try to learn from them. The function below is written to convert the representation of each terms to make it easier to see.

# Convert coef polyms to name
polymTerms <- function(name) {
    names <- c('age', 'cement', 'super_plast', 'slag', 'water', 'flyash', 'coarse_agg', 'fine_agg')
    pattern <- paste(rep("[:digit:]", length(names)), collapse=".")
    val <- str_extract(name, pattern)
    indices <- str_split(val, '\\.')
    
    # For each name in the list
    for(n in seq(1:length(indices))) {
        
        pow <- indices[[n]]
        
        # Iterate and create terms
        result_str <- ''
        for(i in seq(1:length(names))) {
            
            if(pow[i] != '0' ) {
                if(pow[i] == '1') {
                    term <- names[i]
                } else {
                    term <- paste(names[i], "^", pow[i], sep = "")
                }
                result_str <- paste(result_str, term, "+")
            }
        }

        # Remove last '+'
        name[[n]] <- substr(result_str, 2, nchar(result_str)-2)
    
    }
    return(name)
}

Let's look at the high terms first

coef.poly.high.5 <- coef.poly %>% sort(decreasing = T) %>% head(5)

coef.poly.high.5 <- data.frame(
    terms = polymTerms(names(coef.poly.high.5)),
    values = coef.poly.high.5,
    row.names = seq(1:length(coef.poly.high.5))
)

coef.poly.high.5 %>% 
    kable() %>% 
    kable_styling()
terms values
cement + super_plast + fine_agg 0.9359346
cement 0.8414100
super_plast + slag + fine_agg 0.7778911
cement + super_plast + slag 0.7573296
age 0.7268689

From the highest 5 terms, it seems that cement, super_plast and fine_agg plays a major positive contribution to the strength of the concrete.

Now let's look at the low 5 terms.

coef.poly.low.5 <- coef.poly %>% sort(decreasing = T) %>% tail(5)

coef.poly.low.5 <- data.frame(
    terms = polymTerms(names(coef.poly.low.5)),
    values = coef.poly.low.5,
    row.names = seq(1:length(coef.poly.low.5))
)

coef.poly.low.5 %>% 
    kable() %>% 
    kable_styling()
terms values
cement + flyash + coarse_agg -3.112172
cement + slag + flyash -3.600430
flyash + coarse_agg + fine_agg -3.770759
slag + flyash + fine_agg -4.031544
cement + flyash + fine_agg -4.345191

The lowest 5 terms suggest that a combination of flyash and fine_agg negatively impacts a mixture's strength.

Finding a Better Material Composition

Learning from the dataset above, let's try to make our own formula. We'll go about it this way :

  1. Find the compositions with the highest strength
  2. Analyze and try to see patterns
  3. Let's formulate our own composition

Analyzing the top 10

Let's get the top 10 concerete compositions

data.top.10.strength <- data.train %>% arrange(-strength) %>%  head(10)

Let's see how the stats of this subset compares to the rest of the population

# Make temporary highlight data
highlight <- data.top.10.strength %>% 
    pivot_longer(cols = -c(strength), names_to='parameter')

# Plot the data
data.train %>% 
    pivot_longer(cols = -c(strength), names_to='parameter') %>% 
    ggplot(aes(x = parameter, y = value)) +
    geom_jitter(col = 'blue', alpha = 0.2) + 
    geom_jitter(data = highlight, aes(x = parameter, y = value), col = 'red')  + 
    labs(
        title = 'Comparing top 10 parameters vs. population',
        x = 'Parameters',
        y = 'Values'
    )

comparing the data visually, we can take the conclusion from each parameter :

  1. age : tends to be high
  2. cement : average
  3. coarse_agg : average
  4. fine_agg : average
  5. flyash : minimum
  6. slag : quite high
  7. super_plast : average
  8. water : below average

With that, let's try to make a formula

Formulating

Formula 1. Average all variables

The most straightfoward composition that we can do is to average all the variables from the top 10.

formula.1 <- data.top.10.strength %>% 
    select(-strength) %>% 
    mutate_all(mean) %>% head(1)

formula.1
## # A tibble: 1 x 8
##   cement  slag flyash water super_plast coarse_agg fine_agg   age
##    <dbl> <dbl>  <dbl> <dbl>       <dbl>      <dbl>    <dbl> <dbl>
## 1  0.882  1.17 -0.889 -1.03       0.907     0.0761   -0.477 0.987

Formula 2. Weighted average of all variables

The 2nd formula will be an improvement of the 1st formula. Instead of averaging, we'll do a weighted to get each value of the predictors.

formula.2 <- data.top.10.strength %>% 
    mutate(
        cement = weighted.mean(cement, strength),
        slag = weighted.mean(slag, strength),
        flyash = weighted.mean(flyash, strength),
        water = weighted.mean(water, strength),
        super_plast = weighted.mean(super_plast, strength),
        coarse_agg = weighted.mean(coarse_agg, strength),
        fine_agg = weighted.mean(fine_agg, strength),
        age = weighted.mean(age, strength)
    ) %>% 
    select(-strength) %>% 
    head(1)

formula.2
## # A tibble: 1 x 8
##   cement  slag flyash water super_plast coarse_agg fine_agg   age
##    <dbl> <dbl>  <dbl> <dbl>       <dbl>      <dbl>    <dbl> <dbl>
## 1  0.888  1.16 -0.889 -1.04       0.913     0.0860   -0.485 0.983

Formula 3. Maximize the current best

For the last formula, let's try to optimize the best formula we currentl have. Before that, let's compare the composition of the best formula with the others

highlight <- data.top.10.strength %>%
    head(1) %>% 
    pivot_longer(cols = -c(strength), names_to='parameter')

data.top.10.strength %>% 
    pivot_longer(cols = -c(strength), names_to='parameter') %>% 
    ggplot(aes(x = parameter, y = value)) +
    geom_jitter(col = 'blue', alpha = 0.5) + 
    geom_jitter(data = highlight, aes(x = parameter, y = value), col = 'red')  + 
    labs(
        title = 'Composition of best vs top 10',
        x = 'Parameters',
        y = 'Values'
    )

From the chart, we can see that for every parameters, the formula has already followed the group's trend. According to the previous investigation of our model.poly, cement has a high correlation to strength. Besides that, this formula is beind 1 other formula in the amount of cement. Let's try to change the value of cement to equal the highest in the class.

formula.3 <- data.top.10.strength %>% 
    select(-strength) %>% 
    head(1) %>% 
    mutate(cement = max(data.top.10.strength$cement))

formula.3
## # A tibble: 1 x 8
##   cement  slag flyash water super_plast coarse_agg fine_agg   age
##    <dbl> <dbl>  <dbl> <dbl>       <dbl>      <dbl>    <dbl> <dbl>
## 1   2.52  1.32 -0.889 -1.67        2.58     -0.361   -0.300  1.38

With that, we have our 3 formulations. Now let's test them out.

Testing the formula

Let's see if our 3 formulations can sit amount the top10s. To predict the strength of our data, we'll use model.poly.

new.formulas <- as.tibble(
    rbind(
        formula.1,
        formula.2,
        formula.3
    )
)

new.3 <- predict(model.poly, new.formulas)

new.formulas <- new.formulas %>% 
    mutate(
        strength = new.3
    )

data.w.improved <- data.top.10.strength %>% 
    add_row(new.formulas)

data.w.improved %>% 
    mutate(
        formula = factor(c(seq(1:10), 'F1', 'F2', 'F3'), levels = c(seq(1:10), 'F1', 'F2', 'F3')) # add row names temporarily
    ) %>% 
    ggplot(aes(x = formula, y = unscaleColumn(strength, data.scaled, 'strength'))) +
    geom_col(fill = 'blue')  + 
    labs(
        title = 'Comparing new formulas vs. top 10',
        x = 'Recipe',
        y = 'Strength'
    )

From the results above, we see that formula 3 has the highest strength. Formula 3 an increase of 36.04 compared to the current formula with the best strength. Formula 1 and formula 2 did not perform as expected, with formula 1 lower than even number 10 in the list.

Right now, the formula is only a prediction, but it is a prediction based on data and one worth trying. The recipe for the best formula (F3) with the predicted strength is as follows.

data.no_outliers %>% 
    arrange(-strength) %>% 
    head(1) %>% 
    mutate(
        id = 'F3',
        cement = unscaleColumn(data.w.improved[13,'cement'], data.scaled, 'cement')[[1]], 
        strength = unscaleColumn(data.w.improved[13,'strength'], data.scaled, 'strength')[[1]],
        age = exp(age)
    ) %>% 
    rename(predicted_strength = strength) %>% 
    kable() %>% 
    kable_styling()
id cement slag flyash water super_plast coarse_agg fine_agg age predicted_strength
F3 540 189 0 145.9 22 944.7 755.8 91 99.66673

Conclusion

During this project, we have analyzed different formulas for concrete with differing strengths. We have built a model to fits the data that we have. Learning from the model, we have made a new formulation and with it, predicted the strength of the new formulation.

The model that we have used in this project is a polynomial regression. The model describes the data better than a normal regression. As we have discovered, it is an interpretable model, albeit more difficult.

The methodology that was shown in this project can be used to other problems where we would like to optimize an outcome. It will be more applicable in physical systems where relations between factors are governed more strictly compared social systems, but does not mean that it would not be applicable to social problems as well. This methodology can be used to answer optimization problems such as optimizing the flavor, texture and feel for a food product, or finding the right balance of chemicals to make a certain taste or smell. Applied to the right problem, regression models provide an easy way to get insight and show possibilities.

References

  1. Gromiko, N., CMI, Shepard, K. The History of Concrete. Accessed 17 Aug 2020 from https://www.nachi.org/history-of-concrete.htm