Concrete is one of the most important and well-used building materials known to human. Concrete was first found by the Greeks 1 . However, the Romans were the first to fully exploit the properties of concrete. Since then, concrete is the central building materials in.Concrete is evolving with new materials and formulas continously being developed. In this project, we'll be optimzing a formula.
In this paper, we'll learn about the different mixtures of concrete, and how aging / curing contributes to its strength. Our goal is to formulate a new formula to optimize the highest concrete strength. To do that, we'll analyze and learn from a dataset of concrete formula and their strengths. The dataset that will be used is provded as a part of the Machine Learning Capstone Project from Algoritma Data Science School. Because all the data are, and the target variable is numeric, we'll be building regression models for our analysis.
Before getting into the, let's import the required libraries
library(tidyverse) # For data processing and visualization
library(GGally) # Create correlation matrix with ggcorr
library(MLmetrics) # To compare regression results (MAPE, MAE, R2_Score)
library(car) # To calculate VIF
library(lmtest) # for Breusch-Pagan test
library(glue) # Pretty printing
library(kableExtra) # Better looking tablesAnd then import our data
data <- read_csv('data/data-train.csv')
data %>%
head() %>%
kable() %>%
kable_styling()| id | cement | slag | flyash | water | super_plast | coarse_agg | fine_agg | age | strength |
|---|---|---|---|---|---|---|---|---|---|
| S1 | 540.0 | 0.0 | 0 | 162 | 2.5 | 1040.0 | 676.0 | 28 | 79.99 |
| S2 | 540.0 | 0.0 | 0 | 162 | 2.5 | 1055.0 | 676.0 | 28 | 61.89 |
| S3 | 332.5 | 142.5 | 0 | 228 | 0.0 | 932.0 | 594.0 | 270 | 40.27 |
| S4 | 332.5 | 142.5 | 0 | 228 | 0.0 | 932.0 | 594.0 | 365 | 41.05 |
| S5 | 198.6 | 132.4 | 0 | 192 | 0.0 | 978.4 | 825.5 | 360 | 44.30 |
| S6 | 380.0 | 95.0 | 0 | 228 | 0.0 | 932.0 | 594.0 | 365 | 43.70 |
Our dataset consists of 825 rows and 10 columns showing the composition of concrete mixtures:
Before doing analysis, it is good practice to preprocess our data. Preprocessing involves cleaning and transforming data before using it for analysis and modeling.
The first thing that we'll do in data preprocessing is to check for missing values.
colSums(is.na(data))## id cement slag flyash water super_plast
## 0 0 0 0 0 0
## coarse_agg fine_agg age strength
## 0 0 0 0
Our dat does not have missing values, which is good. If theres is missing value, we have to decide to impute the data or exclude the obsevation.
Let's now take a look at the distribution of our data by drawing boxplots.
data %>%
select_if(is.numeric) %>%
boxplot(main = 'Boxplot of parameters', xlab = 'Parameters', ylab = 'Values')Visual inspection of our data helps in seeing whether our data has outliers. Outliers impact the model that we're building by forcing our model to accommodate them. As the name implies, outliers are outside the bulk of our model. Accomodating outliers in our model might lessen the resolving power of our model.
From the boxplot, we can see that some variables have obviously outliers, e.g. age, super_plast, slag. Let's identify the observations that contain these outliers and see if we find anything interesting.
# slag
outlier_slag <- min(boxplot(data['slag'], plot = FALSE)$out)
data.outliers <- data %>%
filter(slag >= outlier_slag)
# water
boxplot(data['water'], plot = FALSE)$out## [1] 121.8 121.8 121.8 121.8 121.8 237.0 247.0 246.9 236.7
# outlier water ada 2 : 121.8 dan > 237.0
data.outliers <- data %>%
filter(water <= 121.8 & water >= 237.0)
# super_plast
outlier_super_plast <- min(boxplot(data['super_plast'], plot = FALSE)$out)
data.outliers <- data %>%
filter(super_plast >= outlier_super_plast)
# fine_agg
outlier_fine_agg <- min(boxplot(data['fine_agg'], plot = FALSE)$out)
# outlier fine_agg ada 2 : 594.0 dan 992.6
data.outliers <- data %>%
filter(fine_agg <= 594.0 & fine_agg >= 992.6)
# age
outlier_age <- min(boxplot(data['age'], plot = FALSE)$out)
data.outliers <- data %>%
filter(age >= outlier_age)
data.outliers %>%
kable() %>%
kable_styling()| id | cement | slag | flyash | water | super_plast | coarse_agg | fine_agg | age | strength |
|---|---|---|---|---|---|---|---|---|---|
| S3 | 332.5 | 142.5 | 0 | 228 | 0 | 932.0 | 594.0 | 270 | 40.27 |
| S4 | 332.5 | 142.5 | 0 | 228 | 0 | 932.0 | 594.0 | 365 | 41.05 |
| S5 | 198.6 | 132.4 | 0 | 192 | 0 | 978.4 | 825.5 | 360 | 44.30 |
| S6 | 380.0 | 95.0 | 0 | 228 | 0 | 932.0 | 594.0 | 365 | 43.70 |
| S14 | 342.0 | 38.0 | 0 | 228 | 0 | 932.0 | 670.0 | 365 | 56.14 |
| S16 | 475.0 | 0.0 | 0 | 228 | 0 | 932.0 | 594.0 | 180 | 42.62 |
| S17 | 427.5 | 47.5 | 0 | 228 | 0 | 932.0 | 594.0 | 180 | 41.84 |
| S20 | 139.6 | 209.4 | 0 | 192 | 0 | 1047.0 | 806.9 | 180 | 44.21 |
| S21 | 380.0 | 0.0 | 0 | 228 | 0 | 932.0 | 670.0 | 365 | 52.52 |
| S22 | 380.0 | 95.0 | 0 | 228 | 0 | 932.0 | 594.0 | 270 | 41.15 |
| S23 | 342.0 | 38.0 | 0 | 228 | 0 | 932.0 | 670.0 | 180 | 52.12 |
| S25 | 304.0 | 76.0 | 0 | 228 | 0 | 932.0 | 670.0 | 365 | 55.26 |
| S26 | 266.0 | 114.0 | 0 | 228 | 0 | 932.0 | 670.0 | 365 | 52.91 |
| S27 | 475.0 | 0.0 | 0 | 228 | 0 | 932.0 | 594.0 | 270 | 42.13 |
| S28 | 190.0 | 190.0 | 0 | 228 | 0 | 932.0 | 670.0 | 365 | 53.69 |
| S29 | 237.5 | 237.5 | 0 | 228 | 0 | 932.0 | 594.0 | 270 | 38.41 |
| S33 | 380.0 | 0.0 | 0 | 228 | 0 | 932.0 | 670.0 | 180 | 53.10 |
| S39 | 332.5 | 142.5 | 0 | 228 | 0 | 932.0 | 594.0 | 180 | 39.78 |
| S45 | 304.0 | 76.0 | 0 | 228 | 0 | 932.0 | 670.0 | 180 | 50.95 |
| S47 | 304.0 | 76.0 | 0 | 228 | 0 | 932.0 | 670.0 | 270 | 54.38 |
| S48 | 266.0 | 114.0 | 0 | 228 | 0 | 932.0 | 670.0 | 270 | 51.73 |
| S50 | 190.0 | 190.0 | 0 | 228 | 0 | 932.0 | 670.0 | 270 | 50.66 |
| S51 | 266.0 | 114.0 | 0 | 228 | 0 | 932.0 | 670.0 | 180 | 48.70 |
| S52 | 342.0 | 38.0 | 0 | 228 | 0 | 932.0 | 670.0 | 270 | 55.06 |
| S53 | 139.6 | 209.4 | 0 | 192 | 0 | 1047.0 | 806.9 | 360 | 44.70 |
| S491 | 339.0 | 0.0 | 0 | 197 | 0 | 968.0 | 781.0 | 180 | 36.45 |
| S495 | 236.0 | 0.0 | 0 | 193 | 0 | 968.0 | 885.0 | 180 | 24.10 |
| S499 | 277.0 | 0.0 | 0 | 191 | 0 | 968.0 | 856.0 | 180 | 32.33 |
| S500 | 277.0 | 0.0 | 0 | 191 | 0 | 968.0 | 856.0 | 360 | 33.70 |
| S503 | 254.0 | 0.0 | 0 | 198 | 0 | 968.0 | 863.0 | 180 | 27.63 |
| S504 | 254.0 | 0.0 | 0 | 198 | 0 | 968.0 | 863.0 | 365 | 29.79 |
| S505 | 307.0 | 0.0 | 0 | 193 | 0 | 968.0 | 812.0 | 180 | 34.49 |
| S506 | 307.0 | 0.0 | 0 | 193 | 0 | 968.0 | 812.0 | 365 | 36.15 |
| S607 | 540.0 | 0.0 | 0 | 173 | 0 | 1125.0 | 613.0 | 180 | 71.62 |
| S608 | 540.0 | 0.0 | 0 | 173 | 0 | 1125.0 | 613.0 | 270 | 74.17 |
| S614 | 350.0 | 0.0 | 0 | 203 | 0 | 974.0 | 775.0 | 180 | 32.72 |
| S620 | 331.0 | 0.0 | 0 | 192 | 0 | 978.0 | 825.0 | 180 | 39.00 |
| S621 | 331.0 | 0.0 | 0 | 192 | 0 | 978.0 | 825.0 | 360 | 41.24 |
| S638 | 349.0 | 0.0 | 0 | 192 | 0 | 1047.0 | 806.0 | 180 | 41.05 |
| S639 | 349.0 | 0.0 | 0 | 192 | 0 | 1047.0 | 806.0 | 360 | 42.13 |
| S641 | 302.0 | 0.0 | 0 | 203 | 0 | 974.0 | 817.0 | 180 | 26.74 |
| S642 | 525.0 | 0.0 | 0 | 189 | 0 | 1125.0 | 613.0 | 180 | 61.92 |
| S644 | 500.0 | 0.0 | 0 | 200 | 0 | 1125.0 | 613.0 | 180 | 51.04 |
| S655 | 310.0 | 0.0 | 0 | 192 | 0 | 970.0 | 850.0 | 180 | 37.33 |
| S656 | 310.0 | 0.0 | 0 | 192 | 0 | 970.0 | 850.0 | 360 | 38.11 |
| S660 | 525.0 | 0.0 | 0 | 189 | 0 | 1125.0 | 613.0 | 270 | 67.11 |
Let'see if these outliers lie have strength above average.
quantiles <- quantile(data$strength, c(0.25, 0.5, 0.75), type = 1)
data %>%
ggplot(aes(x = id, y = strength)) +
geom_point() +
geom_point(data = data.outliers, aes(x = id, y = strength), col = 'red') +
labs(
title = 'Distribution of strength : normal vs outlier (red)',
x = 'ID',
y = 'Strength'
)From the scatter plot above, we can see that the strength of the outliers population is well above the mean of the data. However, they do not form a group which consistently out peform the others. To more objectively classify if they are really different, Let's do a T-test.
data.no_outliers <- data %>% filter(!id %in% data.outliers$id)
t.test(data.no_outliers$strength, data.outliers$strength)##
## Welch Two Sample t-test
##
## data: data.no_outliers$strength and data.outliers$strength
## t = -5.5883, df = 57.951, p-value = 6.449e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -13.256896 -6.264277
## sample estimates:
## mean of x mean of y
## 35.24376 45.00435
The T-test results in a p-value of < 0.05. This shows that we can safely say that the outliers and the normal populations have different strengths. They may merit a different investigation, but for this project, let's remove the outliers. We have created data.no_outliers for our anlysis.
This is what the data looks like after removing the outliers.
data.no_outliers %>%
select_if(is.numeric) %>%
boxplot(main = 'Boxplot of parameters, without outliers', xlab = 'Parameters', ylab = 'Values')As you can see, there are still outliers. These outliers are old data that are now classified as outliers, as the center of the data moves because of transfoarmation. These outliers are now not far from the whiskers of the boxplot.
With the outliers gone, let's see the distribution of our data. To do this, we plot a density graph.
data.no_outliers %>%
select_if(is.numeric) %>%
pivot_longer(cols = -strength, names_to = 'param') %>%
ggplot(aes(x = value)) +
geom_density() +
facet_wrap(~param, scales = 'free_x') +
labs(
title = 'Density graph of parameters',
x = 'Parameters',
y = 'Frequency'
)From the graph, we can see that cement, coarse_agg, fine_agg, flyash and slag is relatively uniform. This might means that these variables are used freely used in the combination and are not used to a certain dosage. These are the basic materials in concrete. In most of the recipes, water has a tendency to be used between 150-200. superplast seems to be not used at all, or with a quantity of 10. The recipes are mostly aged 7, 30, 60, 90-100.
After now. Let's see the trend of our data for each predictor
data.no_outliers %>%
select_if(is.numeric) %>%
pivot_longer(cols = -strength, names_to = 'param') %>%
ggplot(aes(x = value, y = strength)) +
geom_point() +
geom_smooth() +
facet_wrap(~param, scales = 'free_x') +
labs(
title = 'Trends in each parameter',
x = 'Parameters',
y = 'Values'
)From the plots, we can see that cement and super_plast have strong positive correlation with strength. coarse_agg, fine_agg, fly_ash and slag have a slightly negative correlation. age and water does not show a linear correlation. age shows a negative curve, while water shows a sinusoidal pattern. Regression models works best with linear data. We can try to transform the non-linear data to reshape the distribution to make it more linear.
Let's stry to make our data more linear by taking log of age,
data.no_outliers %>%
select_if(is.numeric) %>%
select(age, strength) %>%
ggplot(aes(x = log(age), y = strength)) +
geom_point() +
geom_smooth() +
labs(
title = 'log(age) vs. strength',
x = 'log(age)',
y = 'strength'
)We see that the two relation is more linear. We'll persist this change to our dataset.
data.no_outliers <- data.no_outliers %>%
mutate(age = log(age))Now let's try to linearize water by also taking it's log
data.no_outliers %>%
ggplot(aes(x = log(water), y = strength)) +
geom_point() +
geom_smooth() +
labs(
title = 'log(water) vs. strength',
x = 'log(water)',
y = 'strength'
)The shape is still sinusoidal, and we our transform does not do anything. For this type of data, linear regression does not fit the data well, and we have to use other fitting algorithms.
One of the most important factor in linea regression is the correlation between predictors. Predictors with high correlation affects each other which skews the result of the regression model. Correlation can be positive, or negative. To check correlation, we can use the ggcorr function from the GGally package.
ggcorr(data.no_outliers, label = TRUE) +
labs(title = 'Correlation matrix')The higher the value of the correlation (closer to 1 or -1), the more related the predictors are. From the correlation matrix plot above, we can see the variables have relatively low correlation (< abs(0.5)). In our dataset, we see that the highest correlations in the data is between super_plast and water (-0.6), although the correlation is negative.
The highest positive correlations for strength is age, cement and super_plast. This meanst the variables contribute quite strongly in the positive sense to strength. water on the other hand, has the highest negative correlation. This means it negatively affects strength.
Before we use the data, let's see the scales of our the data.
data.no_outliers %>%
select_if(is.numeric) %>%
pivot_longer(cols = -strength, names_to = 'param') %>%
group_by(param) %>%
summarize(value = max(value)) %>%
ggplot(aes(x = param, y = value)) +
geom_col(fill = 'blue') +
labs(
title = 'Range of data, unscaled',
x = 'Parameters',
y = 'Values'
)The chart above shows the range of values for our variables. We can see that they are note balanced, with coarse_agg reaching maximum of 1145 while the value of age only reaches 4.7874917. It seems that we have quite a range. This skew the value of our coefficients : larger values will result in small coefficients while smaller values may result in large coefficients. We'll later compare the values of these coefficients. A skewed analysis of coefficients will not lead to the right result.
To better reflect the influence of each variables, we have to scale the values. By scaling the data, we put each variable on equal footing, so we can interpret the coefficients of our model correctly.
At this step, we'll also remove the ID column, as we'll not be needing it anymore.
# Remove ID and scale
data.scaled <- data.no_outliers %>%
select(-id) %>% scale()
data.scaled %>%
as.tibble() %>%
pivot_longer(cols = -strength, names_to = 'param') %>%
group_by(param) %>%
summarize(value = max(value)) %>%
ggplot(aes(x = param, y = value)) +
geom_col(fill = 'blue') +
labs(
title = 'Range of data, scaled',
x = 'Parameters',
y = 'Values'
)As we can see, the values of our parameters are more equal after scaling.
We now split the data into train and validation sets. The training set is used to train the model, which is checked against the validation set.
idx <- sample(nrow(data.scaled), nrow(data.scaled) * 0.8)
data.train <- as.tibble(data.scaled[idx, ])
data.validation <- as.tibble(data.scaled[-idx, ])Now we are done with preprocessing and preparing our data. Let's start modeling.
With numeric predictors and numeric target, it is natural to create linear regression model. Let's try to build one and see how it fits our data.
Let's begin with a minimal model, a model no parameter and the target variable as its predictor. In this case, the model outputs the mean of the target, because the mean statistically produces the least error.
model.none <- lm(strength ~ 1, data.train)
summary(model.none)##
## Call:
## lm(formula = strength ~ 1, data = data.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9790 -0.7447 -0.1003 0.6289 2.7889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02401 0.04042 0.594 0.553
##
## Residual standard error: 1.009 on 622 degrees of freedom
Now we'll build the maximal model. The maximal model uses all parameters in our data.
model.all <- lm(strength ~ ., data.frame(data.train))
summary(model.all)##
## Call:
## lm(formula = strength ~ ., data = data.frame(data.train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.08024 -0.25520 -0.02474 0.23041 1.69142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.005052 0.015922 0.317 0.751121
## cement 0.881053 0.042879 20.548 < 2e-16 ***
## slag 0.617157 0.042535 14.509 < 2e-16 ***
## flyash 0.385277 0.039047 9.867 < 2e-16 ***
## water -0.130143 0.038411 -3.388 0.000749 ***
## super_plast 0.007873 0.026061 0.302 0.762693
## coarse_agg 0.162681 0.035030 4.644 4.18e-06 ***
## fine_agg 0.195524 0.040543 4.823 1.79e-06 ***
## age 0.611239 0.016073 38.030 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.397 on 614 degrees of freedom
## Multiple R-squared: 0.8472, Adjusted R-squared: 0.8452
## F-statistic: 425.4 on 8 and 614 DF, p-value: < 2.2e-16
Logically, as it uses all the data in our model, we expect the maximal model to be the best. However, this might not be the case. So we'll have to check using stepwise regression.
We've built model.none that uses no predictor and model.all that uses all variables. Stepwise regression is a method to pick out the optimal model using the Akaika Information Criterion (AIC) as is metrics. The method optimizes the model for the least AIC, meaning the least information loss. Let's try to pick the important variables using stepwise regression. It uses a greedy algorithm to find a local minima. Therefore, it does not guarantee the best model.
model.step <- step(model.all, scope = formula(model.none), direction = 'backward', trace = 0)
summary(model.step)##
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + coarse_agg +
## fine_agg + age, data = data.frame(data.train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0791 -0.2541 -0.0258 0.2328 1.6867
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.005131 0.015908 0.323 0.747
## cement 0.881286 0.042840 20.572 < 2e-16 ***
## slag 0.617556 0.042483 14.537 < 2e-16 ***
## flyash 0.386772 0.038703 9.993 < 2e-16 ***
## water -0.136286 0.032561 -4.186 3.26e-05 ***
## coarse_agg 0.159365 0.033241 4.794 2.05e-06 ***
## fine_agg 0.194036 0.040213 4.825 1.77e-06 ***
## age 0.611798 0.015954 38.348 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3967 on 615 degrees of freedom
## Multiple R-squared: 0.8471, Adjusted R-squared: 0.8454
## F-statistic: 486.9 on 7 and 615 DF, p-value: < 2.2e-16
The model.step uses 7 variables only, omitting super_plast. Let's see how this model performs during evaluation.
The models were trained using scaled data, so any prediction that is made by the model has to be scaled back the results of the prediction. The function scaleResult below to scales back a column of the model using scaling data in an initial scaled object. To make things easier, we'll also make an unscaled version of data.validation's strength to correctly compare the result.
unscaleColumn <- function(x, scaled_data, column) {
unscaled <- x * attr(scaled_data, 'scaled:scale')[column] + attr(scaled_data, 'scaled:center')[column]
return(unscaled)
}
# unscaled `strength` from `data.validation`
data.validation.strength.unscaled <- unscaleColumn(data.validation$strength, data.scaled, 'strength')Now let's start by evaluating model.all
# Evaluating
result.all <- predict(model.all, data.validation)
result.all <- unscaleColumn(result.all, data.scaled, 'strength')
result.all.mae <- MAE(result.all, data.validation.strength.unscaled)
result.all.mape <- MAPE(result.all, data.validation.strength.unscaled)
result.all.r2 <- R2_Score(result.all, data.validation.strength.unscaled)
glue('Result of `model.all`')## Result of `model.all`
glue('MAE : {round(result.all.mae, 4)}')## MAE : 5.7245
glue('MAPE : {round(result.all.mape, 4)}')## MAPE : 0.2101
glue('R-squared : {round(result.all.r2, 4)}')## R-squared : 0.7936
Let's compare it to model.step
result.step <- predict(model.step, data.validation)
result.step <- unscaleColumn(result.step, data.scaled, 'strength')
result.step.mae <- MAE(result.step, data.validation.strength.unscaled)
result.step.mape <- MAPE(result.step, data.validation.strength.unscaled)
result.step.r2 <- R2_Score(result.step, data.validation.strength.unscaled)
glue('Result of `model.step`')## Result of `model.step`
glue('MAE : {round(result.step.mae, 4)}')## MAE : 5.7362
glue('MAPE : {round(result.step.mape, 4)}')## MAPE : 0.2107
glue('R-squared : {round(result.step.r2, 4)}')## R-squared : 0.7926
Comparing the results, we can see that model.all has a better result than model.step. It has a lower MAE, MAPE and higher R-squared values. Moving forward, we'll only be using model.all.
Linear models are made with 4 assumptions. Before we carry on, we have to check whether these assumptions hold for our model.
The assumption of linearity assumes that there exists a linear relationship between the predictors and the targe variable, so that our model can correctly describe it. A visual way to evaluate this is to plot the value of residues between our plot and the model.
plot(model.all$fitted.values, model.all$residuals, ylab = 'Residuals', xlab = 'Prediction', main = 'Test')
abline(h = 0, col = 'red')Visually, we can see that the residuals are not randomly and evenly distributed along the normal residuals. There residuals form clusters below the prediction value of 40, and only distributes evenly after that.
Our model was built on the assumption that there variables linearly independent, that there is no multicollinearity. We can check this by looking at the Variable Independece Factor using the vif function on the car package.
vif(model.all)## cement slag flyash water super_plast coarse_agg
## 7.648712 7.180219 5.970600 5.747052 2.832321 4.925328
## fine_agg age
## 6.682875 1.031804
Independent variables have a VIF value of below 4. From the test, we can see that most of our variables are multicollinear: they affect each other and move in unison. These variables are related and may move together, which might skew the results of our model.
Homoscedasticity assumption states that the error term in the relationship between the predictor and target variables is constant across all values of inputs. This assumption can be checked using the Breusch-Pagan test with hypotheses :
We can test the homoscedasticity of the model using the Breusch-Pagan test.
bptest(model.all)##
## studentized Breusch-Pagan test
##
## data: model.all
## BP = 55.2, df = 8, p-value = 4.038e-09
the p-value of the Breusch pagan test is less than 0.5. Therefore, we reject the null hypothesis, and we found that our dataset is heteroscedastic.
Homoscedasticity assumption states that the error term in the relationship between the predictor and target variables is constant across all values of inputs. This assumption can be checked using the Breusch-Pagan test with hypotheses :
Because our dataset is relatively small (< 5000 observastions), we can test the homoscedasticity of the model using the Shapiro-Wilk test.
shapiro.test(model.all$residuals)##
## Shapiro-Wilk normality test
##
## data: model.all$residuals
## W = 0.99333, p-value = 0.007244
the p-value of the Shapiro-Wilk test is less than 0.5. Therefore, we reject the null hypothesis, and we found that the residuals of model is not normally distributed.
Event though the our linear model does not pass the assumptions, we can still try and get some insights. Our model has a mean average percentage error of 21.01, which is pretty decent.
Let's look at the coefficients of the model.
summary(model.all)##
## Call:
## lm(formula = strength ~ ., data = data.frame(data.train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.08024 -0.25520 -0.02474 0.23041 1.69142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.005052 0.015922 0.317 0.751121
## cement 0.881053 0.042879 20.548 < 2e-16 ***
## slag 0.617157 0.042535 14.509 < 2e-16 ***
## flyash 0.385277 0.039047 9.867 < 2e-16 ***
## water -0.130143 0.038411 -3.388 0.000749 ***
## super_plast 0.007873 0.026061 0.302 0.762693
## coarse_agg 0.162681 0.035030 4.644 4.18e-06 ***
## fine_agg 0.195524 0.040543 4.823 1.79e-06 ***
## age 0.611239 0.016073 38.030 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.397 on 614 degrees of freedom
## Multiple R-squared: 0.8472, Adjusted R-squared: 0.8452
## F-statistic: 425.4 on 8 and 614 DF, p-value: < 2.2e-16
The model was trained on the original data. Only age was log scaled. Looking at the variables, all have positive correlation with strength, except water. This means that according to our model, all variables except water contributes positively.
Let's extract the parameters and plot them to make it easier to look at.
coef.all <- model.all$coefficients[-1]
barplot(coef.all, xlab = names(coef.all), main = 'Values of `model.all` coefficients', ylab = 'Value')From the chart, we can see that cement, slag and log(age) are strong factors in influencing concerete strength.
Although we have tried to transform the dataset, we did not manage to get a satifactory model. The data itself might not be linear. So, let's try to implement another model using polynomial regression to improve our fit.
The polynomial regression model is an extension of the linear regression model. What is different is the terms in a polynomial regression model can be made up of a combination of variables, combining to a certain degree. Where as polynomial models contain only linear terms, polynomrial models can contain power terms. Mathematically, the polynomial regression model is decsribed as
\(y = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 ... + \beta_nx^n + \epsilon\)
Just like linear regression, polynomial regression also has a version with multiple variables, multiple polynomial regression. In R, this is made using the polym function.
Here we define a polynomial model using all the variables in our dataset. We set the maximum degree of each term to be 3.
model.poly <- lm(strength ~ polym(age, cement, super_plast, slag, water, flyash, coarse_agg, fine_agg, degree = 3, raw = T), data.train )Now let's evaluate model.poly
result.poly <- predict(model.poly, data.validation)
result.poly <- unscaleColumn(result.poly, data.scaled, 'strength')
result.poly.mae <- MAE(result.poly, data.validation.strength.unscaled)
result.poly.mape <- MAPE(result.poly, data.validation.strength.unscaled)
result.poly.r2 <- R2_Score(result.poly, data.validation.strength.unscaled)
glue('Result of `model.poly`')## Result of `model.poly`
glue('MAPE : {round(result.poly.mape, 4)}')## MAPE : 0.1267
glue('MAE : {round(result.poly.mae, 4)}')## MAE : 3.7229
glue('R-squared : {round(result.poly.r2, 4)}')## R-squared : 0.8963
The model produces a much better result than model.all. This shows that our model fits the data better.
The polynomial model is also a linear model, with terms that are not linear. So, the model uses the same assumptions of linear regression and we can use the same evaluations to check if the assumptions hold.
Let's see how the residuals distribute along the model.
plot(model.poly$fitted.values, model.poly$residuals, xlab = 'Predicted', ylab = 'Residuals', main = 'Distribution of residuals of `model.poly`')
abline(h = 0, col = 'red')Visually, the residuals of model.poly distributes more evenly randomly around the mean.
We can not evaluate this factor when using polynomial because the terms in our model are no longer exists independently in the terms. We shall see this as we dive deeper into the model in the next subsection.
bptest(model.poly)##
## studentized Breusch-Pagan test
##
## data: model.poly
## BP = 235.21, df = 164, p-value = 0.0002242
Breusch-Pagan Test of our model shows that the p-value is below 0.05 (p < 0.05), which means our model is heteroscedastic. This means the variance of the residual changes with changes in input.
shapiro.test(model.poly$residuals)##
## Shapiro-Wilk normality test
##
## data: model.poly$residuals
## W = 0.98837, p-value = 7.423e-05
The p-value of the Shapiro-Wilk test is less than 0.05 (p < 0.05). Therefore, this means that the residual values are not disributed evenly along our model.
model.polyAlthough model.poly does not hold all asumptions of linear model, its metrics provide a better fit to our data, which means that the model can better descibe the data. However, interpreting the relations of coefficients in model.poly is more difficult because of the numerous number of parameter combinations. The model consists of 165 terms, made up of combinations of all the factors. An example of the terms in the model is as follows
model.poly$coefficients[2]## polym(age, cement, super_plast, slag, water, flyash, coarse_agg, fine_agg, degree = 3, raw = T)1.0.0.0.0.0.0.0
## 0.7268689
The numeric values at the end refers to the degree of each variable in the corresponding term set in polym(...). The value shows the coefficient of the term. To more easily see the values of the coefficients, let's try to plot them.
coef.poly <- model.poly$coefficients[-1]
plot(coef.poly, type = 'line', xlab = 'Term index', ylab = 'Coefficient', main = 'Values of `model.poly` coefficients')From the plot, we can see many high and low coefficients. Let's try to pick the top 5 highest coefficients and top 5 lowest coefficients and try to learn from them. The function below is written to convert the representation of each terms to make it easier to see.
# Convert coef polyms to name
polymTerms <- function(name) {
names <- c('age', 'cement', 'super_plast', 'slag', 'water', 'flyash', 'coarse_agg', 'fine_agg')
pattern <- paste(rep("[:digit:]", length(names)), collapse=".")
val <- str_extract(name, pattern)
indices <- str_split(val, '\\.')
# For each name in the list
for(n in seq(1:length(indices))) {
pow <- indices[[n]]
# Iterate and create terms
result_str <- ''
for(i in seq(1:length(names))) {
if(pow[i] != '0' ) {
if(pow[i] == '1') {
term <- names[i]
} else {
term <- paste(names[i], "^", pow[i], sep = "")
}
result_str <- paste(result_str, term, "+")
}
}
# Remove last '+'
name[[n]] <- substr(result_str, 2, nchar(result_str)-2)
}
return(name)
}Let's look at the high terms first
coef.poly.high.5 <- coef.poly %>% sort(decreasing = T) %>% head(5)
coef.poly.high.5 <- data.frame(
terms = polymTerms(names(coef.poly.high.5)),
values = coef.poly.high.5,
row.names = seq(1:length(coef.poly.high.5))
)
coef.poly.high.5 %>%
kable() %>%
kable_styling()| terms | values |
|---|---|
| cement + super_plast + fine_agg | 0.9359346 |
| cement | 0.8414100 |
| super_plast + slag + fine_agg | 0.7778911 |
| cement + super_plast + slag | 0.7573296 |
| age | 0.7268689 |
From the highest 5 terms, it seems that cement, super_plast and fine_agg plays a major positive contribution to the strength of the concrete.
Now let's look at the low 5 terms.
coef.poly.low.5 <- coef.poly %>% sort(decreasing = T) %>% tail(5)
coef.poly.low.5 <- data.frame(
terms = polymTerms(names(coef.poly.low.5)),
values = coef.poly.low.5,
row.names = seq(1:length(coef.poly.low.5))
)
coef.poly.low.5 %>%
kable() %>%
kable_styling()| terms | values |
|---|---|
| cement + flyash + coarse_agg | -3.112172 |
| cement + slag + flyash | -3.600430 |
| flyash + coarse_agg + fine_agg | -3.770759 |
| slag + flyash + fine_agg | -4.031544 |
| cement + flyash + fine_agg | -4.345191 |
The lowest 5 terms suggest that a combination of flyash and fine_agg negatively impacts a mixture's strength.
Learning from the dataset above, let's try to make our own formula. We'll go about it this way :
Let's get the top 10 concerete compositions
data.top.10.strength <- data.train %>% arrange(-strength) %>% head(10)Let's see how the stats of this subset compares to the rest of the population
# Make temporary highlight data
highlight <- data.top.10.strength %>%
pivot_longer(cols = -c(strength), names_to='parameter')
# Plot the data
data.train %>%
pivot_longer(cols = -c(strength), names_to='parameter') %>%
ggplot(aes(x = parameter, y = value)) +
geom_jitter(col = 'blue', alpha = 0.2) +
geom_jitter(data = highlight, aes(x = parameter, y = value), col = 'red') +
labs(
title = 'Comparing top 10 parameters vs. population',
x = 'Parameters',
y = 'Values'
)comparing the data visually, we can take the conclusion from each parameter :
age : tends to be highcement : averagecoarse_agg : averagefine_agg : averageflyash : minimumslag : quite highsuper_plast : averagewater : below averageWith that, let's try to make a formula
The most straightfoward composition that we can do is to average all the variables from the top 10.
formula.1 <- data.top.10.strength %>%
select(-strength) %>%
mutate_all(mean) %>% head(1)
formula.1## # A tibble: 1 x 8
## cement slag flyash water super_plast coarse_agg fine_agg age
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.882 1.17 -0.889 -1.03 0.907 0.0761 -0.477 0.987
The 2nd formula will be an improvement of the 1st formula. Instead of averaging, we'll do a weighted to get each value of the predictors.
formula.2 <- data.top.10.strength %>%
mutate(
cement = weighted.mean(cement, strength),
slag = weighted.mean(slag, strength),
flyash = weighted.mean(flyash, strength),
water = weighted.mean(water, strength),
super_plast = weighted.mean(super_plast, strength),
coarse_agg = weighted.mean(coarse_agg, strength),
fine_agg = weighted.mean(fine_agg, strength),
age = weighted.mean(age, strength)
) %>%
select(-strength) %>%
head(1)
formula.2## # A tibble: 1 x 8
## cement slag flyash water super_plast coarse_agg fine_agg age
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.888 1.16 -0.889 -1.04 0.913 0.0860 -0.485 0.983
For the last formula, let's try to optimize the best formula we currentl have. Before that, let's compare the composition of the best formula with the others
highlight <- data.top.10.strength %>%
head(1) %>%
pivot_longer(cols = -c(strength), names_to='parameter')
data.top.10.strength %>%
pivot_longer(cols = -c(strength), names_to='parameter') %>%
ggplot(aes(x = parameter, y = value)) +
geom_jitter(col = 'blue', alpha = 0.5) +
geom_jitter(data = highlight, aes(x = parameter, y = value), col = 'red') +
labs(
title = 'Composition of best vs top 10',
x = 'Parameters',
y = 'Values'
)From the chart, we can see that for every parameters, the formula has already followed the group's trend. According to the previous investigation of our model.poly, cement has a high correlation to strength. Besides that, this formula is beind 1 other formula in the amount of cement. Let's try to change the value of cement to equal the highest in the class.
formula.3 <- data.top.10.strength %>%
select(-strength) %>%
head(1) %>%
mutate(cement = max(data.top.10.strength$cement))
formula.3## # A tibble: 1 x 8
## cement slag flyash water super_plast coarse_agg fine_agg age
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.52 1.32 -0.889 -1.67 2.58 -0.361 -0.300 1.38
With that, we have our 3 formulations. Now let's test them out.
Let's see if our 3 formulations can sit amount the top10s. To predict the strength of our data, we'll use model.poly.
new.formulas <- as.tibble(
rbind(
formula.1,
formula.2,
formula.3
)
)
new.3 <- predict(model.poly, new.formulas)
new.formulas <- new.formulas %>%
mutate(
strength = new.3
)
data.w.improved <- data.top.10.strength %>%
add_row(new.formulas)
data.w.improved %>%
mutate(
formula = factor(c(seq(1:10), 'F1', 'F2', 'F3'), levels = c(seq(1:10), 'F1', 'F2', 'F3')) # add row names temporarily
) %>%
ggplot(aes(x = formula, y = unscaleColumn(strength, data.scaled, 'strength'))) +
geom_col(fill = 'blue') +
labs(
title = 'Comparing new formulas vs. top 10',
x = 'Recipe',
y = 'Strength'
)From the results above, we see that formula 3 has the highest strength. Formula 3 an increase of 36.04 compared to the current formula with the best strength. Formula 1 and formula 2 did not perform as expected, with formula 1 lower than even number 10 in the list.
Right now, the formula is only a prediction, but it is a prediction based on data and one worth trying. The recipe for the best formula (F3) with the predicted strength is as follows.
data.no_outliers %>%
arrange(-strength) %>%
head(1) %>%
mutate(
id = 'F3',
cement = unscaleColumn(data.w.improved[13,'cement'], data.scaled, 'cement')[[1]],
strength = unscaleColumn(data.w.improved[13,'strength'], data.scaled, 'strength')[[1]],
age = exp(age)
) %>%
rename(predicted_strength = strength) %>%
kable() %>%
kable_styling()| id | cement | slag | flyash | water | super_plast | coarse_agg | fine_agg | age | predicted_strength |
|---|---|---|---|---|---|---|---|---|---|
| F3 | 540 | 189 | 0 | 145.9 | 22 | 944.7 | 755.8 | 91 | 99.66673 |
During this project, we have analyzed different formulas for concrete with differing strengths. We have built a model to fits the data that we have. Learning from the model, we have made a new formulation and with it, predicted the strength of the new formulation.
The model that we have used in this project is a polynomial regression. The model describes the data better than a normal regression. As we have discovered, it is an interpretable model, albeit more difficult.
The methodology that was shown in this project can be used to other problems where we would like to optimize an outcome. It will be more applicable in physical systems where relations between factors are governed more strictly compared social systems, but does not mean that it would not be applicable to social problems as well. This methodology can be used to answer optimization problems such as optimizing the flavor, texture and feel for a food product, or finding the right balance of chemicals to make a certain taste or smell. Applied to the right problem, regression models provide an easy way to get insight and show possibilities.