Background
This is a capstone project to predict the compression strength based on the mixture properties by selecting several machine learning model: Ordinary Least Square, Neural Network, Decision Tree, and Random Forest.
This problem was originally proposed by Prof. I-Cheng Yeh, Department of Information Management Chung-Hua University, Hsin Chu, Taiwan in 2007. It is related to his research in 1998 about how to predict compression strength in a concrete structure.
Concrete compression strength is determined not just only by water-cement mixture but also other ingredients, and how we treat the mixture. Using this dataset, we are going to find “the perfect recipe” to predict the concrete’s compression strength, and how to explain the relation between the ingredients concentration and the age of testing to the compression strength.
https://rpubs.com/AlgoritmaAcademy/ml-capstone
Target of Machine Learning Model
The target of “Mean Absolute Error” in validation dataset should reach level < 4.
The target of “R-squared” in validation dataset should reach level > 90%.
Importing Dataset
First, we will read a concrete file dataset.
It consists of 825 data observation and 10 variable columns.
Context
The dataset is created for prediction of the compression strength based on the mixture properties.
Content
The observation dataset consists of the following variables:
id: Id of each cement mixture,
cement: The amount of cement (Kg) in a m3 mixture,
slag: The amount of blast furnace slag (Kg) in a m3 mixture,
flyash: The amount of fly ash (Kg) in a m3 mixture,
water: The amount of water (Kg) in a m3 mixture,
super_plast: The amount of Superplasticizer (Kg) in a m3 mixture,
coarse_agg: The amount of Coarse Aggreagate (Kg) in a m3 mixture,
fine_agg: The amount of Fine Aggreagate (Kg) in a m3 mixture,
age: the number of resting days before the compressive strength measurement,
strength: Concrete compressive strength measurement in MPa unit.
Data Wrangling
Looking at this data, we are interested in creating a model that is able to predict the strength (MPa unit) a concrete has based on other variables such as cement (Kg/m3), slag (Kg/m3) , flyash (Kg/m3), water (Kg/m3), super_plast (Kg/m3), coarse_agg (Kg/m3), fine_agg (Kg/m3), and age (days).
## id cement slag flyash water super_plast
## 0 0 0 0 0 0
## coarse_agg fine_agg age strength
## 0 0 0 0
There is no NA in concrete dataset.
We will subset id from dataset and use beton data frame for further data preprocessing.
Data Preprocessing and Exploratory Data Analysis
We will first check data correlation and data distribution respectively.
Checking data correlation
We find correlation of target variable strength as follows:
- Variable
strength has positively correlated with age (0.3).
- Variable
strength and cement has strong correlation (0.5).
- Variable
super_plast has a linear correlation with strength (0.4).

After getting the data prepared, the best way to see the relationship between variables would be to plot them and check out how our plots looks. To do this we will use the plot() command which lists the x-axis variable first and the y-axis variable second.
We will run a plot for every predictor variable with our target variable, strength.
There must be a linear relationship between the target variable and the predictor variables.
Scatterplots below show that most of the curves have a curvilinear, not a linear relationships.








Checking data distribution
By using the summary() function below, We find data distribution varied among predictor variables.
All the predictors, except age, are measured in Kg/m3, accordingly we choose not to scale the predictors. Rescaling a predictor in a regression has absolutely no effect on the magnitude of the relation being studied. The slope itself will not change its steepness, nor will the p-values or variance explained be changed.
## cement slag flyash water
## Min. :102.0 Min. : 0.00 Min. : 0.00 Min. :121.8
## 1st Qu.:194.7 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:164.9
## Median :275.1 Median : 20.00 Median : 0.00 Median :184.0
## Mean :280.9 Mean : 73.18 Mean : 54.03 Mean :181.1
## 3rd Qu.:350.0 3rd Qu.:141.30 3rd Qu.:118.20 3rd Qu.:192.0
## Max. :540.0 Max. :359.40 Max. :200.10 Max. :247.0
## super_plast coarse_agg fine_agg age
## Min. : 0.000 Min. : 801.0 Min. :594.0 Min. : 1.00
## 1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:734.0 1st Qu.: 7.00
## Median : 6.500 Median : 968.0 Median :780.1 Median : 28.00
## Mean : 6.266 Mean : 972.8 Mean :775.6 Mean : 45.14
## 3rd Qu.:10.100 3rd Qu.:1028.4 3rd Qu.:826.8 3rd Qu.: 56.00
## Max. :32.200 Max. :1145.0 Max. :992.6 Max. :365.00
## strength
## Min. : 2.33
## 1st Qu.:23.64
## Median :34.57
## Mean :35.79
## 3rd Qu.:45.94
## Max. :82.60
By using the plot_num() function below, we find most of the numerical variables not distributed normally (plotted as normal bell curve).

As shown by the boxplot() function below, we can figure out outliers from several predictor variables, i.e.: slag, water, super_plas, fine_agg, and age.

Model Ordinary Least Square
Multiple linear regression is a tool for looking at the relationship among more than three variables (quantitative or categorical). The regression model uses two or more variables to predict one dependent variable of interest.
We will create our regression model, the Ordinary Least Squares, to get some numbers about the relationship between our predictors. Ordinary Least Squares, or Linear Least Squares, estimates the parameters in a regression model by minimizing the sum of squared residuals. This method draws a line through the data points that minimizes the sum of squared differences between the observed values and the corresponding fitted values.
We are looking for numbers that will tell us about our correlation and error of our model. Our correlation is the big one because it tells us how well our predictor variable predicts our target variable.
Model Fitting and Evaluation
Cross validation
We will create the index that we shall use to split the data into a training and validation dataset. We will use 80% of the data on training and the other 20% for validation dataset.
Building model
We will build a multiple linear regression model using all predictor variables by using the lm() function to predict strength as target variable.
##
## Call:
## lm(formula = strength ~ ., data = beton.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.862 -6.322 0.598 6.610 34.017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.840955 32.296665 -1.141 0.25441
## cement 0.122580 0.010443 11.738 < 2e-16 ***
## slag 0.105616 0.012206 8.653 < 2e-16 ***
## flyash 0.092117 0.015497 5.944 4.53e-09 ***
## water -0.133628 0.049834 -2.681 0.00752 **
## super_plast 0.286586 0.115709 2.477 0.01351 *
## coarse_agg 0.025708 0.011312 2.273 0.02338 *
## fine_agg 0.022144 0.012975 1.707 0.08835 .
## age 0.129584 0.007224 17.937 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.31 on 651 degrees of freedom
## Multiple R-squared: 0.6288, Adjusted R-squared: 0.6243
## F-statistic: 137.9 on 8 and 651 DF, p-value: < 2.2e-16
Feature selection
To optimize the model, we’ll do feature selection by using the step() backward method.
##
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast +
## coarse_agg + fine_agg + age, data = beton.train)
##
## Coefficients:
## (Intercept) cement slag flyash water super_plast
## -36.84096 0.12258 0.10562 0.09212 -0.13363 0.28659
## coarse_agg fine_agg age
## 0.02571 0.02214 0.12958
##
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast +
## coarse_agg + fine_agg + age, data = beton.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.862 -6.322 0.598 6.610 34.017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.840955 32.296665 -1.141 0.25441
## cement 0.122580 0.010443 11.738 < 2e-16 ***
## slag 0.105616 0.012206 8.653 < 2e-16 ***
## flyash 0.092117 0.015497 5.944 4.53e-09 ***
## water -0.133628 0.049834 -2.681 0.00752 **
## super_plast 0.286586 0.115709 2.477 0.01351 *
## coarse_agg 0.025708 0.011312 2.273 0.02338 *
## fine_agg 0.022144 0.012975 1.707 0.08835 .
## age 0.129584 0.007224 17.937 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.31 on 651 degrees of freedom
## Multiple R-squared: 0.6288, Adjusted R-squared: 0.6243
## F-statistic: 137.9 on 8 and 651 DF, p-value: < 2.2e-16
There are 4 (four) most influencing predictors of the model, i.e.: cement, slag, flyash, age.
As we can see our Multiple R-squared value is 0.6288 which is not good enough. It means our model is not able to predict strength well. Also, this value means that predictors explain only 62.88% of the variability in strength.
Our Adjusted R-squared 0.6243 is a little lower obviously as we are using 8 (eight) variables to predict. This could possibly happen if there is one or more predictors that aren’t very good and are hurting our model.
Our p-value is very small so we can reject our null hypothesis and say that there is a relationship between at least one of the predictor variables and the target variable.
Coefficient confidence interval
We will show confidence interval by using the confint() function.
## 2.5 % 97.5 %
## (Intercept) -1.002592e+02 26.57725054
## cement 1.020746e-01 0.14308617
## slag 8.164769e-02 0.12958503
## flyash 6.168694e-02 0.12254772
## water -2.314815e-01 -0.03577366
## super_plast 5.937804e-02 0.51379434
## coarse_agg 3.495294e-03 0.04792151
## fine_agg -3.332949e-03 0.04762188
## age 1.153975e-01 0.14376969
We can interpret this output that for every Kg/m3 the concrete’s water decreases, we are 95% confident that the corresponding strength will change between -2.315 and -0.036 MPa unit.
Model Assumptions
In order to perform multiple linear regression, there are certain assumptions that need to be considered before building a model. Assumptions regarding residuals, such as normal distribution, constant variance i.e homoscedasticity, and multicolinearity need to be checked while building a model.
Check normality
An important thing when considering a multiple linear regression model is that our assumptions are met.
One important assumption is that our errors (or residuals) have an approximate normal distribution.
A residual is our observed value minus our predicted value for any value of x. We can check this easiest by getting our residuals for all our points and making a histogram for them by using the hist() command.
One of the assumptions for linear regression stated that the error obtained from the model must be distributed normally around the mean of 0.

This histogram looks have enough evidence to show that it is approximately normally distributed. It has fairly even tails and looks fairly bell shaped.
Another good way to look at this assumption is to plot our errors along a straight line in a quantile plot. If our error is normally distributed then they will follow the straight line.

The Shapiro-Wilk hypothesis test is as follow:
##
## Shapiro-Wilk normality test
##
## data: model_beton$residuals
## W = 0.99566, p-value = 0.06284
If p-value < 0.05, then reject Ho, accept H1
Shapiro-Wilk hypothesis test: (expectation: p-value > alpha)
H0: error/residual distributed normally
H1: error/residual not-distributed normally
We have p-value 0.06284, meaning that the assumption of error/residual distributed normally is fullfilled due to p-value > 0.05.
Check homoscedasticity
Another assumption needed to test is whether or not the error of the model is homoscedasticity which means the error is distributed with equal variance (constant error) over different data ranges.

The Breusch-Pagan hypothesis test is as follow:
##
## studentized Breusch-Pagan test
##
## data: model_beton
## BP = 85.502, df = 8, p-value = 3.792e-15
If p-value < 0.05, then reject Ho, accept H1
Breusch-Pagan hypothesis test: (expectation: p-value > alpha)
H0: error/residual splitted constantly (Homoscedasticity)
H1: error/residual not splitted constantly or creating pattern (Heteroscedasticity)
We have p-value 3.792e-15, meaning that the assumption of homocedasticity is not fullfilled due to p-value < 0.05.
Check multicolinearity
Other assumption is whether or not there is multicolinearity among predictor variables.
The VIF test is as follow:
## cement slag flyash water super_plast coarse_agg
## 7.551687 6.618902 6.073964 7.048351 3.120786 4.861049
## fine_agg age
## 7.065958 1.123345
The VIF numbers are below 10, meaning that there is no multicolinearity among predictor variables.
Overfitting/underfitting calculation
Let’s calculate the R-squared for the training dataset.
## [1] 0.6288358
Let’s calculate the R-squared for the validation dataset.
## [1] 0.5577481
The prediction performance in the training dataset is higher than the prediction performance in the validation dataset. Now we can calculate the severity of overfitting, which reveals 7.11%.
## [1] 0.07108765
Model Neural Network (NN)
A neural network can be thought of as a network of “neurons” which are organised in layers. The predictors (or inputs) form the bottom layer, and the forecasts (or outputs) form the top layer. There may also be intermediate layers containing “hidden neurons”.
We will create our neural network model to get some numbers about the relationship between our predictors.
Model Fitting and Evaluation
Cross validation
We will create the index that we shall use to split the data into a training and validation dataset. We will use 80% of the data on training and the other 20% for validation dataset.
Overfitting/underfitting calculation
Let’s calculate the R-squared for the training dataset.
## [,1]
## [1,] 0.6147568
Let’s calculate the R-squared for the validation dataset.
## [,1]
## [1,] 0.6296329
The prediction performance in the validation dataset is higher than the prediction performance in the training dataset. Now we can calculate the severity of underfitting, which reveals 1.49%.
## [,1]
## [1,] 0.01487619
Model Decision Tree (DT)
A decision tree learning is a supervised machine learning technique for inducing a decision tree from training data. A decision tree (also referred to as a classification tree or a reduction tree) is a predictive model which is a mapping from observations about an item to conclusions about its target value. In the tree structures, leaves represent classifications (also referred to as labels), nonleaf nodes are features, and branches represent conjunctions of features that lead to the classifications.
We will create our decision tree model to get some numbers about the relationship between our predictors.
Model Fitting and Evaluation
Building model

Overfitting/underfitting calculation
Let’s calculate the R-squared for the training dataset.
## [1] 0.8329197
Let’s calculate the R-squared for the validation dataset.
## [1] 0.7703695
The prediction performance in the training dataset is higher than the prediction performance in the validation dataset. Now we can calculate the severity of overfitting, which reveals 6.25%.
## [1] 0.06255018
Model Random Forest (RF)
A random forest learning is a supervised learning algorithm which uses ensemble learning method for classification and regression. Random forest is a bagging technique and not a boosting technique. The trees in random forests are run in parallel. There is no interaction between these trees while building the trees. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
We will create our random forest model to get some numbers about the relationship between our predictors.
Model Fitting and Evaluation
Building model
## Random Forest
##
## 660 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 528, 528, 528, 528, 528
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 6.047319 0.8941465 4.540488
## 5 5.507695 0.8970091 3.970218
## 8 5.534659 0.8936224 3.925503
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
Overfitting/underfitting calculation
Let’s calculate the R-squared for the training dataset.
## [1] 0.9818889
Let’s calculate the R-squared for the validation dataset.
## [1] 0.9108781
The prediction performance in the training dataset is higher than the prediction performance in the validation dataset. Now we can calculate the severity of overfitting, which reveals 7.1%.
## [1] 0.07101081
Prediction for Data Test
We will read data-test file and create data.test object.
We will subset id variable and create data_test object for further prediction processing.
We will predict target strength from data_test object by using our model and store in strength object.
We will combined strength prediction with id and store in data_subs object.
We will write into data_subs.csv for further submission.
Interpretation & Explanation
Variable importance
After training a random forest, it is natural to ask which variables have the most predictive power. Variables with high importance are drivers of the outcome and their values have a significant impact on the outcome values. By contrast, variables with low importance might be omitted from a model, making it simpler and faster to fit and predict.
We have 3 (three) important variables from the varImp() function: age, cement, water.
## rf variable importance
##
## Overall
## age 100.000
## cement 83.882
## water 28.307
## super_plast 13.403
## slag 9.639
## fine_agg 6.531
## coarse_agg 1.416
## flyash 0.000
LIME Model
Local Interpretable Model-agnostic Explanations (LIME) is a visualization technique that helps explain individual predictions. As the name implies, it is model agnostic so it can be applied to any supervised regression or classification model. Behind the workings of LIME lies the assumption that every complex model is linear on a local scale and asserting that it is possible to fit a simple model around a single observation that will mimic how the global model behaves at that locality. The simple model can then be used to explain the predictions of the more complex model locally.
The following creates our lime object.
## [1] "data_frame_explainer" "explainer" "list"
Once we created our lime objects, we can now perform the generalized LIME algorithm by using the explain() function.
We will use 6 (six) sample of data validation for explanation.
The explain output explanation_caret is a data frame containing different information on the simple model predictions.
Most importantly, for each observation in data_val it contains the simple model fit (model_r2) and the weighted importance (feature_weight) for each important feature (feature_desc) that best describes the local relationship.
## [1] "age" "cement" "super_plast"
However, the simplest approach to interpret the results is to visualize them. There are several plotting functions provided by lime but for tabular data we are only concerned with two functions: plot_features() and plot_explanation().
We will use 3 (three) features to explain the random forest model.

## Warning: Unknown or uninitialised column: 'label'.

LIME interpretation & explanation
We do not scale the data, accordingly we don’t need to scale back the data into original value in order to be more interpretable.
We use 3 (three) dominant features to explain the random forest model, i.e,: age, cement, and super_plast. The interpretation of the plot are as follows:
The feature age and cement have high weighted correlation with strength in 6 (six) samples.
In 4 (four) samples, feature age has positively correlation with strength. Only in 1 (one) sample, feature age <= 7 days has negatively correlation with strength.
In 2 (two) samples, feature cement has positively correlation with strength. In another 2 (two) samples, feature cement <= 195 Kg/m3 has negatively correlation with strength.
In 3(three) samples, feature super_plas only has low weighted, negatively correlation with strength.
By using LIME, we can use sample of data validation for creating simple model that will mimic the global model. The model can make several clusters of a variable base on its positive/negative correlation with the target variable. The model works like the interpretable machine learning model such as Multiple Linear or Logistic Regression.
By using LIME, we can interpret weight of selected predictors and its positive/negative correlation with the target variable. Compared to Variable Importance in Random Forest, the varImp() function only give insight the degree of importance of the predictors used in the model.
Conclusion
The conclusion of the capstone project are as follows:
- We have our goal or target achieved in the validation dataset.
- The figure of “R-squared” is 91.09% (target level > 90%).
- The figure of “Mean Absolute Error” is 3.58 (target level < 4).
We have our problem can be solved by selecting Random Forest as Machine Learning Model.
Overfitting may still occur due to bias-variance trade off.
We begin our model with Ordinary Least Square (MAE: 8.87; R-squared: 55.77% ), improved by Neural Network (MAE: 8.04; R-squared: 62.96%), then improved by Decision Tree (MAE: 8.04; R-squared: 77.04%), at last improved by Random Forest (MAE: 3.58; R-squared: 91.09%).
The capstone project has potential business implementation, by building a machine learning model to predict cost of good sold for manufactured product, such as car, housing, feed, etc. Accordingly, company can set certain selling price to safeguard its profit margin.