1 Background

This is a capstone project to predict the compression strength based on the mixture properties by selecting several machine learning model: Ordinary Least Square, Neural Network, Decision Tree, and Random Forest.

This problem was originally proposed by Prof. I-Cheng Yeh, Department of Information Management Chung-Hua University, Hsin Chu, Taiwan in 2007. It is related to his research in 1998 about how to predict compression strength in a concrete structure.

Concrete compression strength is determined not just only by water-cement mixture but also other ingredients, and how we treat the mixture. Using this dataset, we are going to find “the perfect recipe” to predict the concrete’s compression strength, and how to explain the relation between the ingredients concentration and the age of testing to the compression strength.

https://rpubs.com/AlgoritmaAcademy/ml-capstone

2 Source of Dataset

The analysis will use source of dataset as follow:

https://bit.ly/capstone-ml-dataset

3 Target of Machine Learning Model

The target of “Mean Absolute Error” in validation dataset should reach level < 4.

The target of “R-squared” in validation dataset should reach level > 90%.

4 Importing Dataset

First, we will read a concrete file dataset.

It consists of 825 data observation and 10 variable columns.

4.1 Context

The dataset is created for prediction of the compression strength based on the mixture properties.

4.2 Content

The observation dataset consists of the following variables:

id: Id of each cement mixture,
cement: The amount of cement (Kg) in a m3 mixture,
slag: The amount of blast furnace slag (Kg) in a m3 mixture,
flyash: The amount of fly ash (Kg) in a m3 mixture,
water: The amount of water (Kg) in a m3 mixture,
super_plast: The amount of Superplasticizer (Kg) in a m3 mixture,
coarse_agg: The amount of Coarse Aggreagate (Kg) in a m3 mixture,
fine_agg: The amount of Fine Aggreagate (Kg) in a m3 mixture,
age: the number of resting days before the compressive strength measurement,
strength: Concrete compressive strength measurement in MPa unit.

5 Data Wrangling

Looking at this data, we are interested in creating a model that is able to predict the strength (MPa unit) a concrete has based on other variables such as cement (Kg/m3), slag (Kg/m3) , flyash (Kg/m3), water (Kg/m3), super_plast (Kg/m3), coarse_agg (Kg/m3), fine_agg (Kg/m3), and age (days).

##          id      cement        slag      flyash       water super_plast 
##           0           0           0           0           0           0 
##  coarse_agg    fine_agg         age    strength 
##           0           0           0           0

There is no NA in concrete dataset.

We will subset id from dataset and use beton data frame for further data preprocessing.

6 Data Preprocessing and Exploratory Data Analysis

We will first check data correlation and data distribution respectively.

6.1 Checking data correlation

We find correlation of target variable strength as follows:

  • Variable strength has positively correlated with age (0.3).
  • Variable strength and cement has strong correlation (0.5).
  • Variable super_plast has a linear correlation with strength (0.4).

After getting the data prepared, the best way to see the relationship between variables would be to plot them and check out how our plots looks. To do this we will use the plot() command which lists the x-axis variable first and the y-axis variable second.

We will run a plot for every predictor variable with our target variable, strength.

There must be a linear relationship between the target variable and the predictor variables.

Scatterplots below show that most of the curves have a curvilinear, not a linear relationships.

6.2 Checking data distribution

By using the summary() function below, We find data distribution varied among predictor variables.

All the predictors, except age, are measured in Kg/m3, accordingly we choose not to scale the predictors. Rescaling a predictor in a regression has absolutely no effect on the magnitude of the relation being studied. The slope itself will not change its steepness, nor will the p-values or variance explained be changed.

##      cement           slag            flyash           water      
##  Min.   :102.0   Min.   :  0.00   Min.   :  0.00   Min.   :121.8  
##  1st Qu.:194.7   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:164.9  
##  Median :275.1   Median : 20.00   Median :  0.00   Median :184.0  
##  Mean   :280.9   Mean   : 73.18   Mean   : 54.03   Mean   :181.1  
##  3rd Qu.:350.0   3rd Qu.:141.30   3rd Qu.:118.20   3rd Qu.:192.0  
##  Max.   :540.0   Max.   :359.40   Max.   :200.10   Max.   :247.0  
##   super_plast       coarse_agg        fine_agg          age        
##  Min.   : 0.000   Min.   : 801.0   Min.   :594.0   Min.   :  1.00  
##  1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:734.0   1st Qu.:  7.00  
##  Median : 6.500   Median : 968.0   Median :780.1   Median : 28.00  
##  Mean   : 6.266   Mean   : 972.8   Mean   :775.6   Mean   : 45.14  
##  3rd Qu.:10.100   3rd Qu.:1028.4   3rd Qu.:826.8   3rd Qu.: 56.00  
##  Max.   :32.200   Max.   :1145.0   Max.   :992.6   Max.   :365.00  
##     strength    
##  Min.   : 2.33  
##  1st Qu.:23.64  
##  Median :34.57  
##  Mean   :35.79  
##  3rd Qu.:45.94  
##  Max.   :82.60

By using the plot_num() function below, we find most of the numerical variables not distributed normally (plotted as normal bell curve).

As shown by the boxplot() function below, we can figure out outliers from several predictor variables, i.e.: slag, water, super_plas, fine_agg, and age.

7 Model Ordinary Least Square

Multiple linear regression is a tool for looking at the relationship among more than three variables (quantitative or categorical). The regression model uses two or more variables to predict one dependent variable of interest.

We will create our regression model, the Ordinary Least Squares, to get some numbers about the relationship between our predictors. Ordinary Least Squares, or Linear Least Squares, estimates the parameters in a regression model by minimizing the sum of squared residuals. This method draws a line through the data points that minimizes the sum of squared differences between the observed values and the corresponding fitted values.

We are looking for numbers that will tell us about our correlation and error of our model. Our correlation is the big one because it tells us how well our predictor variable predicts our target variable.

7.1 Model Fitting and Evaluation

7.1.1 Cross validation

We will create the index that we shall use to split the data into a training and validation dataset. We will use 80% of the data on training and the other 20% for validation dataset.

7.1.2 Building model

We will build a multiple linear regression model using all predictor variables by using the lm() function to predict strength as target variable.

## 
## Call:
## lm(formula = strength ~ ., data = beton.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.862  -6.322   0.598   6.610  34.017 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.840955  32.296665  -1.141  0.25441    
## cement        0.122580   0.010443  11.738  < 2e-16 ***
## slag          0.105616   0.012206   8.653  < 2e-16 ***
## flyash        0.092117   0.015497   5.944 4.53e-09 ***
## water        -0.133628   0.049834  -2.681  0.00752 ** 
## super_plast   0.286586   0.115709   2.477  0.01351 *  
## coarse_agg    0.025708   0.011312   2.273  0.02338 *  
## fine_agg      0.022144   0.012975   1.707  0.08835 .  
## age           0.129584   0.007224  17.937  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.31 on 651 degrees of freedom
## Multiple R-squared:  0.6288, Adjusted R-squared:  0.6243 
## F-statistic: 137.9 on 8 and 651 DF,  p-value: < 2.2e-16

7.1.3 Feature selection

To optimize the model, we’ll do feature selection by using the step() backward method.

## 
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
##     coarse_agg + fine_agg + age, data = beton.train)
## 
## Coefficients:
## (Intercept)       cement         slag       flyash        water  super_plast  
##   -36.84096      0.12258      0.10562      0.09212     -0.13363      0.28659  
##  coarse_agg     fine_agg          age  
##     0.02571      0.02214      0.12958
## 
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
##     coarse_agg + fine_agg + age, data = beton.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.862  -6.322   0.598   6.610  34.017 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.840955  32.296665  -1.141  0.25441    
## cement        0.122580   0.010443  11.738  < 2e-16 ***
## slag          0.105616   0.012206   8.653  < 2e-16 ***
## flyash        0.092117   0.015497   5.944 4.53e-09 ***
## water        -0.133628   0.049834  -2.681  0.00752 ** 
## super_plast   0.286586   0.115709   2.477  0.01351 *  
## coarse_agg    0.025708   0.011312   2.273  0.02338 *  
## fine_agg      0.022144   0.012975   1.707  0.08835 .  
## age           0.129584   0.007224  17.937  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.31 on 651 degrees of freedom
## Multiple R-squared:  0.6288, Adjusted R-squared:  0.6243 
## F-statistic: 137.9 on 8 and 651 DF,  p-value: < 2.2e-16

There are 4 (four) most influencing predictors of the model, i.e.: cement, slag, flyash, age.

As we can see our Multiple R-squared value is 0.6288 which is not good enough. It means our model is not able to predict strength well. Also, this value means that predictors explain only 62.88% of the variability in strength.

Our Adjusted R-squared 0.6243 is a little lower obviously as we are using 8 (eight) variables to predict. This could possibly happen if there is one or more predictors that aren’t very good and are hurting our model.

Our p-value is very small so we can reject our null hypothesis and say that there is a relationship between at least one of the predictor variables and the target variable.

7.1.4 Coefficient confidence interval

We will show confidence interval by using the confint() function.

##                     2.5 %      97.5 %
## (Intercept) -1.002592e+02 26.57725054
## cement       1.020746e-01  0.14308617
## slag         8.164769e-02  0.12958503
## flyash       6.168694e-02  0.12254772
## water       -2.314815e-01 -0.03577366
## super_plast  5.937804e-02  0.51379434
## coarse_agg   3.495294e-03  0.04792151
## fine_agg    -3.332949e-03  0.04762188
## age          1.153975e-01  0.14376969

We can interpret this output that for every Kg/m3 the concrete’s water decreases, we are 95% confident that the corresponding strength will change between -2.315 and -0.036 MPa unit.

7.2 Model Assumptions

In order to perform multiple linear regression, there are certain assumptions that need to be considered before building a model. Assumptions regarding residuals, such as normal distribution, constant variance i.e homoscedasticity, and multicolinearity need to be checked while building a model.

7.2.1 Check normality

An important thing when considering a multiple linear regression model is that our assumptions are met.

One important assumption is that our errors (or residuals) have an approximate normal distribution.

A residual is our observed value minus our predicted value for any value of x. We can check this easiest by getting our residuals for all our points and making a histogram for them by using the hist() command.

One of the assumptions for linear regression stated that the error obtained from the model must be distributed normally around the mean of 0.

This histogram looks have enough evidence to show that it is approximately normally distributed. It has fairly even tails and looks fairly bell shaped.

Another good way to look at this assumption is to plot our errors along a straight line in a quantile plot. If our error is normally distributed then they will follow the straight line.

The Shapiro-Wilk hypothesis test is as follow:

## 
##  Shapiro-Wilk normality test
## 
## data:  model_beton$residuals
## W = 0.99566, p-value = 0.06284

If p-value < 0.05, then reject Ho, accept H1

Shapiro-Wilk hypothesis test: (expectation: p-value > alpha)

H0: error/residual distributed normally

H1: error/residual not-distributed normally

We have p-value 0.06284, meaning that the assumption of error/residual distributed normally is fullfilled due to p-value > 0.05.

7.2.2 Check homoscedasticity

Another assumption needed to test is whether or not the error of the model is homoscedasticity which means the error is distributed with equal variance (constant error) over different data ranges.

The Breusch-Pagan hypothesis test is as follow:

## 
##  studentized Breusch-Pagan test
## 
## data:  model_beton
## BP = 85.502, df = 8, p-value = 3.792e-15

If p-value < 0.05, then reject Ho, accept H1

Breusch-Pagan hypothesis test: (expectation: p-value > alpha)

H0: error/residual splitted constantly (Homoscedasticity)

H1: error/residual not splitted constantly or creating pattern (Heteroscedasticity)

We have p-value 3.792e-15, meaning that the assumption of homocedasticity is not fullfilled due to p-value < 0.05.

7.2.3 Check multicolinearity

Other assumption is whether or not there is multicolinearity among predictor variables.

The VIF test is as follow:

##      cement        slag      flyash       water super_plast  coarse_agg 
##    7.551687    6.618902    6.073964    7.048351    3.120786    4.861049 
##    fine_agg         age 
##    7.065958    1.123345

The VIF numbers are below 10, meaning that there is no multicolinearity among predictor variables.

7.3 Prediction Performance

We will now show the “Actual Strength vs. Predicted Strength - Ordinary Least Square”.

The figure of “R-squared” is 55.77% (target level > 90%).

The figure of “Mean Absolute Error” is 8.87 (target level < 4).

## [1] 8.872794

The prediction value of strength is not closed to the regression line.

7.4 Overfitting/underfitting calculation

Let’s calculate the R-squared for the training dataset.

## [1] 0.6288358

Let’s calculate the R-squared for the validation dataset.

## [1] 0.5577481

The prediction performance in the training dataset is higher than the prediction performance in the validation dataset. Now we can calculate the severity of overfitting, which reveals 7.11%.

## [1] 0.07108765

8 Model Neural Network (NN)

A neural network can be thought of as a network of “neurons” which are organised in layers. The predictors (or inputs) form the bottom layer, and the forecasts (or outputs) form the top layer. There may also be intermediate layers containing “hidden neurons”.

We will create our neural network model to get some numbers about the relationship between our predictors.

8.1 Model Fitting and Evaluation

8.1.1 Cross validation

We will create the index that we shall use to split the data into a training and validation dataset. We will use 80% of the data on training and the other 20% for validation dataset.

8.2 Prediction Performance

We will show the “Actual Strength vs. Predicted Strength - Neural Network Model”.

The figure of “R-squared” is 62.96% (target level > 90%).

The figure of “Mean Absolute Error” is 8.04 (target level < 4).

## [1] 8.037525
## R-squared is 0.629633

The prediction value of strength variable is not closed to the regression line.

8.3 Overfitting/underfitting calculation

Let’s calculate the R-squared for the training dataset.

##           [,1]
## [1,] 0.6147568

Let’s calculate the R-squared for the validation dataset.

##           [,1]
## [1,] 0.6296329

The prediction performance in the validation dataset is higher than the prediction performance in the training dataset. Now we can calculate the severity of underfitting, which reveals 1.49%.

##            [,1]
## [1,] 0.01487619

9 Model Decision Tree (DT)

A decision tree learning is a supervised machine learning technique for inducing a decision tree from training data. A decision tree (also referred to as a classification tree or a reduction tree) is a predictive model which is a mapping from observations about an item to conclusions about its target value. In the tree structures, leaves represent classifications (also referred to as labels), nonleaf nodes are features, and branches represent conjunctions of features that lead to the classifications.

We will create our decision tree model to get some numbers about the relationship between our predictors.

9.2 Prediction Performance

We will show the “Actual Strength vs. Predicted Strength - Decision Tree Model”.

The figure of “R-squared” is 77.04% (target level > 90%).

The figure of “Mean Absolute Error” is 8.04 (target level < 4).

## [1] 8.037525
## R-squared is 0.770370

The prediction value of strength variable is not closed to the regression line.

9.3 Overfitting/underfitting calculation

Let’s calculate the R-squared for the training dataset.

## [1] 0.8329197

Let’s calculate the R-squared for the validation dataset.

## [1] 0.7703695

The prediction performance in the training dataset is higher than the prediction performance in the validation dataset. Now we can calculate the severity of overfitting, which reveals 6.25%.

## [1] 0.06255018

10 Model Random Forest (RF)

A random forest learning is a supervised learning algorithm which uses ensemble learning method for classification and regression. Random forest is a bagging technique and not a boosting technique. The trees in random forests are run in parallel. There is no interaction between these trees while building the trees. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

We will create our random forest model to get some numbers about the relationship between our predictors.

10.1 Model Fitting and Evaluation

10.1.1 Building model

## Random Forest 
## 
## 660 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 528, 528, 528, 528, 528 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     6.047319  0.8941465  4.540488
##   5     5.507695  0.8970091  3.970218
##   8     5.534659  0.8936224  3.925503
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.

10.2 Prediction Performance

We will show the “Actual Strength vs. Predicted Strength - Random Forest Model”.

The figure of “R-squared” is 91.09% (target level > 90%).

The figure of “Mean Absolute Error” is 3.58 (target level < 4).

## [1] 3.583162
## R-squared is 0.910878

The prediction value of strength variable is more closed to the regression line.

10.3 Overfitting/underfitting calculation

Let’s calculate the R-squared for the training dataset.

## [1] 0.9818889

Let’s calculate the R-squared for the validation dataset.

## [1] 0.9108781

The prediction performance in the training dataset is higher than the prediction performance in the validation dataset. Now we can calculate the severity of overfitting, which reveals 7.1%.

## [1] 0.07101081

10.4 Prediction for Data Test

We will read data-test file and create data.test object.

We will subset id variable and create data_test object for further prediction processing.

We will predict target strength from data_test object by using our model and store in strength object.

We will combined strength prediction with id and store in data_subs object.

We will write into data_subs.csv for further submission.

11 Interpretation & Explanation

11.1 Variable importance

After training a random forest, it is natural to ask which variables have the most predictive power. Variables with high importance are drivers of the outcome and their values have a significant impact on the outcome values. By contrast, variables with low importance might be omitted from a model, making it simpler and faster to fit and predict.

We have 3 (three) important variables from the varImp() function: age, cement, water.

## rf variable importance
## 
##             Overall
## age         100.000
## cement       83.882
## water        28.307
## super_plast  13.403
## slag          9.639
## fine_agg      6.531
## coarse_agg    1.416
## flyash        0.000

11.2 LIME Model

Local Interpretable Model-agnostic Explanations (LIME) is a visualization technique that helps explain individual predictions. As the name implies, it is model agnostic so it can be applied to any supervised regression or classification model. Behind the workings of LIME lies the assumption that every complex model is linear on a local scale and asserting that it is possible to fit a simple model around a single observation that will mimic how the global model behaves at that locality. The simple model can then be used to explain the predictions of the more complex model locally.

The following creates our lime object.

## [1] "data_frame_explainer" "explainer"            "list"

Once we created our lime objects, we can now perform the generalized LIME algorithm by using the explain() function.

We will use 6 (six) sample of data validation for explanation.

The explain output explanation_caret is a data frame containing different information on the simple model predictions.

Most importantly, for each observation in data_val it contains the simple model fit (model_r2) and the weighted importance (feature_weight) for each important feature (feature_desc) that best describes the local relationship.

## [1] "age"         "cement"      "super_plast"

However, the simplest approach to interpret the results is to visualize them. There are several plotting functions provided by lime but for tabular data we are only concerned with two functions: plot_features() and plot_explanation().

We will use 3 (three) features to explain the random forest model.

## Warning: Unknown or uninitialised column: 'label'.

11.3 LIME interpretation & explanation

  1. We do not scale the data, accordingly we don’t need to scale back the data into original value in order to be more interpretable.

  2. We use 3 (three) dominant features to explain the random forest model, i.e,: age, cement, and super_plast. The interpretation of the plot are as follows:

  • The feature age and cement have high weighted correlation with strength in 6 (six) samples.

  • In 4 (four) samples, feature age has positively correlation with strength. Only in 1 (one) sample, feature age <= 7 days has negatively correlation with strength.

  • In 2 (two) samples, feature cement has positively correlation with strength. In another 2 (two) samples, feature cement <= 195 Kg/m3 has negatively correlation with strength.

  • In 3(three) samples, feature super_plas only has low weighted, negatively correlation with strength.

  1. By using LIME, we can use sample of data validation for creating simple model that will mimic the global model. The model can make several clusters of a variable base on its positive/negative correlation with the target variable. The model works like the interpretable machine learning model such as Multiple Linear or Logistic Regression.

  2. By using LIME, we can interpret weight of selected predictors and its positive/negative correlation with the target variable. Compared to Variable Importance in Random Forest, the varImp() function only give insight the degree of importance of the predictors used in the model.

12 Conclusion

The conclusion of the capstone project are as follows:

  1. We have our goal or target achieved in the validation dataset.
  • The figure of “R-squared” is 91.09% (target level > 90%).
  • The figure of “Mean Absolute Error” is 3.58 (target level < 4).
  1. We have our problem can be solved by selecting Random Forest as Machine Learning Model.
    Overfitting may still occur due to bias-variance trade off.

  2. We begin our model with Ordinary Least Square (MAE: 8.87; R-squared: 55.77% ), improved by Neural Network (MAE: 8.04; R-squared: 62.96%), then improved by Decision Tree (MAE: 8.04; R-squared: 77.04%), at last improved by Random Forest (MAE: 3.58; R-squared: 91.09%).

  3. The capstone project has potential business implementation, by building a machine learning model to predict cost of good sold for manufactured product, such as car, housing, feed, etc. Accordingly, company can set certain selling price to safeguard its profit margin.