Case

Case yang dipilih adalah: Concrete Prediction

Business Question: Interpret how each ingredients and age of testing affect the concrete compression strength.

Data Preprocessing

Tentukan langkah-langkah yang akan dilakukan dalam Data Preprocessing:

library(dplyr)
library(GGally)
library(rsample)
library(lmtest)
library(car)
library(caret)
library(partykit)
# install.packages("lime")
library(lime)

# install.packages("KScorrect")
library(KScorrect)

# install.packages("neuralnet")
library(neuralnet)

Data Wrangling

Import Data

concrete <- read.csv("data/data-train.csv")
anyNA(concrete)

## [1] FALSE

# The `id` column isn't meaningful so we're going to take it out
concrete <- concrete[,-1]
head(concrete)

summary(concrete)

##      cement           slag            flyash           water      
##  Min.   :102.0   Min.   :  0.00   Min.   :  0.00   Min.   :121.8  
##  1st Qu.:194.7   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:164.9  
##  Median :275.1   Median : 20.00   Median :  0.00   Median :184.0  
##  Mean   :280.9   Mean   : 73.18   Mean   : 54.03   Mean   :181.1  
##  3rd Qu.:350.0   3rd Qu.:141.30   3rd Qu.:118.20   3rd Qu.:192.0  
##  Max.   :540.0   Max.   :359.40   Max.   :200.10   Max.   :247.0  
##   super_plast       coarse_agg        fine_agg          age        
##  Min.   : 0.000   Min.   : 801.0   Min.   :594.0   Min.   :  1.00  
##  1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:734.0   1st Qu.:  7.00  
##  Median : 6.500   Median : 968.0   Median :780.1   Median : 28.00  
##  Mean   : 6.266   Mean   : 972.8   Mean   :775.6   Mean   : 45.14  
##  3rd Qu.:10.100   3rd Qu.:1028.4   3rd Qu.:826.8   3rd Qu.: 56.00  
##  Max.   :32.200   Max.   :1145.0   Max.   :992.6   Max.   :365.00  
##     strength    
##  Min.   : 2.33  
##  1st Qu.:23.64  
##  Median :34.57  
##  Mean   :35.79  
##  3rd Qu.:45.94  
##  Max.   :82.60

We have 825 observation with 9 variables with all numerical data and no missing values (NA).

The observation data consists of the following variables:

id: Id of each cement mixture,
cement: The amount of cement (Kg) in a m3 mixture,
slag: The amount of blast furnace slag (Kg) in a m3 mixture,
flyash: The amount of fly ash (Kg) in a m3 mixture,
water: The amount of water (Kg) in a m3 mixture,
super_plast: The amount of Superplasticizer (Kg) in a m3 mixture,
coarse_agg: The amount of Coarse Aggreagate (Kg) in a m3 mixture,
fine_agg: The amount of Fine Aggreagate(Kg) in a m3 mixture,
age: the number of resting days before the compressive strength measurement,
strength: Concrete compressive strength measurement in MPa unit.

Exploratory Data Analysis

See if there is outlier

boxplot(concrete)

The concrete data has outliers in many of the predictable variables and the target variable. We will scale the data to normalize the data distribution later.

In here, we shall see the distribution of the predictor variables using histogram plot:

hist(concrete$cement)

hist(concrete$slag)

hist(concrete$flyash)

hist(concrete$water)

hist(concrete$super_plast)

hist(concrete$coarse_agg)

hist(concrete$fine_agg)

hist(concrete$age)

Based on all of these histograms, the distribution of the predictor variables are varied. Many of the variables with outliers show a ‘not normal’ distribution or creating a bell curve. I will normalize the data with scaling later.

See the correlation between the target variable to predictor variables

ggcorr(concrete, label = TRUE, label_size = 3, hjust = 1, layout.exp = 3)

On this plot, we can see if cement, super_plast, and age making the highest correlation positively with the target variable strength.

To answer the question: - Is strength positively correlated with age? YES, with value 0.2 - Is strength and cement has strong correlation? YES, with value 0.4 - Is super_plast has a linear correlation with the strength? I will check with the cor.test here:

cor.test(concrete$super_plast, concrete$strength)

## 
##  Pearson's product-moment correlation
## 
## data:  concrete$super_plast and concrete$strength
## t = 11.127, df = 823, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3007983 0.4195290
## sample estimates:
##       cor 
## 0.3616289

With p-value < 0.05, we can conclude that between super_plast and strength has linear correlation.

Let’s see if another predictor variables have correlation with strength

cor.test(concrete$cement, concrete$strength)$p.value

## [1] 0.0000000000000000000000000000000000000000000000000001903868

cor.test(concrete$slag, concrete$strength)$p.value

## [1] 0.0002271348

cor.test(concrete$flyash, concrete$strength)$p.value

## [1] 0.002878826

cor.test(concrete$water, concrete$strength)$p.value

## [1] 0.0000000000000001108019

cor.test(concrete$coarse_agg, concrete$strength)$p.value

## [1] 0.00009244434

cor.test(concrete$fine_agg, concrete$strength)$p.value

## [1] 0.0000000658619

cor.test(concrete$age, concrete$strength)$p.value

## [1] 0.000000000000000000000005228994

From the linearity test, all the variables have significant correlation with the strength.

Model Fitting and Evaluation

Cross Validation

We need to split the train data into train and validation data to see if the model itself fit best with it’s own train data. We will see later if the MAE score and R squared are a good fit compared to another model.

In here I sample the data into Data Train and Data Validation

set.seed(1000)
concrete_sample <-  sample(nrow(concrete), nrow(concrete)*0.8)
concrete_train <- concrete[concrete_sample,]
concrete_val <- concrete[-concrete_sample,]

I use 80% of the data for training and the rest of 20% for data validation

Model Building

Feature Selection

For the first step, I will try to Feature Selecting to choose variables using step wise.

# making model_none
model_none <- lm(strength ~ 1, concrete_train)
model_all <- lm(strength ~., concrete_train)

# backward
step(object = model_all, direction = "backward", trace = 0)

## 
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
##     coarse_agg + fine_agg + age, data = concrete_train)
## 
## Coefficients:
## (Intercept)       cement         slag       flyash        water  super_plast  
##   -50.03594      0.12460      0.10375      0.08604     -0.10724      0.44089  
##  coarse_agg     fine_agg          age  
##     0.03306      0.02280      0.11592

# forward
step(object = model_none,scope = list(lower = model_none, upper =model_all),direction = "forward", trace = 0)

## 
## Call:
## lm(formula = strength ~ cement + super_plast + age + slag + water + 
##     flyash + coarse_agg + fine_agg, data = concrete_train)
## 
## Coefficients:
## (Intercept)       cement  super_plast          age         slag        water  
##   -50.03594      0.12460      0.44089      0.11592      0.10375     -0.10724  
##      flyash   coarse_agg     fine_agg  
##     0.08604      0.03306      0.02280

# both
step(object = model_all,scope = list(lower = model_none, upper =model_all),direction = "both",trace = 0)

## 
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
##     coarse_agg + fine_agg + age, data = concrete_train)
## 
## Coefficients:
## (Intercept)       cement         slag       flyash        water  super_plast  
##   -50.03594      0.12460      0.10375      0.08604     -0.10724      0.44089  
##  coarse_agg     fine_agg          age  
##     0.03306      0.02280      0.11592

All the stepwise resulting with the same variables, so we will gonna use the selection on the backward step.

Model Predicting

I use the model_back for the concrete_train data, and evaluate the R Squared

model_back <- lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
    coarse_agg + fine_agg + age, data = concrete_train)
summary(model_back)

## 
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
##     coarse_agg + fine_agg + age, data = concrete_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.345  -6.705   0.358   7.385  33.679 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -50.035937  32.382924  -1.545             0.122800    
## cement        0.124598   0.010469  11.901 < 0.0000000000000002 ***
## slag          0.103754   0.012352   8.400  0.00000000000000028 ***
## flyash        0.086035   0.015466   5.563  0.00000003877438740 ***
## water        -0.107242   0.048940  -2.191             0.028785 *  
## super_plast   0.440888   0.117559   3.750             0.000192 ***
## coarse_agg    0.033064   0.011513   2.872             0.004215 ** 
## fine_agg      0.022799   0.013111   1.739             0.082526 .  
## age           0.115920   0.006639  17.460 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.47 on 651 degrees of freedom
## Multiple R-squared:  0.6149, Adjusted R-squared:  0.6102 
## F-statistic: 129.9 on 8 and 651 DF,  p-value: < 0.00000000000000022

in the summary, we can see the coarse_agg is not significant enough as the predictor variable in the model_back. And the Multiple R Squared value is 0.6149, and the Adjusted R Squared is 0.6102.

Assumption Test

Before we predict the model with the validation data, let’s check the Normality

# Visualize the residuals with ist
hist(model_back$residuals)

# Testing the Shapiro Test
shapiro.test(model_back$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_back$residuals
## W = 0.99619, p-value = 0.1123

From the hist() and the shapiro.test(), we can see that the residuals is Normal (histogram) but the result of the Shapiro Test is p-value > 0.05 which means that the residuals of the model is not distributed normally.

Homoscedasticity

plot(model_back$fitted.values, model_back$residuals)
abline(h = 0, col = "red")

The result on this plot between model’s fitted.values and residuals is there is no homoscedasticity.

Breusch-Pagan Test

library(lmtest)
bptest(model_back)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_back
## BP = 91.326, df = 8, p-value = 0.0000000000000002501

The result p-value > 0.05 means there is no homoscedasticity

Predicting the Model

To see whether the model is overfit or not, we will check it here:

pred_concretelm <- predict(object = model_back, newdata = concrete_val)

MAE(pred = pred_concretelm, obs = concrete_val$strength)

## [1] 8.083035

The requirement of the MAE value is < 4 and the R-squared is > 90%, since the Mean Absolute Error in model_back is 7.613729 with the Adj.R-squared is around 61.53%, we need to try to normalize the data.

Compared Multiple Data Preprocessing

In this case we will normalize the data with Scale to see if the model perfoms better.

concrete_scale <- scale(concrete)
summary(concrete_scale)

##      cement              slag             flyash            water        
##  Min.   :-1.71772   Min.   :-0.8498   Min.   :-0.8458   Min.   :-2.8030  
##  1st Qu.:-0.82783   1st Qu.:-0.8498   1st Qu.:-0.8458   1st Qu.:-0.7663  
##  Median :-0.05602   Median :-0.6176   Median :-0.8458   Median : 0.1363  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.66299   3rd Qu.: 0.7911   3rd Qu.: 1.0044   3rd Qu.: 0.5144  
##  Max.   : 2.48692   Max.   : 3.3240   Max.   : 2.2863   Max.   : 3.1135  
##   super_plast         coarse_agg          fine_agg             age         
##  Min.   :-1.04874   Min.   :-2.24044   Min.   :-2.24166   Min.   :-0.7114  
##  1st Qu.:-1.04874   1st Qu.:-0.53228   1st Qu.:-0.51382   1st Qu.:-0.6147  
##  Median : 0.03924   Median :-0.06287   Median : 0.05514   Median :-0.2763  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.64181   3rd Qu.: 0.72471   3rd Qu.: 0.63150   3rd Qu.: 0.1750  
##  Max.   : 4.34095   Max.   : 2.24510   Max.   : 2.67776   Max.   : 5.1545  
##     strength       
##  Min.   :-2.00171  
##  1st Qu.:-0.72678  
##  Median :-0.07287  
##  Mean   : 0.00000  
##  3rd Qu.: 0.60737  
##  Max.   : 2.80064

See if the data has a normal distribution with histogram and also we’ll check what varible has the significant correlation with the target variable.

hist(concrete_scale)

ggcorr(concrete_scale, label = TRUE, label_size = 3, hjust = 1, layout.exp = 3)

After we scale the data and check with histogram, we can see that the data variance distribute normally near to zero. And also, same with the previous one, we can see if cement, super_plast, and age making the highest correlation positively with the target variable strength.

Cross Validation with 80% for data train, and 20% for data validation

set.seed(1000)
concrete_sample2 <-  sample(nrow(concrete_scale), nrow(concrete_scale)*0.8)
concrete_train2 <- concrete[concrete_sample2,]
concrete_val2 <- concrete[-concrete_sample2,]

Feature Selection using Stepwise

# making model_none & all
model_none2 <- lm(strength ~ 1, concrete_train2)
model_all2 <- lm(strength ~., concrete_train2)

# backward
step(object = model_all2, direction = "backward", trace = 0)

## 
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
##     coarse_agg + fine_agg + age, data = concrete_train2)
## 
## Coefficients:
## (Intercept)       cement         slag       flyash        water  super_plast  
##   -50.03594      0.12460      0.10375      0.08604     -0.10724      0.44089  
##  coarse_agg     fine_agg          age  
##     0.03306      0.02280      0.11592

# forward
step(object = model_none2,scope = list(lower = model_none2, upper =model_all2),direction = "forward", trace = 0)

## 
## Call:
## lm(formula = strength ~ cement + super_plast + age + slag + water + 
##     flyash + coarse_agg + fine_agg, data = concrete_train2)
## 
## Coefficients:
## (Intercept)       cement  super_plast          age         slag        water  
##   -50.03594      0.12460      0.44089      0.11592      0.10375     -0.10724  
##      flyash   coarse_agg     fine_agg  
##     0.08604      0.03306      0.02280

# both
step(object = model_all2,scope = list(lower = model_none2, upper =model_all2),direction = "both",trace = 0)

## 
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
##     coarse_agg + fine_agg + age, data = concrete_train2)
## 
## Coefficients:
## (Intercept)       cement         slag       flyash        water  super_plast  
##   -50.03594      0.12460      0.10375      0.08604     -0.10724      0.44089  
##  coarse_agg     fine_agg          age  
##     0.03306      0.02280      0.11592

All the stepwise resulting in the same ingredients so I will use the model_back2

I use the model_back2 for the concrete_train2 data, and evaluate the Adj. R Squared

model_back2 <- lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
    coarse_agg + age, data = concrete_train2)
summary(model_back2)

## 
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
##     coarse_agg + age, data = concrete_train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.899  -6.956   0.541   7.537  33.651 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  2.869198  11.107809   0.258             0.796253    
## cement       0.108928   0.005337  20.412 < 0.0000000000000002 ***
## slag         0.085475   0.006496  13.158 < 0.0000000000000002 ***
## flyash       0.064925   0.009597   6.765      0.0000000000297 ***
## water       -0.175148   0.029544  -5.928      0.0000000049629 ***
## super_plast  0.417081   0.116941   3.567             0.000388 ***
## coarse_agg   0.016744   0.006680   2.507             0.012428 *  
## age          0.115094   0.006633  17.353 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.49 on 652 degrees of freedom
## Multiple R-squared:  0.6131, Adjusted R-squared:  0.609 
## F-statistic: 147.6 on 7 and 652 DF,  p-value: < 0.00000000000000022

in the summary, we can see the fine_agg is not significant enough as the predictor variable in the model_back2. And the Multiple R Squared value is 0.6131, and the Adjusted R Squared is 0.609.

on this model_back2 the ingredients & age suggested are:
strength = 2.869198 + 0.108928 cement + 0.085475 slag + 0.064925 flyash + (-0.175148) water + 0.417081 super_plast + 0.016744 coarse_agg + 0.115094 age.

To see whether the model is overfit or not, we will check it here:

pred_concretelm2 <- predict(object = model_back2, newdata = concrete_val2)

MAE(pred = pred_concretelm2, obs = concrete_val2$strength)

## [1] 8.122788

The requirement of the MAE value is < 4 and the R-squared is > 90%, since the Mean Absolute Error in model_back2 is 8.122788 with the Adj.R-squared is still around 60.9%, we will try to use another model building.

Do you need to log-transform or scale the variables with square root?

Compared Multiple Model

Decision Tree

Using the scaled data on previous Cross Validation step and using all predictor variables, we’re going to model the train data using Decision Tree

model_DT <- ctree(formula = strength ~ ., data = concrete_train2)
plot(model_DT, type = "simple")

Predicting the model using data validation

predict_DT <- predict(model_DT, concrete_val2)
MAE(pred = predict_DT, obs = concrete_val2$strength)

## [1] 6.904247

The MAE returned to be 5.696779 and it is and the R-squared is unknown.

Neural Network

Using scaled data, I will try to build model with Neural Network. I set the single hidden layer at first (7).

set.seed(100)
model_nn <- neuralnet(formula = strength ~ ., data = concrete_train2, hidden = c(7, 4), rep = 3)
plot(model_nn, rep = "best")

Before I predict the model, I will check what the best repetition with which.min()

which.min(model_nn$result.matrix[1,])

## [1] 2

Now we can predict the model with rep = 2

pred_nn <- compute(model_nn, concrete_val2[,-9], rep= 2)
MAE(pred = pred_nn$net.result, obs = concrete_val2$strength)

## [1] 12.99147

The MAE returns to 12.99147 which mean it is bigger than any other model. So now, we will try to build model with Random Forest.

Random Forest

First, I’m going to cross validation the data

library(rsample)
set.seed(1000)
splitted <- initial_split(data = concrete, prop = 0.7, strata = "strength")
concrete_trainRF <- training(splitted)
concrete_testRF <- testing(splitted)

Creating control

set.seed(850)
ctrl <- trainControl(method = "cv", number = 5)

Creating model random forest

model_RF <- train(strength ~., data = concrete_trainRF, method = "rf", trControl= ctrl)

And then we predict the model with our data validation.

pred_RF <- predict(model_RF, concrete_testRF)
varImp(model_RF)

## rf variable importance
## 
##              Overall
## age         100.0000
## cement       78.8663
## water        31.6940
## super_plast  12.2839
## slag          6.2119
## fine_agg      2.8969
## coarse_agg    0.5443
## flyash        0.0000

MAE(pred = pred_RF, obs = concrete_testRF$strength)

## [1] 4.07931

With Random Forest, the MAE returns to 4.667194, I will try to tuning the model defining ntree parameter. As I use the varImp() function to see which of the variables is important orderly from the most important to the least.

set.seed(1000)
control <- trainControl(method = "cv", number = 10)
model_RF2 <- train(strength ~., data = concrete_trainRF, method = "rf", ntree = 5000, trControl= control)
pred_RF2 <- predict(model_RF2, concrete_testRF)
MAE(pred = pred_RF2, obs = concrete_testRF$strength)

## [1] 4.055413

With Random Forest, the MAE returns to 4.058838, this is the closest to the requirement which is < 4. For the R-square value, we will try to interpret using LIME.

LIME Interpretation

Do you need to scale back the data into original value in order to be more interpretable? Since I use Forest Regression, I don’t need to scale back the data since I didn’t scale the data in the first place.

Setting the explainer using lime()

set.seed(100)
explainer_RF <- lime(x = concrete_trainRF %>% select(-strength), 
                   model = model_RF2)

setting the explanation with explain().

How many features do you use to explain the model? In here I use 8 features to interpret the model since all of the variables are important ingredients to create the concrete mixture.

explanation <- explain(x = concrete_testRF %>% select(-strength),
                       explainer = explainer_RF,
                       n_permutations = 5000,
                       dist_fun = "gower",
                       labels = NULL,
                       n_features = 8,
                       feature_select = "highest_weights")

Let’s see how many observations do the explanation have.

nrow(explanation)

## [1] 1960

Since the result retunring to 1,960 observations, the plot would return error because of the massive amount of the observations.

Let’s see the heatmap of the explanations

plot_explanations(explanation)

## Warning: Unknown or uninitialised column: 'label'.

As we can see here, the most weighted feature is age with > 56 days (dark blue), age between 28 ays to 56 days the next, cement with > 359, and gradually super.plast > 10.00, water <= 165, and so on into fading blue to dark red.

What is the difference between using LIME compared to interpretable machine learning models such as Decision Tree or metrics such as Variable Importance in Random Forest? LIME gives deeper detail on what has the model selecting each of the variables. Compared to varImp, it can only gives which variables are the most important without telling deeper information of the feature weights.

Conclusion

Is your goal achieved? The goal to get MAE < 4 cannot be attained after many trials on several model buildings or feature engineerings.
Is the problem can be solved by machine learning? After I read couple of civil engineering journals about concrete mixture, Machine Learning could really help engineers decide which composition to use if there is not much time to do A/B testing.
What model did you use and how is the performance? The Random Forest model gives the least MAE value than the other model. It is the robust one to get so far.
What is the potential business implementation of your capstone project? I think it can be implemented as well for another insudtry such as chemical industry, oil & gas, or catering. The mixture model offers a broad understanding about ingredients and composition and thus it will create a confidence decision in order to make ‘the king formula’

Submission

Write the final prediction to csv file formmat.

data_test <- read.csv("data/data-test.csv")

predict_final <- predict(model_RF, data_test)
submit <- data.frame(id = data_test$id, strength = predict_final)
write.csv(submit,"concreteforest.csv", row.names = FALSE)
head(submit)

Brief Capstone Machine Learning

Andina - Regression Concrete Prediction

2020-03-31