Case yang dipilih adalah: Concrete Prediction
Business Question: Interpret how each ingredients and age of testing affect the concrete compression strength.
Tentukan langkah-langkah yang akan dilakukan dalam Data Preprocessing:
library(dplyr)
library(GGally)
library(rsample)
library(lmtest)
library(car)
library(caret)
library(partykit)
# install.packages("lime")
library(lime)
# install.packages("KScorrect")
library(KScorrect)
# install.packages("neuralnet")
library(neuralnet)## [1] FALSE
# The `id` column isn't meaningful so we're going to take it out
concrete <- concrete[,-1]
head(concrete)## cement slag flyash water
## Min. :102.0 Min. : 0.00 Min. : 0.00 Min. :121.8
## 1st Qu.:194.7 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:164.9
## Median :275.1 Median : 20.00 Median : 0.00 Median :184.0
## Mean :280.9 Mean : 73.18 Mean : 54.03 Mean :181.1
## 3rd Qu.:350.0 3rd Qu.:141.30 3rd Qu.:118.20 3rd Qu.:192.0
## Max. :540.0 Max. :359.40 Max. :200.10 Max. :247.0
## super_plast coarse_agg fine_agg age
## Min. : 0.000 Min. : 801.0 Min. :594.0 Min. : 1.00
## 1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:734.0 1st Qu.: 7.00
## Median : 6.500 Median : 968.0 Median :780.1 Median : 28.00
## Mean : 6.266 Mean : 972.8 Mean :775.6 Mean : 45.14
## 3rd Qu.:10.100 3rd Qu.:1028.4 3rd Qu.:826.8 3rd Qu.: 56.00
## Max. :32.200 Max. :1145.0 Max. :992.6 Max. :365.00
## strength
## Min. : 2.33
## 1st Qu.:23.64
## Median :34.57
## Mean :35.79
## 3rd Qu.:45.94
## Max. :82.60
We have 825 observation with 9 variables with all numerical data and no missing values (NA).
The observation data consists of the following variables:
id: Id of each cement mixture,
cement: The amount of cement (Kg) in a m3 mixture,
slag: The amount of blast furnace slag (Kg) in a m3 mixture,
flyash: The amount of fly ash (Kg) in a m3 mixture,
water: The amount of water (Kg) in a m3 mixture,
super_plast: The amount of Superplasticizer (Kg) in a m3 mixture,
coarse_agg: The amount of Coarse Aggreagate (Kg) in a m3 mixture,
fine_agg: The amount of Fine Aggreagate(Kg) in a m3 mixture,
age: the number of resting days before the compressive strength measurement,
strength: Concrete compressive strength measurement in MPa unit.
The
concrete data has outliers in many of the predictable variables and the target variable. We will scale the data to normalize the data distribution later.
In here, we shall see the distribution of the predictor variables using histogram plot:
Based on all of these histograms, the distribution of the predictor variables are varied. Many of the variables with outliers show a ‘not normal’ distribution or creating a bell curve. I will normalize the data with scaling later.
On this plot, we can see if
cement, super_plast, and age making the highest correlation positively with the target variable strength.
To answer the question: - Is strength positively correlated with age? YES, with value 0.2 - Is strength and cement has strong correlation? YES, with value 0.4 - Is super_plast has a linear correlation with the strength? I will check with the cor.test here:
##
## Pearson's product-moment correlation
##
## data: concrete$super_plast and concrete$strength
## t = 11.127, df = 823, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3007983 0.4195290
## sample estimates:
## cor
## 0.3616289
With p-value < 0.05, we can conclude that between super_plast and strength has linear correlation.
strength## [1] 0.0000000000000000000000000000000000000000000000000001903868
## [1] 0.0002271348
## [1] 0.002878826
## [1] 0.0000000000000001108019
## [1] 0.00009244434
## [1] 0.0000000658619
## [1] 0.000000000000000000000005228994
From the linearity test, all the variables have significant correlation with the strength.
We need to split the train data into train and validation data to see if the model itself fit best with it’s own train data. We will see later if the MAE score and R squared are a good fit compared to another model.
set.seed(1000)
concrete_sample <- sample(nrow(concrete), nrow(concrete)*0.8)
concrete_train <- concrete[concrete_sample,]
concrete_val <- concrete[-concrete_sample,]I use 80% of the data for training and the rest of 20% for data validation
For the first step, I will try to Feature Selecting to choose variables using step wise.
# making model_none
model_none <- lm(strength ~ 1, concrete_train)
model_all <- lm(strength ~., concrete_train)##
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast +
## coarse_agg + fine_agg + age, data = concrete_train)
##
## Coefficients:
## (Intercept) cement slag flyash water super_plast
## -50.03594 0.12460 0.10375 0.08604 -0.10724 0.44089
## coarse_agg fine_agg age
## 0.03306 0.02280 0.11592
# forward
step(object = model_none,scope = list(lower = model_none, upper =model_all),direction = "forward", trace = 0)##
## Call:
## lm(formula = strength ~ cement + super_plast + age + slag + water +
## flyash + coarse_agg + fine_agg, data = concrete_train)
##
## Coefficients:
## (Intercept) cement super_plast age slag water
## -50.03594 0.12460 0.44089 0.11592 0.10375 -0.10724
## flyash coarse_agg fine_agg
## 0.08604 0.03306 0.02280
# both
step(object = model_all,scope = list(lower = model_none, upper =model_all),direction = "both",trace = 0)##
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast +
## coarse_agg + fine_agg + age, data = concrete_train)
##
## Coefficients:
## (Intercept) cement slag flyash water super_plast
## -50.03594 0.12460 0.10375 0.08604 -0.10724 0.44089
## coarse_agg fine_agg age
## 0.03306 0.02280 0.11592
All the stepwise resulting with the same variables, so we will gonna use the selection on the backward step.
I use the model_back for the concrete_train data, and evaluate the R Squared
model_back <- lm(formula = strength ~ cement + slag + flyash + water + super_plast +
coarse_agg + fine_agg + age, data = concrete_train)
summary(model_back)##
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast +
## coarse_agg + fine_agg + age, data = concrete_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.345 -6.705 0.358 7.385 33.679
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -50.035937 32.382924 -1.545 0.122800
## cement 0.124598 0.010469 11.901 < 0.0000000000000002 ***
## slag 0.103754 0.012352 8.400 0.00000000000000028 ***
## flyash 0.086035 0.015466 5.563 0.00000003877438740 ***
## water -0.107242 0.048940 -2.191 0.028785 *
## super_plast 0.440888 0.117559 3.750 0.000192 ***
## coarse_agg 0.033064 0.011513 2.872 0.004215 **
## fine_agg 0.022799 0.013111 1.739 0.082526 .
## age 0.115920 0.006639 17.460 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.47 on 651 degrees of freedom
## Multiple R-squared: 0.6149, Adjusted R-squared: 0.6102
## F-statistic: 129.9 on 8 and 651 DF, p-value: < 0.00000000000000022
in the summary, we can see the coarse_agg is not significant enough as the predictor variable in the model_back. And the Multiple R Squared value is 0.6149, and the Adjusted R Squared is 0.6102.
Before we predict the model with the validation data, let’s check the Normality
##
## Shapiro-Wilk normality test
##
## data: model_back$residuals
## W = 0.99619, p-value = 0.1123
From the hist() and the shapiro.test(), we can see that the residuals is Normal (histogram) but the result of the Shapiro Test is p-value > 0.05 which means that the residuals of the model is not distributed normally.
The result on this plot between model’s fitted.values and residuals is there is no homoscedasticity.
##
## studentized Breusch-Pagan test
##
## data: model_back
## BP = 91.326, df = 8, p-value = 0.0000000000000002501
The result p-value > 0.05 means there is no homoscedasticity
To see whether the model is overfit or not, we will check it here:
pred_concretelm <- predict(object = model_back, newdata = concrete_val)
MAE(pred = pred_concretelm, obs = concrete_val$strength)## [1] 8.083035
The requirement of the MAE value is < 4 and the R-squared is > 90%, since the Mean Absolute Error in model_back is 7.613729 with the Adj.R-squared is around 61.53%, we need to try to normalize the data.
## cement slag flyash water
## Min. :-1.71772 Min. :-0.8498 Min. :-0.8458 Min. :-2.8030
## 1st Qu.:-0.82783 1st Qu.:-0.8498 1st Qu.:-0.8458 1st Qu.:-0.7663
## Median :-0.05602 Median :-0.6176 Median :-0.8458 Median : 0.1363
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.66299 3rd Qu.: 0.7911 3rd Qu.: 1.0044 3rd Qu.: 0.5144
## Max. : 2.48692 Max. : 3.3240 Max. : 2.2863 Max. : 3.1135
## super_plast coarse_agg fine_agg age
## Min. :-1.04874 Min. :-2.24044 Min. :-2.24166 Min. :-0.7114
## 1st Qu.:-1.04874 1st Qu.:-0.53228 1st Qu.:-0.51382 1st Qu.:-0.6147
## Median : 0.03924 Median :-0.06287 Median : 0.05514 Median :-0.2763
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.64181 3rd Qu.: 0.72471 3rd Qu.: 0.63150 3rd Qu.: 0.1750
## Max. : 4.34095 Max. : 2.24510 Max. : 2.67776 Max. : 5.1545
## strength
## Min. :-2.00171
## 1st Qu.:-0.72678
## Median :-0.07287
## Mean : 0.00000
## 3rd Qu.: 0.60737
## Max. : 2.80064
After we scale the data and check with histogram, we can see that the data variance distribute normally near to zero. And also, same with the previous one, we can see if
cement, super_plast, and age making the highest correlation positively with the target variable strength.
set.seed(1000)
concrete_sample2 <- sample(nrow(concrete_scale), nrow(concrete_scale)*0.8)
concrete_train2 <- concrete[concrete_sample2,]
concrete_val2 <- concrete[-concrete_sample2,]# making model_none & all
model_none2 <- lm(strength ~ 1, concrete_train2)
model_all2 <- lm(strength ~., concrete_train2)##
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast +
## coarse_agg + fine_agg + age, data = concrete_train2)
##
## Coefficients:
## (Intercept) cement slag flyash water super_plast
## -50.03594 0.12460 0.10375 0.08604 -0.10724 0.44089
## coarse_agg fine_agg age
## 0.03306 0.02280 0.11592
# forward
step(object = model_none2,scope = list(lower = model_none2, upper =model_all2),direction = "forward", trace = 0)##
## Call:
## lm(formula = strength ~ cement + super_plast + age + slag + water +
## flyash + coarse_agg + fine_agg, data = concrete_train2)
##
## Coefficients:
## (Intercept) cement super_plast age slag water
## -50.03594 0.12460 0.44089 0.11592 0.10375 -0.10724
## flyash coarse_agg fine_agg
## 0.08604 0.03306 0.02280
# both
step(object = model_all2,scope = list(lower = model_none2, upper =model_all2),direction = "both",trace = 0)##
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast +
## coarse_agg + fine_agg + age, data = concrete_train2)
##
## Coefficients:
## (Intercept) cement slag flyash water super_plast
## -50.03594 0.12460 0.10375 0.08604 -0.10724 0.44089
## coarse_agg fine_agg age
## 0.03306 0.02280 0.11592
All the stepwise resulting in the same ingredients so I will use the model_back2
model_back2 for the concrete_train2 data, and evaluate the Adj. R Squaredmodel_back2 <- lm(formula = strength ~ cement + slag + flyash + water + super_plast +
coarse_agg + age, data = concrete_train2)
summary(model_back2)##
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast +
## coarse_agg + age, data = concrete_train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.899 -6.956 0.541 7.537 33.651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.869198 11.107809 0.258 0.796253
## cement 0.108928 0.005337 20.412 < 0.0000000000000002 ***
## slag 0.085475 0.006496 13.158 < 0.0000000000000002 ***
## flyash 0.064925 0.009597 6.765 0.0000000000297 ***
## water -0.175148 0.029544 -5.928 0.0000000049629 ***
## super_plast 0.417081 0.116941 3.567 0.000388 ***
## coarse_agg 0.016744 0.006680 2.507 0.012428 *
## age 0.115094 0.006633 17.353 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.49 on 652 degrees of freedom
## Multiple R-squared: 0.6131, Adjusted R-squared: 0.609
## F-statistic: 147.6 on 7 and 652 DF, p-value: < 0.00000000000000022
in the summary, we can see the fine_agg is not significant enough as the predictor variable in the model_back2. And the Multiple R Squared value is 0.6131, and the Adjusted R Squared is 0.609.
on this model_back2 the ingredients & age suggested are:
strength = 2.869198 + 0.108928 cement + 0.085475 slag + 0.064925 flyash + (-0.175148) water + 0.417081 super_plast + 0.016744 coarse_agg + 0.115094 age.
To see whether the model is overfit or not, we will check it here:
pred_concretelm2 <- predict(object = model_back2, newdata = concrete_val2)
MAE(pred = pred_concretelm2, obs = concrete_val2$strength)## [1] 8.122788
The requirement of the MAE value is < 4 and the R-squared is > 90%, since the Mean Absolute Error in model_back2 is 8.122788 with the Adj.R-squared is still around 60.9%, we will try to use another model building.
Do you need to log-transform or scale the variables with square root?
## [1] 6.904247
The MAE returned to be 5.696779 and it is and the R-squared is unknown.
Using scaled data, I will try to build model with Neural Network. I set the single hidden layer at first (7).
set.seed(100)
model_nn <- neuralnet(formula = strength ~ ., data = concrete_train2, hidden = c(7, 4), rep = 3)
plot(model_nn, rep = "best")Before I predict the model, I will check what the best repetition with which.min()
## [1] 2
pred_nn <- compute(model_nn, concrete_val2[,-9], rep= 2)
MAE(pred = pred_nn$net.result, obs = concrete_val2$strength)## [1] 12.99147
The MAE returns to 12.99147 which mean it is bigger than any other model. So now, we will try to build model with Random Forest.
library(rsample)
set.seed(1000)
splitted <- initial_split(data = concrete, prop = 0.7, strata = "strength")
concrete_trainRF <- training(splitted)
concrete_testRF <- testing(splitted)## rf variable importance
##
## Overall
## age 100.0000
## cement 78.8663
## water 31.6940
## super_plast 12.2839
## slag 6.2119
## fine_agg 2.8969
## coarse_agg 0.5443
## flyash 0.0000
## [1] 4.07931
With Random Forest, the MAE returns to 4.667194, I will try to tuning the model defining ntree parameter. As I use the varImp() function to see which of the variables is important orderly from the most important to the least.
set.seed(1000)
control <- trainControl(method = "cv", number = 10)
model_RF2 <- train(strength ~., data = concrete_trainRF, method = "rf", ntree = 5000, trControl= control)
pred_RF2 <- predict(model_RF2, concrete_testRF)
MAE(pred = pred_RF2, obs = concrete_testRF$strength)## [1] 4.055413
With Random Forest, the MAE returns to 4.058838, this is the closest to the requirement which is < 4. For the R-square value, we will try to interpret using LIME.
Do you need to scale back the data into original value in order to be more interpretable? Since I use Forest Regression, I don’t need to scale back the data since I didn’t scale the data in the first place.
lime()explain().How many features do you use to explain the model? In here I use 8 features to interpret the model since all of the variables are important ingredients to create the concrete mixture.
explanation <- explain(x = concrete_testRF %>% select(-strength),
explainer = explainer_RF,
n_permutations = 5000,
dist_fun = "gower",
labels = NULL,
n_features = 8,
feature_select = "highest_weights")explanation have.## [1] 1960
Since the result retunring to 1,960 observations, the plot would return error because of the massive amount of the observations.
explanations## Warning: Unknown or uninitialised column: 'label'.
As we can see here, the most weighted feature is age with > 56 days (dark blue), age between 28 ays to 56 days the next, cement with > 359, and gradually super.plast > 10.00, water <= 165, and so on into fading blue to dark red.
What is the difference between using LIME compared to interpretable machine learning models such as Decision Tree or metrics such as Variable Importance in Random Forest? LIME gives deeper detail on what has the model selecting each of the variables. Compared to varImp, it can only gives which variables are the most important without telling deeper information of the feature weights.
data_test <- read.csv("data/data-test.csv")
predict_final <- predict(model_RF, data_test)
submit <- data.frame(id = data_test$id, strength = predict_final)
write.csv(submit,"concreteforest.csv", row.names = FALSE)
head(submit)