Library and Setup

library(dplyr)
library(GGally)
library(MLmetrics)
library(recipes)
library(yardstick)
library(randomForest)
library(lime)
library(caret)

Import Data

train <- read.csv("data-train.csv")
test <- read.csv("data-test.csv")
submit <- read.csv("submission-example.csv")

Data Preprocess and Exploratory Data Analysis

Demonstrated how to apply some data preprocess to make sure that your data is “ready”, such as handling outlier.

What data preprocessing that you do?

We’ll try to do some preprocessing data, such as inspect sample of data, data type, if there is NA, etc.

rmarkdown::paged_table(train)

Data train has 825 rows and 10 columns. The data type of data train is fit enough. We’ll try to check if there is na in data train.

anyNA(train)
## [1] FALSE

Okay, there is no NA (missing value) in our train data.

rmarkdown::paged_table(test)

Our test data has 205 rows and 10 columns. The data type of our data test is fit enough. Column strength as target variable still has NA (missing value). Our goals is to predict the value of strength in test data. Let’s check if there is NA in predictor in test data.

anyNA(test %>% select(-strength))
## [1] FALSE

There is no missing value in predictor of test data. Let’s go on.

Is there any outlier?

Our target variable in this case is strength. So, we’ll check if there is outlier in strength

boxplot(train$strength)

From the plot, we can see there’s outlier of our data. We’ll try to inspect the value of the outlier.

sort(boxplot(train$strength, plot = F)$out)
## [1] 79.40 79.99 80.20 81.75 82.60

We got some value of our outlier. Outlier can influence out model, so we’ll remove the outlier.

train <- train %>% 
          select(-id) %>% 
          subset(strength < 79.40)

Do you need to scale the features or the target?

head(train,3)
##   cement  slag flyash water super_plast coarse_agg fine_agg age strength
## 2  540.0   0.0      0   162         2.5       1055      676  28    61.89
## 3  332.5 142.5      0   228         0.0        932      594 270    40.27
## 4  332.5 142.5      0   228         0.0        932      594 365    41.05

From the sample train data, we see that the target variable has large spread of values. For preventing the performance of the learning process and testing, we’ll scale the data. We’ll also compare the result of performance which data has scaling and not scaling.

Explored the relation between the target and the features.

ggcorr(train, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

Is strength positively correlated with age?

strength is positive correlated with age.

Is strength and cement has strong correlation?

strength and cement has strong positive correlation.

Is super_plast has a linear correlation with the strength?

super_plast has a linear correlation with strength

Model Fitting and Evaluation

Demonstrated how to prepare cross-validation data for this case.

What is the proportion of the training vs testing dataset?

nrow(train)
## [1] 820
nrow(test)
## [1] 205

The proportion of our train vs test dataset is 80% for data train and 20% for data test. Before predict the test data, we’ll do cross validation in data train to split 80% of data train to training data and 20% of data train to validating data. We’ll try our performance before apply the model to test data.

RNGkind(sample.kind = "Rounding")
set.seed(34)
idx <- sample(nrow(train), nrow(train) * 0.8)
data_train <- train[idx, ]
data_validasi <- train[-idx, ]

Demonstrated how to properly do model fitting and evaluation.

What model do you use?

We’ll use linear regression model.

mod1 <- lm(formula = strength ~., data_train)
mod1 <- stats::step(object = mod1, direction = "backward")
## Start:  AIC=3097.41
## strength ~ cement + slag + flyash + water + super_plast + coarse_agg + 
##     fine_agg + age
## 
##               Df Sum of Sq    RSS    AIC
## <none>                      71709 3097.4
## - water        1     358.7  72068 3098.7
## - super_plast  1     487.6  72197 3099.9
## - coarse_agg   1     588.8  72298 3100.8
## - fine_agg     1     627.3  72337 3101.1
## - flyash       1    4256.1  75965 3133.2
## - slag         1    8199.0  79908 3166.4
## - cement       1   14762.4  86472 3218.2
## - age          1   30652.2 102362 3328.9
#predict
pred_mod1_train <- predict(mod1, newdata = data.frame(data_train)) #predict in data_train
pred_mod1_validasi <- predict(mod1, newdata = data.frame(data_validasi)) #predict in data_validasi

How do you evaluate the model?

For model evaluation, we’ll use R-squared and MAE (Mean Absolute Error). If the R-squared value get closer to 1, it indicates the performance is better. For MAE which measure the error, our goal is to get the lowest MAE score.

rbind(
  #performance in predicting data train
  "Data Train" = select(data_train, strength) %>%
  mutate(
    pred_lm = predict(mod1, data_train)
  ) %>% 
  summarise(
    mae = mae_vec(truth = strength, estimate = pred_lm),
    rsq = rsq_vec(truth = strength, estimate = pred_lm),
  ),

  #performance in predicting data validasi
  "Data Validation" = select(data_validasi, strength) %>%
  mutate(
    pred_lm = predict(mod1, data_validasi)
  ) %>% 
  summarise(
    mae = mae_vec(truth = strength, estimate = pred_lm),
    rsq = rsq_vec(truth = strength, estimate = pred_lm),
  )
)
##                      mae       rsq
## Data Train      8.358577 0.5911806
## Data Validation 7.662895 0.6818137

Is your model overfit?

When predicting with using model mod1 :
* The score of error in predicting data train with MAE is 8.358577
* The score of error in predicting data validasi with MAE is 7.662895
* From the score of MAE, the model is better at predicting in validasi data than train data. So, the model is not overfit.

Compared multiple data preprocess approach.

Do you need to normalize the data?

For regression model, actually it depends to normalize the data. But, if we see the variance of data in this case, we’ll normalize the data.

Do you need to log-transform or scale the variables with square root?

For regression model, actually it doesn’t required. But, if we see the variance and spread value of data in this case, we’ll do some transformation (scale and square root) the data.

Compared multiple model.

Build at least 2 models or build a model then tune the parameter later.

We have build model using linear regression and get the value of R-squared = 0.6818137, and MAE = 7.662895 in validation data. We’ll try to using linear regression model and random forest with scaling the data.

# define preprocess recipe from train dataset
rec <- recipe(strength ~ ., data = data_train) %>% 
  step_sqrt(all_numeric()) %>%
  step_center(all_numeric()) %>% 
  step_scale(all_numeric()) %>% 
  prep()

# prepare recipes-revert functions
rec_rev <- function(x, rec){

  means <- rec$steps[[2]]$means[["strength"]]
  sds <- rec$steps[[3]]$sds[["strength"]]

  x <- (x * sds + means) ^ 2

  x
}

# apply recipe to train data
dat_train_scaled <- juice(rec)

# apply recipe to test data
dat_test_scaled <- bake(rec, data_validasi)

#modeling with regression
model_lm_scaled <- stats::step(lm(strength ~., data = dat_train_scaled), direction = "backward", trace = 0)

#regression mode performance comparation with scaling and not scaling data
rbind(
  # performance with scaling
 "Scaled" = select(data_validasi, strength) %>%
  mutate(
    pred_lm = predict(model_lm_scaled, dat_test_scaled),
    pred_lm = rec_rev(pred_lm, rec)
  ) %>% 
  summarise(
    mae = mae_vec(truth = strength, estimate = pred_lm),
    rsq = rsq_vec(truth = strength, estimate = pred_lm),
  ),
  
  # performance without scaling
 "Not Scaled" = select(data_validasi, strength) %>%
  mutate(
    pred_lm = predict(mod1, data_validasi)
  ) %>% 
  summarise(
    mae = mae_vec(truth = strength, estimate = pred_lm),
    rsq = rsq_vec(truth = strength, estimate = pred_lm),
  )
)
##                 mae       rsq
## Scaled     5.803299 0.8040983
## Not Scaled 7.662895 0.6818137

The model which data scaled get better performance than the model which data has not scaled.

If the model is not satisfactory, what will you do to tune the model?

We can try use random forest or neural network in modeling. We’ll try using random forest model with scaling data. We’ll try to use random forest from package randomForest.

set.seed(34)

#modeling with random forest from package randomForest
model_rf <- randomForest(formula =  strength ~ ., data = dat_train_scaled)

Is the tuned model perform better?

Yes. Random forest model get better performance than Regression model.

rbind(
  # performance with linear regression model
 "Regression" = select(data_validasi, strength) %>%
  mutate(
    pred_lm = predict(model_lm_scaled, dat_test_scaled),
    pred_lm = rec_rev(pred_lm, rec)
  ) %>% 
  summarise(
    mae = mae_vec(truth = strength, estimate = pred_lm),
    rsq = rsq_vec(truth = strength, estimate = pred_lm),
  ),
  
  # performance with random forest with package randomForest
 "Random Forest" = select(data_validasi, strength) %>%
  mutate(
    pred_rf = predict(model_rf, dat_test_scaled),
    pred_rf = rec_rev(pred_rf, rec)
  ) %>% 
  summarise(
    mae = mae_vec(truth = strength, estimate = pred_rf),
    rsq = rsq_vec(truth = strength, estimate = pred_rf),
  )
)
##                    mae       rsq
## Regression    5.803299 0.8040983
## Random Forest 3.496140 0.9445154

Prediction Performance

MAE in (your own) validation dataset reach < 4.

R-squared in (your own) validation dataset reach > 90%.

select(data_validasi, strength) %>%
  mutate(
    pred_rf = predict(model_rf, dat_test_scaled),
    pred_rf = rec_rev(pred_rf, rec)
  ) %>% 
  summarise(
    mae = mae_vec(truth = strength, estimate = pred_rf),
    rsq = rsq_vec(truth = strength, estimate = pred_rf),
  )
##       mae       rsq
## 1 3.49614 0.9445154

Prediction performance using random forest model in validation dataset has reached MAE < 4 and R-squared > 90%.

MAE in test dataset reach < 4.

R-squared in test dataset reach > 90%.

After get prediction performance in validation dataset, we’ll try to predict in test data set and submit in leaderboard.

# apply recipe to test data
dat_test1_scaled <- bake(rec, test)

#predict
pred_test <- select(test, strength) %>%
  mutate(
    pred_rf = predict(model_rf, dat_test1_scaled),
    pred_rf = rec_rev(pred_rf, rec)
  )

#get the value of prediction
pred_rf <- pred_test$pred_rf

#write submission
result <- submit %>% mutate(strength = pred_rf)
write.csv(result, file = "submit_RF.csv")

Prediction performance using random forest model in test dataset has reached MAE < 4 and R-squared > 90%.

Interpretation

Use LIME method to interpret the model that you have used

Do you need to scale back the data into original value in order to be more interpretable?

We’ll scale back the data into original value in order to be more interpretable.

How many features do you use to explain the model?

In this model, we use 8 features (predictor).

What is the difference between using LIME compared to interpretable machine learning models such as Decision Tree or metrics such as Variable Importance in Random Forest?

We can use Variable Importance in Random Forest to see what variables mostly used (important) in making random forest model in overall data. But we can using LIME to get the important factors for each observation data and the correlation of the predictor.

Interpret the first 4 observations of the plot

If we check the documentation of model_type, lime supports the following model objects :
* train from caret
* WrappedModel from mlr
* xgb.Booster from xgboost
* H2OModel from h2o
* keras.engine.training.Model from keras
* lda from MASS (used for low-dependency examples)

For this case, we’re using random forest model from package randomForest, so we’ll do some treatment in order to using LIME.

class(model_rf)
## [1] "randomForest.formula" "randomForest"
model_type.randomForest <- function(x) {
  return("regression")
}

predict_model.randomForest <- function(x, newdata, type="response") {
  
  res <- predict(x, newdata, type = "response") %>% as.data.frame()
  
  return(res)
}
set.seed(34)
explainer <- lime(dat_train_scaled %>% select(-strength), model_rf)
explanation <- explain(dat_test_scaled %>% select(-strength) %>% head(4), 
                       explainer,
                       n_features = 8,
                       feature_select = "none"
                       )

plot_features(explanation)

What is the difference between interpreting black box model with LIME and using an interpretable machine learning model?

An interpretable machine learning model such as regression model, we can easily interpret the model and see how much weight of a variable influence the target by get the coefficient and gradient. We can also using some statistic test, such as linearity, normality of residuals, homoscedasticity, and if there is multicollinearity. In black box model(ex random forest), we can use LIME to interpret model randomForest, but it was limited and more comfortable with a small number weighted features as an explanation. In regression for random forest, LIME we can only get the value weight of positive or negative which influenced the prediction value.

How good is the explanation fit? What does it signify?

Explanation fit indicates how good LIME explain the model, kind of like the R2 (R-Squared) value of linear regression. The first 4 observation we see that the Explanation fit only has values around 0.21 - 0.46 (21%-46%), which can be interpred that LIME can only explain a little about our model.

What are the most and the least important factors for each observation?

#age case 1, 2, and 3
(0.496*rec$steps[[3]]$sds[["age"]] + rec$steps[[2]]$means[["age"]])^2
## [1] 55.98527
#super_plast case 1
(0.393*rec$steps[[3]]$sds[["super_plast"]] + rec$steps[[2]]$means[["super_plast"]])^2
## [1] 6.696426
#coarse_agg case 1
(-0.0252*rec$steps[[3]]$sds[["coarse_agg"]] + rec$steps[[2]]$means[["coarse_agg"]])^2
## [1] 967.9963
(0.7457*rec$steps[[3]]$sds[["coarse_agg"]] + rec$steps[[2]]$means[["coarse_agg"]])^2
## [1] 1028.1
#water case 2, 3
(0.536*rec$steps[[3]]$sds[["water"]] + rec$steps[[2]]$means[["water"]])^2
## [1] 192.0096
#coarse_agg case 2, 3
(-0.4984*rec$steps[[3]]$sds[["coarse_agg"]] + rec$steps[[2]]$means[["coarse_agg"]])^2
## [1] 931.9995
#cement case 4
(-0.8534*rec$steps[[3]]$sds[["cement"]] + rec$steps[[2]]$means[["cement"]])^2
## [1] 190.6969
#flyash case 4
(1.07*rec$steps[[3]]$sds[["flyash"]] + rec$steps[[2]]$means[["flyash"]])^2
## [1] 117.9146
#-0.868 < age <= -0.122 case 4
(-0.868*rec$steps[[3]]$sds[["age"]] + rec$steps[[2]]$means[["age"]])^2
## [1] 7.004619
(-0.122*rec$steps[[3]]$sds[["age"]] + rec$steps[[2]]$means[["age"]])^2
## [1] 27.99862
  1. For the first observation (Case 1), we can see that age > 55.98 days has most positive weight (postive correlation) in prediction strength, while super_plast <= 6.69 has most negative weight (negative correlation) in prediction strength. coarse_agg between 967.99 and 1028.1 was the least important factors to strength in case 1.

  2. For the second observation (Case 2), we can see that age > 55.98 days has most positive weight (postive correlation) in prediction strength, while water > 192.0096 has most negative weight (negative correlation) in prediction strength. coarse_agg <= 931.99 was the least important factors to strength in case 2.

  3. For the third observation (Case 3), we can see that age > 55.98 days has most positive weight (postive correlation) in prediction strength, while water > 192.0096 has most negative weight (negative correlation) in prediction strength. coarse_agg <= 931.99 was the least important factors to strength in case 3.

  4. For the forth observation (Case 4), we can see that cement <= 190.69 has most negative weight (negative correlation) in prediction strength, while flyash <= 117.9146 has most positive weight (positive correlation) in prediction strength. age between 7 and 27.99 was the least important factors to strength in case 4.

Conclusion

Write the conclusion of your capstone project

Is your goal achieved?

Our goal is to Predict the compression strength based on the mixture properties. In this case, model performance by using MAE (error evaluation) and R-Squared value are our metrics. We have achieved our goal by using model_rf (random forest model).

Is the problem can be solved by machine learning?

Yes. We can use random forest model to solve this case to get better performance. We can also using linear regression model to solve this case, but the performance with random forest model is better in this case.

What model did you use and how is the performance?

#comparison in scaling
rbind(
  # performance with linear regression model  without scaling
 "Regression without scaling" = select(data_validasi, strength) %>%
  mutate(
    pred_lm = predict(mod1, data_validasi)
  ) %>% 
  summarise(
    mae = mae_vec(truth = strength, estimate = pred_lm),
    rsq = rsq_vec(truth = strength, estimate = pred_lm),
  ),
  
  # performance with linear regression model  with scaling
 "Regression with scaling" = select(data_validasi, strength) %>%
  mutate(
    pred_lm = predict(model_lm_scaled, dat_test_scaled),
    pred_lm = rec_rev(pred_lm, rec)
  ) %>% 
  summarise(
    mae = mae_vec(truth = strength, estimate = pred_lm),
    rsq = rsq_vec(truth = strength, estimate = pred_lm),
  ),
  
  # performance with random forest model with scaling
 "Random Forest with scaling" = select(data_validasi, strength) %>%
  mutate(
    pred_rf = predict(model_rf, dat_test_scaled),
    pred_rf = rec_rev(pred_rf, rec)
  ) %>% 
  summarise(
    mae = mae_vec(truth = strength, estimate = pred_rf),
    rsq = rsq_vec(truth = strength, estimate = pred_rf),
  )
)
##                                 mae       rsq
## Regression without scaling 7.662895 0.6818137
## Regression with scaling    5.803299 0.8040983
## Random Forest with scaling 3.496140 0.9445154

We use regression model and random forest model to predict. The performance with random forest was better than regression model in this case.

What is the potential business implementation of your capstone project?

  1. From the first 3 observation in LIME we can see that variable age > 55.98 days has most positive weight in predict strength, we can say that the age with more than 55.98 days are better for mixture properties to make the strength better.

  2. We can also using machine learning to calculate the strength in shorter time instead of waiting for longer period to observe by using prediction of strength like we do above.

  3. By using predicting the strength, we can also optimize the price of building (for property sector). Using strength to optimize the price of property selling.