library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(randomForest)

## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(ggplot2)
library(MLmetrics)

## 
## Attaching package: 'MLmetrics'
## 
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## 
## The following object is masked from 'package:base':
## 
##     Recall

library(e1071)
library(recipes)

## 
## Attaching package: 'recipes'
## 
## The following object is masked from 'package:stringr':
## 
##     fixed
## 
## The following object is masked from 'package:stats':
## 
##     step

library(lime)

## 
## Attaching package: 'lime'
## 
## The following object is masked from 'package:dplyr':
## 
##     explain

train <- read.csv("data/data-train.csv")
test <- read.csv("data/data-test.csv")
glimpse(train)

## Rows: 825
## Columns: 10
## $ id          <chr> "S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8", "S9", "S10…
## $ cement      <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 380.0, 380.0, 475.0, 19…
## $ slag        <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 95.0, 95.0, 0.0, 132.4, 132…
## $ flyash      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ water       <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 192, 192, 228, 228…
## $ super_plast <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ coarse_agg  <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932.0, …
## $ fine_agg    <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 594.0, 594.0, 594.0, 82…
## $ age         <int> 28, 28, 270, 365, 360, 365, 28, 28, 90, 28, 28, 90, 90, 36…
## $ strength    <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 43.70, 36.45, 39.29, 38…

glimpse(test)

## Rows: 205
## Columns: 10
## $ id          <chr> "S826", "S827", "S828", "S829", "S830", "S831", "S832", "S…
## $ cement      <dbl> 266.0, 266.0, 427.5, 190.0, 380.0, 427.5, 198.6, 332.5, 23…
## $ slag        <dbl> 114.0, 114.0, 47.5, 190.0, 0.0, 47.5, 132.4, 142.5, 237.5,…
## $ flyash      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ water       <dbl> 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 192.0, 228.0, 22…
## $ super_plast <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ coarse_agg  <dbl> 932.0, 932.0, 932.0, 932.0, 932.0, 932.0, 978.4, 932.0, 93…
## $ fine_agg    <dbl> 670.0, 670.0, 594.0, 670.0, 670.0, 594.0, 825.5, 594.0, 59…
## $ age         <int> 90, 28, 270, 90, 270, 28, 180, 90, 180, 365, 365, 180, 180…
## $ strength    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

train %>%  is.na() %>% colSums()

##          id      cement        slag      flyash       water super_plast 
##           0           0           0           0           0           0 
##  coarse_agg    fine_agg         age    strength 
##           0           0           0           0

test %>%  is.na() %>% colSums()

##          id      cement        slag      flyash       water super_plast 
##           0           0           0           0           0           0 
##  coarse_agg    fine_agg         age    strength 
##           0           0           0         205

boxplot(train %>% select(-id))

hist(train$flyash)

hist(train$age)

(2 Points) Explore the relation between the target and the features. Is strength positively correlated with age? berkorelasi positif dengan age sebesat 0.3

Is strength and cement has strong correlation? iya, sebesar 0.5

Is super_plast has a linear correlation with the strength? ya, punya sebesar 0.4

ggcorr(train[,-1], label = T)

(2 Points) Demonstrate and explain how to apply some data preprocessing to make sure that your data is “ready”, such as handling outlier. What data preprocessing that you do? Answer : Data preproses yang saya lakukan contohnya diatas saya melakukan pengecekan missing value, outliers, dan distribusi data

Is there any outlier? Answer : Ya, Ada

Do you need to scale the features or the target? Answer : Perlu karena banyak data yang berdistribusi tidak normal

(2 Points) Explore the relation between the target and the features. Is strength positively correlated with age? Is strength and cement has strong correlation? Is super_plast has a linear correlation with the strength?

train_clean <- train[,-1]

train_log <- train_clean
train_log$age <- log(train_log$age)

train_no_outlier <- train_clean %>% 
  filter(strength<79.40)

train_scale <-  scale(train_no_outlier,center = TRUE,scale = TRUE)

(2 Points) Demonstrate how to prepare cross-validation data for this case. What is the proportion of the training vs testing dataset? answer : Training akan berisi 80% data dan test sisanya, dengan menggunakan cross validation

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)
index <- sample(nrow(train_clean), nrow(train_clean) * 0.8)
data_train <- train_clean[index,]
data_test <- train_clean[-index,]

(2 Points) Demonstrate how to properly do model fitting and evaluation. What model do you use? Answer : Linear Regresi dan Random Forest

How do you evaluate the model? Melihat MAE dan R-squared dari setiap model dan juga melakukan pengecekan terhadap multikoliner, heteroscedasticity, normalitas dll

Is your model overfit?

Model 1 Regresi tanpa scaling

lm <- lm(strength ~., data_train)
model_original<- stats::step(lm, direction = "backward", trace = F)

model evaluate dari yang pertama : MAE R-Square multicollinearity heteroscedasticity normality test

pred_lm <- predict(object = lm, newdata = data_test)
MAE(pred_lm, data_test$strength)

## [1] 8.494832

summary(model_original)$adj.r.squared

## [1] 0.6146765

vif(model_original)

##      cement        slag      flyash       water super_plast         age 
##    1.864874    1.722825    2.251775    1.860621    2.332434    1.065414

lmtest::bptest(model_original)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_original
## BP = 99.71, df = 6, p-value < 2.2e-16

shapiro.test(model_original$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_original$residuals
## W = 0.99461, p-value = 0.01984

Model 2 Regresi denga scaling dan square root

train dibagi menjadi 80% dan test 20%

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)
index <- sample(nrow(train_scale), nrow(train_scale) * 0.8)
data_train2 <- train_scale[index,]
data_test2 <- train_scale[-index,]

lm.scale <- lm(strength ~ .,data.frame(data_train2))
summary(lm.scale)

## 
## Call:
## lm(formula = strength ~ ., data = data.frame(data_train2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.81741 -0.38551  0.03816  0.41970  1.99698 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.003372   0.023923  -0.141   0.8879    
## cement       0.786236   0.063934  12.298  < 2e-16 ***
## slag         0.529600   0.062454   8.480  < 2e-16 ***
## flyash       0.362403   0.058885   6.154 1.32e-09 ***
## water       -0.151167   0.060755  -2.488   0.0131 *  
## super_plast  0.107540   0.039818   2.701   0.0071 ** 
## coarse_agg   0.114383   0.051884   2.205   0.0278 *  
## fine_agg     0.120138   0.062475   1.923   0.0549 .  
## age          0.493483   0.026434  18.669  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6114 on 647 degrees of freedom
## Multiple R-squared:  0.6352, Adjusted R-squared:  0.6307 
## F-statistic: 140.8 on 8 and 647 DF,  p-value: < 2.2e-16

model_scale <- stats::step(lm.scale, direction = "backward", trace = 0)
summary(model_scale)

## 
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
##     coarse_agg + fine_agg + age, data = data.frame(data_train2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.81741 -0.38551  0.03816  0.41970  1.99698 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.003372   0.023923  -0.141   0.8879    
## cement       0.786236   0.063934  12.298  < 2e-16 ***
## slag         0.529600   0.062454   8.480  < 2e-16 ***
## flyash       0.362403   0.058885   6.154 1.32e-09 ***
## water       -0.151167   0.060755  -2.488   0.0131 *  
## super_plast  0.107540   0.039818   2.701   0.0071 ** 
## coarse_agg   0.114383   0.051884   2.205   0.0278 *  
## fine_agg     0.120138   0.062475   1.923   0.0549 .  
## age          0.493483   0.026434  18.669  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6114 on 647 degrees of freedom
## Multiple R-squared:  0.6352, Adjusted R-squared:  0.6307 
## F-statistic: 140.8 on 8 and 647 DF,  p-value: < 2.2e-16

model evaluate dari yang pertama : MAE R-Square multicollinearity heteroscedasticity normality test

Memberikan hasil yang jauh lebih memuaskan dari model 1 dan menunjukan cement, slag, flyash, dan age mempunyai korelasi yang sangat tinggi terhadap stregth

pred_scale <- predict(object = model_scale, newdata = data.frame(data_test2))
MAE(pred_scale, data.frame(data_test2)$strength)

## [1] 0.5414662

summary(model_scale)$adj.r.squared

## [1] 0.6306556

vif(model_scale)

##      cement        slag      flyash       water super_plast  coarse_agg 
##    7.323742    6.838624    6.229559    6.141061    2.859650    4.789047 
##    fine_agg         age 
##    6.459908    1.104924

lmtest::bptest(model_scale)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_scale
## BP = 90.358, df = 8, p-value = 3.933e-16

shapiro.test(model_scale$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_scale$residuals
## W = 0.99565, p-value = 0.06401

(4 Points) Compare multiple data preprocess approach. Do you need to normalize the data? Ya butuh, karena distribusi data yang tidak normal maka di model pertama MAE yang didapat sangat besar Do you need to log-transform or scale the variables with square root? Bisa, disini saya melakukan square root dan hasil model menghasilkan nilai MAE yang lebih memuaskan dari model sebelumnya

Model 3

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)

index <- sample(nrow(train_no_outlier), nrow(train_no_outlier) * 0.8)
data_train3 <- train_no_outlier[index,]
data_test3 <- train_no_outlier[-index,]

rf <- train[,-1]
rf <- rf %>% filter(strength<79.40)

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(417)
ctrl <- trainControl(method="repeatedcv", number = 6, repeats = 3)
forest <- train(strength ~ ., data = train_no_outlier, method = "rf", trControl = ctrl)
saveRDS(forest, "R/model_rf.RDS") # simpan model

library(randomForest)
rfm <- readRDS("R/model_rf.RDS")
rfm

## Random Forest 
## 
## 820 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (6 fold, repeated 3 times) 
## Summary of sample sizes: 684, 683, 684, 683, 683, 683, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     5.670121  0.9001213  4.272428
##   5     5.040950  0.9091862  3.651791
##   8     5.072445  0.9063404  3.641625
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.

rfm$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 23.79246
##                     % Var explained: 91.13

varImp(rfm)

## rf variable importance
## 
##              Overall
## age         100.0000
## cement       81.7304
## water        28.2220
## super_plast  12.7195
## slag          7.3321
## fine_agg      5.9187
## coarse_agg    0.9443
## flyash        0.0000

pred_rf <- predict(object = rfm, newdata = test)
test$strength <- pred_rf

write.csv(test %>%select(id,strength), "submission-rafif.csv",row.names=FALSE)

Compare multiple model.

Build at least 2 models or build a model then tune the parameter later. Saya membuat Regresi, Random Forest, Random Forest dengan tuner

If the model is not satisfactory, what will you do to tune the model? Satisfying, but I still performed tuning with RandomSearch and obtained MAE and R-square values that were more or less the same.

Is the tuned model perform better? Better, but it didn’t make a significant impact on the model compared to before tuning.

# Random Search
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="random")
set.seed(123)
metric <- "MAE"
mtry <- sqrt(ncol(train))
rf_random <- train(strength~., data=train_no_outlier, method="rf", metric=metric, tuneLength=15, trControl=control)
print(rf_random)

## Random Forest 
## 
## 820 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 739, 739, 738, 738, 737, 737, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   1     7.500446  0.8448508  5.984596
##   4     4.980617  0.9127849  3.605522
##   5     4.947356  0.9125650  3.550881
##   6     4.958391  0.9113483  3.537329
##   7     4.966201  0.9106978  3.541672
##   8     4.993420  0.9092716  3.544059
## 
## MAE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.

plot(rf_random)

saveRDS(rf_random, "R/model_rfrandom.RDS") # simpan model

rfr <- read_rds("R/model_rfrandom.RDS")
varImp(rfr)

## rf variable importance
## 
##             Overall
## age         100.000
## cement       84.965
## water        28.907
## super_plast  14.563
## slag         10.698
## fine_agg      6.547
## coarse_agg    2.215
## flyash        0.000

pred_rftune <- predict(object = rfr, newdata = test)
test$strength <- pred_rftune

Not overfitting.

pred_train_rf <- predict(rfr, newdata = train)
rf.mod.R2 <- cor(pred_train_rf, train$strength)^2
rf.mod.test.R2 <- cor(pred_rftune, test$strength)^2
fit.rf.mod <- rf.mod.R2 - rf.mod.test.R2
fit.rf.mod

## [1] -0.0233749

write.csv(test %>%select(id,strength), "submission-rafif2.csv",row.names=FALSE)

Use LIME method to interpret the model that you have used.

Do you need to scale back the data into original value in order to be more interpretable?

How many features do you use to explain the model?

What is the difference between using LIME compared to interpretable machine learning models such as Decision Tree or metrics such as Variable Importance in Random Forest?

explainer <- lime(data_train3, rfr)
data_expl <- data_test3[1:6,]
explanation_caret <- explain(
  x = data_expl, 
  explainer = explainer, 
  n_permutations = 5000,
  dist_fun = "gower",
  kernel_width = .75,
  n_features = 2, 
  feature_select = "highest_weights",
  )
head(explainer)

## $model
## Random Forest 
## 
## 820 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 739, 739, 738, 738, 737, 737, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   1     7.500446  0.8448508  5.984596
##   4     4.980617  0.9127849  3.605522
##   5     4.947356  0.9125650  3.550881
##   6     4.958391  0.9113483  3.537329
##   7     4.966201  0.9106978  3.541672
##   8     4.993420  0.9092716  3.544059
## 
## MAE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
## 
## $preprocess
## function (x) 
## x
## <bytecode: 0x000001d82d04e958>
## <environment: 0x000001d82d0476e0>
## 
## $bin_continuous
## [1] TRUE
## 
## $n_bins
## [1] 4
## 
## $quantile_bins
## [1] TRUE
## 
## $use_density
## [1] TRUE

plot_features(explanation_caret)

plot_explanations(explanation_caret)

## Warning: Unknown or uninitialised column: `label`.

Use LIME method to interpret the model that you have used.

Do you need to scale back the data into original value in order to be more interpretable? No need, as Lime has already assisted us in interpreting it.

How many features do you use to explain the model? 3 : Age and cement have a very high weight in the six samples above. Among the four samples, age shows a positive correlation with strength, while in the two samples, cement also exhibits a positive correlation with strength. Finally, super_plas, unlike cement, has a low weight and a negative correlation with strength.

What is the difference between using LIME compared to interpretable machine learning models such as Decision Tree or metrics such as Variable Importance in Random Forest? In Vimp, we can only obtain insights into the degree of importance of its predictions. Meanwhile, with Lime, we can interpret the correlations between predictor variables and its target. Furthermore, we can create a dummy model that is specifically for interpretation purposes. This way, the model can be explained similar to a logistic regression model.

Interpret the first 4 observations of the plot.

What is the difference between interpreting black box model with LIME and using an interpretable machine learning model? A black box model can interpret complex models, which is suitable for models like Random Forest and others that have been scaled.

How good is the explanation fit? What does it signify? It’s much more helpful compared to the evaluation of a Random Forest model, which only provides a limited summary of its results.

What are the most and the least important factors for each observation? From the observations above, it can be seen that superplasticizer (superplas) doesn’t have a significant impact, while age has a high significance on strength.

Write the conclusion of your capstone project.

Is your goal achieved?

Yes, we have achieved our goals. We obtained an MAE value of 3.57 (less than 4) and an R-squared value of 91% (more than 90%).

Is the problem can be solved by machine learning? Yes, we can solve this problem with machine learning, for example, using random forest. However, overfitting can occur due to its bias-variance.

What model did you use and how is the performance? First, I used regression with an MAE of around 0.5 and an R-squared of 0.6. Then, with random forest, we achieved an MAE of around 3.5 and an R-squared of 91.

What is the potential business implementation of your capstone project? The potential use could be in predicting optimal prices for manufactured goods, such as houses, cars, etc., to maintain profit margins or to find cost-effective materials without compromising quality.

Regression Model

Rafif

2022-09-25

Model 1 Regresi tanpa scaling

Model 2 Regresi denga scaling dan square root

Model 3