library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2

## Warning: package 'ggplot2' was built under R version 4.2.2

## Warning: package 'stringr' was built under R version 4.2.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(GGally)

## Warning: package 'GGally' was built under R version 4.2.2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.2.2

## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(caret)

## Warning: package 'caret' was built under R version 4.2.2

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(car)

## Warning: package 'car' was built under R version 4.2.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.2.2

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(ggplot2)
library(MLmetrics)

## Warning: package 'MLmetrics' was built under R version 4.2.2

## 
## Attaching package: 'MLmetrics'
## 
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## 
## The following object is masked from 'package:base':
## 
##     Recall

library(e1071)

## Warning: package 'e1071' was built under R version 4.2.2

library(recipes)

## Warning: package 'recipes' was built under R version 4.2.2

## 
## Attaching package: 'recipes'
## 
## The following object is masked from 'package:stringr':
## 
##     fixed
## 
## The following object is masked from 'package:stats':
## 
##     step

library(lime)

## Warning: package 'lime' was built under R version 4.2.2

## 
## Attaching package: 'lime'
## 
## The following object is masked from 'package:dplyr':
## 
##     explain

train <- read.csv("data/data-train.csv")
test <- read.csv("data/data-test.csv")
glimpse(train)

## Rows: 825
## Columns: 10
## $ id          <chr> "S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8", "S9", "S10…
## $ cement      <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 380.0, 380.0, 475.0, 19…
## $ slag        <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 95.0, 95.0, 0.0, 132.4, 132…
## $ flyash      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ water       <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 192, 192, 228, 228…
## $ super_plast <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ coarse_agg  <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932.0, …
## $ fine_agg    <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 594.0, 594.0, 594.0, 82…
## $ age         <int> 28, 28, 270, 365, 360, 365, 28, 28, 90, 28, 28, 90, 90, 36…
## $ strength    <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 43.70, 36.45, 39.29, 38…

glimpse(test)

## Rows: 205
## Columns: 10
## $ id          <chr> "S826", "S827", "S828", "S829", "S830", "S831", "S832", "S…
## $ cement      <dbl> 266.0, 266.0, 427.5, 190.0, 380.0, 427.5, 198.6, 332.5, 23…
## $ slag        <dbl> 114.0, 114.0, 47.5, 190.0, 0.0, 47.5, 132.4, 142.5, 237.5,…
## $ flyash      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ water       <dbl> 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 192.0, 228.0, 22…
## $ super_plast <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ coarse_agg  <dbl> 932.0, 932.0, 932.0, 932.0, 932.0, 932.0, 978.4, 932.0, 93…
## $ fine_agg    <dbl> 670.0, 670.0, 594.0, 670.0, 670.0, 594.0, 825.5, 594.0, 59…
## $ age         <int> 90, 28, 270, 90, 270, 28, 180, 90, 180, 365, 365, 180, 180…
## $ strength    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

train %>%  is.na() %>% colSums()

##          id      cement        slag      flyash       water super_plast 
##           0           0           0           0           0           0 
##  coarse_agg    fine_agg         age    strength 
##           0           0           0           0

test %>%  is.na() %>% colSums()

##          id      cement        slag      flyash       water super_plast 
##           0           0           0           0           0           0 
##  coarse_agg    fine_agg         age    strength 
##           0           0           0         205

boxplot(train %>% select(-id))

hist(train$flyash)

hist(train$age)

(2 Points) Explore the relation between the target and the features. Is strength positively correlated with age? berkorelasi positif dengan age sebesat 0.3

Is strength and cement has strong correlation? iya, sebesar 0.5

Is super_plast has a linear correlation with the strength? ya, punya sebesar 0.4

ggcorr(train[,-1], label = T)

(2 Points) Demonstrate and explain how to apply some data preprocessing to make sure that your data is “ready”, such as handling outlier. What data preprocessing that you do? Answer : Data preproses yang saya lakukan contohnya diatas saya melakukan pengecekan missing value, outliers, dan distribusi data

Is there any outlier? Answer : Ya, Ada

Do you need to scale the features or the target? Answer : Perlu karena banyak data yang berdistribusi tidak normal

(2 Points) Explore the relation between the target and the features. Is strength positively correlated with age? Is strength and cement has strong correlation? Is super_plast has a linear correlation with the strength?

train_clean <- train[,-1]

train_log <- train_clean
train_log$age <- log(train_log$age)

train_no_outlier <- train_clean %>% 
  filter(strength<79.40)

train_scale <-  scale(train_no_outlier,center = TRUE,scale = TRUE)

(2 Points) Demonstrate how to prepare cross-validation data for this case. What is the proportion of the training vs testing dataset? answer : Training akan berisi 80% data dan test sisanya, dengan menggunakan cross validation

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)
index <- sample(nrow(train_clean), nrow(train_clean) * 0.8)
data_train <- train_clean[index,]
data_test <- train_clean[-index,]

(2 Points) Demonstrate how to properly do model fitting and evaluation. What model do you use? Answer : Linear Regresi dan Random Forest

How do you evaluate the model? Melihat MAE dan R-squared dari setiap model dan juga melakukan pengecekan terhadap multikoliner, heteroscedasticity, normalitas dll

Is your model overfit?

Model 1 Regresi tanpa scaling

lm <- lm(strength ~., data_train)
model_original<- stats::step(lm, direction = "backward", trace = F)

model evaluate dari yang pertama : MAE R-Square multicollinearity heteroscedasticity normality test

pred_lm <- predict(object = lm, newdata = data_test)
MAE(pred_lm, data_test$strength)

## [1] 8.494832

summary(model_original)$adj.r.squared

## [1] 0.6146765

vif(model_original)

##      cement        slag      flyash       water super_plast         age 
##    1.864874    1.722825    2.251775    1.860621    2.332434    1.065414

lmtest::bptest(model_original)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_original
## BP = 99.71, df = 6, p-value < 2.2e-16

shapiro.test(model_original$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_original$residuals
## W = 0.99461, p-value = 0.01984

Model 2 Regresi denga scaling dan square root

train dibagi menjadi 80% dan test 20%

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)
index <- sample(nrow(train_scale), nrow(train_scale) * 0.8)
data_train2 <- train_scale[index,]
data_test2 <- train_scale[-index,]

lm.scale <- lm(strength ~ .,data.frame(data_train2))
summary(lm.scale)

## 
## Call:
## lm(formula = strength ~ ., data = data.frame(data_train2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.81741 -0.38551  0.03816  0.41970  1.99698 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.003372   0.023923  -0.141   0.8879    
## cement       0.786236   0.063934  12.298  < 2e-16 ***
## slag         0.529600   0.062454   8.480  < 2e-16 ***
## flyash       0.362403   0.058885   6.154 1.32e-09 ***
## water       -0.151167   0.060755  -2.488   0.0131 *  
## super_plast  0.107540   0.039818   2.701   0.0071 ** 
## coarse_agg   0.114383   0.051884   2.205   0.0278 *  
## fine_agg     0.120138   0.062475   1.923   0.0549 .  
## age          0.493483   0.026434  18.669  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6114 on 647 degrees of freedom
## Multiple R-squared:  0.6352, Adjusted R-squared:  0.6307 
## F-statistic: 140.8 on 8 and 647 DF,  p-value: < 2.2e-16

model_scale <- stats::step(lm.scale, direction = "backward", trace = 0)
summary(model_scale)

## 
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
##     coarse_agg + fine_agg + age, data = data.frame(data_train2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.81741 -0.38551  0.03816  0.41970  1.99698 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.003372   0.023923  -0.141   0.8879    
## cement       0.786236   0.063934  12.298  < 2e-16 ***
## slag         0.529600   0.062454   8.480  < 2e-16 ***
## flyash       0.362403   0.058885   6.154 1.32e-09 ***
## water       -0.151167   0.060755  -2.488   0.0131 *  
## super_plast  0.107540   0.039818   2.701   0.0071 ** 
## coarse_agg   0.114383   0.051884   2.205   0.0278 *  
## fine_agg     0.120138   0.062475   1.923   0.0549 .  
## age          0.493483   0.026434  18.669  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6114 on 647 degrees of freedom
## Multiple R-squared:  0.6352, Adjusted R-squared:  0.6307 
## F-statistic: 140.8 on 8 and 647 DF,  p-value: < 2.2e-16

model evaluate dari yang pertama : MAE R-Square multicollinearity heteroscedasticity normality test

Memberikan hasil yang jauh lebih memuaskan dari model 1 dan menunjukan cement, slag, flyash, dan age mempunyai korelasi yang sangat tinggi terhadap stregth

pred_scale <- predict(object = model_scale, newdata = data.frame(data_test2))
MAE(pred_scale, data.frame(data_test2)$strength)

## [1] 0.5414662

summary(model_scale)$adj.r.squared

## [1] 0.6306556

vif(model_scale)

##      cement        slag      flyash       water super_plast  coarse_agg 
##    7.323742    6.838624    6.229559    6.141061    2.859650    4.789047 
##    fine_agg         age 
##    6.459908    1.104924

lmtest::bptest(model_scale)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_scale
## BP = 90.358, df = 8, p-value = 3.933e-16

shapiro.test(model_scale$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_scale$residuals
## W = 0.99565, p-value = 0.06401

(4 Points) Compare multiple data preprocess approach. Do you need to normalize the data? Ya butuh, karena distribusi data yang tidak normal maka di model pertama MAE yang didapat sangat besar Do you need to log-transform or scale the variables with square root? Bisa, disini saya melakukan square root dan hasil model menghasilkan nilai MAE yang lebih memuaskan dari model sebelumnya

Model 3

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)

index <- sample(nrow(train_no_outlier), nrow(train_no_outlier) * 0.8)
data_train3 <- train_no_outlier[index,]
data_test3 <- train_no_outlier[-index,]

rf <- train[,-1]
rf <- rf %>% filter(strength<79.40)

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(417)
ctrl <- trainControl(method="repeatedcv", number = 6, repeats = 3)
forest <- train(strength ~ ., data = train_no_outlier, method = "rf", trControl = ctrl)
saveRDS(forest, "R/model_rf.RDS") # simpan model

library(randomForest)
rfm <- readRDS("R/model_rf.RDS")
rfm

## Random Forest 
## 
## 820 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (6 fold, repeated 3 times) 
## Summary of sample sizes: 684, 683, 684, 683, 683, 683, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     5.670121  0.9001213  4.272428
##   5     5.040950  0.9091862  3.651791
##   8     5.072445  0.9063404  3.641625
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.

rfm$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 23.79246
##                     % Var explained: 91.13

varImp(rfm)

## rf variable importance
## 
##              Overall
## age         100.0000
## cement       81.7304
## water        28.2220
## super_plast  12.7195
## slag          7.3321
## fine_agg      5.9187
## coarse_agg    0.9443
## flyash        0.0000

pred_rf <- predict(object = rfm, newdata = test)
test$strength <- pred_rf

write.csv(test %>%select(id,strength), "submission-rafif.csv",row.names=FALSE)

(4 Points) Compare multiple model. Build at least 2 models or build a model then tune the parameter later. Saya membuat Regresi, Random Forest, Random Forest dengan tuner

If the model is not satisfactory, what will you do to tune the model? Memuaskan, tetapi saya tetap melakukan tuning dengan RandomSearch dan menghasilkan nilai MAE dan R-square yang kurang lebih sama nilainya

Is the tuned model perform better? Lebih baik tapi tidak memberikan impact yang jauh dari model sebelum di tune

# Random Search
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="random")
set.seed(123)
metric <- "MAE"
mtry <- sqrt(ncol(train))
rf_random <- train(strength~., data=train_no_outlier, method="rf", metric=metric, tuneLength=15, trControl=control)
print(rf_random)

## Random Forest 
## 
## 820 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 739, 739, 738, 738, 737, 737, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   1     7.500446  0.8448508  5.984596
##   4     4.980617  0.9127849  3.605522
##   5     4.947356  0.9125650  3.550881
##   6     4.958391  0.9113483  3.537329
##   7     4.966201  0.9106978  3.541672
##   8     4.993420  0.9092716  3.544059
## 
## MAE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.

plot(rf_random)

saveRDS(rf_random, "R/model_rfrandom.RDS") # simpan model

rfr <- read_rds("R/model_rfrandom.RDS")
varImp(rfr)

## rf variable importance
## 
##             Overall
## age         100.000
## cement       84.965
## water        28.907
## super_plast  14.563
## slag         10.698
## fine_agg      6.547
## coarse_agg    2.215
## flyash        0.000

pred_rftune <- predict(object = rfr, newdata = test)
test$strength <- pred_rftune

Tidak Overfit

pred_train_rf <- predict(rfr, newdata = train)
rf.mod.R2 <- cor(pred_train_rf, train$strength)^2
rf.mod.test.R2 <- cor(pred_rftune, test$strength)^2
fit.rf.mod <- rf.mod.R2 - rf.mod.test.R2
fit.rf.mod

## [1] -0.0233749

write.csv(test %>%select(id,strength), "submission-rafif2.csv",row.names=FALSE)

(2 Points) Use LIME method to interpret the model that you have used. Do you need to scale back the data into original value in order to be more interpretable? How many features do you use to explain the model? What is the difference between using LIME compared to interpretable machine learning models such as Decision Tree or metrics such as Variable Importance in Random Forest?

explainer <- lime(data_train3, rfr)
data_expl <- data_test3[1:6,]
explanation_caret <- explain(
  x = data_expl, 
  explainer = explainer, 
  n_permutations = 5000,
  dist_fun = "gower",
  kernel_width = .75,
  n_features = 2, 
  feature_select = "highest_weights",
  )
head(explainer)

## $model
## Random Forest 
## 
## 820 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 739, 739, 738, 738, 737, 737, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   1     7.500446  0.8448508  5.984596
##   4     4.980617  0.9127849  3.605522
##   5     4.947356  0.9125650  3.550881
##   6     4.958391  0.9113483  3.537329
##   7     4.966201  0.9106978  3.541672
##   8     4.993420  0.9092716  3.544059
## 
## MAE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
## 
## $preprocess
## function (x) 
## x
## <bytecode: 0x000002d3cbf693e8>
## <environment: 0x000002d3cbf5a620>
## 
## $bin_continuous
## [1] TRUE
## 
## $n_bins
## [1] 4
## 
## $quantile_bins
## [1] TRUE
## 
## $use_density
## [1] TRUE

plot_features(explanation_caret)

plot_explanations(explanation_caret)

## Warning: Unknown or uninitialised column: `label`.

(2 Points) Use LIME method to interpret the model that you have used. Do you need to scale back the data into original value in order to be more interpretable? Tidak perlu karena ada lime yang sudah membantu kita untuk mengintrepetasikannya

How many features do you use to explain the model? 3 : age dan cement memliki weight yang sangat tinggi dari 6 sample diatas dari 4 sample, age berkorelasi positif terhadap strength dan dari 2 sample. cement berkorelasi positif terhadap strength terakhir super_plas kebalikan dari cement hanya memiliki weight yang rendah dan negatif korelasi terhadap stregth

What is the difference between using LIME compared to interpretable machine learning models such as Decision Tree or metrics such as Variable Importance in Random Forest? Dalam Vimp kita hanya bisa mendapatkan insight degree of importance dari prediksi nya, sedangkan dengan lime kita bisa mengintrepetasi korelasi antar variable prediktor terhadap targer nya, kemudian kita juga bisa membuat dummy model yang hanya untuk intrepetasi yang dimana nantinya model bisa dijelaskan seperti logistic regression

(2 Points) Interpret the first 4 observations of the plot. What is the difference between interpreting black box model with LIME and using an interpretable machine learning model? black box model bisa mengintrepetasikan model yang mempunyai kompleksitas yang tinggi yang dimana cocok untuk random forest dan model lainnya yang sudah di scaling

How good is the explanation fit? What does it signify? sangat membantu dibandingkan model evaluate dari random forest yang hanya menampilkan sedikir summary dari hasil modelnya

What are the most and the least important factors for each observation? dari hasil observasi diatas bisa diliat superplas tidak terlalu berpengaruh dan age mempunyai signifikansi yang tinggi terhadap stregth

(2 Points) Write the conclusion of your capstone project. Is your goal achieved? ya, kita sudah achived goals kita diperoleh nilai MAE 3.57 (kurang dari 4) dan R Square sebesar 91% (lebih dari 90%).

Is the problem can be solved by machine learning? Ya kita bisa menyelesaikan masalah ini dengan machine learning contohnya dengan random forest yang akan tetapi overfitting akan bisa terjadi akibat bias-variance nya

What model did you use and how is the performance? pertama saya menggunakan Regresi Dengan MAE sekitar 0.5 dan rsquae 0.6, kemudian di random forest tercapai MAE sekitar 3.5 dan rsquare 91

What is the potential business implementation of your capstone project? potensi nya bisa digunakan untuk mempredict harga yang bagus untuk barang manufactured, seperti rumah, mobil, dll untuk menjaga margin profit. atau untuk mencari bahan yang lebih murah dan kualitas yang tidak kalah saing

Airline

2022-12-18

Model 1 Regresi tanpa scaling

Model 2 Regresi denga scaling dan square root

Model 3