library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'stringr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.2.2
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
## Warning: package 'caret' was built under R version 4.2.2
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(car)
## Warning: package 'car' was built under R version 4.2.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.2
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(ggplot2)
library(MLmetrics)
## Warning: package 'MLmetrics' was built under R version 4.2.2
##
## Attaching package: 'MLmetrics'
##
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
##
## The following object is masked from 'package:base':
##
## Recall
library(e1071)
## Warning: package 'e1071' was built under R version 4.2.2
library(recipes)
## Warning: package 'recipes' was built under R version 4.2.2
##
## Attaching package: 'recipes'
##
## The following object is masked from 'package:stringr':
##
## fixed
##
## The following object is masked from 'package:stats':
##
## step
library(lime)
## Warning: package 'lime' was built under R version 4.2.2
##
## Attaching package: 'lime'
##
## The following object is masked from 'package:dplyr':
##
## explain
train <- read.csv("data/data-train.csv")
test <- read.csv("data/data-test.csv")
glimpse(train)
## Rows: 825
## Columns: 10
## $ id <chr> "S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8", "S9", "S10…
## $ cement <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 380.0, 380.0, 475.0, 19…
## $ slag <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 95.0, 95.0, 0.0, 132.4, 132…
## $ flyash <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ water <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 192, 192, 228, 228…
## $ super_plast <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ coarse_agg <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932.0, …
## $ fine_agg <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 594.0, 594.0, 594.0, 82…
## $ age <int> 28, 28, 270, 365, 360, 365, 28, 28, 90, 28, 28, 90, 90, 36…
## $ strength <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 43.70, 36.45, 39.29, 38…
glimpse(test)
## Rows: 205
## Columns: 10
## $ id <chr> "S826", "S827", "S828", "S829", "S830", "S831", "S832", "S…
## $ cement <dbl> 266.0, 266.0, 427.5, 190.0, 380.0, 427.5, 198.6, 332.5, 23…
## $ slag <dbl> 114.0, 114.0, 47.5, 190.0, 0.0, 47.5, 132.4, 142.5, 237.5,…
## $ flyash <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ water <dbl> 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 192.0, 228.0, 22…
## $ super_plast <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ coarse_agg <dbl> 932.0, 932.0, 932.0, 932.0, 932.0, 932.0, 978.4, 932.0, 93…
## $ fine_agg <dbl> 670.0, 670.0, 594.0, 670.0, 670.0, 594.0, 825.5, 594.0, 59…
## $ age <int> 90, 28, 270, 90, 270, 28, 180, 90, 180, 365, 365, 180, 180…
## $ strength <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
train %>% is.na() %>% colSums()
## id cement slag flyash water super_plast
## 0 0 0 0 0 0
## coarse_agg fine_agg age strength
## 0 0 0 0
test %>% is.na() %>% colSums()
## id cement slag flyash water super_plast
## 0 0 0 0 0 0
## coarse_agg fine_agg age strength
## 0 0 0 205
boxplot(train %>% select(-id))
hist(train$flyash)
hist(train$age)
(2 Points) Explore the relation between the target and the features. Is strength positively correlated with age? berkorelasi positif dengan age sebesat 0.3
Is strength and cement has strong correlation? iya, sebesar 0.5
Is super_plast has a linear correlation with the strength? ya, punya sebesar 0.4
ggcorr(train[,-1], label = T)
(2 Points) Demonstrate and explain how to apply some data preprocessing to make sure that your data is “ready”, such as handling outlier. What data preprocessing that you do? Answer : Data preproses yang saya lakukan contohnya diatas saya melakukan pengecekan missing value, outliers, dan distribusi data
Is there any outlier? Answer : Ya, Ada
Do you need to scale the features or the target? Answer : Perlu karena banyak data yang berdistribusi tidak normal
(2 Points) Explore the relation between the target and the features. Is strength positively correlated with age? Is strength and cement has strong correlation? Is super_plast has a linear correlation with the strength?
train_clean <- train[,-1]
train_log <- train_clean
train_log$age <- log(train_log$age)
train_no_outlier <- train_clean %>%
filter(strength<79.40)
train_scale <- scale(train_no_outlier,center = TRUE,scale = TRUE)
(2 Points) Demonstrate how to prepare cross-validation data for this case. What is the proportion of the training vs testing dataset? answer : Training akan berisi 80% data dan test sisanya, dengan menggunakan cross validation
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
index <- sample(nrow(train_clean), nrow(train_clean) * 0.8)
data_train <- train_clean[index,]
data_test <- train_clean[-index,]
(2 Points) Demonstrate how to properly do model fitting and evaluation. What model do you use? Answer : Linear Regresi dan Random Forest
How do you evaluate the model? Melihat MAE dan R-squared dari setiap model dan juga melakukan pengecekan terhadap multikoliner, heteroscedasticity, normalitas dll
Is your model overfit?
lm <- lm(strength ~., data_train)
model_original<- stats::step(lm, direction = "backward", trace = F)
model evaluate dari yang pertama : MAE R-Square multicollinearity heteroscedasticity normality test
pred_lm <- predict(object = lm, newdata = data_test)
MAE(pred_lm, data_test$strength)
## [1] 8.494832
summary(model_original)$adj.r.squared
## [1] 0.6146765
vif(model_original)
## cement slag flyash water super_plast age
## 1.864874 1.722825 2.251775 1.860621 2.332434 1.065414
lmtest::bptest(model_original)
##
## studentized Breusch-Pagan test
##
## data: model_original
## BP = 99.71, df = 6, p-value < 2.2e-16
shapiro.test(model_original$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_original$residuals
## W = 0.99461, p-value = 0.01984
train dibagi menjadi 80% dan test 20%
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
index <- sample(nrow(train_scale), nrow(train_scale) * 0.8)
data_train2 <- train_scale[index,]
data_test2 <- train_scale[-index,]
lm.scale <- lm(strength ~ .,data.frame(data_train2))
summary(lm.scale)
##
## Call:
## lm(formula = strength ~ ., data = data.frame(data_train2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.81741 -0.38551 0.03816 0.41970 1.99698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.003372 0.023923 -0.141 0.8879
## cement 0.786236 0.063934 12.298 < 2e-16 ***
## slag 0.529600 0.062454 8.480 < 2e-16 ***
## flyash 0.362403 0.058885 6.154 1.32e-09 ***
## water -0.151167 0.060755 -2.488 0.0131 *
## super_plast 0.107540 0.039818 2.701 0.0071 **
## coarse_agg 0.114383 0.051884 2.205 0.0278 *
## fine_agg 0.120138 0.062475 1.923 0.0549 .
## age 0.493483 0.026434 18.669 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6114 on 647 degrees of freedom
## Multiple R-squared: 0.6352, Adjusted R-squared: 0.6307
## F-statistic: 140.8 on 8 and 647 DF, p-value: < 2.2e-16
model_scale <- stats::step(lm.scale, direction = "backward", trace = 0)
summary(model_scale)
##
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast +
## coarse_agg + fine_agg + age, data = data.frame(data_train2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.81741 -0.38551 0.03816 0.41970 1.99698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.003372 0.023923 -0.141 0.8879
## cement 0.786236 0.063934 12.298 < 2e-16 ***
## slag 0.529600 0.062454 8.480 < 2e-16 ***
## flyash 0.362403 0.058885 6.154 1.32e-09 ***
## water -0.151167 0.060755 -2.488 0.0131 *
## super_plast 0.107540 0.039818 2.701 0.0071 **
## coarse_agg 0.114383 0.051884 2.205 0.0278 *
## fine_agg 0.120138 0.062475 1.923 0.0549 .
## age 0.493483 0.026434 18.669 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6114 on 647 degrees of freedom
## Multiple R-squared: 0.6352, Adjusted R-squared: 0.6307
## F-statistic: 140.8 on 8 and 647 DF, p-value: < 2.2e-16
model evaluate dari yang pertama : MAE R-Square multicollinearity heteroscedasticity normality test
Memberikan hasil yang jauh lebih memuaskan dari model 1 dan menunjukan cement, slag, flyash, dan age mempunyai korelasi yang sangat tinggi terhadap stregth
pred_scale <- predict(object = model_scale, newdata = data.frame(data_test2))
MAE(pred_scale, data.frame(data_test2)$strength)
## [1] 0.5414662
summary(model_scale)$adj.r.squared
## [1] 0.6306556
vif(model_scale)
## cement slag flyash water super_plast coarse_agg
## 7.323742 6.838624 6.229559 6.141061 2.859650 4.789047
## fine_agg age
## 6.459908 1.104924
lmtest::bptest(model_scale)
##
## studentized Breusch-Pagan test
##
## data: model_scale
## BP = 90.358, df = 8, p-value = 3.933e-16
shapiro.test(model_scale$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_scale$residuals
## W = 0.99565, p-value = 0.06401
(4 Points) Compare multiple data preprocess approach. Do you need to normalize the data? Ya butuh, karena distribusi data yang tidak normal maka di model pertama MAE yang didapat sangat besar Do you need to log-transform or scale the variables with square root? Bisa, disini saya melakukan square root dan hasil model menghasilkan nilai MAE yang lebih memuaskan dari model sebelumnya
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
index <- sample(nrow(train_no_outlier), nrow(train_no_outlier) * 0.8)
data_train3 <- train_no_outlier[index,]
data_test3 <- train_no_outlier[-index,]
rf <- train[,-1]
rf <- rf %>% filter(strength<79.40)
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(417)
ctrl <- trainControl(method="repeatedcv", number = 6, repeats = 3)
forest <- train(strength ~ ., data = train_no_outlier, method = "rf", trControl = ctrl)
saveRDS(forest, "R/model_rf.RDS") # simpan model
library(randomForest)
rfm <- readRDS("R/model_rf.RDS")
rfm
## Random Forest
##
## 820 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (6 fold, repeated 3 times)
## Summary of sample sizes: 684, 683, 684, 683, 683, 683, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 5.670121 0.9001213 4.272428
## 5 5.040950 0.9091862 3.651791
## 8 5.072445 0.9063404 3.641625
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
rfm$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 23.79246
## % Var explained: 91.13
varImp(rfm)
## rf variable importance
##
## Overall
## age 100.0000
## cement 81.7304
## water 28.2220
## super_plast 12.7195
## slag 7.3321
## fine_agg 5.9187
## coarse_agg 0.9443
## flyash 0.0000
pred_rf <- predict(object = rfm, newdata = test)
test$strength <- pred_rf
write.csv(test %>%select(id,strength), "submission-rafif.csv",row.names=FALSE)
(4 Points) Compare multiple model. Build at least 2 models or build a model then tune the parameter later. Saya membuat Regresi, Random Forest, Random Forest dengan tuner
If the model is not satisfactory, what will you do to tune the model? Memuaskan, tetapi saya tetap melakukan tuning dengan RandomSearch dan menghasilkan nilai MAE dan R-square yang kurang lebih sama nilainya
Is the tuned model perform better? Lebih baik tapi tidak memberikan impact yang jauh dari model sebelum di tune
# Random Search
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="random")
set.seed(123)
metric <- "MAE"
mtry <- sqrt(ncol(train))
rf_random <- train(strength~., data=train_no_outlier, method="rf", metric=metric, tuneLength=15, trControl=control)
print(rf_random)
## Random Forest
##
## 820 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 739, 739, 738, 738, 737, 737, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 7.500446 0.8448508 5.984596
## 4 4.980617 0.9127849 3.605522
## 5 4.947356 0.9125650 3.550881
## 6 4.958391 0.9113483 3.537329
## 7 4.966201 0.9106978 3.541672
## 8 4.993420 0.9092716 3.544059
##
## MAE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
plot(rf_random)
saveRDS(rf_random, "R/model_rfrandom.RDS") # simpan model
rfr <- read_rds("R/model_rfrandom.RDS")
varImp(rfr)
## rf variable importance
##
## Overall
## age 100.000
## cement 84.965
## water 28.907
## super_plast 14.563
## slag 10.698
## fine_agg 6.547
## coarse_agg 2.215
## flyash 0.000
pred_rftune <- predict(object = rfr, newdata = test)
test$strength <- pred_rftune
Tidak Overfit
pred_train_rf <- predict(rfr, newdata = train)
rf.mod.R2 <- cor(pred_train_rf, train$strength)^2
rf.mod.test.R2 <- cor(pred_rftune, test$strength)^2
fit.rf.mod <- rf.mod.R2 - rf.mod.test.R2
fit.rf.mod
## [1] -0.0233749
write.csv(test %>%select(id,strength), "submission-rafif2.csv",row.names=FALSE)
(2 Points) Use LIME method to interpret the model that you have used. Do you need to scale back the data into original value in order to be more interpretable? How many features do you use to explain the model? What is the difference between using LIME compared to interpretable machine learning models such as Decision Tree or metrics such as Variable Importance in Random Forest?
explainer <- lime(data_train3, rfr)
data_expl <- data_test3[1:6,]
explanation_caret <- explain(
x = data_expl,
explainer = explainer,
n_permutations = 5000,
dist_fun = "gower",
kernel_width = .75,
n_features = 2,
feature_select = "highest_weights",
)
head(explainer)
## $model
## Random Forest
##
## 820 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 739, 739, 738, 738, 737, 737, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 7.500446 0.8448508 5.984596
## 4 4.980617 0.9127849 3.605522
## 5 4.947356 0.9125650 3.550881
## 6 4.958391 0.9113483 3.537329
## 7 4.966201 0.9106978 3.541672
## 8 4.993420 0.9092716 3.544059
##
## MAE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
##
## $preprocess
## function (x)
## x
## <bytecode: 0x000002d3cbf693e8>
## <environment: 0x000002d3cbf5a620>
##
## $bin_continuous
## [1] TRUE
##
## $n_bins
## [1] 4
##
## $quantile_bins
## [1] TRUE
##
## $use_density
## [1] TRUE
plot_features(explanation_caret)
plot_explanations(explanation_caret)
## Warning: Unknown or uninitialised column: `label`.
(2 Points) Use LIME method to interpret the model that you have used. Do
you need to scale back the data into original value in order to be more
interpretable? Tidak perlu karena ada lime yang sudah membantu kita
untuk mengintrepetasikannya
How many features do you use to explain the model? 3 : age dan cement memliki weight yang sangat tinggi dari 6 sample diatas dari 4 sample, age berkorelasi positif terhadap strength dan dari 2 sample. cement berkorelasi positif terhadap strength terakhir super_plas kebalikan dari cement hanya memiliki weight yang rendah dan negatif korelasi terhadap stregth
What is the difference between using LIME compared to interpretable machine learning models such as Decision Tree or metrics such as Variable Importance in Random Forest? Dalam Vimp kita hanya bisa mendapatkan insight degree of importance dari prediksi nya, sedangkan dengan lime kita bisa mengintrepetasi korelasi antar variable prediktor terhadap targer nya, kemudian kita juga bisa membuat dummy model yang hanya untuk intrepetasi yang dimana nantinya model bisa dijelaskan seperti logistic regression
(2 Points) Interpret the first 4 observations of the plot. What is the difference between interpreting black box model with LIME and using an interpretable machine learning model? black box model bisa mengintrepetasikan model yang mempunyai kompleksitas yang tinggi yang dimana cocok untuk random forest dan model lainnya yang sudah di scaling
How good is the explanation fit? What does it signify? sangat membantu dibandingkan model evaluate dari random forest yang hanya menampilkan sedikir summary dari hasil modelnya
What are the most and the least important factors for each observation? dari hasil observasi diatas bisa diliat superplas tidak terlalu berpengaruh dan age mempunyai signifikansi yang tinggi terhadap stregth
(2 Points) Write the conclusion of your capstone project. Is your goal achieved? ya, kita sudah achived goals kita diperoleh nilai MAE 3.57 (kurang dari 4) dan R Square sebesar 91% (lebih dari 90%).
Is the problem can be solved by machine learning? Ya kita bisa menyelesaikan masalah ini dengan machine learning contohnya dengan random forest yang akan tetapi overfitting akan bisa terjadi akibat bias-variance nya
What model did you use and how is the performance? pertama saya menggunakan Regresi Dengan MAE sekitar 0.5 dan rsquae 0.6, kemudian di random forest tercapai MAE sekitar 3.5 dan rsquare 91
What is the potential business implementation of your capstone project? potensi nya bisa digunakan untuk mempredict harga yang bagus untuk barang manufactured, seperti rumah, mobil, dll untuk menjaga margin profit. atau untuk mencari bahan yang lebih murah dan kualitas yang tidak kalah saing