library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(ggplot2)
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
##
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
##
## The following object is masked from 'package:base':
##
## Recall
library(e1071)
library(recipes)
##
## Attaching package: 'recipes'
##
## The following object is masked from 'package:stringr':
##
## fixed
##
## The following object is masked from 'package:stats':
##
## step
library(lime)
##
## Attaching package: 'lime'
##
## The following object is masked from 'package:dplyr':
##
## explain
train <- read.csv("data/data-train.csv")
test <- read.csv("data/data-test.csv")
glimpse(train)
## Rows: 825
## Columns: 10
## $ id <chr> "S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8", "S9", "S10…
## $ cement <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 380.0, 380.0, 475.0, 19…
## $ slag <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 95.0, 95.0, 0.0, 132.4, 132…
## $ flyash <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ water <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 192, 192, 228, 228…
## $ super_plast <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ coarse_agg <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932.0, …
## $ fine_agg <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 594.0, 594.0, 594.0, 82…
## $ age <int> 28, 28, 270, 365, 360, 365, 28, 28, 90, 28, 28, 90, 90, 36…
## $ strength <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 43.70, 36.45, 39.29, 38…
glimpse(test)
## Rows: 205
## Columns: 10
## $ id <chr> "S826", "S827", "S828", "S829", "S830", "S831", "S832", "S…
## $ cement <dbl> 266.0, 266.0, 427.5, 190.0, 380.0, 427.5, 198.6, 332.5, 23…
## $ slag <dbl> 114.0, 114.0, 47.5, 190.0, 0.0, 47.5, 132.4, 142.5, 237.5,…
## $ flyash <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ water <dbl> 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 192.0, 228.0, 22…
## $ super_plast <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ coarse_agg <dbl> 932.0, 932.0, 932.0, 932.0, 932.0, 932.0, 978.4, 932.0, 93…
## $ fine_agg <dbl> 670.0, 670.0, 594.0, 670.0, 670.0, 594.0, 825.5, 594.0, 59…
## $ age <int> 90, 28, 270, 90, 270, 28, 180, 90, 180, 365, 365, 180, 180…
## $ strength <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
train %>% is.na() %>% colSums()
## id cement slag flyash water super_plast
## 0 0 0 0 0 0
## coarse_agg fine_agg age strength
## 0 0 0 0
test %>% is.na() %>% colSums()
## id cement slag flyash water super_plast
## 0 0 0 0 0 0
## coarse_agg fine_agg age strength
## 0 0 0 205
boxplot(train %>% select(-id))
hist(train$flyash)
hist(train$age)
(2 Points) Explore the relation between the target and the features. Is strength positively correlated with age? berkorelasi positif dengan age sebesat 0.3
Is strength and cement has strong correlation? iya, sebesar 0.5
Is super_plast has a linear correlation with the strength? ya, punya sebesar 0.4
ggcorr(train[,-1], label = T)
(2 Points) Demonstrate and explain how to apply some data preprocessing to make sure that your data is “ready”, such as handling outlier. What data preprocessing that you do? Answer : Data preproses yang saya lakukan contohnya diatas saya melakukan pengecekan missing value, outliers, dan distribusi data
Is there any outlier? Answer : Ya, Ada
Do you need to scale the features or the target? Answer : Perlu karena banyak data yang berdistribusi tidak normal
(2 Points) Explore the relation between the target and the features. Is strength positively correlated with age? Is strength and cement has strong correlation? Is super_plast has a linear correlation with the strength?
train_clean <- train[,-1]
train_log <- train_clean
train_log$age <- log(train_log$age)
train_no_outlier <- train_clean %>%
filter(strength<79.40)
train_scale <- scale(train_no_outlier,center = TRUE,scale = TRUE)
(2 Points) Demonstrate how to prepare cross-validation data for this case. What is the proportion of the training vs testing dataset? answer : Training akan berisi 80% data dan test sisanya, dengan menggunakan cross validation
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
index <- sample(nrow(train_clean), nrow(train_clean) * 0.8)
data_train <- train_clean[index,]
data_test <- train_clean[-index,]
(2 Points) Demonstrate how to properly do model fitting and evaluation. What model do you use? Answer : Linear Regresi dan Random Forest
How do you evaluate the model? Melihat MAE dan R-squared dari setiap model dan juga melakukan pengecekan terhadap multikoliner, heteroscedasticity, normalitas dll
Is your model overfit?
lm <- lm(strength ~., data_train)
model_original<- stats::step(lm, direction = "backward", trace = F)
model evaluate dari yang pertama : MAE R-Square multicollinearity heteroscedasticity normality test
pred_lm <- predict(object = lm, newdata = data_test)
MAE(pred_lm, data_test$strength)
## [1] 8.494832
summary(model_original)$adj.r.squared
## [1] 0.6146765
vif(model_original)
## cement slag flyash water super_plast age
## 1.864874 1.722825 2.251775 1.860621 2.332434 1.065414
lmtest::bptest(model_original)
##
## studentized Breusch-Pagan test
##
## data: model_original
## BP = 99.71, df = 6, p-value < 2.2e-16
shapiro.test(model_original$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_original$residuals
## W = 0.99461, p-value = 0.01984
train dibagi menjadi 80% dan test 20%
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
index <- sample(nrow(train_scale), nrow(train_scale) * 0.8)
data_train2 <- train_scale[index,]
data_test2 <- train_scale[-index,]
lm.scale <- lm(strength ~ .,data.frame(data_train2))
summary(lm.scale)
##
## Call:
## lm(formula = strength ~ ., data = data.frame(data_train2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.81741 -0.38551 0.03816 0.41970 1.99698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.003372 0.023923 -0.141 0.8879
## cement 0.786236 0.063934 12.298 < 2e-16 ***
## slag 0.529600 0.062454 8.480 < 2e-16 ***
## flyash 0.362403 0.058885 6.154 1.32e-09 ***
## water -0.151167 0.060755 -2.488 0.0131 *
## super_plast 0.107540 0.039818 2.701 0.0071 **
## coarse_agg 0.114383 0.051884 2.205 0.0278 *
## fine_agg 0.120138 0.062475 1.923 0.0549 .
## age 0.493483 0.026434 18.669 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6114 on 647 degrees of freedom
## Multiple R-squared: 0.6352, Adjusted R-squared: 0.6307
## F-statistic: 140.8 on 8 and 647 DF, p-value: < 2.2e-16
model_scale <- stats::step(lm.scale, direction = "backward", trace = 0)
summary(model_scale)
##
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast +
## coarse_agg + fine_agg + age, data = data.frame(data_train2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.81741 -0.38551 0.03816 0.41970 1.99698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.003372 0.023923 -0.141 0.8879
## cement 0.786236 0.063934 12.298 < 2e-16 ***
## slag 0.529600 0.062454 8.480 < 2e-16 ***
## flyash 0.362403 0.058885 6.154 1.32e-09 ***
## water -0.151167 0.060755 -2.488 0.0131 *
## super_plast 0.107540 0.039818 2.701 0.0071 **
## coarse_agg 0.114383 0.051884 2.205 0.0278 *
## fine_agg 0.120138 0.062475 1.923 0.0549 .
## age 0.493483 0.026434 18.669 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6114 on 647 degrees of freedom
## Multiple R-squared: 0.6352, Adjusted R-squared: 0.6307
## F-statistic: 140.8 on 8 and 647 DF, p-value: < 2.2e-16
model evaluate dari yang pertama : MAE R-Square multicollinearity heteroscedasticity normality test
Memberikan hasil yang jauh lebih memuaskan dari model 1 dan menunjukan cement, slag, flyash, dan age mempunyai korelasi yang sangat tinggi terhadap stregth
pred_scale <- predict(object = model_scale, newdata = data.frame(data_test2))
MAE(pred_scale, data.frame(data_test2)$strength)
## [1] 0.5414662
summary(model_scale)$adj.r.squared
## [1] 0.6306556
vif(model_scale)
## cement slag flyash water super_plast coarse_agg
## 7.323742 6.838624 6.229559 6.141061 2.859650 4.789047
## fine_agg age
## 6.459908 1.104924
lmtest::bptest(model_scale)
##
## studentized Breusch-Pagan test
##
## data: model_scale
## BP = 90.358, df = 8, p-value = 3.933e-16
shapiro.test(model_scale$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_scale$residuals
## W = 0.99565, p-value = 0.06401
(4 Points) Compare multiple data preprocess approach. Do you need to normalize the data? Ya butuh, karena distribusi data yang tidak normal maka di model pertama MAE yang didapat sangat besar Do you need to log-transform or scale the variables with square root? Bisa, disini saya melakukan square root dan hasil model menghasilkan nilai MAE yang lebih memuaskan dari model sebelumnya
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
index <- sample(nrow(train_no_outlier), nrow(train_no_outlier) * 0.8)
data_train3 <- train_no_outlier[index,]
data_test3 <- train_no_outlier[-index,]
rf <- train[,-1]
rf <- rf %>% filter(strength<79.40)
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(417)
ctrl <- trainControl(method="repeatedcv", number = 6, repeats = 3)
forest <- train(strength ~ ., data = train_no_outlier, method = "rf", trControl = ctrl)
saveRDS(forest, "R/model_rf.RDS") # simpan model
library(randomForest)
rfm <- readRDS("R/model_rf.RDS")
rfm
## Random Forest
##
## 820 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (6 fold, repeated 3 times)
## Summary of sample sizes: 684, 683, 684, 683, 683, 683, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 5.670121 0.9001213 4.272428
## 5 5.040950 0.9091862 3.651791
## 8 5.072445 0.9063404 3.641625
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
rfm$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 23.79246
## % Var explained: 91.13
varImp(rfm)
## rf variable importance
##
## Overall
## age 100.0000
## cement 81.7304
## water 28.2220
## super_plast 12.7195
## slag 7.3321
## fine_agg 5.9187
## coarse_agg 0.9443
## flyash 0.0000
pred_rf <- predict(object = rfm, newdata = test)
test$strength <- pred_rf
write.csv(test %>%select(id,strength), "submission-rafif.csv",row.names=FALSE)
Compare multiple model.
Build at least 2 models or build a model then tune the parameter later. Saya membuat Regresi, Random Forest, Random Forest dengan tuner
If the model is not satisfactory, what will you do to tune the model? Satisfying, but I still performed tuning with RandomSearch and obtained MAE and R-square values that were more or less the same.
Is the tuned model perform better? Better, but it didn’t make a significant impact on the model compared to before tuning.
# Random Search
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="random")
set.seed(123)
metric <- "MAE"
mtry <- sqrt(ncol(train))
rf_random <- train(strength~., data=train_no_outlier, method="rf", metric=metric, tuneLength=15, trControl=control)
print(rf_random)
## Random Forest
##
## 820 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 739, 739, 738, 738, 737, 737, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 7.500446 0.8448508 5.984596
## 4 4.980617 0.9127849 3.605522
## 5 4.947356 0.9125650 3.550881
## 6 4.958391 0.9113483 3.537329
## 7 4.966201 0.9106978 3.541672
## 8 4.993420 0.9092716 3.544059
##
## MAE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
plot(rf_random)
saveRDS(rf_random, "R/model_rfrandom.RDS") # simpan model
rfr <- read_rds("R/model_rfrandom.RDS")
varImp(rfr)
## rf variable importance
##
## Overall
## age 100.000
## cement 84.965
## water 28.907
## super_plast 14.563
## slag 10.698
## fine_agg 6.547
## coarse_agg 2.215
## flyash 0.000
pred_rftune <- predict(object = rfr, newdata = test)
test$strength <- pred_rftune
Not overfitting.
pred_train_rf <- predict(rfr, newdata = train)
rf.mod.R2 <- cor(pred_train_rf, train$strength)^2
rf.mod.test.R2 <- cor(pred_rftune, test$strength)^2
fit.rf.mod <- rf.mod.R2 - rf.mod.test.R2
fit.rf.mod
## [1] -0.0233749
write.csv(test %>%select(id,strength), "submission-rafif2.csv",row.names=FALSE)
Use LIME method to interpret the model that you have used.
Do you need to scale back the data into original value in order to be more interpretable?
How many features do you use to explain the model?
What is the difference between using LIME compared to interpretable machine learning models such as Decision Tree or metrics such as Variable Importance in Random Forest?
explainer <- lime(data_train3, rfr)
data_expl <- data_test3[1:6,]
explanation_caret <- explain(
x = data_expl,
explainer = explainer,
n_permutations = 5000,
dist_fun = "gower",
kernel_width = .75,
n_features = 2,
feature_select = "highest_weights",
)
head(explainer)
## $model
## Random Forest
##
## 820 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 739, 739, 738, 738, 737, 737, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 7.500446 0.8448508 5.984596
## 4 4.980617 0.9127849 3.605522
## 5 4.947356 0.9125650 3.550881
## 6 4.958391 0.9113483 3.537329
## 7 4.966201 0.9106978 3.541672
## 8 4.993420 0.9092716 3.544059
##
## MAE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
##
## $preprocess
## function (x)
## x
## <bytecode: 0x000001d82d04e958>
## <environment: 0x000001d82d0476e0>
##
## $bin_continuous
## [1] TRUE
##
## $n_bins
## [1] 4
##
## $quantile_bins
## [1] TRUE
##
## $use_density
## [1] TRUE
plot_features(explanation_caret)
plot_explanations(explanation_caret)
## Warning: Unknown or uninitialised column: `label`.
Use LIME method to interpret the model that you have used.
Do you need to scale back the data into original value in order to be more interpretable? No need, as Lime has already assisted us in interpreting it.
How many features do you use to explain the model? 3 : Age and cement have a very high weight in the six samples above. Among the four samples, age shows a positive correlation with strength, while in the two samples, cement also exhibits a positive correlation with strength. Finally, super_plas, unlike cement, has a low weight and a negative correlation with strength.
What is the difference between using LIME compared to interpretable machine learning models such as Decision Tree or metrics such as Variable Importance in Random Forest? In Vimp, we can only obtain insights into the degree of importance of its predictions. Meanwhile, with Lime, we can interpret the correlations between predictor variables and its target. Furthermore, we can create a dummy model that is specifically for interpretation purposes. This way, the model can be explained similar to a logistic regression model.
Interpret the first 4 observations of the plot.
What is the difference between interpreting black box model with LIME and using an interpretable machine learning model? A black box model can interpret complex models, which is suitable for models like Random Forest and others that have been scaled.
How good is the explanation fit? What does it signify? It’s much more helpful compared to the evaluation of a Random Forest model, which only provides a limited summary of its results.
What are the most and the least important factors for each observation? From the observations above, it can be seen that superplasticizer (superplas) doesn’t have a significant impact, while age has a high significance on strength.
Write the conclusion of your capstone project.
Is your goal achieved?
Yes, we have achieved our goals. We obtained an MAE value of 3.57 (less than 4) and an R-squared value of 91% (more than 90%).
Is the problem can be solved by machine learning? Yes, we can solve this problem with machine learning, for example, using random forest. However, overfitting can occur due to its bias-variance.
What model did you use and how is the performance? First, I used regression with an MAE of around 0.5 and an R-squared of 0.6. Then, with random forest, we achieved an MAE of around 3.5 and an R-squared of 91.
What is the potential business implementation of your capstone project? The potential use could be in predicting optimal prices for manufactured goods, such as houses, cars, etc., to maintain profit margins or to find cost-effective materials without compromising quality.