1. Package
Berikut ini adalah package yang dipakai dalam analisa week 5 ini.
library(mlr3tuning)## Loading required package: mlr3
## Loading required package: paradox
library(gam)## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.20
library(rpart)
library(pROC)## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(tidyverse)## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.3 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x purrr::accumulate() masks foreach::accumulate()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::when() masks foreach::when()
library(mlr3verse)
library(mlr3extralearners)##
## Attaching package: 'mlr3extralearners'
## The following objects are masked from 'package:mlr3':
##
## lrn, lrns
library(precrec)##
## Attaching package: 'precrec'
## The following object is masked from 'package:pROC':
##
## auc
library(adabag)## Loading required package: caret
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
library(ROCR)
library(ROCit)
library(magrittr)##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(visdat)
library(naniar)
library(UpSetR)##
## Attaching package: 'UpSetR'
## The following object is masked from 'package:lattice':
##
## histogram
library(laeken)
library(vcd)## Loading required package: grid
library(VIM)## Loading required package: colorspace
##
## Attaching package: 'colorspace'
## The following object is masked from 'package:pROC':
##
## coords
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
library(sm)## Package 'sm', version 2.2-5.7: type help(sm) for summary information
library(ggplot2)
library(dplyr)
library(mlbench)
library(caret)
library(mlr3verse)
library(mlr3fselect)
library(DataExplorer)
library(skimr)##
## Attaching package: 'skimr'
## The following object is masked from 'package:naniar':
##
## n_complete
## The following object is masked from 'package:mlr3':
##
## partition
library(corrplot)## corrplot 0.92 loaded
library(leaps)
library(olsrr)##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
library(kableExtra) #Tampilan Tabel##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(agricolae) #Pemeriksaan Asumsi
library(lmtest) #Untuk pengecekan asumsi## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car) #Untuk pengecekan asumsi## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:ROCit':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
library(tseries) #Untuk pengecekan asumsi## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(glmnet)## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-2
library(glmnetUtils)##
## Attaching package: 'glmnetUtils'
## The following objects are masked from 'package:glmnet':
##
## cv.glmnet, glmnet
library(broom)
library(ggpubr)
library(modelr)##
## Attaching package: 'modelr'
## The following object is masked from 'package:broom':
##
## bootstrap
## The following object is masked from 'package:mlr3':
##
## resample
library(precrec)
library(adabag)
library(rpart.plot)
library(mice)##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(caret)
library(randomForest)## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(ROSE)## Loaded ROSE 0.0-4
2. Data Training
Data training yang dipakai adalah data training hasil pre-processing pada week sebelumnya dengan sedikit ada perubahan yaitu merubah variabel factor menjadi variabel dummy.
data2 <- read.csv2("D:/Magister IPB/Kuliah/Semester 2/STA582_Pembelajaran_Mesin_Statistika/Praktikum/Tugas_UTS/Problem 2/datatrainnew.csv",stringsAsFactors = TRUE)
dim(data2)## [1] 49692 13
data2$bed_type <- ifelse(data2$bed_type == "Airbed", 1, ifelse(data2$bed_type == "Couch", 2, ifelse(data2$bed_type == "Futon", 3, ifelse(data2$bed_type == "Pull-out Sofa", 4,5))))
data2$cancellation_policy <- ifelse(data2$cancellation_policy == "flexible", 1, ifelse(data2$cancellation_policy == "moderate", 2, ifelse(data2$cancellation_policy == "strict", 3, ifelse(data2$cancellation_policy == "Psuper_strict_30", 4,5))))
data2$city <- ifelse(data2$city == "Boston", 1, ifelse(data2$city == "Chicago", 2, ifelse(data2$city == "DC", 3, ifelse(data2$city == "LA", 4,ifelse(data2$city == "NYC", 5,6)))))
data2$property_type <- ifelse(data2$property_type == "Apartment", 1,
ifelse(data2$property_type == "Bed & Breakfast", 2,
ifelse(data2$property_type == "Boat", 3,
ifelse(data2$property_type == "Boutique hotel", 4,
ifelse(data2$property_type == "Bungalow", 5,
ifelse(data2$property_type == "Cabin", 6,
ifelse(data2$property_type == "Camper/RV", 7,
ifelse(data2$property_type == "Castle", 8,
ifelse(data2$property_type == "Cave", 9,
ifelse(data2$property_type == "Chalet", 10,
ifelse(data2$property_type == "Condominium", 11,
ifelse(data2$property_type == "Dorm", 12,
ifelse(data2$property_type == "Earth House", 13,
ifelse(data2$property_type == "Guest suite", 14,
ifelse(data2$property_type == "Guesthouse", 15,
ifelse(data2$property_type == "Hostel", 16,
ifelse(data2$property_type == "House", 17,
ifelse(data2$property_type == "Hut", 18,
ifelse(data2$property_type == "In-law", 19,
ifelse(data2$property_type == "Loft", 20,
ifelse(data2$property_type == "Other", 21,
ifelse(data2$property_type == "Serviced apartment", 22,
ifelse(data2$property_type == "Tent", 23,
ifelse(data2$property_type == "Timeshare", 24,
ifelse(data2$property_type == "Townhouse", 25,
ifelse(data2$property_type == "Train", 26,
ifelse(data2$property_type == "Treehouse", 27,
ifelse(data2$property_type == "Vacation home", 28,
ifelse(data2$property_type == "Villa", 29,
30)))))))))))))))))))))))))))))
str(data2)## 'data.frame': 49692 obs. of 13 variables:
## $ property_type : num 1 1 1 1 1 1 11 17 17 1 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 1 1 1 1 2 1 1 2 2 2 ...
## $ accommodates : int 3 7 5 2 2 3 2 2 2 2 ...
## $ bathrooms : int 1 1 1 1 1 1 1 1 1 1 ...
## $ bed_type : num 5 5 5 5 5 5 5 5 5 5 ...
## $ cancellation_policy: num 3 3 2 2 3 2 2 2 2 3 ...
## $ cleaning_fee : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ city : num 5 5 5 3 6 4 4 6 4 5 ...
## $ instant_bookable : Factor w/ 2 levels "f","t": 1 2 2 2 2 2 1 1 2 1 ...
## $ bedrooms : int 1 3 1 0 1 1 1 1 1 1 ...
## $ beds : int 1 3 3 1 1 1 1 1 1 1 ...
## $ price : num 5.01 5.13 4.98 4.74 4.44 ...
## $ sum_amenities : int 9 15 19 12 10 21 26 21 13 15 ...
anyNA(data2)## [1] FALSE
data2$cleaning_fee<- as.factor(data2$cleaning_fee)
glimpse(data2)## Rows: 49,692
## Columns: 13
## $ property_type <dbl> 1, 1, 1, 1, 1, 1, 11, 17, 17, 1, 1, 1, 1, 20, 25, ~
## $ room_type <fct> Entire home/apt, Entire home/apt, Entire home/apt,~
## $ accommodates <int> 3, 7, 5, 2, 2, 3, 2, 2, 2, 2, 2, 6, 2, 2, 2, 2, 8,~
## $ bathrooms <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,~
## $ bed_type <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,~
## $ cancellation_policy <dbl> 3, 3, 2, 2, 3, 2, 2, 2, 2, 3, 1, 3, 3, 3, 3, 3, 3,~
## $ cleaning_fee <fct> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR~
## $ city <dbl> 5, 5, 5, 3, 6, 4, 4, 6, 4, 5, 4, 3, 4, 2, 1, 5, 5,~
## $ instant_bookable <fct> f, t, t, t, t, t, f, f, t, f, f, t, f, f, f, f, t,~
## $ bedrooms <int> 1, 3, 1, 0, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 3,~
## $ beds <int> 1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 2, 3,~
## $ price <dbl> 5.010635, 5.129899, 4.976734, 4.744932, 4.442651, ~
## $ sum_amenities <int> 9, 15, 19, 12, 10, 21, 26, 21, 13, 15, 17, 25, 7, ~
3. Data Testing
Data testing tetap menggunakan data test awal, namun ada sedikit modifkasi yaitu adanya proses imputasi data hilang agar bisa dieksekuis prediksinya dalam model yang sudah dibangun. Metode dalam imputasi menggunakan random forest.
data_test <- read.csv2("D:/Magister IPB/Kuliah/Semester 2/STA582_Pembelajaran_Mesin_Statistika/Praktikum/Tugas_UTS/Problem 2/test2.csv",stringsAsFactors = TRUE)
data_test<- data_test[,c(2,3,5,6,7,8,9,10,17,27,28,29)]
data_test$bed_type <- ifelse(data_test$bed_type == "Airbed", 1, ifelse(data_test$bed_type == "Couch", 2, ifelse(data_test$bed_type == "Futon", 3, ifelse(data_test$bed_type == "Pull-out Sofa", 4,5))))
data_test$cancellation_policy <- ifelse(data_test$cancellation_policy == "flexible", 1, ifelse(data_test$cancellation_policy == "moderate", 2, ifelse(data_test$cancellation_policy == "strict", 3, ifelse(data_test$cancellation_policy == "Psuper_strict_30", 4,5))))
data_test$city <- ifelse(data_test$city == "Boston", 1, ifelse(data_test$city == "Chicago", 2, ifelse(data_test$city == "DC", 3, ifelse(data_test$city == "LA", 4,ifelse(data_test$city == "NYC", 5,6)))))
data_test$property_type <- ifelse(data_test$property_type == "Apartment", 1,
ifelse(data_test$property_type == "Bed & Breakfast", 2,
ifelse(data_test$property_type == "Boat", 3,
ifelse(data_test$property_type == "Boutique hotel", 4,
ifelse(data_test$property_type == "Bungalow", 5,
ifelse(data_test$property_type == "Cabin", 6,
ifelse(data_test$property_type == "Camper/RV", 7,
ifelse(data_test$property_type == "Castle", 8,
ifelse(data_test$property_type == "Cave", 9,
ifelse(data_test$property_type == "Chalet", 10,
ifelse(data_test$property_type == "Condominium", 11,
ifelse(data_test$property_type == "Dorm", 12,
ifelse(data_test$property_type == "Earth House", 13,
ifelse(data_test$property_type == "Guest suite", 14,
ifelse(data_test$property_type == "Guesthouse", 15,
ifelse(data_test$property_type == "Hostel", 16,
ifelse(data_test$property_type == "House", 17,
ifelse(data_test$property_type == "Hut", 18,
ifelse(data_test$property_type == "In-law", 19,
ifelse(data_test$property_type == "Loft", 20,
ifelse(data_test$property_type == "Other", 21,
ifelse(data_test$property_type == "Serviced apartment", 22,
ifelse(data_test$property_type == "Tent", 23,
ifelse(data_test$property_type == "Timeshare", 24,
ifelse(data_test$property_type == "Townhouse", 25,
ifelse(data_test$property_type == "Train", 26,
ifelse(data_test$property_type == "Treehouse", 27,
ifelse(data_test$property_type == "Vacation home", 28,
ifelse(data_test$property_type == "Villa", 29,
30)))))))))))))))))))))))))))))
data_test$cleaning_fee <- as.factor(data_test$cleaning_fee)
glimpse(data_test)## Rows: 22,232
## Columns: 12
## $ property_type <dbl> 17, 1, 1, 17, 1, 1, 1, 1, 1, 1, 17, 17, 1, 1, 1, 1~
## $ room_type <fct> Entire home/apt, Private room, Entire home/apt, Pr~
## $ accommodates <int> 4, 2, 4, 2, 4, 6, 4, 8, 2, 1, 2, 2, 2, 5, 6, 1, 3,~
## $ bathrooms <dbl> 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ~
## $ bed_type <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,~
## $ cancellation_policy <dbl> 3, 2, 2, 2, 2, 3, 3, 2, 1, 1, 3, 3, 1, 3, 3, 1, 2,~
## $ cleaning_fee <fct> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FA~
## $ city <dbl> 4, 2, 4, 4, 3, 5, 4, 4, 5, 5, 5, 2, 4, 5, 1, 5, 4,~
## $ instant_bookable <fct> f, f, t, f, f, f, f, t, f, f, t, f, f, f, f, f, f,~
## $ bedrooms <int> 2, 1, 1, 1, 1, 3, 2, 1, 1, 1, 1, 1, 1, 0, 2, 0, 1,~
## $ beds <int> 2, 1, 1, 1, 2, 3, 3, 7, 1, 1, 1, 1, 1, 2, 4, 1, 1,~
## $ sum_amenities <int> 22, 12, 29, 16, 14, 44, 27, 15, 14, 16, 22, 20, 15~
str(data_test)## 'data.frame': 22232 obs. of 12 variables:
## $ property_type : num 17 1 1 17 1 1 1 1 1 1 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 1 2 1 2 1 1 1 2 1 2 ...
## $ accommodates : int 4 2 4 2 4 6 4 8 2 1 ...
## $ bathrooms : num 1.5 1.5 1 1 1 1 1 1 1 1 ...
## $ bed_type : num 5 5 5 5 5 5 5 5 5 5 ...
## $ cancellation_policy: num 3 2 2 2 2 3 3 2 1 1 ...
## $ cleaning_fee : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 1 1 ...
## $ city : num 4 2 4 4 3 5 4 4 5 5 ...
## $ instant_bookable : Factor w/ 2 levels "f","t": 1 1 2 1 1 1 1 2 1 1 ...
## $ bedrooms : int 2 1 1 1 1 3 2 1 1 1 ...
## $ beds : int 2 1 1 1 2 3 3 7 1 1 ...
## $ sum_amenities : int 22 12 29 16 14 44 27 15 14 16 ...
Cek data hilang
sum_mis2<-miss_var_summary(data_test)
sum_mis_plot2<-head(sum_mis2,7)
sum_mis_plot2Imputasi data hilang
my_imptest<- mice(data_test, m = 2, method = "rf", maxit = 10, seed = 52500)##
## iter imp variable
## 1 1 bathrooms bedrooms beds
## 1 2 bathrooms bedrooms beds
## 2 1 bathrooms bedrooms beds
## 2 2 bathrooms bedrooms beds
## 3 1 bathrooms bedrooms beds
## 3 2 bathrooms bedrooms beds
## 4 1 bathrooms bedrooms beds
## 4 2 bathrooms bedrooms beds
## 5 1 bathrooms bedrooms beds
## 5 2 bathrooms bedrooms beds
## 6 1 bathrooms bedrooms beds
## 6 2 bathrooms bedrooms beds
## 7 1 bathrooms bedrooms beds
## 7 2 bathrooms bedrooms beds
## 8 1 bathrooms bedrooms beds
## 8 2 bathrooms bedrooms beds
## 9 1 bathrooms bedrooms beds
## 9 2 bathrooms bedrooms beds
## 10 1 bathrooms bedrooms beds
## 10 2 bathrooms bedrooms beds
summary(my_imptest)## Class: mids
## Number of multiple imputations: 2
## Imputation methods:
## property_type room_type accommodates bathrooms
## "" "" "" "rf"
## bed_type cancellation_policy cleaning_fee city
## "" "" "" ""
## instant_bookable bedrooms beds sum_amenities
## "" "rf" "rf" ""
## PredictorMatrix:
## property_type room_type accommodates bathrooms bed_type
## property_type 0 1 1 1 1
## room_type 1 0 1 1 1
## accommodates 1 1 0 1 1
## bathrooms 1 1 1 0 1
## bed_type 1 1 1 1 0
## cancellation_policy 1 1 1 1 1
## cancellation_policy cleaning_fee city instant_bookable
## property_type 1 1 1 1
## room_type 1 1 1 1
## accommodates 1 1 1 1
## bathrooms 1 1 1 1
## bed_type 1 1 1 1
## cancellation_policy 0 1 1 1
## bedrooms beds sum_amenities
## property_type 1 1 1
## room_type 1 1 1
## accommodates 1 1 1
## bathrooms 1 1 1
## bed_type 1 1 1
## cancellation_policy 1 1 1
data_test_final = complete(my_imptest,2)Cek kembali data hilang
anyNA(data_test_final)## [1] FALSE
sum_mis3<-miss_var_summary(data_test_final)
sum_mis_plot3<-head(sum_mis3,7)
sum_mis_plot3Data hilang sudah berhasil diselesaikan.
4. Eksplorasi Korelasi Data
Berikut ini adalah pola hubungan setiap variabel prediktor dengan variabel respon.
DataExplorer::plot_scatterplot(data = data2,
by = "price",nrow = 3,ncol = 3,geom_point_args = list(color="Steelblue"))5. Pemodelan
Pada pertemuan ini beberapa metode pemodelan yang akan dicobakan diantaranya regresi random forest dan regresi gradient boost.
Random Forest
Pertama kita akan bangun dengan menggunakan model random forest, untuk kemudian akan kita uji akurasinya dengan menggunakan fungsi predict yang mengacu pada nilai RMSE dan MAE.
Memprediksi respon pada data baru
Berikut ini kita definisikan dahulu fungsi untuk membangun regresi random forest.
task_house = TaskRegr$new(id="house",backend = data2,target = "price")Menentukan model yang digunakan
Karena menggunakan model random forest, maka fungsi lrn yang digunakan adalah regr.ranger. Berikut ini adalah list parameter pengukurand dari fungsi tsb.
as.data.table(lrn("regr.ranger")$param_set)Mengecek variabel importance
Kita juga perlu melihat variabel importance dari model yang kita buat dengan fungsi berikut.
model_rf <- lrn("regr.ranger",importance="impurity")Berikut adalah detail variabel importance.
model_rf$train(task = task_house)## Growing trees.. Progress: 82%. Estimated remaining time: 6 seconds.
model_rf$model$variable.importance## accommodates bathrooms bed_type bedrooms
## 2293.62605 785.52107 89.29425 2348.25392
## beds cancellation_policy city cleaning_fee
## 933.56353 234.57373 1033.56175 151.01545
## instant_bookable property_type room_type sum_amenities
## 125.70481 458.31736 5332.06627 785.84104
Jika dibuat ke dalam dataframe makan akan menjadi output sbb.
importance <- data.frame(Predictors = names(model_rf$model$variable.importance),
impurity = model_rf$model$variable.importance
)
rownames(importance) <- NULL
importance %>% arrange(desc(impurity))Secara terurut variabel importance digambarkan oleh tabel di atas, dimana biggest 5 nya adalah room type, bedrooms, accomodates, beds, dan city.
Mendefinisikan Tuning Hiperparameter
Selanjutnya kita akan coba membuat model dengan menggunakan validasi silang agar model yang dihasilkan tidak overfit.
param_bound_rf <- ParamSet$new(params =
list(ParamInt$new("mtry",
lower = 7,
upper = 12),
ParamInt$new("max.depth",
lower = 7,
upper = 12)
)
)Mendefinisikan Tuning Hiperparameter
Tentukan jumlah ulangannya.
terminate = trm("evals", n_evals = 8)
terminate$param_set## <ParamSet>
## id class lower upper nlevels default value
## 1: n_evals ParamInt 0 Inf Inf 100 8
## 2: k ParamInt 0 Inf Inf 0 0
Menentukan metode optimisasi
tuner <- tnr("random_search")
tuner$param_set## <ParamSet>
## id class lower upper nlevels default value
## 1: batch_size ParamInt -Inf Inf Inf 1 1
Menentukan metode resampling (inner resampling)
Definisikan jumlah cv yang diinginkan.
resample_inner = rsmp("cv", folds = 3)Menentukan metode resampling (inner resampling)
model_rf_tune <- AutoTuner$new(learner = model_rf,
measure = msr("regr.mae"),
terminator = terminate,
resampling = resample_inner,
search_space = param_bound_rf,
tuner = tuner,
store_models = TRUE
)Menentukan metode resampling (outer resampling)
resample_outer = rsmp("cv", folds = 3)
set.seed(123)
resample_outer$instantiate(task = task_house)6.Komparasi Model
model_house <- list( model_rf,
model_rf_tune
)
design <- benchmark_grid(tasks = task_house,
learners = model_house,
resamplings = resample_outer
)lgr::get_logger("bbotk")$set_threshold("warn")
bmr = benchmark(design,store_models = TRUE)## INFO [15:20:14.397] [mlr3] Running benchmark with 6 resampling iterations
## INFO [15:20:14.484] [mlr3] Applying learner 'regr.ranger.tuned' on task 'house' (iter 1/3)
## INFO [15:20:14.618] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:20:14.624] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:20:30.818] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:20:46.401] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:21:02.784] [mlr3] Finished benchmark
## INFO [15:21:03.455] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:21:03.461] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:21:19.505] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:21:36.520] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:21:51.886] [mlr3] Finished benchmark
## INFO [15:21:52.503] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:21:52.510] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:22:07.397] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:22:22.187] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:22:36.966] [mlr3] Finished benchmark
## INFO [15:22:37.194] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:22:37.200] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:22:54.553] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:23:11.844] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:23:28.791] [mlr3] Finished benchmark
## INFO [15:23:29.387] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:23:29.395] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:23:38.033] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:23:47.140] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:23:56.057] [mlr3] Finished benchmark
## INFO [15:23:56.209] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:23:56.216] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:24:15.048] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:24:32.845] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:24:49.524] [mlr3] Finished benchmark
## INFO [15:24:50.197] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:24:50.204] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:25:00.985] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:25:11.707] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:25:22.708] [mlr3] Finished benchmark
## INFO [15:25:22.866] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:25:22.874] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:25:42.444] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:26:02.257] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:26:21.768] [mlr3] Finished benchmark
## INFO [15:26:50.641] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:27:19.520] [mlr3] Applying learner 'regr.ranger.tuned' on task 'house' (iter 3/3)
## INFO [15:27:19.643] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:27:19.649] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:27:31.149] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:27:42.159] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:27:53.561] [mlr3] Finished benchmark
## INFO [15:27:53.793] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:27:53.799] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:28:06.183] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:28:18.448] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:28:30.139] [mlr3] Finished benchmark
## INFO [15:28:30.330] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:28:30.337] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:28:40.126] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:28:49.961] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:28:59.442] [mlr3] Finished benchmark
## INFO [15:28:59.614] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:28:59.621] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:29:12.469] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:29:24.543] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:29:37.098] [mlr3] Finished benchmark
## INFO [15:29:37.257] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:29:37.264] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:29:50.931] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:30:04.324] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:30:18.390] [mlr3] Finished benchmark
## INFO [15:30:18.635] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:30:18.644] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:30:28.388] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:30:37.149] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:30:46.012] [mlr3] Finished benchmark
## INFO [15:30:46.195] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:30:46.201] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:30:56.784] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:31:07.673] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:31:18.747] [mlr3] Finished benchmark
## INFO [15:31:18.925] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:31:18.933] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:31:32.773] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:31:46.543] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:32:00.815] [mlr3] Finished benchmark
## INFO [15:32:24.601] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:32:50.678] [mlr3] Applying learner 'regr.ranger.tuned' on task 'house' (iter 2/3)
## INFO [15:32:50.795] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:32:50.804] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:33:04.669] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:33:19.067] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:33:32.533] [mlr3] Finished benchmark
## INFO [15:33:32.746] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:33:32.753] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:33:41.388] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:33:50.290] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:33:59.081] [mlr3] Finished benchmark
## INFO [15:33:59.229] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:33:59.235] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:34:11.245] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:34:22.594] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:34:34.049] [mlr3] Finished benchmark
## INFO [15:34:34.651] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:34:34.657] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:34:46.018] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:34:57.249] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:35:08.307] [mlr3] Finished benchmark
## INFO [15:35:08.519] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:35:08.526] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:35:26.476] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:35:44.290] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:36:01.733] [mlr3] Finished benchmark
## INFO [15:36:02.000] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:36:02.007] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:36:10.351] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:36:19.094] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:36:28.122] [mlr3] Finished benchmark
## INFO [15:36:28.276] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:36:28.285] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:36:44.325] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:36:59.479] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:37:15.363] [mlr3] Finished benchmark
## INFO [15:37:15.576] [mlr3] Running benchmark with 3 resampling iterations
## INFO [15:37:15.582] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:37:30.180] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 1/3)
## INFO [15:37:44.831] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 2/3)
## INFO [15:37:59.670] [mlr3] Finished benchmark
## Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
## INFO [15:38:33.489] [mlr3] Applying learner 'regr.ranger' on task 'house' (iter 3/3)
## INFO [15:38:59.384] [mlr3] Finished benchmark
Hasil Komparasi model
result = bmr$aggregate(msr("regr.mae"))
resultHiperparameter Terbaik
get_param_res <- function(i){
as.data.table(bmr)$learner[[i]]$tuning_result
}Memprediksi respon pada data baru
best_rf_param =map_dfr(1:6,get_param_res)
best_rf_parambest_rf_param %>% slice_min(regr.mae)best_rf_param_value <- c(best_rf_param %>%
slice_min(regr.mae) %>%
pull(mtry),
best_rf_param %>%
slice_min(regr.mae) %>%
pull(max.depth)
)# data dummy
set.seed(123)
data_house_baru <- data2 %>% slice_sample(n=100)head(data_house_baru,6)model_rf_best <- lrn("regr.ranger",
mtry=best_rf_param_value[1],
max.depth=best_rf_param_value[2]
)
model_rf_best$train(task = task_house)## Growing trees.. Progress: 49%. Estimated remaining time: 31 seconds.
## Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
prediksi_rf_new <- model_rf_best$predict_newdata(newdata = data_house_baru)
as.data.table(prediksi_rf_new)prediksi_rf_new2 <- model_rf_best$predict_newdata(newdata = data_test_final)
hasilpred<-as.data.table(prediksi_rf_new2)
hasilpred<-as.data.frame(hasilpred)
a<-hasilpred$responseKemudian kita simulasikan dengan data test yang sebelumnya kita siapkan. Data test kita kembalikan ke bentuk asalnya karena data test yang kita gunakan sudah ditransformasi ke bentuk logaritma natural.
#Mengembalikan fungsi log natural
pred_rf<- 2.718282^(a)
pred_rf<-as.data.frame(pred_rf)7. Kesimpulan
Adapun yang dapat disimpulkan dari simulasi di atas adalah sbb:
- Model terbaik adalah regresi linier
- Metode spline dan loess tidak cocok dengan problem ini karena pada spline dan loess hanya bisa untuk variabel numerik minimum nilai unik = 4
- Metode spline dan loess akan lebih baik diterapkan pada variabel dengan skala interval atau rasio
- Untuk memprediksi dengan model loess dan spline tidak boleh ada data yang hilang
- Regresi linier lebih baik karena jumlah variabel bebas yang dimasukkan bisa lebih banyak dan fleksibel
Referensi
Breheny P. (n.d.). Getting started with grpreg. GitHub Pages. https://pbreheny.github.io/grpreg/articles/getting-started.html
Hastie T. (2013, May 9). glmnet: Lasso and elastic-net regularization in R. Revolutions. https://blog.revolutionanalytics.com/2013/05/hastie-glmnet.html
Post J. (2014, September 29). LASSO, Ridge, and Elastic Net. https://www4.stat.ncsu.edu/~post/josh/LASSO_Ridge_Elastic_Net_-_Examples.html
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., & Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1), 91-108. https://doi.org/10.1111/j.1467-9868.2005.00490.x
Tripathy A. (2013, July 14). Regularization – Predictive Modeling Beyond Ordinary Least Squares Fit. ShatterLine Blog. https://shatterline.com/blog/2013/07/
Xiaotong C., Chen G., & Chong W. (n.d.). Statistical Learning and Data Mining Codes. Biostatistics - Academic Divisions - School of Public Health - University of Minnesota. https://www.biostat.umn.edu/~weip/course/dm/examples/