IPB University

Syaza Abdu Razzaq

Deri Siswara

18/02/2022

1. Package

Berikut ini adalah package yang dipakai dalam analisa week 5 ini.

library(gbm)
## Loaded gbm 2.1.8
library(mlr3tuning)
## Loading required package: mlr3
## Loading required package: paradox
library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.20
library(rpart)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.3     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x purrr::accumulate() masks foreach::accumulate()
## x dplyr::filter()     masks stats::filter()
## x dplyr::lag()        masks stats::lag()
## x purrr::when()       masks foreach::when()
library(mlr3verse)
library(mlr3extralearners)
## 
## Attaching package: 'mlr3extralearners'
## The following objects are masked from 'package:mlr3':
## 
##     lrn, lrns
library(precrec)
## 
## Attaching package: 'precrec'
## The following object is masked from 'package:pROC':
## 
##     auc
library(adabag)
## Loading required package: caret
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
library(ROCR)
library(ROCit)
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(visdat)
library(naniar)
library(UpSetR)
## 
## Attaching package: 'UpSetR'
## The following object is masked from 'package:lattice':
## 
##     histogram
library(laeken)
library(vcd)
## Loading required package: grid
library(VIM)
## Loading required package: colorspace
## 
## Attaching package: 'colorspace'
## The following object is masked from 'package:pROC':
## 
##     coords
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
library(sm)
## Package 'sm', version 2.2-5.7: type help(sm) for summary information
library(ggplot2)
library(dplyr)
library(mlbench)
library(caret)
library(mlr3verse)
library(mlr3fselect)
library(DataExplorer)
library(skimr)
## 
## Attaching package: 'skimr'
## The following object is masked from 'package:naniar':
## 
##     n_complete
## The following object is masked from 'package:mlr3':
## 
##     partition
library(corrplot)
## corrplot 0.92 loaded
library(leaps)
library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
library(kableExtra) #Tampilan Tabel
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(agricolae)  #Pemeriksaan Asumsi
library(lmtest)     #Untuk pengecekan asumsi
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(car)        #Untuk pengecekan asumsi
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:ROCit':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library(tseries)    #Untuk pengecekan asumsi
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-2
library(glmnetUtils)
## 
## Attaching package: 'glmnetUtils'
## The following objects are masked from 'package:glmnet':
## 
##     cv.glmnet, glmnet
library(broom)
library(ggpubr)
library(modelr)
## 
## Attaching package: 'modelr'
## The following object is masked from 'package:broom':
## 
##     bootstrap
## The following object is masked from 'package:mlr3':
## 
##     resample
library(precrec)
library(adabag)
library(rpart.plot)
library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(caret)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(ROSE)
## Loaded ROSE 0.0-4

2. Data Training

Data training yang dipakai adalah data training hasil pre-processing pada week sebelumnya dengan sedikit ada perubahan yaitu merubah variabel factor menjadi variabel dummy.

data2 <- read.csv2("D:/Magister IPB/Kuliah/Semester 2/STA582_Pembelajaran_Mesin_Statistika/Praktikum/Tugas_UTS/Problem 2/datatrainnew.csv",stringsAsFactors = TRUE)
dim(data2)
## [1] 49692    13
data2$bed_type <- ifelse(data2$bed_type == "Airbed", 1, ifelse(data2$bed_type == "Couch", 2, ifelse(data2$bed_type == "Futon", 3, ifelse(data2$bed_type == "Pull-out Sofa", 4,5))))
data2$cancellation_policy <- ifelse(data2$cancellation_policy == "flexible", 1, ifelse(data2$cancellation_policy == "moderate", 2, ifelse(data2$cancellation_policy == "strict", 3, ifelse(data2$cancellation_policy == "Psuper_strict_30", 4,5))))
data2$city <- ifelse(data2$city  == "Boston", 1, ifelse(data2$city  == "Chicago", 2, ifelse(data2$city == "DC", 3, ifelse(data2$city  == "LA", 4,ifelse(data2$city  == "NYC", 5,6)))))
data2$property_type <- ifelse(data2$property_type == "Apartment", 1, 
                       ifelse(data2$property_type == "Bed & Breakfast", 2, 
                       ifelse(data2$property_type == "Boat", 3, 
                       ifelse(data2$property_type == "Boutique hotel", 4,
                       ifelse(data2$property_type == "Bungalow", 5,       
                       ifelse(data2$property_type == "Cabin", 6,       
                       ifelse(data2$property_type == "Camper/RV", 7,       
                       ifelse(data2$property_type == "Castle", 8,        
                       ifelse(data2$property_type == "Cave", 9,        
                       ifelse(data2$property_type == "Chalet", 10,        
                       ifelse(data2$property_type == "Condominium", 11,        
                       ifelse(data2$property_type == "Dorm", 12,        
                       ifelse(data2$property_type == "Earth House", 13,        
                       ifelse(data2$property_type == "Guest suite", 14,        
                       ifelse(data2$property_type == "Guesthouse", 15,        
                       ifelse(data2$property_type == "Hostel", 16,        
                       ifelse(data2$property_type == "House", 17,            
                       ifelse(data2$property_type == "Hut", 18,          
                       ifelse(data2$property_type == "In-law", 19,        
                       ifelse(data2$property_type == "Loft", 20,        
                       ifelse(data2$property_type == "Other", 21,        
                       ifelse(data2$property_type == "Serviced apartment", 22,        
                       ifelse(data2$property_type == "Tent", 23,        
                       ifelse(data2$property_type == "Timeshare", 24,        
                       ifelse(data2$property_type == "Townhouse", 25,        
                       ifelse(data2$property_type == "Train", 26,        
                       ifelse(data2$property_type == "Treehouse", 27,        
                       ifelse(data2$property_type == "Vacation home", 28,        
                       ifelse(data2$property_type == "Villa", 29,        
                   
                       30)))))))))))))))))))))))))))))

str(data2)
## 'data.frame':    49692 obs. of  13 variables:
##  $ property_type      : num  1 1 1 1 1 1 11 17 17 1 ...
##  $ room_type          : Factor w/ 3 levels "Entire home/apt",..: 1 1 1 1 2 1 1 2 2 2 ...
##  $ accommodates       : int  3 7 5 2 2 3 2 2 2 2 ...
##  $ bathrooms          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ bed_type           : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ cancellation_policy: num  3 3 2 2 3 2 2 2 2 3 ...
##  $ cleaning_fee       : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ city               : num  5 5 5 3 6 4 4 6 4 5 ...
##  $ instant_bookable   : Factor w/ 2 levels "f","t": 1 2 2 2 2 2 1 1 2 1 ...
##  $ bedrooms           : int  1 3 1 0 1 1 1 1 1 1 ...
##  $ beds               : int  1 3 3 1 1 1 1 1 1 1 ...
##  $ price              : num  5.01 5.13 4.98 4.74 4.44 ...
##  $ sum_amenities      : int  9 15 19 12 10 21 26 21 13 15 ...
anyNA(data2)
## [1] FALSE
data2$cleaning_fee<- as.factor(data2$cleaning_fee)
glimpse(data2)
## Rows: 49,692
## Columns: 13
## $ property_type       <dbl> 1, 1, 1, 1, 1, 1, 11, 17, 17, 1, 1, 1, 1, 20, 25, ~
## $ room_type           <fct> Entire home/apt, Entire home/apt, Entire home/apt,~
## $ accommodates        <int> 3, 7, 5, 2, 2, 3, 2, 2, 2, 2, 2, 6, 2, 2, 2, 2, 8,~
## $ bathrooms           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,~
## $ bed_type            <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,~
## $ cancellation_policy <dbl> 3, 3, 2, 2, 3, 2, 2, 2, 2, 3, 1, 3, 3, 3, 3, 3, 3,~
## $ cleaning_fee        <fct> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR~
## $ city                <dbl> 5, 5, 5, 3, 6, 4, 4, 6, 4, 5, 4, 3, 4, 2, 1, 5, 5,~
## $ instant_bookable    <fct> f, t, t, t, t, t, f, f, t, f, f, t, f, f, f, f, t,~
## $ bedrooms            <int> 1, 3, 1, 0, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 3,~
## $ beds                <int> 1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 2, 3,~
## $ price               <dbl> 5.010635, 5.129899, 4.976734, 4.744932, 4.442651, ~
## $ sum_amenities       <int> 9, 15, 19, 12, 10, 21, 26, 21, 13, 15, 17, 25, 7, ~

3. Data Testing

Data testing tetap menggunakan data test awal, namun ada sedikit modifkasi yaitu adanya proses imputasi data hilang agar bisa dieksekuis prediksinya dalam model yang sudah dibangun. Metode dalam imputasi menggunakan random forest.

data_test <- read.csv2("D:/Magister IPB/Kuliah/Semester 2/STA582_Pembelajaran_Mesin_Statistika/Praktikum/Tugas_UTS/Problem 2/test2.csv",stringsAsFactors = TRUE)
data_test<- data_test[,c(2,3,5,6,7,8,9,10,17,27,28,29)]
data_test$bed_type <- ifelse(data_test$bed_type == "Airbed", 1, ifelse(data_test$bed_type == "Couch", 2, ifelse(data_test$bed_type == "Futon", 3, ifelse(data_test$bed_type == "Pull-out Sofa", 4,5))))
data_test$cancellation_policy <- ifelse(data_test$cancellation_policy == "flexible", 1, ifelse(data_test$cancellation_policy == "moderate", 2, ifelse(data_test$cancellation_policy == "strict", 3, ifelse(data_test$cancellation_policy == "Psuper_strict_30", 4,5))))
data_test$city <- ifelse(data_test$city  == "Boston", 1, ifelse(data_test$city  == "Chicago", 2, ifelse(data_test$city == "DC", 3, ifelse(data_test$city  == "LA", 4,ifelse(data_test$city  == "NYC", 5,6)))))
data_test$property_type <- ifelse(data_test$property_type == "Apartment", 1, 
                       ifelse(data_test$property_type == "Bed & Breakfast", 2, 
                       ifelse(data_test$property_type == "Boat", 3, 
                       ifelse(data_test$property_type == "Boutique hotel", 4,
                       ifelse(data_test$property_type == "Bungalow", 5,       
                       ifelse(data_test$property_type == "Cabin", 6,       
                       ifelse(data_test$property_type == "Camper/RV", 7,       
                       ifelse(data_test$property_type == "Castle", 8,        
                       ifelse(data_test$property_type == "Cave", 9,        
                       ifelse(data_test$property_type == "Chalet", 10,        
                       ifelse(data_test$property_type == "Condominium", 11,        
                       ifelse(data_test$property_type == "Dorm", 12,        
                       ifelse(data_test$property_type == "Earth House", 13,        
                       ifelse(data_test$property_type == "Guest suite", 14,        
                       ifelse(data_test$property_type == "Guesthouse", 15,        
                       ifelse(data_test$property_type == "Hostel", 16,        
                       ifelse(data_test$property_type == "House", 17,            
                       ifelse(data_test$property_type == "Hut", 18,          
                       ifelse(data_test$property_type == "In-law", 19,        
                       ifelse(data_test$property_type == "Loft", 20,        
                       ifelse(data_test$property_type == "Other", 21,        
                       ifelse(data_test$property_type == "Serviced apartment", 22,        
                       ifelse(data_test$property_type == "Tent", 23,        
                       ifelse(data_test$property_type == "Timeshare", 24,        
                       ifelse(data_test$property_type == "Townhouse", 25,        
                       ifelse(data_test$property_type == "Train", 26,        
                       ifelse(data_test$property_type == "Treehouse", 27,        
                       ifelse(data_test$property_type == "Vacation home", 28,        
                       ifelse(data_test$property_type == "Villa", 29,        
                   
                       30)))))))))))))))))))))))))))))

data_test$cleaning_fee <- as.factor(data_test$cleaning_fee)
glimpse(data_test)
## Rows: 22,232
## Columns: 12
## $ property_type       <dbl> 17, 1, 1, 17, 1, 1, 1, 1, 1, 1, 17, 17, 1, 1, 1, 1~
## $ room_type           <fct> Entire home/apt, Private room, Entire home/apt, Pr~
## $ accommodates        <int> 4, 2, 4, 2, 4, 6, 4, 8, 2, 1, 2, 2, 2, 5, 6, 1, 3,~
## $ bathrooms           <dbl> 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ~
## $ bed_type            <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,~
## $ cancellation_policy <dbl> 3, 2, 2, 2, 2, 3, 3, 2, 1, 1, 3, 3, 1, 3, 3, 1, 2,~
## $ cleaning_fee        <fct> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FA~
## $ city                <dbl> 4, 2, 4, 4, 3, 5, 4, 4, 5, 5, 5, 2, 4, 5, 1, 5, 4,~
## $ instant_bookable    <fct> f, f, t, f, f, f, f, t, f, f, t, f, f, f, f, f, f,~
## $ bedrooms            <int> 2, 1, 1, 1, 1, 3, 2, 1, 1, 1, 1, 1, 1, 0, 2, 0, 1,~
## $ beds                <int> 2, 1, 1, 1, 2, 3, 3, 7, 1, 1, 1, 1, 1, 2, 4, 1, 1,~
## $ sum_amenities       <int> 22, 12, 29, 16, 14, 44, 27, 15, 14, 16, 22, 20, 15~
str(data_test)
## 'data.frame':    22232 obs. of  12 variables:
##  $ property_type      : num  17 1 1 17 1 1 1 1 1 1 ...
##  $ room_type          : Factor w/ 3 levels "Entire home/apt",..: 1 2 1 2 1 1 1 2 1 2 ...
##  $ accommodates       : int  4 2 4 2 4 6 4 8 2 1 ...
##  $ bathrooms          : num  1.5 1.5 1 1 1 1 1 1 1 1 ...
##  $ bed_type           : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ cancellation_policy: num  3 2 2 2 2 3 3 2 1 1 ...
##  $ cleaning_fee       : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 1 1 ...
##  $ city               : num  4 2 4 4 3 5 4 4 5 5 ...
##  $ instant_bookable   : Factor w/ 2 levels "f","t": 1 1 2 1 1 1 1 2 1 1 ...
##  $ bedrooms           : int  2 1 1 1 1 3 2 1 1 1 ...
##  $ beds               : int  2 1 1 1 2 3 3 7 1 1 ...
##  $ sum_amenities      : int  22 12 29 16 14 44 27 15 14 16 ...

Cek data hilang

sum_mis2<-miss_var_summary(data_test)
sum_mis_plot2<-head(sum_mis2,7)
sum_mis_plot2

Imputasi data hilang

my_imptest<- mice(data_test, m = 2, method = "rf", maxit = 10, seed = 52500)
## 
##  iter imp variable
##   1   1  bathrooms  bedrooms  beds
##   1   2  bathrooms  bedrooms  beds
##   2   1  bathrooms  bedrooms  beds
##   2   2  bathrooms  bedrooms  beds
##   3   1  bathrooms  bedrooms  beds
##   3   2  bathrooms  bedrooms  beds
##   4   1  bathrooms  bedrooms  beds
##   4   2  bathrooms  bedrooms  beds
##   5   1  bathrooms  bedrooms  beds
##   5   2  bathrooms  bedrooms  beds
##   6   1  bathrooms  bedrooms  beds
##   6   2  bathrooms  bedrooms  beds
##   7   1  bathrooms  bedrooms  beds
##   7   2  bathrooms  bedrooms  beds
##   8   1  bathrooms  bedrooms  beds
##   8   2  bathrooms  bedrooms  beds
##   9   1  bathrooms  bedrooms  beds
##   9   2  bathrooms  bedrooms  beds
##   10   1  bathrooms  bedrooms  beds
##   10   2  bathrooms  bedrooms  beds
summary(my_imptest)
## Class: mids
## Number of multiple imputations:  2 
## Imputation methods:
##       property_type           room_type        accommodates           bathrooms 
##                  ""                  ""                  ""                "rf" 
##            bed_type cancellation_policy        cleaning_fee                city 
##                  ""                  ""                  ""                  "" 
##    instant_bookable            bedrooms                beds       sum_amenities 
##                  ""                "rf"                "rf"                  "" 
## PredictorMatrix:
##                     property_type room_type accommodates bathrooms bed_type
## property_type                   0         1            1         1        1
## room_type                       1         0            1         1        1
## accommodates                    1         1            0         1        1
## bathrooms                       1         1            1         0        1
## bed_type                        1         1            1         1        0
## cancellation_policy             1         1            1         1        1
##                     cancellation_policy cleaning_fee city instant_bookable
## property_type                         1            1    1                1
## room_type                             1            1    1                1
## accommodates                          1            1    1                1
## bathrooms                             1            1    1                1
## bed_type                              1            1    1                1
## cancellation_policy                   0            1    1                1
##                     bedrooms beds sum_amenities
## property_type              1    1             1
## room_type                  1    1             1
## accommodates               1    1             1
## bathrooms                  1    1             1
## bed_type                   1    1             1
## cancellation_policy        1    1             1
data_test_final = complete(my_imptest,2)

Cek kembali data hilang

anyNA(data_test_final)
## [1] FALSE
sum_mis3<-miss_var_summary(data_test_final)
sum_mis_plot3<-head(sum_mis3,7)
sum_mis_plot3

Data hilang sudah berhasil diselesaikan.

4. Eksplorasi Korelasi Data

Berikut ini adalah pola hubungan setiap variabel prediktor dengan variabel respon.

DataExplorer::plot_scatterplot(data = data2,
                               by = "price",nrow = 3,ncol = 3,geom_point_args = list(color="Steelblue"))

plot_intro(data2)

5. Pemodelan

Pada pertemuan ini metode yang akan digunakan dalam pomodelan adalah regresi gradient boost.

Gradient Boosting

Pertama kita akan bangun dengan menggunakan model gradient boosting, untuk kemudian akan kita uji akurasinya dengan menggunakan fungsi predict yang mengacu pada nilai RMSE dan MAE.

Memprediksi respon pada data baru

Berikut ini kita definisikan dahulu fungsi untuk membangun regresi gradient boosting.

gbm <- gbm(formula = price ~ .,
             distribution = "gaussian",
             data =data2,
             n.trees = 500,  
            interaction.depth = 12,
            cv.folds = 5,
            n.cores = NULL, # will use all cores by default
            verbose = FALSE)

print(gbm)
## gbm(formula = price ~ ., distribution = "gaussian", data = data2, 
##     n.trees = 500, interaction.depth = 12, cv.folds = 5, verbose = FALSE, 
##     n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 500 iterations were performed.
## The best cross-validation iteration was 305.
## There were 12 predictors of which 12 had non-zero influence.

Variabel Importance

Berikut ini akan ditampilkan urutan variabel yang paling berpengaruh terhadap variabel respon secara terurut tersusun menurun.

sqrt(min(gbm$cv.error))
## [1] 0.3777056
summary(gbm)

Tuning Hyperparameter

Untuk mendapatkan model terbaik, perlu dilakukan tuning hyperparameter sebagai berikut.

hyper_grid <- expand.grid(
  interaction.depth = c(2,5,7,9,12), #Maximum depth of each tree
  optimal_trees = 0, 
  min_mae = 0,
  n.minobsinnode = c(1000, 500, 100,50,25)# minimum number of observations in the terminal nodes of the trees
)

nrow(hyper_grid) 
## [1] 25

Berikutnya kita siapkan random data baru sebagai data uji bagi model yang sudah terbentuk.

# data dummy
set.seed(123)
data_house_baru <- data2 %>% slice_sample(n=100)
head(data_house_baru)

Sekarang kita siapkan, parameter-parameter yang akan kita tuning.

for(i in 1:nrow(hyper_grid)) {
  set.seed(123)
  gbm.tune <- gbm(
    formula = price ~ .,
    distribution = "gaussian",
    data = data2,
    n.trees = 1000,
    interaction.depth = hyper_grid$interaction.depth[i],
    n.minobsinnode = hyper_grid$n.minobsinnode[i],
    train.fraction = 0.75,
    n.cores = NULL,
    verbose = FALSE
  )
  

  hyper_grid$optimal_trees[i] <- which.min(gbm.tune$valid.error)
  hyper_grid$min_mae[i] <- sqrt(min(gbm.tune$valid.error))

}

hyper_grid %>% 
  dplyr::arrange(min_mae) %>%
  head(10)
min(hyper_grid$min_mae)
## [1] 0.3791272

Berdasarkan output di atas, dapat disimpulkan bahwa model ini akan menghasilkan nilai RMSE minimum saat kita menggunakan interaction.depth = 12, n-trees = 316, dan n.minobsinnode = 25. Hal ini terlihat dari nilai RMSE nya adalah yang paling kecil.

Final Model

Sekarang kita bangun ulang model dengan menggunakan model yang terbaik di atas.

# train GBM model
gbm.fit.final <- gbm(
  formula = price ~ .,
  distribution = "gaussian",
  data = data2,
  n.trees = 316,
  interaction.depth = 12,
  n.minobsinnode = 25,
  n.cores = NULL,
  verbose = FALSE
  ) 

summary(gbm.fit.final, cBars = 10,
        method = relative.influence, las = 2)

Didapatkan secara terurut variabel importance dari model tsb.

Predict data test

Selanjutnya kita prediksi variabel respon dengan menggunakan datatest.

gbm_pred <- predict(object=gbm.fit.final, newdata=data_test_final)
## Using 316 trees...
gbm_pred<-as.data.frame(gbm_pred)
a<-gbm_pred$gbm_pred

Kemudian kita simulasikan dengan data test yang sebelumnya kita siapkan. Data test kita kembalikan ke bentuk asalnya karena data test yang kita gunakan sudah ditransformasi ke bentuk logaritma natural.

#Mengembalikan fungsi log natural
pred_gbm<- 2.718282^(a)
pred_gbm<-as.data.frame(pred_gbm)

7. Kesimpulan

Adapun yang dapat disimpulkan dari simulasi di atas adalah sbb:

  1. Model terbaik adalah regresi gradient boosting dengan hyperparameter tertentu.
  1. Metode ini dapat sangat baik dalam menjelaskan hubungan antara variabel respon dan predictor.
  1. Tuning hyperparameter adalah hal yang sangat penting untuk dilakukan.
  1. Semakin presisi model pada data training, maka berpeluang akan overfit pada data test.
  1. Model regresi gradient boosting ini lebih baik dibanding model random forest.

Referensi

Breheny P. (n.d.). Getting started with grpreg. GitHub Pages. https://pbreheny.github.io/grpreg/articles/getting-started.html

Hastie T. (2013, May 9). glmnet: Lasso and elastic-net regularization in R. Revolutions. https://blog.revolutionanalytics.com/2013/05/hastie-glmnet.html

Post J. (2014, September 29). LASSO, Ridge, and Elastic Net. https://www4.stat.ncsu.edu/~post/josh/LASSO_Ridge_Elastic_Net_-_Examples.html

Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., & Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1), 91-108. https://doi.org/10.1111/j.1467-9868.2005.00490.x

Tripathy A. (2013, July 14). Regularization – Predictive Modeling Beyond Ordinary Least Squares Fit. ShatterLine Blog. https://shatterline.com/blog/2013/07/

Xiaotong C., Chen G., & Chong W. (n.d.). Statistical Learning and Data Mining Codes. Biostatistics - Academic Divisions - School of Public Health - University of Minnesota. https://www.biostat.umn.edu/~weip/course/dm/examples/