Steps

Read Data Data Wrangling Exploratory Data Analysis Data Pre-processing Cross Validation Model Fitting Model Prediction Model Evaluation Conclusion

Library

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
library(e1071)
library(rpart)
## Warning: package 'rpart' was built under R version 4.4.2
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.2

Kita akan melakukan pemodelan machine learning untuk memprediksi apakah seseorang lebih menyukai pantai atau gunung sebagai tujuan liburan berdasarkan atribut demografis, geografis, perilaku, dan preferensi lainnya.

Gambaran Dataset Demografi: Usia, Jenis Kelamin, Tingkat Pendidikan, dan Pendapatan. Faktor Geografis: Kedekatan dengan Gunung dan Kedekatan dengan Pantai. Karakteristik Perilaku: Frekuensi Perjalanan, Aktivitas Favorit, dan Kepedulian Lingkungan. Atribut Lainnya: Musim Favorit, Anggaran Liburan, Memiliki Hewan Peliharaan, dan Lokasi Tempat Tinggal. Variabel Target: Preference (biner: 0 = Pantai, 1 = Gunung).

I. Read Data

data <- read.csv("lbb/mountains_vs_beaches_preferences.csv")

glimpse(data)
## Rows: 52,444
## Columns: 14
## $ Age                    <int> 56, 69, 46, 32, 60, 25, 38, 56, 36, 40, 28, 28,…
## $ Gender                 <chr> "male", "male", "female", "non-binary", "female…
## $ Income                 <int> 71477, 88740, 46562, 99044, 106583, 110588, 222…
## $ Education_Level        <chr> "bachelor", "master", "master", "high school", …
## $ Travel_Frequency       <int> 9, 1, 0, 6, 5, 3, 1, 8, 6, 1, 4, 3, 7, 9, 3, 2,…
## $ Preferred_Activities   <chr> "skiing", "swimming", "skiing", "hiking", "sunb…
## $ Vacation_Budget        <int> 2477, 4777, 1469, 1482, 516, 2895, 4994, 3656, …
## $ Location               <chr> "urban", "suburban", "urban", "rural", "suburba…
## $ Proximity_to_Mountains <int> 175, 228, 71, 31, 23, 6, 157, 210, 218, 271, 15…
## $ Proximity_to_Beaches   <int> 267, 190, 280, 255, 151, 47, 225, 166, 263, 15,…
## $ Favorite_Season        <chr> "summer", "fall", "winter", "summer", "winter",…
## $ Pets                   <int> 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1,…
## $ Environmental_Concerns <int> 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1,…
## $ Preference             <int> 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,…
summary(data)
##       Age           Gender              Income       Education_Level   
##  Min.   :18.00   Length:52444       Min.   : 20001   Length:52444      
##  1st Qu.:31.00   Class :character   1st Qu.: 45048   Class :character  
##  Median :43.00   Mode  :character   Median : 70167   Mode  :character  
##  Mean   :43.51                      Mean   : 70017                     
##  3rd Qu.:56.00                      3rd Qu.: 95109                     
##  Max.   :69.00                      Max.   :119999                     
##  Travel_Frequency Preferred_Activities Vacation_Budget   Location        
##  Min.   :0.000    Length:52444         Min.   : 500    Length:52444      
##  1st Qu.:2.000    Class :character     1st Qu.:1622    Class :character  
##  Median :4.000    Mode  :character     Median :2733    Mode  :character  
##  Mean   :4.489                         Mean   :2742                      
##  3rd Qu.:7.000                         3rd Qu.:3869                      
##  Max.   :9.000                         Max.   :4999                      
##  Proximity_to_Mountains Proximity_to_Beaches Favorite_Season   
##  Min.   :  0.0          Min.   :  0.00       Length:52444      
##  1st Qu.: 75.0          1st Qu.: 75.75       Class :character  
##  Median :150.0          Median :150.00       Mode  :character  
##  Mean   :149.9          Mean   :149.89                         
##  3rd Qu.:225.0          3rd Qu.:225.00                         
##  Max.   :299.0          Max.   :299.00                         
##       Pets        Environmental_Concerns   Preference    
##  Min.   :0.0000   Min.   :0.0000         Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000         1st Qu.:0.0000  
##  Median :1.0000   Median :0.0000         Median :0.0000  
##  Mean   :0.5009   Mean   :0.4984         Mean   :0.2507  
##  3rd Qu.:1.0000   3rd Qu.:1.0000         3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000         Max.   :1.0000

II. Data Wrangling

# Cek apakah ada missing values
colSums(is.na(data))
##                    Age                 Gender                 Income 
##                      0                      0                      0 
##        Education_Level       Travel_Frequency   Preferred_Activities 
##                      0                      0                      0 
##        Vacation_Budget               Location Proximity_to_Mountains 
##                      0                      0                      0 
##   Proximity_to_Beaches        Favorite_Season                   Pets 
##                      0                      0                      0 
## Environmental_Concerns             Preference 
##                      0                      0
# Mengubah kolom karakter menjadi faktor
data$Gender <- as.factor(data$Gender)
data$Education_Level <- as.factor(data$Education_Level)
data$Preferred_Activities <- as.factor(data$Preferred_Activities)
data$Location <- as.factor(data$Location)
data$Favorite_Season <- as.factor(data$Favorite_Season)

# Menangani nilai yang tidak konsisten pada kolom Gender
data$Gender <- ifelse(data$Gender %in% c("non-binary"), NA, data$Gender)
data <- na.omit(data)

III. Exploratory Data Analysis

Pada tahap EDA, kita akan memvisualisasikan data untuk memahami distribusi dan hubungan antar variabel.

# Distribusi umur
ggplot(data, aes(x = Age)) + 
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  theme_minimal()

# Distribusi pendapatan berdasarkan gender
ggplot(data, aes(x = Gender, y = Income, group = Gender)) + 
  geom_boxplot(fill = "lightgreen") +
  theme_minimal() +
  labs(title = "Boxplot of Income by Gender", x = "Gender", y = "Income")

# Korelasi antar variabel numerik
correlation_matrix <- cor(select(data, Age, Income, Travel_Frequency, Vacation_Budget, Proximity_to_Mountains, Proximity_to_Beaches))
print(correlation_matrix)
##                                 Age        Income Travel_Frequency
## Age                    1.0000000000  0.0043244524      0.004435364
## Income                 0.0043244524  1.0000000000     -0.001863704
## Travel_Frequency       0.0044353645 -0.0018637036      1.000000000
## Vacation_Budget        0.0041709242  0.0005608845      0.003651239
## Proximity_to_Mountains 0.0069122360 -0.0091209252      0.012263416
## Proximity_to_Beaches   0.0002210541  0.0027544914     -0.006831843
##                        Vacation_Budget Proximity_to_Mountains
## Age                       0.0041709242           0.0069122360
## Income                    0.0005608845          -0.0091209252
## Travel_Frequency          0.0036512389           0.0122634155
## Vacation_Budget           1.0000000000           0.0001055065
## Proximity_to_Mountains    0.0001055065           1.0000000000
## Proximity_to_Beaches     -0.0044640950           0.0015289262
##                        Proximity_to_Beaches
## Age                            0.0002210541
## Income                         0.0027544914
## Travel_Frequency              -0.0068318434
## Vacation_Budget               -0.0044640950
## Proximity_to_Mountains         0.0015289262
## Proximity_to_Beaches           1.0000000000

Insight :

Data ini menunjukkan bahwa tidak ada hubungan linier yang signifikan antara sebagian besar pasangan variabel. Data dengan hubungan linier rendah bisa dipelajari lebih lanjut dengan menggunakan Decision Tree atau Random Forest

IV. Data Pre-processing

# Normalisasi kolom numerik
normalize <- function(x) { (x - min(x)) / (max(x) - min(x)) }
data$Income <- normalize(data$Income)
data$Vacation_Budget <- normalize(data$Vacation_Budget)
data$Proximity_to_Mountains <- normalize(data$Proximity_to_Mountains)
data$Proximity_to_Beaches <- normalize(data$Proximity_to_Beaches)

# Convert target variabel (Preference) menjadi faktor
data$Preference <- as.factor(data$Preference)

Mengapa Normalisasi Dilakukan di Sini? 1. Untuk Konsistensi Skala: Variabel seperti Income dan budget mungkin memiliki rentang nilai yang jauh lebih besar dibandingkan Proximity_to_Mountains. 2. Data yang dinormalisasi lebih seragam dan dapat digunakan langsung dalam visualisasi atau analisis tanpa dipengaruhi oleh perbedaan skala antar kolom

V. Cross Validation

set.seed(88)
train_index <- createDataPartition(data$Preference, p = 0.8, list = FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]

split data 80-20 train and test.

VI. Model Fitting - Random Forest

# Melatih model Random Forest
set.seed(88)
rf_model <- randomForest(Preference ~ ., data = train_data, ntree = 100, mtry = 3, importance = TRUE)
rf_model
## 
## Call:
##  randomForest(formula = Preference ~ ., data = train_data, ntree = 100,      mtry = 3, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0.71%
## Confusion matrix:
##       0    1 class.error
## 0 21017   59 0.002799393
## 1   141 6911 0.019994328

Random Forest model yang dibuat memiliki akurasi yang tinggi, dengan error rate secara keseluruhan di 0.67%. Prediksi Class 0 juga sangat baik, sedangkan prediksi Class 1 memiliki tingkat akurasi lebih rendah dari Class 0 tapi secara keseluruhan masih sangat baik.

VII. Model Prediction - Random Forest

# Memprediksi pada data testing
rf_pred <- predict(rf_model, test_data)
summary(rf_pred)
##    0    1 
## 5283 1749

VIII. Model Evaluation & Importance - Random Forest

# Evaluasi model
conf_matrix <- confusionMatrix(rf_pred, test_data$Preference)
conf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5261   22
##          1    8 1741
##                                           
##                Accuracy : 0.9957          
##                  95% CI : (0.9939, 0.9971)
##     No Information Rate : 0.7493          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9886          
##                                           
##  Mcnemar's Test P-Value : 0.01762         
##                                           
##             Sensitivity : 0.9985          
##             Specificity : 0.9875          
##          Pos Pred Value : 0.9958          
##          Neg Pred Value : 0.9954          
##              Prevalence : 0.7493          
##          Detection Rate : 0.7482          
##    Detection Prevalence : 0.7513          
##       Balanced Accuracy : 0.9930          
##                                           
##        'Positive' Class : 0               
## 

Observasi: True Positives (TP) (Predicted 0 and actual 0): 5249 False Positives (FP) (Predicted 0, actual 1): 17 True Negatives (TN) (Predicted 1, actual 1): 1746 False Negatives (FN) (Predicted 1, actual 0): 20 Metrik: Sensitivity (Recall for Class 0): 0.9962 The model correctly identifies 99.62% of all true Class 0 instances. Specificity (Recall for Class 1): 0.9904 The model correctly identifies 99.04% of all true Class 1 instances. Positive Predictive Value (Precision for Class 0): 0.9968 Of all the instances predicted as Class 0, 99.68% are correct. Negative Predictive Value (Precision for Class 1): 0.9887 Of all the instances predicted as Class 1, 98.87% are correct. Balanced Accuracy: 0.9933

Model Importance

# Menampilkan pentingnya fitur
importance(rf_model)
##                                   0            1 MeanDecreaseAccuracy
## Age                      0.67007493   0.24854142           0.66604215
## Gender                   0.03858473   1.69334281           1.06646368
## Income                  -1.14335117   0.27072580          -0.61543273
## Education_Level         -0.52712435  -0.62405365          -0.81459098
## Travel_Frequency         0.23028935   0.48524666           0.53423575
## Preferred_Activities   221.08129388 218.53746184         224.82092746
## Vacation_Budget         -0.36460989   0.99507103           0.52089868
## Location                 0.77903353   0.01064259           0.58280744
## Proximity_to_Mountains 122.09716737 116.46917185         121.43619426
## Proximity_to_Beaches   124.93282074 130.49336984         134.25898958
## Favorite_Season         -0.76145144   0.31421150          -0.34252736
## Pets                     0.09400294  -0.03876043           0.03589835
## Environmental_Concerns   0.15035504   1.33166083           1.09506000
##                        MeanDecreaseGini
## Age                           174.02486
## Gender                         32.97309
## Income                        226.21421
## Education_Level                78.76138
## Travel_Frequency              111.26719
## Preferred_Activities         4365.42265
## Vacation_Budget               224.15908
## Location                       57.33189
## Proximity_to_Mountains       2603.55922
## Proximity_to_Beaches         2551.09551
## Favorite_Season                81.53310
## Pets                           27.44636
## Environmental_Concerns         28.33470
varImpPlot(rf_model)

Most Important Features (High MDA and MDG): Preferred_Activities: MDA: 203.14 (highest) MDG: 4381.79 (highest) Variabel ini adalah yang paling menentukan Preference. Proximity_to_Mountains: MDA: 120.51 MDG: 2617.73 Menunjukan jarak ke gunung juga menentukan preferences. Proximity_to_Beaches: MDA: 117.92 MDG: 2545.44 Sama seperti jarak ke gunung, jarak ke pantai juga menentukan preferences. Vacation_Budget: MDA: 0.067 MDG: 220.03 Budget Vacation juga memiliki pengaruh, menunjukan faktor finansial ada pengaruh terhadap preference. Income: MDA: -1.18 MDG: 219.63 Walaupun memiliki MDA negatif, MDG yang tinggi menunjukkan income berpengaruh membuat split dalam decision trees

VI. Model Fitting - Decision Tree

# Melatih model Decision Tree
dt_model <- rpart(Preference ~ ., data = train_data, method = "class")
dt_model
## n= 28128 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 28128 7052 0 (0.749288965 0.250711035)  
##     2) Preferred_Activities=sunbathing,swimming 14076    0 0 (1.000000000 0.000000000) *
##     3) Preferred_Activities=hiking,skiing 14052 7000 1 (0.498149730 0.501850270)  
##       6) Proximity_to_Beaches< 0.5301003 7325 1914 0 (0.738703072 0.261296928)  
##        12) Proximity_to_Mountains>=0.3227425 4977  274 0 (0.944946755 0.055053245)  
##          24) Proximity_to_Mountains>=0.4498328 4021   34 0 (0.991544392 0.008455608) *
##          25) Proximity_to_Mountains< 0.4498328 956  240 0 (0.748953975 0.251046025)  
##            50) Proximity_to_Beaches< 0.3795987 698   16 0 (0.977077364 0.022922636) *
##            51) Proximity_to_Beaches>=0.3795987 258   34 1 (0.131782946 0.868217054) *
##        13) Proximity_to_Mountains< 0.3227425 2348  708 1 (0.301533220 0.698466780)  
##          26) Proximity_to_Beaches< 0.1755853 790  216 0 (0.726582278 0.273417722)  
##            52) Proximity_to_Mountains>=0.09866221 550   36 0 (0.934545455 0.065454545) *
##            53) Proximity_to_Mountains< 0.09866221 240   60 1 (0.250000000 0.750000000) *
##          27) Proximity_to_Beaches>=0.1755853 1558  134 1 (0.086007702 0.913992298) *
##       7) Proximity_to_Beaches>=0.5301003 6727 1589 1 (0.236212279 0.763787721)  
##        14) Proximity_to_Mountains>=0.6906355 2049  645 0 (0.685212299 0.314787701)  
##          28) Proximity_to_Beaches< 0.8311037 1305  117 0 (0.910344828 0.089655172) *
##          29) Proximity_to_Beaches>=0.8311037 744  216 1 (0.290322581 0.709677419)  
##            58) Proximity_to_Mountains>=0.867893 347  140 0 (0.596541787 0.403458213)  
##             116) Proximity_to_Beaches< 0.9247492 191   24 0 (0.874345550 0.125654450) *
##             117) Proximity_to_Beaches>=0.9247492 156   40 1 (0.256410256 0.743589744) *
##            59) Proximity_to_Mountains< 0.867893 397    9 1 (0.022670025 0.977329975) *
##        15) Proximity_to_Mountains< 0.6906355 4678  185 1 (0.039546815 0.960453185) *
summary(dt_model)
## Call:
## rpart(formula = Preference ~ ., data = train_data, method = "class")
##   n= 28128 
## 
##           CP nsplit  rel error    xerror        xstd
## 1 0.25163074      0 1.00000000 1.0000000 0.010307860
## 2 0.13216109      2 0.49673851 0.5086500 0.007932865
## 3 0.10762904      3 0.36457742 0.3946398 0.007101025
## 4 0.05076574      4 0.25694838 0.2616279 0.005887812
## 5 0.04424277      5 0.20618264 0.2029212 0.005225999
## 6 0.01701645      6 0.16193988 0.1741350 0.004859522
## 7 0.01347136      7 0.14492343 0.1460579 0.004466893
## 8 0.01013897      9 0.11798071 0.1185479 0.004038677
## 9 0.01000000     11 0.09770278 0.1103233 0.003900199
## 
## Variable importance
##   Preferred_Activities   Proximity_to_Beaches Proximity_to_Mountains 
##                     37                     31                     30 
##        Vacation_Budget                 Income 
##                      1                      1 
## 
## Node number 1: 28128 observations,    complexity param=0.2516307
##   predicted class=0  expected loss=0.250711  P(node) =1
##     class counts: 21076  7052
##    probabilities: 0.749 0.251 
##   left son=2 (14076 obs) right son=3 (14052 obs)
##   Primary splits:
##       Preferred_Activities   splits as  RRLL, improve=3542.068000, (0 missing)
##       Proximity_to_Beaches   < 0.5301003   to the left,  improve= 924.164000, (0 missing)
##       Proximity_to_Mountains < 0.506689    to the right, improve= 900.937900, (0 missing)
##       Vacation_Budget        < 0.5008891   to the right, improve=   2.739928, (0 missing)
##       Environmental_Concerns < 0.5         to the left,  improve=   1.676078, (0 missing)
##   Surrogate splits:
##       Proximity_to_Beaches   < 0.4364548   to the left,  agree=0.509, adj=0.016, (0 split)
##       Vacation_Budget        < 0.3937542   to the right, agree=0.508, adj=0.016, (0 split)
##       Income                 < 0.5127805   to the right, agree=0.507, adj=0.013, (0 split)
##       Age                    < 39.5        to the right, agree=0.505, adj=0.010, (0 split)
##       Proximity_to_Mountains < 0.6505017   to the right, agree=0.505, adj=0.009, (0 split)
## 
## Node number 2: 14076 observations
##   predicted class=0  expected loss=0  P(node) =0.5004266
##     class counts: 14076     0
##    probabilities: 1.000 0.000 
## 
## Node number 3: 14052 observations,    complexity param=0.2516307
##   predicted class=1  expected loss=0.4981497  P(node) =0.4995734
##     class counts:  7000  7052
##    probabilities: 0.498 0.502 
##   left son=6 (7325 obs) right son=7 (6727 obs)
##   Primary splits:
##       Proximity_to_Beaches   < 0.5301003   to the left,  improve=1770.831000, (0 missing)
##       Proximity_to_Mountains < 0.5033445   to the right, improve=1761.459000, (0 missing)
##       Age                    < 66.5        to the left,  improve=   3.166996, (0 missing)
##       Income                 < 0.9986999   to the left,  improve=   3.091624, (0 missing)
##       Vacation_Budget        < 0.5037786   to the right, improve=   1.860815, (0 missing)
##   Surrogate splits:
##       Vacation_Budget        < 0.9049789   to the left,  agree=0.525, adj=0.009, (0 split)
##       Age                    < 67.5        to the left,  agree=0.523, adj=0.004, (0 split)
##       Income                 < 0.9860794   to the left,  agree=0.523, adj=0.003, (0 split)
##       Proximity_to_Mountains < 0.9882943   to the left,  agree=0.521, adj=0.000, (0 split)
## 
## Node number 6: 7325 observations,    complexity param=0.1321611
##   predicted class=0  expected loss=0.2612969  P(node) =0.2604167
##     class counts:  5411  1914
##    probabilities: 0.739 0.261 
##   left son=12 (4977 obs) right son=13 (2348 obs)
##   Primary splits:
##       Proximity_to_Mountains < 0.3227425   to the right, improve=1320.896000, (0 missing)
##       Proximity_to_Beaches   < 0.3160535   to the left,  improve= 226.821200, (0 missing)
##       Income                 < 0.001320053 to the right, improve=   2.875532, (0 missing)
##       Vacation_Budget        < 0.01878195  to the left,  improve=   2.024209, (0 missing)
##       Environmental_Concerns < 0.5         to the left,  improve=   1.600880, (0 missing)
##   Surrogate splits:
##       Income < 0.001320053 to the right, agree=0.68, adj=0.001, (0 split)
## 
## Node number 7: 6727 observations,    complexity param=0.107629
##   predicted class=1  expected loss=0.2362123  P(node) =0.2391567
##     class counts:  1589  5138
##    probabilities: 0.236 0.764 
##   left son=14 (2049 obs) right son=15 (4678 obs)
##   Primary splits:
##       Proximity_to_Mountains < 0.6906355   to the right, improve=1188.026000, (0 missing)
##       Proximity_to_Beaches   < 0.777592    to the left,  improve= 170.034100, (0 missing)
##       Vacation_Budget        < 0.6269171   to the right, improve=   1.952756, (0 missing)
##       Income                 < 0.003360134 to the left,  improve=   1.833659, (0 missing)
##       Environmental_Concerns < 0.5         to the left,  improve=   1.623661, (0 missing)
## 
## Node number 12: 4977 observations,    complexity param=0.01347136
##   predicted class=0  expected loss=0.05505324  P(node) =0.1769411
##     class counts:  4703   274
##    probabilities: 0.945 0.055 
##   left son=24 (4021 obs) right son=25 (956 obs)
##   Primary splits:
##       Proximity_to_Mountains < 0.4498328   to the right, improve=90.9079000, (0 missing)
##       Proximity_to_Beaches   < 0.4197324   to the left,  improve=61.5071700, (0 missing)
##       Vacation_Budget        < 0.941098    to the left,  improve= 0.9842867, (0 missing)
##       Income                 < 0.8675797   to the left,  improve= 0.4652594, (0 missing)
##       Travel_Frequency       < 4.5         to the right, improve= 0.4535930, (0 missing)
## 
## Node number 13: 2348 observations,    complexity param=0.05076574
##   predicted class=1  expected loss=0.3015332  P(node) =0.08347554
##     class counts:   708  1640
##    probabilities: 0.302 0.698 
##   left son=26 (790 obs) right son=27 (1558 obs)
##   Primary splits:
##       Proximity_to_Beaches   < 0.1755853   to the left,  improve=430.195500, (0 missing)
##       Proximity_to_Mountains < 0.1488294   to the right, improve=106.272400, (0 missing)
##       Gender                 < 1.5         to the left,  improve=  2.005759, (0 missing)
##       Vacation_Budget        < 0.9843299   to the right, improve=  1.971968, (0 missing)
##       Income                 < 0.002115085 to the right, improve=  1.459730, (0 missing)
##   Surrogate splits:
##       Income          < 0.001050042 to the left,  agree=0.664, adj=0.001, (0 split)
##       Vacation_Budget < 0.001111358 to the left,  agree=0.664, adj=0.001, (0 split)
## 
## Node number 14: 2049 observations,    complexity param=0.04424277
##   predicted class=0  expected loss=0.3147877  P(node) =0.07284556
##     class counts:  1404   645
##    probabilities: 0.685 0.315 
##   left son=28 (1305 obs) right son=29 (744 obs)
##   Primary splits:
##       Proximity_to_Beaches   < 0.8311037   to the left,  improve=364.322500, (0 missing)
##       Proximity_to_Mountains < 0.8478261   to the right, improve=100.330400, (0 missing)
##       Vacation_Budget        < 0.9739942   to the right, improve=  3.131489, (0 missing)
##       Age                    < 66.5        to the left,  improve=  2.230838, (0 missing)
##       Income                 < 0.005735229 to the left,  improve=  2.191775, (0 missing)
##   Surrogate splits:
##       Income < 0.9820843   to the left,  agree=0.638, adj=0.004, (0 split)
## 
## Node number 15: 4678 observations
##   predicted class=1  expected loss=0.03954681  P(node) =0.1663111
##     class counts:   185  4493
##    probabilities: 0.040 0.960 
## 
## Node number 24: 4021 observations
##   predicted class=0  expected loss=0.008455608  P(node) =0.1429536
##     class counts:  3987    34
##    probabilities: 0.992 0.008 
## 
## Node number 25: 956 observations,    complexity param=0.01347136
##   predicted class=0  expected loss=0.251046  P(node) =0.03398749
##     class counts:   716   240
##    probabilities: 0.749 0.251 
##   left son=50 (698 obs) right son=51 (258 obs)
##   Primary splits:
##       Proximity_to_Beaches   < 0.3795987   to the left,  improve=269.192700, (0 missing)
##       Proximity_to_Mountains < 0.40301     to the right, improve=  5.609206, (0 missing)
##       Vacation_Budget        < 0.9405423   to the left,  improve=  2.341175, (0 missing)
##       Favorite_Season        splits as  RRLL, improve=  1.757646, (0 missing)
##       Age                    < 64.5        to the left,  improve=  1.716359, (0 missing)
## 
## Node number 26: 790 observations,    complexity param=0.01701645
##   predicted class=0  expected loss=0.2734177  P(node) =0.02808589
##     class counts:   574   216
##    probabilities: 0.727 0.273 
##   left son=52 (550 obs) right son=53 (240 obs)
##   Primary splits:
##       Proximity_to_Mountains < 0.09866221  to the right, improve=156.596300, (0 missing)
##       Proximity_to_Beaches   < 0.09531773  to the left,  improve= 38.331850, (0 missing)
##       Travel_Frequency       < 5.5         to the left,  improve=  2.115627, (0 missing)
##       Income                 < 0.9220619   to the right, improve=  1.747214, (0 missing)
##       Pets                   < 0.5         to the right, improve=  1.312365, (0 missing)
##   Surrogate splits:
##       Income          < 0.00151006  to the right, agree=0.699, adj=0.008, (0 split)
##       Vacation_Budget < 0.004334297 to the right, agree=0.699, adj=0.008, (0 split)
## 
## Node number 27: 1558 observations
##   predicted class=1  expected loss=0.0860077  P(node) =0.05538965
##     class counts:   134  1424
##    probabilities: 0.086 0.914 
## 
## Node number 28: 1305 observations
##   predicted class=0  expected loss=0.08965517  P(node) =0.04639505
##     class counts:  1188   117
##    probabilities: 0.910 0.090 
## 
## Node number 29: 744 observations,    complexity param=0.01013897
##   predicted class=1  expected loss=0.2903226  P(node) =0.02645051
##     class counts:   216   528
##    probabilities: 0.290 0.710 
##   left son=58 (347 obs) right son=59 (397 obs)
##   Primary splits:
##       Proximity_to_Mountains < 0.867893    to the right, improve=121.957000, (0 missing)
##       Proximity_to_Beaches   < 0.9013378   to the left,  improve= 40.182940, (0 missing)
##       Vacation_Budget        < 0.9792176   to the right, improve=  3.454962, (0 missing)
##       Income                 < 0.03120625  to the left,  improve=  3.317735, (0 missing)
##       Favorite_Season        splits as  RRRL, improve=  1.893465, (0 missing)
##   Surrogate splits:
##       Income               < 0.9312422   to the right, agree=0.552, adj=0.040, (0 split)
##       Vacation_Budget      < 0.975439    to the right, agree=0.548, adj=0.032, (0 split)
##       Favorite_Season      splits as  RRRL, agree=0.544, adj=0.023, (0 split)
##       Age                  < 18.5        to the left,  agree=0.539, adj=0.012, (0 split)
##       Proximity_to_Beaches < 0.9916388   to the right, agree=0.539, adj=0.012, (0 split)
## 
## Node number 50: 698 observations
##   predicted class=0  expected loss=0.02292264  P(node) =0.02481513
##     class counts:   682    16
##    probabilities: 0.977 0.023 
## 
## Node number 51: 258 observations
##   predicted class=1  expected loss=0.1317829  P(node) =0.009172355
##     class counts:    34   224
##    probabilities: 0.132 0.868 
## 
## Node number 52: 550 observations
##   predicted class=0  expected loss=0.06545455  P(node) =0.01955347
##     class counts:   514    36
##    probabilities: 0.935 0.065 
## 
## Node number 53: 240 observations
##   predicted class=1  expected loss=0.25  P(node) =0.008532423
##     class counts:    60   180
##    probabilities: 0.250 0.750 
## 
## Node number 58: 347 observations,    complexity param=0.01013897
##   predicted class=0  expected loss=0.4034582  P(node) =0.01233646
##     class counts:   207   140
##    probabilities: 0.597 0.403 
##   left son=116 (191 obs) right son=117 (156 obs)
##   Primary splits:
##       Proximity_to_Beaches   < 0.9247492   to the left,  improve=65.575930, (0 missing)
##       Proximity_to_Mountains < 0.9347826   to the right, improve=29.210700, (0 missing)
##       Vacation_Budget        < 0.6653701   to the right, improve= 3.761286, (0 missing)
##       Favorite_Season        splits as  RLLL, improve= 3.071775, (0 missing)
##       Gender                 < 1.5         to the right, improve= 2.127182, (0 missing)
##   Surrogate splits:
##       Favorite_Season splits as  RLLL, agree=0.585, adj=0.077, (0 split)
##       Age             < 27.5        to the right, agree=0.556, adj=0.013, (0 split)
##       Income          < 0.9227769   to the left,  agree=0.556, adj=0.013, (0 split)
##       Vacation_Budget < 0.8679707   to the left,  agree=0.556, adj=0.013, (0 split)
## 
## Node number 59: 397 observations
##   predicted class=1  expected loss=0.02267003  P(node) =0.01411405
##     class counts:     9   388
##    probabilities: 0.023 0.977 
## 
## Node number 116: 191 observations
##   predicted class=0  expected loss=0.1256545  P(node) =0.006790387
##     class counts:   167    24
##    probabilities: 0.874 0.126 
## 
## Node number 117: 156 observations
##   predicted class=1  expected loss=0.2564103  P(node) =0.005546075
##     class counts:    40   116
##    probabilities: 0.256 0.744

VII. Model Prediction - Decision Tree

# Memprediksi pada data testing
dt_pred <- predict(dt_model, test_data, type = "class")
summary(dt_pred)
##    0    1 
## 5219 1813

VIII. Model Evaluation - Decision Tree

# Plot
rpart.plot(dt_model)

# Evaluasi model
conf_matrix_dt <- confusionMatrix(dt_pred, test_data$Preference)
conf_matrix_dt
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5153   66
##          1  116 1697
##                                           
##                Accuracy : 0.9741          
##                  95% CI : (0.9701, 0.9777)
##     No Information Rate : 0.7493          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9318          
##                                           
##  Mcnemar's Test P-Value : 0.0002811       
##                                           
##             Sensitivity : 0.9780          
##             Specificity : 0.9626          
##          Pos Pred Value : 0.9874          
##          Neg Pred Value : 0.9360          
##              Prevalence : 0.7493          
##          Detection Rate : 0.7328          
##    Detection Prevalence : 0.7422          
##       Balanced Accuracy : 0.9703          
##                                           
##        'Positive' Class : 0               
## 

IX Conclusion Model Comparison Random Forest VS. Decision Tree

Untuk menentukan model mana yang lebih baik—Random Forest atau Decision Tree—mari kita akan membandingkan metrik performanya dari confusion matrix:

Metrik Utama Akurasi: Random Forest: 0,9947 (99,47%) Decision Tree: 0,9697 (96,97%) Kesimpulan: Model Random Forest lebih akurat daripada Decision Tree. Sensitivitas (Recall untuk Kelas 0): Random Forest: 0,9962 Decision Tree: 0,9768 Kesimpulan: Random Forest memiliki sensitivitas lebih tinggi, artinya lebih banyak mengidentifikasi dengan benar contoh dari Kelas 0 (kelas mayoritas). Spesifisitas: Random Forest: 0,9904 Decision Tree: 0,9484 Kesimpulan: Random Forest lebih baik dalam mengidentifikasi Kelas 1 (kelas minoritas) dengan benar. Kappa (Kesepakatan antara nilai prediksi dan nilai aktual): Random Forest: 0,986 Decision Tree: 0,9198 Kesimpulan: Random Forest memiliki nilai Kappa yang lebih tinggi, artinya ada kesepakatan yang lebih baik antara prediksi dan nilai aktual. Akurasi Seimbang (Rata-rata sensitivitas dan spesifisitas): Random Forest: 0,9933 Decision Tree: 0,9626 Kesimpulan: Random Forest memiliki akurasi seimbang yang lebih tinggi, artinya kinerjanya lebih baik di kedua kelas, terutama pada dataset yang tidak seimbang. Uji McNemar: Random Forest: p-value = 0,7423 Decision Tree: p-value = 0,03982 Kesimpulan: p-value untuk Uji McNemar pada Decision Tree menunjukkan perbedaan dalam kesalahan klasifikasi antara kedua kelas, sementara Random Forest tidak menunjukkan perbedaan yang signifikan. Ini lebih mendukung performa yang lebih baik dari Random Forest.

Kesimpulan Akhir: Model Random Forest lebih unggul dibandingkan model Decision Tree di semua metrik utama, termasuk akurasi, sensitivitas, spesifisitas, dan akurasi seimbang. Oleh karena itu, Random Forest adalah model yang lebih baik untuk menentukan preference.