Read Data Data Wrangling Exploratory Data Analysis Data Pre-processing Cross Validation Model Fitting Model Prediction Model Evaluation Conclusion
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
library(e1071)
library(rpart)
## Warning: package 'rpart' was built under R version 4.4.2
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.2
Kita akan melakukan pemodelan machine learning untuk memprediksi apakah seseorang lebih menyukai pantai atau gunung sebagai tujuan liburan berdasarkan atribut demografis, geografis, perilaku, dan preferensi lainnya.
Gambaran Dataset Demografi: Usia, Jenis Kelamin, Tingkat Pendidikan, dan Pendapatan. Faktor Geografis: Kedekatan dengan Gunung dan Kedekatan dengan Pantai. Karakteristik Perilaku: Frekuensi Perjalanan, Aktivitas Favorit, dan Kepedulian Lingkungan. Atribut Lainnya: Musim Favorit, Anggaran Liburan, Memiliki Hewan Peliharaan, dan Lokasi Tempat Tinggal. Variabel Target: Preference (biner: 0 = Pantai, 1 = Gunung).
data <- read.csv("lbb/mountains_vs_beaches_preferences.csv")
glimpse(data)
## Rows: 52,444
## Columns: 14
## $ Age <int> 56, 69, 46, 32, 60, 25, 38, 56, 36, 40, 28, 28,…
## $ Gender <chr> "male", "male", "female", "non-binary", "female…
## $ Income <int> 71477, 88740, 46562, 99044, 106583, 110588, 222…
## $ Education_Level <chr> "bachelor", "master", "master", "high school", …
## $ Travel_Frequency <int> 9, 1, 0, 6, 5, 3, 1, 8, 6, 1, 4, 3, 7, 9, 3, 2,…
## $ Preferred_Activities <chr> "skiing", "swimming", "skiing", "hiking", "sunb…
## $ Vacation_Budget <int> 2477, 4777, 1469, 1482, 516, 2895, 4994, 3656, …
## $ Location <chr> "urban", "suburban", "urban", "rural", "suburba…
## $ Proximity_to_Mountains <int> 175, 228, 71, 31, 23, 6, 157, 210, 218, 271, 15…
## $ Proximity_to_Beaches <int> 267, 190, 280, 255, 151, 47, 225, 166, 263, 15,…
## $ Favorite_Season <chr> "summer", "fall", "winter", "summer", "winter",…
## $ Pets <int> 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1,…
## $ Environmental_Concerns <int> 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1,…
## $ Preference <int> 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,…
summary(data)
## Age Gender Income Education_Level
## Min. :18.00 Length:52444 Min. : 20001 Length:52444
## 1st Qu.:31.00 Class :character 1st Qu.: 45048 Class :character
## Median :43.00 Mode :character Median : 70167 Mode :character
## Mean :43.51 Mean : 70017
## 3rd Qu.:56.00 3rd Qu.: 95109
## Max. :69.00 Max. :119999
## Travel_Frequency Preferred_Activities Vacation_Budget Location
## Min. :0.000 Length:52444 Min. : 500 Length:52444
## 1st Qu.:2.000 Class :character 1st Qu.:1622 Class :character
## Median :4.000 Mode :character Median :2733 Mode :character
## Mean :4.489 Mean :2742
## 3rd Qu.:7.000 3rd Qu.:3869
## Max. :9.000 Max. :4999
## Proximity_to_Mountains Proximity_to_Beaches Favorite_Season
## Min. : 0.0 Min. : 0.00 Length:52444
## 1st Qu.: 75.0 1st Qu.: 75.75 Class :character
## Median :150.0 Median :150.00 Mode :character
## Mean :149.9 Mean :149.89
## 3rd Qu.:225.0 3rd Qu.:225.00
## Max. :299.0 Max. :299.00
## Pets Environmental_Concerns Preference
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.0000 Median :0.0000
## Mean :0.5009 Mean :0.4984 Mean :0.2507
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
# Cek apakah ada missing values
colSums(is.na(data))
## Age Gender Income
## 0 0 0
## Education_Level Travel_Frequency Preferred_Activities
## 0 0 0
## Vacation_Budget Location Proximity_to_Mountains
## 0 0 0
## Proximity_to_Beaches Favorite_Season Pets
## 0 0 0
## Environmental_Concerns Preference
## 0 0
# Mengubah kolom karakter menjadi faktor
data$Gender <- as.factor(data$Gender)
data$Education_Level <- as.factor(data$Education_Level)
data$Preferred_Activities <- as.factor(data$Preferred_Activities)
data$Location <- as.factor(data$Location)
data$Favorite_Season <- as.factor(data$Favorite_Season)
# Menangani nilai yang tidak konsisten pada kolom Gender
data$Gender <- ifelse(data$Gender %in% c("non-binary"), NA, data$Gender)
data <- na.omit(data)
Pada tahap EDA, kita akan memvisualisasikan data untuk memahami distribusi dan hubungan antar variabel.
# Distribusi umur
ggplot(data, aes(x = Age)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
theme_minimal()
# Distribusi pendapatan berdasarkan gender
ggplot(data, aes(x = Gender, y = Income, group = Gender)) +
geom_boxplot(fill = "lightgreen") +
theme_minimal() +
labs(title = "Boxplot of Income by Gender", x = "Gender", y = "Income")
# Korelasi antar variabel numerik
correlation_matrix <- cor(select(data, Age, Income, Travel_Frequency, Vacation_Budget, Proximity_to_Mountains, Proximity_to_Beaches))
print(correlation_matrix)
## Age Income Travel_Frequency
## Age 1.0000000000 0.0043244524 0.004435364
## Income 0.0043244524 1.0000000000 -0.001863704
## Travel_Frequency 0.0044353645 -0.0018637036 1.000000000
## Vacation_Budget 0.0041709242 0.0005608845 0.003651239
## Proximity_to_Mountains 0.0069122360 -0.0091209252 0.012263416
## Proximity_to_Beaches 0.0002210541 0.0027544914 -0.006831843
## Vacation_Budget Proximity_to_Mountains
## Age 0.0041709242 0.0069122360
## Income 0.0005608845 -0.0091209252
## Travel_Frequency 0.0036512389 0.0122634155
## Vacation_Budget 1.0000000000 0.0001055065
## Proximity_to_Mountains 0.0001055065 1.0000000000
## Proximity_to_Beaches -0.0044640950 0.0015289262
## Proximity_to_Beaches
## Age 0.0002210541
## Income 0.0027544914
## Travel_Frequency -0.0068318434
## Vacation_Budget -0.0044640950
## Proximity_to_Mountains 0.0015289262
## Proximity_to_Beaches 1.0000000000
Data ini menunjukkan bahwa tidak ada hubungan linier yang signifikan antara sebagian besar pasangan variabel. Data dengan hubungan linier rendah bisa dipelajari lebih lanjut dengan menggunakan Decision Tree atau Random Forest
# Normalisasi kolom numerik
normalize <- function(x) { (x - min(x)) / (max(x) - min(x)) }
data$Income <- normalize(data$Income)
data$Vacation_Budget <- normalize(data$Vacation_Budget)
data$Proximity_to_Mountains <- normalize(data$Proximity_to_Mountains)
data$Proximity_to_Beaches <- normalize(data$Proximity_to_Beaches)
# Convert target variabel (Preference) menjadi faktor
data$Preference <- as.factor(data$Preference)
Mengapa Normalisasi Dilakukan di Sini? 1. Untuk Konsistensi Skala: Variabel seperti Income dan budget mungkin memiliki rentang nilai yang jauh lebih besar dibandingkan Proximity_to_Mountains. 2. Data yang dinormalisasi lebih seragam dan dapat digunakan langsung dalam visualisasi atau analisis tanpa dipengaruhi oleh perbedaan skala antar kolom
set.seed(88)
train_index <- createDataPartition(data$Preference, p = 0.8, list = FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]
split data 80-20 train and test.
# Melatih model Random Forest
set.seed(88)
rf_model <- randomForest(Preference ~ ., data = train_data, ntree = 100, mtry = 3, importance = TRUE)
rf_model
##
## Call:
## randomForest(formula = Preference ~ ., data = train_data, ntree = 100, mtry = 3, importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0.71%
## Confusion matrix:
## 0 1 class.error
## 0 21017 59 0.002799393
## 1 141 6911 0.019994328
Random Forest model yang dibuat memiliki akurasi yang tinggi, dengan error rate secara keseluruhan di 0.67%. Prediksi Class 0 juga sangat baik, sedangkan prediksi Class 1 memiliki tingkat akurasi lebih rendah dari Class 0 tapi secara keseluruhan masih sangat baik.
# Memprediksi pada data testing
rf_pred <- predict(rf_model, test_data)
summary(rf_pred)
## 0 1
## 5283 1749
# Evaluasi model
conf_matrix <- confusionMatrix(rf_pred, test_data$Preference)
conf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5261 22
## 1 8 1741
##
## Accuracy : 0.9957
## 95% CI : (0.9939, 0.9971)
## No Information Rate : 0.7493
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9886
##
## Mcnemar's Test P-Value : 0.01762
##
## Sensitivity : 0.9985
## Specificity : 0.9875
## Pos Pred Value : 0.9958
## Neg Pred Value : 0.9954
## Prevalence : 0.7493
## Detection Rate : 0.7482
## Detection Prevalence : 0.7513
## Balanced Accuracy : 0.9930
##
## 'Positive' Class : 0
##
Observasi: True Positives (TP) (Predicted 0 and actual 0): 5249 False Positives (FP) (Predicted 0, actual 1): 17 True Negatives (TN) (Predicted 1, actual 1): 1746 False Negatives (FN) (Predicted 1, actual 0): 20 Metrik: Sensitivity (Recall for Class 0): 0.9962 The model correctly identifies 99.62% of all true Class 0 instances. Specificity (Recall for Class 1): 0.9904 The model correctly identifies 99.04% of all true Class 1 instances. Positive Predictive Value (Precision for Class 0): 0.9968 Of all the instances predicted as Class 0, 99.68% are correct. Negative Predictive Value (Precision for Class 1): 0.9887 Of all the instances predicted as Class 1, 98.87% are correct. Balanced Accuracy: 0.9933
# Menampilkan pentingnya fitur
importance(rf_model)
## 0 1 MeanDecreaseAccuracy
## Age 0.67007493 0.24854142 0.66604215
## Gender 0.03858473 1.69334281 1.06646368
## Income -1.14335117 0.27072580 -0.61543273
## Education_Level -0.52712435 -0.62405365 -0.81459098
## Travel_Frequency 0.23028935 0.48524666 0.53423575
## Preferred_Activities 221.08129388 218.53746184 224.82092746
## Vacation_Budget -0.36460989 0.99507103 0.52089868
## Location 0.77903353 0.01064259 0.58280744
## Proximity_to_Mountains 122.09716737 116.46917185 121.43619426
## Proximity_to_Beaches 124.93282074 130.49336984 134.25898958
## Favorite_Season -0.76145144 0.31421150 -0.34252736
## Pets 0.09400294 -0.03876043 0.03589835
## Environmental_Concerns 0.15035504 1.33166083 1.09506000
## MeanDecreaseGini
## Age 174.02486
## Gender 32.97309
## Income 226.21421
## Education_Level 78.76138
## Travel_Frequency 111.26719
## Preferred_Activities 4365.42265
## Vacation_Budget 224.15908
## Location 57.33189
## Proximity_to_Mountains 2603.55922
## Proximity_to_Beaches 2551.09551
## Favorite_Season 81.53310
## Pets 27.44636
## Environmental_Concerns 28.33470
varImpPlot(rf_model)
Most Important Features (High MDA and MDG): Preferred_Activities: MDA:
203.14 (highest) MDG: 4381.79 (highest) Variabel ini adalah yang paling
menentukan Preference. Proximity_to_Mountains: MDA: 120.51 MDG: 2617.73
Menunjukan jarak ke gunung juga menentukan preferences.
Proximity_to_Beaches: MDA: 117.92 MDG: 2545.44 Sama seperti jarak ke
gunung, jarak ke pantai juga menentukan preferences. Vacation_Budget:
MDA: 0.067 MDG: 220.03 Budget Vacation juga memiliki pengaruh,
menunjukan faktor finansial ada pengaruh terhadap preference. Income:
MDA: -1.18 MDG: 219.63 Walaupun memiliki MDA negatif, MDG yang tinggi
menunjukkan income berpengaruh membuat split dalam decision trees
# Melatih model Decision Tree
dt_model <- rpart(Preference ~ ., data = train_data, method = "class")
dt_model
## n= 28128
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 28128 7052 0 (0.749288965 0.250711035)
## 2) Preferred_Activities=sunbathing,swimming 14076 0 0 (1.000000000 0.000000000) *
## 3) Preferred_Activities=hiking,skiing 14052 7000 1 (0.498149730 0.501850270)
## 6) Proximity_to_Beaches< 0.5301003 7325 1914 0 (0.738703072 0.261296928)
## 12) Proximity_to_Mountains>=0.3227425 4977 274 0 (0.944946755 0.055053245)
## 24) Proximity_to_Mountains>=0.4498328 4021 34 0 (0.991544392 0.008455608) *
## 25) Proximity_to_Mountains< 0.4498328 956 240 0 (0.748953975 0.251046025)
## 50) Proximity_to_Beaches< 0.3795987 698 16 0 (0.977077364 0.022922636) *
## 51) Proximity_to_Beaches>=0.3795987 258 34 1 (0.131782946 0.868217054) *
## 13) Proximity_to_Mountains< 0.3227425 2348 708 1 (0.301533220 0.698466780)
## 26) Proximity_to_Beaches< 0.1755853 790 216 0 (0.726582278 0.273417722)
## 52) Proximity_to_Mountains>=0.09866221 550 36 0 (0.934545455 0.065454545) *
## 53) Proximity_to_Mountains< 0.09866221 240 60 1 (0.250000000 0.750000000) *
## 27) Proximity_to_Beaches>=0.1755853 1558 134 1 (0.086007702 0.913992298) *
## 7) Proximity_to_Beaches>=0.5301003 6727 1589 1 (0.236212279 0.763787721)
## 14) Proximity_to_Mountains>=0.6906355 2049 645 0 (0.685212299 0.314787701)
## 28) Proximity_to_Beaches< 0.8311037 1305 117 0 (0.910344828 0.089655172) *
## 29) Proximity_to_Beaches>=0.8311037 744 216 1 (0.290322581 0.709677419)
## 58) Proximity_to_Mountains>=0.867893 347 140 0 (0.596541787 0.403458213)
## 116) Proximity_to_Beaches< 0.9247492 191 24 0 (0.874345550 0.125654450) *
## 117) Proximity_to_Beaches>=0.9247492 156 40 1 (0.256410256 0.743589744) *
## 59) Proximity_to_Mountains< 0.867893 397 9 1 (0.022670025 0.977329975) *
## 15) Proximity_to_Mountains< 0.6906355 4678 185 1 (0.039546815 0.960453185) *
summary(dt_model)
## Call:
## rpart(formula = Preference ~ ., data = train_data, method = "class")
## n= 28128
##
## CP nsplit rel error xerror xstd
## 1 0.25163074 0 1.00000000 1.0000000 0.010307860
## 2 0.13216109 2 0.49673851 0.5086500 0.007932865
## 3 0.10762904 3 0.36457742 0.3946398 0.007101025
## 4 0.05076574 4 0.25694838 0.2616279 0.005887812
## 5 0.04424277 5 0.20618264 0.2029212 0.005225999
## 6 0.01701645 6 0.16193988 0.1741350 0.004859522
## 7 0.01347136 7 0.14492343 0.1460579 0.004466893
## 8 0.01013897 9 0.11798071 0.1185479 0.004038677
## 9 0.01000000 11 0.09770278 0.1103233 0.003900199
##
## Variable importance
## Preferred_Activities Proximity_to_Beaches Proximity_to_Mountains
## 37 31 30
## Vacation_Budget Income
## 1 1
##
## Node number 1: 28128 observations, complexity param=0.2516307
## predicted class=0 expected loss=0.250711 P(node) =1
## class counts: 21076 7052
## probabilities: 0.749 0.251
## left son=2 (14076 obs) right son=3 (14052 obs)
## Primary splits:
## Preferred_Activities splits as RRLL, improve=3542.068000, (0 missing)
## Proximity_to_Beaches < 0.5301003 to the left, improve= 924.164000, (0 missing)
## Proximity_to_Mountains < 0.506689 to the right, improve= 900.937900, (0 missing)
## Vacation_Budget < 0.5008891 to the right, improve= 2.739928, (0 missing)
## Environmental_Concerns < 0.5 to the left, improve= 1.676078, (0 missing)
## Surrogate splits:
## Proximity_to_Beaches < 0.4364548 to the left, agree=0.509, adj=0.016, (0 split)
## Vacation_Budget < 0.3937542 to the right, agree=0.508, adj=0.016, (0 split)
## Income < 0.5127805 to the right, agree=0.507, adj=0.013, (0 split)
## Age < 39.5 to the right, agree=0.505, adj=0.010, (0 split)
## Proximity_to_Mountains < 0.6505017 to the right, agree=0.505, adj=0.009, (0 split)
##
## Node number 2: 14076 observations
## predicted class=0 expected loss=0 P(node) =0.5004266
## class counts: 14076 0
## probabilities: 1.000 0.000
##
## Node number 3: 14052 observations, complexity param=0.2516307
## predicted class=1 expected loss=0.4981497 P(node) =0.4995734
## class counts: 7000 7052
## probabilities: 0.498 0.502
## left son=6 (7325 obs) right son=7 (6727 obs)
## Primary splits:
## Proximity_to_Beaches < 0.5301003 to the left, improve=1770.831000, (0 missing)
## Proximity_to_Mountains < 0.5033445 to the right, improve=1761.459000, (0 missing)
## Age < 66.5 to the left, improve= 3.166996, (0 missing)
## Income < 0.9986999 to the left, improve= 3.091624, (0 missing)
## Vacation_Budget < 0.5037786 to the right, improve= 1.860815, (0 missing)
## Surrogate splits:
## Vacation_Budget < 0.9049789 to the left, agree=0.525, adj=0.009, (0 split)
## Age < 67.5 to the left, agree=0.523, adj=0.004, (0 split)
## Income < 0.9860794 to the left, agree=0.523, adj=0.003, (0 split)
## Proximity_to_Mountains < 0.9882943 to the left, agree=0.521, adj=0.000, (0 split)
##
## Node number 6: 7325 observations, complexity param=0.1321611
## predicted class=0 expected loss=0.2612969 P(node) =0.2604167
## class counts: 5411 1914
## probabilities: 0.739 0.261
## left son=12 (4977 obs) right son=13 (2348 obs)
## Primary splits:
## Proximity_to_Mountains < 0.3227425 to the right, improve=1320.896000, (0 missing)
## Proximity_to_Beaches < 0.3160535 to the left, improve= 226.821200, (0 missing)
## Income < 0.001320053 to the right, improve= 2.875532, (0 missing)
## Vacation_Budget < 0.01878195 to the left, improve= 2.024209, (0 missing)
## Environmental_Concerns < 0.5 to the left, improve= 1.600880, (0 missing)
## Surrogate splits:
## Income < 0.001320053 to the right, agree=0.68, adj=0.001, (0 split)
##
## Node number 7: 6727 observations, complexity param=0.107629
## predicted class=1 expected loss=0.2362123 P(node) =0.2391567
## class counts: 1589 5138
## probabilities: 0.236 0.764
## left son=14 (2049 obs) right son=15 (4678 obs)
## Primary splits:
## Proximity_to_Mountains < 0.6906355 to the right, improve=1188.026000, (0 missing)
## Proximity_to_Beaches < 0.777592 to the left, improve= 170.034100, (0 missing)
## Vacation_Budget < 0.6269171 to the right, improve= 1.952756, (0 missing)
## Income < 0.003360134 to the left, improve= 1.833659, (0 missing)
## Environmental_Concerns < 0.5 to the left, improve= 1.623661, (0 missing)
##
## Node number 12: 4977 observations, complexity param=0.01347136
## predicted class=0 expected loss=0.05505324 P(node) =0.1769411
## class counts: 4703 274
## probabilities: 0.945 0.055
## left son=24 (4021 obs) right son=25 (956 obs)
## Primary splits:
## Proximity_to_Mountains < 0.4498328 to the right, improve=90.9079000, (0 missing)
## Proximity_to_Beaches < 0.4197324 to the left, improve=61.5071700, (0 missing)
## Vacation_Budget < 0.941098 to the left, improve= 0.9842867, (0 missing)
## Income < 0.8675797 to the left, improve= 0.4652594, (0 missing)
## Travel_Frequency < 4.5 to the right, improve= 0.4535930, (0 missing)
##
## Node number 13: 2348 observations, complexity param=0.05076574
## predicted class=1 expected loss=0.3015332 P(node) =0.08347554
## class counts: 708 1640
## probabilities: 0.302 0.698
## left son=26 (790 obs) right son=27 (1558 obs)
## Primary splits:
## Proximity_to_Beaches < 0.1755853 to the left, improve=430.195500, (0 missing)
## Proximity_to_Mountains < 0.1488294 to the right, improve=106.272400, (0 missing)
## Gender < 1.5 to the left, improve= 2.005759, (0 missing)
## Vacation_Budget < 0.9843299 to the right, improve= 1.971968, (0 missing)
## Income < 0.002115085 to the right, improve= 1.459730, (0 missing)
## Surrogate splits:
## Income < 0.001050042 to the left, agree=0.664, adj=0.001, (0 split)
## Vacation_Budget < 0.001111358 to the left, agree=0.664, adj=0.001, (0 split)
##
## Node number 14: 2049 observations, complexity param=0.04424277
## predicted class=0 expected loss=0.3147877 P(node) =0.07284556
## class counts: 1404 645
## probabilities: 0.685 0.315
## left son=28 (1305 obs) right son=29 (744 obs)
## Primary splits:
## Proximity_to_Beaches < 0.8311037 to the left, improve=364.322500, (0 missing)
## Proximity_to_Mountains < 0.8478261 to the right, improve=100.330400, (0 missing)
## Vacation_Budget < 0.9739942 to the right, improve= 3.131489, (0 missing)
## Age < 66.5 to the left, improve= 2.230838, (0 missing)
## Income < 0.005735229 to the left, improve= 2.191775, (0 missing)
## Surrogate splits:
## Income < 0.9820843 to the left, agree=0.638, adj=0.004, (0 split)
##
## Node number 15: 4678 observations
## predicted class=1 expected loss=0.03954681 P(node) =0.1663111
## class counts: 185 4493
## probabilities: 0.040 0.960
##
## Node number 24: 4021 observations
## predicted class=0 expected loss=0.008455608 P(node) =0.1429536
## class counts: 3987 34
## probabilities: 0.992 0.008
##
## Node number 25: 956 observations, complexity param=0.01347136
## predicted class=0 expected loss=0.251046 P(node) =0.03398749
## class counts: 716 240
## probabilities: 0.749 0.251
## left son=50 (698 obs) right son=51 (258 obs)
## Primary splits:
## Proximity_to_Beaches < 0.3795987 to the left, improve=269.192700, (0 missing)
## Proximity_to_Mountains < 0.40301 to the right, improve= 5.609206, (0 missing)
## Vacation_Budget < 0.9405423 to the left, improve= 2.341175, (0 missing)
## Favorite_Season splits as RRLL, improve= 1.757646, (0 missing)
## Age < 64.5 to the left, improve= 1.716359, (0 missing)
##
## Node number 26: 790 observations, complexity param=0.01701645
## predicted class=0 expected loss=0.2734177 P(node) =0.02808589
## class counts: 574 216
## probabilities: 0.727 0.273
## left son=52 (550 obs) right son=53 (240 obs)
## Primary splits:
## Proximity_to_Mountains < 0.09866221 to the right, improve=156.596300, (0 missing)
## Proximity_to_Beaches < 0.09531773 to the left, improve= 38.331850, (0 missing)
## Travel_Frequency < 5.5 to the left, improve= 2.115627, (0 missing)
## Income < 0.9220619 to the right, improve= 1.747214, (0 missing)
## Pets < 0.5 to the right, improve= 1.312365, (0 missing)
## Surrogate splits:
## Income < 0.00151006 to the right, agree=0.699, adj=0.008, (0 split)
## Vacation_Budget < 0.004334297 to the right, agree=0.699, adj=0.008, (0 split)
##
## Node number 27: 1558 observations
## predicted class=1 expected loss=0.0860077 P(node) =0.05538965
## class counts: 134 1424
## probabilities: 0.086 0.914
##
## Node number 28: 1305 observations
## predicted class=0 expected loss=0.08965517 P(node) =0.04639505
## class counts: 1188 117
## probabilities: 0.910 0.090
##
## Node number 29: 744 observations, complexity param=0.01013897
## predicted class=1 expected loss=0.2903226 P(node) =0.02645051
## class counts: 216 528
## probabilities: 0.290 0.710
## left son=58 (347 obs) right son=59 (397 obs)
## Primary splits:
## Proximity_to_Mountains < 0.867893 to the right, improve=121.957000, (0 missing)
## Proximity_to_Beaches < 0.9013378 to the left, improve= 40.182940, (0 missing)
## Vacation_Budget < 0.9792176 to the right, improve= 3.454962, (0 missing)
## Income < 0.03120625 to the left, improve= 3.317735, (0 missing)
## Favorite_Season splits as RRRL, improve= 1.893465, (0 missing)
## Surrogate splits:
## Income < 0.9312422 to the right, agree=0.552, adj=0.040, (0 split)
## Vacation_Budget < 0.975439 to the right, agree=0.548, adj=0.032, (0 split)
## Favorite_Season splits as RRRL, agree=0.544, adj=0.023, (0 split)
## Age < 18.5 to the left, agree=0.539, adj=0.012, (0 split)
## Proximity_to_Beaches < 0.9916388 to the right, agree=0.539, adj=0.012, (0 split)
##
## Node number 50: 698 observations
## predicted class=0 expected loss=0.02292264 P(node) =0.02481513
## class counts: 682 16
## probabilities: 0.977 0.023
##
## Node number 51: 258 observations
## predicted class=1 expected loss=0.1317829 P(node) =0.009172355
## class counts: 34 224
## probabilities: 0.132 0.868
##
## Node number 52: 550 observations
## predicted class=0 expected loss=0.06545455 P(node) =0.01955347
## class counts: 514 36
## probabilities: 0.935 0.065
##
## Node number 53: 240 observations
## predicted class=1 expected loss=0.25 P(node) =0.008532423
## class counts: 60 180
## probabilities: 0.250 0.750
##
## Node number 58: 347 observations, complexity param=0.01013897
## predicted class=0 expected loss=0.4034582 P(node) =0.01233646
## class counts: 207 140
## probabilities: 0.597 0.403
## left son=116 (191 obs) right son=117 (156 obs)
## Primary splits:
## Proximity_to_Beaches < 0.9247492 to the left, improve=65.575930, (0 missing)
## Proximity_to_Mountains < 0.9347826 to the right, improve=29.210700, (0 missing)
## Vacation_Budget < 0.6653701 to the right, improve= 3.761286, (0 missing)
## Favorite_Season splits as RLLL, improve= 3.071775, (0 missing)
## Gender < 1.5 to the right, improve= 2.127182, (0 missing)
## Surrogate splits:
## Favorite_Season splits as RLLL, agree=0.585, adj=0.077, (0 split)
## Age < 27.5 to the right, agree=0.556, adj=0.013, (0 split)
## Income < 0.9227769 to the left, agree=0.556, adj=0.013, (0 split)
## Vacation_Budget < 0.8679707 to the left, agree=0.556, adj=0.013, (0 split)
##
## Node number 59: 397 observations
## predicted class=1 expected loss=0.02267003 P(node) =0.01411405
## class counts: 9 388
## probabilities: 0.023 0.977
##
## Node number 116: 191 observations
## predicted class=0 expected loss=0.1256545 P(node) =0.006790387
## class counts: 167 24
## probabilities: 0.874 0.126
##
## Node number 117: 156 observations
## predicted class=1 expected loss=0.2564103 P(node) =0.005546075
## class counts: 40 116
## probabilities: 0.256 0.744
# Memprediksi pada data testing
dt_pred <- predict(dt_model, test_data, type = "class")
summary(dt_pred)
## 0 1
## 5219 1813
# Plot
rpart.plot(dt_model)
# Evaluasi model
conf_matrix_dt <- confusionMatrix(dt_pred, test_data$Preference)
conf_matrix_dt
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5153 66
## 1 116 1697
##
## Accuracy : 0.9741
## 95% CI : (0.9701, 0.9777)
## No Information Rate : 0.7493
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9318
##
## Mcnemar's Test P-Value : 0.0002811
##
## Sensitivity : 0.9780
## Specificity : 0.9626
## Pos Pred Value : 0.9874
## Neg Pred Value : 0.9360
## Prevalence : 0.7493
## Detection Rate : 0.7328
## Detection Prevalence : 0.7422
## Balanced Accuracy : 0.9703
##
## 'Positive' Class : 0
##
Untuk menentukan model mana yang lebih baik—Random Forest atau Decision Tree—mari kita akan membandingkan metrik performanya dari confusion matrix:
Metrik Utama Akurasi: Random Forest: 0,9947 (99,47%) Decision Tree: 0,9697 (96,97%) Kesimpulan: Model Random Forest lebih akurat daripada Decision Tree. Sensitivitas (Recall untuk Kelas 0): Random Forest: 0,9962 Decision Tree: 0,9768 Kesimpulan: Random Forest memiliki sensitivitas lebih tinggi, artinya lebih banyak mengidentifikasi dengan benar contoh dari Kelas 0 (kelas mayoritas). Spesifisitas: Random Forest: 0,9904 Decision Tree: 0,9484 Kesimpulan: Random Forest lebih baik dalam mengidentifikasi Kelas 1 (kelas minoritas) dengan benar. Kappa (Kesepakatan antara nilai prediksi dan nilai aktual): Random Forest: 0,986 Decision Tree: 0,9198 Kesimpulan: Random Forest memiliki nilai Kappa yang lebih tinggi, artinya ada kesepakatan yang lebih baik antara prediksi dan nilai aktual. Akurasi Seimbang (Rata-rata sensitivitas dan spesifisitas): Random Forest: 0,9933 Decision Tree: 0,9626 Kesimpulan: Random Forest memiliki akurasi seimbang yang lebih tinggi, artinya kinerjanya lebih baik di kedua kelas, terutama pada dataset yang tidak seimbang. Uji McNemar: Random Forest: p-value = 0,7423 Decision Tree: p-value = 0,03982 Kesimpulan: p-value untuk Uji McNemar pada Decision Tree menunjukkan perbedaan dalam kesalahan klasifikasi antara kedua kelas, sementara Random Forest tidak menunjukkan perbedaan yang signifikan. Ini lebih mendukung performa yang lebih baik dari Random Forest.
Kesimpulan Akhir: Model Random Forest lebih unggul dibandingkan model Decision Tree di semua metrik utama, termasuk akurasi, sensitivitas, spesifisitas, dan akurasi seimbang. Oleh karena itu, Random Forest adalah model yang lebih baik untuk menentukan preference.