LATIHAN TPM PRAK 4
Library
## corrplot 0.94 loaded
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:DescTools':
##
## Recode
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some() masks car::some()
## ✖ lubridate::stamp() masks cowplot::stamp()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
##
## The following objects are masked from 'package:Metrics':
##
## precision, recall
##
## The following objects are masked from 'package:DescTools':
##
## MAE, RMSE
Data
Didapatkan data properti di daerah pakistan, dengan beberapa peubah yaitu :
data <- read.csv("C:\\Users\\Ghonniyu\\Documents\\Semester 6\\TPM\\Pakistan house price dataset.csv")
head(data)
## property_id location_id
## 1 237062 3325
## 2 346905 3236
## 3 386513 764
## 4 656161 340
## 5 841645 3226
## 6 850762 3390
## page_url
## 1 https://www.zameen.com/Property/g_10_g_10_2_ground_floor_corner_apartment_with_green_lawn_for_sale-237062-3325-1.html
## 2 https://www.zameen.com/Property/e_11_2_services_society_flat_available_for_sale-346905-3236-1.html
## 3 https://www.zameen.com/Property/islamabad_g_15_house_is_available_for_sale-386513-764-1.html
## 4 https://www.zameen.com/Property/islamabad_bani_gala_a_rare_minimalist_concept_in_a_quiet_location-656161-340-1.html
## 5 https://www.zameen.com/Property/dha_valley_dha_homes_islamabad_dha_valley_8_marla_home_for_sale-841645-3226-1.html
## 6 https://www.zameen.com/Property/ghauri_town_ghauri_town_phase_1_house_is_available_for_sale_in_ghauri_town_phase_1-850762-3390-1.html
## property_type price location city province_name latitude
## 1 Flat 10000000 G-10 Islamabad Islamabad Capital 33.67989
## 2 Flat 6900000 E-11 Islamabad Islamabad Capital 33.70099
## 3 House 16500000 G-15 Islamabad Islamabad Capital 33.63149
## 4 House 43500000 Bani Gala Islamabad Islamabad Capital 33.70757
## 5 House 7000000 DHA Defence Islamabad Islamabad Capital 33.49259
## 6 House 34500000 Ghauri Town Islamabad Islamabad Capital 33.62395
## longitude baths area purpose bedrooms date_added agency
## 1 73.01264 2 4 Marla For Sale 2 2/4/2019
## 2 72.97149 3 5.6 Marla For Sale 3 5/4/2019
## 3 72.92656 6 8 Marla For Sale 5 7/17/2019
## 4 73.15120 4 2 Kanal For Sale 4 4/5/2019
## 5 73.30134 3 8 Marla For Sale 3 7/10/2019 Easy Property
## 6 73.12659 8 1.6 Kanal For Sale 8 4/5/2019
## agent Area.Type Area.Size
## 1 Marla 4.0
## 2 Marla 5.6
## 3 Marla 8.0
## 4 Kanal 2.0
## 5 Muhammad Junaid Ceo Muhammad Shahid Director Marla 8.0
## 6 Kanal 1.6
## Area.Category
## 1 0-5 Marla
## 2 5-10 Marla
## 3 5-10 Marla
## 4 1-5 Kanal
## 5 5-10 Marla
## 6 1-5 Kanal
Data Exploration
## property_id location_id page_url property_type
## Min. : 86575 Min. : 1 Length:168446 Length:168446
## 1st Qu.:14883202 1st Qu.: 1058 Class :character Class :character
## Median :16658506 Median : 3286 Mode :character Mode :character
## Mean :15596255 Mean : 4376
## 3rd Qu.:17086620 3rd Qu.: 7220
## Max. :17357718 Max. :14220
## price location city province_name
## Min. :0.000e+00 Length:168446 Length:168446 Length:168446
## 1st Qu.:1.750e+05 Class :character Class :character Class :character
## Median :8.500e+06 Mode :character Mode :character Mode :character
## Mean :1.777e+07
## 3rd Qu.:1.950e+07
## Max. :2.000e+09
## latitude longitude baths area
## Min. :11.05 Min. :25.91 Min. : 0.000 Length:168446
## 1st Qu.:24.95 1st Qu.:67.13 1st Qu.: 0.000 Class :character
## Median :31.46 Median :73.06 Median : 3.000 Mode :character
## Mean :29.86 Mean :71.24 Mean : 2.874
## 3rd Qu.:33.56 3rd Qu.:73.26 3rd Qu.: 4.000
## Max. :73.18 Max. :80.16 Max. :403.000
## purpose bedrooms date_added agency
## Length:168446 Min. : 0.000 Length:168446 Length:168446
## Class :character 1st Qu.: 2.000 Class :character Class :character
## Mode :character Median : 3.000 Mode :character Mode :character
## Mean : 3.179
## 3rd Qu.: 4.000
## Max. :68.000
## agent Area.Type Area.Size Area.Category
## Length:168446 Length:168446 Min. : 0.000 Length:168446
## Class :character Class :character 1st Qu.: 3.000 Class :character
## Mode :character Mode :character Median : 5.000 Mode :character
## Mean : 5.892
## 3rd Qu.: 8.000
## Max. :800.000
Memastikan apakah ada missing value
## property_id location_id page_url property_type price
## 0 0 0 0 0
## location city province_name latitude longitude
## 0 0 0 0 0
## baths area purpose bedrooms date_added
## 0 0 0 0 0
## agency agent Area.Type Area.Size Area.Category
## 0 0 0 0 0
## 'data.frame': 168446 obs. of 20 variables:
## $ property_id : int 237062 346905 386513 656161 841645 850762 937975 1258636 1402466 1418706 ...
## $ location_id : int 3325 3236 764 340 3226 3390 445 3241 376 3282 ...
## $ page_url : chr "https://www.zameen.com/Property/g_10_g_10_2_ground_floor_corner_apartment_with_green_lawn_for_sale-237062-3325-1.html" "https://www.zameen.com/Property/e_11_2_services_society_flat_available_for_sale-346905-3236-1.html" "https://www.zameen.com/Property/islamabad_g_15_house_is_available_for_sale-386513-764-1.html" "https://www.zameen.com/Property/islamabad_bani_gala_a_rare_minimalist_concept_in_a_quiet_location-656161-340-1.html" ...
## $ property_type: chr "Flat" "Flat" "House" "House" ...
## $ price : int 10000000 6900000 16500000 43500000 7000000 34500000 27000000 7800000 50000000 40000000 ...
## $ location : chr "G-10" "E-11" "G-15" "Bani Gala" ...
## $ city : chr "Islamabad" "Islamabad" "Islamabad" "Islamabad" ...
## $ province_name: chr "Islamabad Capital" "Islamabad Capital" "Islamabad Capital" "Islamabad Capital" ...
## $ latitude : num 33.7 33.7 33.6 33.7 33.5 ...
## $ longitude : num 73 73 72.9 73.2 73.3 ...
## $ baths : int 2 3 6 4 3 8 8 2 7 5 ...
## $ area : chr "4 Marla" "5.6 Marla" "8 Marla" "2 Kanal" ...
## $ purpose : chr "For Sale" "For Sale" "For Sale" "For Sale" ...
## $ bedrooms : int 2 3 5 4 3 8 8 2 7 5 ...
## $ date_added : chr "2/4/2019" "5/4/2019" "7/17/2019" "4/5/2019" ...
## $ agency : chr "" "" "" "" ...
## $ agent : chr "" "" "" "" ...
## $ Area.Type : chr "Marla" "Marla" "Marla" "Kanal" ...
## $ Area.Size : num 4 5.6 8 2 8 1.6 1 6.2 1 1 ...
## $ Area.Category: chr "0-5 Marla" "5-10 Marla" "5-10 Marla" "1-5 Kanal" ...
numeric_vars <- data %>%
select_if(is.numeric) %>%
names()
# Loop untuk membuat boxplot setiap kolom numerik
for (col in numeric_vars) {
# Membuat boxplot
print(
ggplot(data, aes(x = property_type, y = .data[[col]], fill = property_type)) +
geom_boxplot(color = "black", outlier.color = "red", outlier.size = 2) +
labs(title = paste("Boxplot of", col), y = col, x = "property_type") +
theme_light()
)
}
Berdasarkan hasil eksplorasi salah satu variabel berdasarkan tipe properti melalui boxplot didapatkan hasil yang menjelaskan tipe properti terbanyak adalah rumah (house), lalu terdapat data pencilan pada total kamar tidur dan kamar mandi pada tipe rumah (house).
Modelling
Input Data
data <- read.csv(("C:\\Users\\Ghonniyu\\Documents\\Semester 6\\TPM\\Pakistan house price dataset.csv"))
data <- data[data$price >= 1000, ]
str(data)
## 'data.frame': 168427 obs. of 20 variables:
## $ property_id : int 237062 346905 386513 656161 841645 850762 937975 1258636 1402466 1418706 ...
## $ location_id : int 3325 3236 764 340 3226 3390 445 3241 376 3282 ...
## $ page_url : chr "https://www.zameen.com/Property/g_10_g_10_2_ground_floor_corner_apartment_with_green_lawn_for_sale-237062-3325-1.html" "https://www.zameen.com/Property/e_11_2_services_society_flat_available_for_sale-346905-3236-1.html" "https://www.zameen.com/Property/islamabad_g_15_house_is_available_for_sale-386513-764-1.html" "https://www.zameen.com/Property/islamabad_bani_gala_a_rare_minimalist_concept_in_a_quiet_location-656161-340-1.html" ...
## $ property_type: chr "Flat" "Flat" "House" "House" ...
## $ price : int 10000000 6900000 16500000 43500000 7000000 34500000 27000000 7800000 50000000 40000000 ...
## $ location : chr "G-10" "E-11" "G-15" "Bani Gala" ...
## $ city : chr "Islamabad" "Islamabad" "Islamabad" "Islamabad" ...
## $ province_name: chr "Islamabad Capital" "Islamabad Capital" "Islamabad Capital" "Islamabad Capital" ...
## $ latitude : num 33.7 33.7 33.6 33.7 33.5 ...
## $ longitude : num 73 73 72.9 73.2 73.3 ...
## $ baths : int 2 3 6 4 3 8 8 2 7 5 ...
## $ area : chr "4 Marla" "5.6 Marla" "8 Marla" "2 Kanal" ...
## $ purpose : chr "For Sale" "For Sale" "For Sale" "For Sale" ...
## $ bedrooms : int 2 3 5 4 3 8 8 2 7 5 ...
## $ date_added : chr "2/4/2019" "5/4/2019" "7/17/2019" "4/5/2019" ...
## $ agency : chr "" "" "" "" ...
## $ agent : chr "" "" "" "" ...
## $ Area.Type : chr "Marla" "Marla" "Marla" "Kanal" ...
## $ Area.Size : num 4 5.6 8 2 8 1.6 1 6.2 1 1 ...
## $ Area.Category: chr "0-5 Marla" "5-10 Marla" "5-10 Marla" "1-5 Kanal" ...
Pemodelan dilakukan menggunakan variabel berupa price, purpose, tipe area, ukuran area, total kamar mandi, dan total kamar tidur.
# Creates a new binary variable, High.
harga = ifelse(data$price <=17767764, "Murah", "Mahal")
# Add High to the data set.
data=data.frame(data,harga)
# Remove the Sales variable from the data.
data.H <- data %>% select(-c(property_type,city,province_name,property_id,location_id,page_url,date_added,location,latitude,longitude,area,date_added,agency,agent,Area.Category,price))
# Code High as a factor variable
data.H$harga = as.factor(data$harga)
class(data.H$harga)
## [1] "factor"
Preprocessing
Membagi data menjadi data training dan testing dengan pembagian 70% dan 30%
library(caret)
set.seed(234)
train <- createDataPartition(as.factor(data.H$harga), p=0.7, list=FALSE)
data.train=data.H[train,]
data.test=data.H[-train,]
harga.test=harga[-train]
head(data.H)
## baths purpose bedrooms Area.Type Area.Size harga
## 1 2 For Sale 2 Marla 4.0 Murah
## 2 3 For Sale 3 Marla 5.6 Murah
## 3 6 For Sale 5 Marla 8.0 Murah
## 4 4 For Sale 4 Kanal 2.0 Mahal
## 5 3 For Sale 3 Marla 8.0 Murah
## 6 8 For Sale 8 Kanal 1.6 Mahal
Hyperparameter Tuning
Melakukan hyperparameter tuning untuk mendapatkan parameter terbaik berdasarkan cp value, minsplit, dan maxdepth
library(rpart)
library(caret)
cp_values <- seq(0.001, 0.02, by = 0.002)
minsplit_values <- c(1,2,3,4,5,6,7,8,9,10)
maxdepth_values <- c(5, 10, 15)
best_acc <- 0
best_params <- list()
for (cp in cp_values) {
for (minsplit in minsplit_values) {
for (maxdepth in maxdepth_values) {
control <- rpart.control(cp = cp, minsplit = minsplit, maxdepth = maxdepth)
model <- rpart(harga ~ ., data = data.train, method = "class", control = control)
pred <- predict(model, data.test, type = "class")
acc <- mean(pred == data.test$harga)
if (acc > best_acc) {
best_acc <- acc
best_params <- list(cp = cp, minsplit = minsplit, maxdepth = maxdepth)
}
}
}
}
print(paste("Best Accuracy:", best_acc))
## [1] "Best Accuracy: 0.915174065351198"
## [1] "Best Parameters:"
## $cp
## [1] 0.001
##
## $minsplit
## [1] 1
##
## $maxdepth
## [1] 10
Model Awal
Berdasarkan hasil hypertuning, yang akan digunakan adalah
CP = 0.001
Minsplit = 1
Maxdepth = 10
library(rpart)
control= rpart.control(cp=0.001, minsplit = 1, maxdepth = 10, xval = 10, minbucket = 10)
fit.tree = rpart(harga ~ ., data=data.train, method = 'class', control = control)
fit.tree
## n= 117900
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 117900 32099 Murah (0.27225615 0.72774385)
## 2) bedrooms>=3.5 45359 20734 Mahal (0.54289116 0.45710884)
## 4) purpose=For Sale 37409 12784 Mahal (0.65826405 0.34173595)
## 8) Area.Type=Kanal 11231 109 Mahal (0.99029472 0.00970528) *
## 9) Area.Type=Marla 26178 12675 Mahal (0.51581481 0.48418519)
## 18) Area.Size>=8.55 13502 2463 Mahal (0.81758258 0.18241742) *
## 19) Area.Size< 8.55 12676 2464 Murah (0.19438309 0.80561691)
## 38) Area.Size>=6.25 3716 1356 Murah (0.36490850 0.63509150)
## 76) bedrooms>=5.5 768 350 Mahal (0.54427083 0.45572917)
## 152) Area.Size>=7.7 467 159 Mahal (0.65952891 0.34047109) *
## 153) Area.Size< 7.7 301 110 Murah (0.36544850 0.63455150) *
## 77) bedrooms< 5.5 2948 938 Murah (0.31818182 0.68181818) *
## 39) Area.Size< 6.25 8960 1108 Murah (0.12366071 0.87633929) *
## 5) purpose=For Rent 7950 0 Murah (0.00000000 1.00000000) *
## 3) bedrooms< 3.5 72541 7474 Murah (0.10303139 0.89696861)
## 6) Area.Size>=7.05 19599 4943 Murah (0.25220675 0.74779325)
## 12) purpose=For Sale 10936 4943 Murah (0.45199342 0.54800658)
## 24) Area.Size>=10.15 2117 513 Mahal (0.75767596 0.24232404) *
## 25) Area.Size< 10.15 8819 3339 Murah (0.37861436 0.62138564)
## 50) bedrooms< 1.5 1550 679 Mahal (0.56193548 0.43806452)
## 100) Area.Size>=8.45 945 359 Mahal (0.62010582 0.37989418) *
## 101) Area.Size< 8.45 605 285 Murah (0.47107438 0.52892562) *
## 51) bedrooms>=1.5 7269 2468 Murah (0.33952401 0.66047599)
## 102) Area.Size< 9.9 5699 2092 Murah (0.36708194 0.63291806)
## 204) Area.Size>=8.85 978 456 Mahal (0.53374233 0.46625767)
## 408) Area.Size< 8.95 227 49 Mahal (0.78414097 0.21585903) *
## 409) Area.Size>=8.95 751 344 Murah (0.45805593 0.54194407)
## 818) Area.Size>=9.65 118 39 Mahal (0.66949153 0.33050847) *
## 819) Area.Size< 9.65 633 265 Murah (0.41864139 0.58135861) *
## 205) Area.Size< 8.85 4721 1570 Murah (0.33255666 0.66744334) *
## 103) Area.Size>=9.9 1570 376 Murah (0.23949045 0.76050955) *
## 13) purpose=For Rent 8663 0 Murah (0.00000000 1.00000000) *
## 7) Area.Size< 7.05 52942 2531 Murah (0.04780703 0.95219297)
## 14) Area.Type=Kanal 5980 1392 Murah (0.23277592 0.76722408)
## 28) purpose=For Sale 1612 220 Mahal (0.86352357 0.13647643) *
## 29) purpose=For Rent 4368 0 Murah (0.00000000 1.00000000) *
## 15) Area.Type=Marla 46962 1139 Murah (0.02425365 0.97574635) *
Berdasarkan klasifikasi yang dihasilkan akan digunakan untuk mengklasifikasikan properti berdasarkan apakah “Murah” atau “Mahal” penentuan tersebut menggunakan rata-rata harga propertinya, jika dia menginginkan . Berikut interpretasi dari diagram ini:
- Akar Pohon (Root Node)
Pada bagian atas terdapat node utama yang memisahkan properti berdasarkan jumlah bedrooms (>= 4). Jika jumlah kamar tidur >= 4, properti dikategorikan sebagai Murah dengan probabilitas 0.73 (100%). Jika jumlah kamar tidur < 4, pohon akan bercabang lebih lanjut berdasarkan variabel lain.
- Cabang dan Pemisahan
Beberapa variabel yang digunakan untuk memisahkan kategori properti adalah:
Purpose (For Sale) → apakah properti untuk dijual. Area.Type (Kanal) → tipe area properti. Area.Size → ukuran area properti dalam satuan tertentu. Jumlah bedrooms → jumlah kamar tidur. Setiap cabang menunjukkan aturan keputusan yang lebih spesifik, membagi properti menjadi lebih banyak kategori.
- Level Daun (Leaf Nodes)
Node paling bawah dalam pohon merupakan keputusan akhir, apakah properti tergolong Murah atau Mahal, beserta probabilitasnya.
Jika probabilitas mendekati 1, maka properti sangat pasti masuk dalam kategori tersebut.
Variable Importnace
## Area.Size bedrooms purpose baths Area.Type
## 13226.803 11940.732 9413.208 8661.697 6216.822
library(tibble)
library(forcats)
fit.tree$variable.importance %>%
data.frame() %>%
rownames_to_column(var = "Feature") %>%
rename(Overall = '.') %>%
ggplot(aes(x = fct_reorder(Feature, Overall), y = Overall)) +
geom_pointrange(aes(ymin = 0, ymax = Overall), color = "cadetblue", size = .3) +
theme_minimal() +
coord_flip() +
labs(x = "", y = "", title = "Variable Importance with Simple Classication")
Berdasarkan variable importance didapatkan variabel terpenting dalam klasfisikasi yaitu area.size sementara untuk variable yang memiliki pengaruh terkecil adalah tipe area(area.type)
Prediksi Klasifikasi
Akurasi dan Evaluasi
##
## pred.tree_train Mahal Murah
## Mahal 26308 3911
## Murah 5791 81890
## harga.test
## pred.tree_test Mahal Murah
## Mahal 11180 1710
## Murah 2576 35061
## Confusion Matrix and Statistics
##
## Reference
## Prediction Mahal Murah
## Mahal 26308 3911
## Murah 5791 81890
##
## Accuracy : 0.9177
## 95% CI : (0.9161, 0.9193)
## No Information Rate : 0.7277
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7885
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8196
## Specificity : 0.9544
## Pos Pred Value : 0.8706
## Neg Pred Value : 0.9340
## Prevalence : 0.2723
## Detection Rate : 0.2231
## Detection Prevalence : 0.2563
## Balanced Accuracy : 0.8870
##
## 'Positive' Class : Mahal
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction Mahal Murah
## Mahal 11180 1710
## Murah 2576 35061
##
## Accuracy : 0.9152
## 95% CI : (0.9127, 0.9176)
## No Information Rate : 0.7277
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7816
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8127
## Specificity : 0.9535
## Pos Pred Value : 0.8673
## Neg Pred Value : 0.9316
## Prevalence : 0.2723
## Detection Rate : 0.2213
## Detection Prevalence : 0.2551
## Balanced Accuracy : 0.8831
##
## 'Positive' Class : Mahal
##
akurasi_train <- cm_train$overall['Accuracy']
balanced_accuracy.train <- cm_train$byClass['Balanced Accuracy']
f1_score_train <- cm_train$byClass['F1']
akurasi_test <- cm_test$overall['Accuracy']
balanced_accuracy.test <- cm_test$byClass['Balanced Accuracy']
f1_score_test <- cm_test$byClass['F1']
df_eval <- data.frame("Accuracy" = c(akurasi_train,akurasi_test),
"Bal. Accuracy" = c(balanced_accuracy.train, balanced_accuracy.test),
"F1 Score" = c(f1_score_train, f1_score_test))
rownames(df_eval) <- c("Data Train", "Data Test")
round(df_eval, 2)
## Accuracy Bal..Accuracy F1.Score
## Data Train 0.92 0.89 0.84
## Data Test 0.92 0.88 0.84
Berdasarkan nilai akurasi dan F1.Score baik pada data train dan data test didapatkan hasil yang sangat memuaskan, dimana accuracy sebesar 92% dan F1.Score sebesar 84%
### kesalahan klasifikasi
conf_matrix_train <- table(pred.tree_train, data.train$harga)
print(conf_matrix_train)
##
## pred.tree_train Mahal Murah
## Mahal 26308 3911
## Murah 5791 81890
## harga.test
## pred.tree_test Mahal Murah
## Mahal 11180 1710
## Murah 2576 35061
accuracy_train <- sum(diag(conf_matrix_train)) / sum(conf_matrix_train)
accuracy_test <- sum(diag(conf_matrix_test)) / sum(conf_matrix_test)
misclassification_rate_train <- 1 - accuracy_train
misclassification_rate_test <- 1 - accuracy_test
cat("Tingkat Kesalahan Training:", misclassification_rate_train * 100, "%\n")
## Tingkat Kesalahan Training: 8.229008 %
## Tingkat Kesalahan Testing: 8.482593 %
Dilanjutkan dengan melihat tingkat kesalahan training dan testing yang cukup kecil yaitu pada tingkat 8%
Optimisasi
##
## Classification tree:
## rpart(formula = harga ~ ., data = data.train, method = "class",
## control = control)
##
## Variables actually used in tree construction:
## [1] Area.Size Area.Type bedrooms purpose
##
## Root node error: 32099/117900 = 0.27226
##
## n= 117900
##
## CP nsplit rel error xerror xstd
## 1 0.1844450 0 1.00000 1.00000 0.0047615
## 2 0.1206891 2 0.63111 0.63111 0.0040352
## 3 0.0141001 4 0.38973 0.38973 0.0032944
## 4 0.0059815 9 0.31923 0.32017 0.0030174
## 5 0.0015473 10 0.31325 0.31415 0.0029916
## 6 0.0010904 13 0.30861 0.30976 0.0029726
## 7 0.0010281 14 0.30752 0.30552 0.0029541
## 8 0.0010000 18 0.30225 0.30552 0.0029541
## [1] 0.001028069
Berdasarkan optimisasi didapatkan nilai CP yang sama dengan hypertuning sehingga tidak diperlukan membuat model lain.
Regresi
Setelah melakukan klasifikasi, dilanjutkan dengan regression tree untuk memprediksi angka/harga dari properti berdasarkan kebutuhan fasilitas/peubah yang diinginkan
Input Data
data <- read.csv(("C:\\Users\\Ghonniyu\\Documents\\Semester 6\\TPM\\Pakistan house price dataset.csv"))
data <- as.data.frame(data)
head(data)
## property_id location_id
## 1 237062 3325
## 2 346905 3236
## 3 386513 764
## 4 656161 340
## 5 841645 3226
## 6 850762 3390
## page_url
## 1 https://www.zameen.com/Property/g_10_g_10_2_ground_floor_corner_apartment_with_green_lawn_for_sale-237062-3325-1.html
## 2 https://www.zameen.com/Property/e_11_2_services_society_flat_available_for_sale-346905-3236-1.html
## 3 https://www.zameen.com/Property/islamabad_g_15_house_is_available_for_sale-386513-764-1.html
## 4 https://www.zameen.com/Property/islamabad_bani_gala_a_rare_minimalist_concept_in_a_quiet_location-656161-340-1.html
## 5 https://www.zameen.com/Property/dha_valley_dha_homes_islamabad_dha_valley_8_marla_home_for_sale-841645-3226-1.html
## 6 https://www.zameen.com/Property/ghauri_town_ghauri_town_phase_1_house_is_available_for_sale_in_ghauri_town_phase_1-850762-3390-1.html
## property_type price location city province_name latitude
## 1 Flat 10000000 G-10 Islamabad Islamabad Capital 33.67989
## 2 Flat 6900000 E-11 Islamabad Islamabad Capital 33.70099
## 3 House 16500000 G-15 Islamabad Islamabad Capital 33.63149
## 4 House 43500000 Bani Gala Islamabad Islamabad Capital 33.70757
## 5 House 7000000 DHA Defence Islamabad Islamabad Capital 33.49259
## 6 House 34500000 Ghauri Town Islamabad Islamabad Capital 33.62395
## longitude baths area purpose bedrooms date_added agency
## 1 73.01264 2 4 Marla For Sale 2 2/4/2019
## 2 72.97149 3 5.6 Marla For Sale 3 5/4/2019
## 3 72.92656 6 8 Marla For Sale 5 7/17/2019
## 4 73.15120 4 2 Kanal For Sale 4 4/5/2019
## 5 73.30134 3 8 Marla For Sale 3 7/10/2019 Easy Property
## 6 73.12659 8 1.6 Kanal For Sale 8 4/5/2019
## agent Area.Type Area.Size
## 1 Marla 4.0
## 2 Marla 5.6
## 3 Marla 8.0
## 4 Kanal 2.0
## 5 Muhammad Junaid Ceo Muhammad Shahid Director Marla 8.0
## 6 Kanal 1.6
## Area.Category
## 1 0-5 Marla
## 2 5-10 Marla
## 3 5-10 Marla
## 4 1-5 Kanal
## 5 5-10 Marla
## 6 1-5 Kanal
Preprocessing
Melakukan pemilihan peubah seperti pada tahap klasifikasi ### Cleaning
data <- data %>% select(-c(property_type,city,province_name,property_id,location_id,page_url,date_added,location,latitude,longitude,area,date_added,agency,agent,Area.Category))
any(is.na(data))
## [1] FALSE
## 'data.frame': 168446 obs. of 6 variables:
## $ price : int 10000000 6900000 16500000 43500000 7000000 34500000 27000000 7800000 50000000 40000000 ...
## $ baths : int 2 3 6 4 3 8 8 2 7 5 ...
## $ purpose : chr "For Sale" "For Sale" "For Sale" "For Sale" ...
## $ bedrooms : int 2 3 5 4 3 8 8 2 7 5 ...
## $ Area.Type: chr "Marla" "Marla" "Marla" "Kanal" ...
## $ Area.Size: num 4 5.6 8 2 8 1.6 1 6.2 1 1 ...
## Warning: package 'fastDummies' was built under R version 4.4.2
## 'data.frame': 168446 obs. of 6 variables:
## $ price : int 10000000 6900000 16500000 43500000 7000000 34500000 27000000 7800000 50000000 40000000 ...
## $ baths : int 2 3 6 4 3 8 8 2 7 5 ...
## $ purpose : Factor w/ 2 levels "For Rent","For Sale": 2 2 2 2 2 2 2 2 2 2 ...
## $ bedrooms : int 2 3 5 4 3 8 8 2 7 5 ...
## $ Area.Type: Factor w/ 2 levels "Kanal","Marla": 2 2 2 1 2 1 1 2 1 1 ...
## $ Area.Size: num 4 5.6 8 2 8 1.6 1 6.2 1 1 ...
Membagi data menjadi data training dan testing dengan pembagian 70% dan 30%
Pembuatan Model Awal
## n= 117913
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 117913 1.496534e+20 17827750.00
## 2) Area.Type=Marla 96465 1.763716e+19 11204870.00
## 4) purpose=For Rent 24939 2.090168e+14 49106.58 *
## 5) purpose=For Sale 71526 1.345110e+19 15094560.00
## 10) Area.Size< 8.55 52346 2.975806e+18 10342790.00 *
## 11) Area.Size>=8.55 19180 6.067618e+18 28063070.00 *
## 3) Area.Type=Kanal 21448 1.087547e+20 47614950.00
## 6) purpose=For Rent 8547 6.069483e+14 229144.10 *
## 7) purpose=For Sale 12901 7.684804e+19 79008360.00
## 14) Area.Size< 1.75 10210 1.386297e+19 60138720.00
## 28) Area.Size< 1.15 8846 8.163188e+18 55083080.00 *
## 29) Area.Size>=1.15 1364 4.007345e+18 92926300.00 *
## 15) Area.Size>=1.75 2691 4.555647e+19 150602200.00
## 30) bedrooms< 5.5 1428 2.194962e+19 123225400.00 *
## 31) bedrooms>=5.5 1263 2.132649e+19 181555600.00
## 62) Area.Size< 3.65 1070 7.376142e+18 158045500.00 *
## 63) Area.Size>=3.65 193 1.008011e+19 311896400.00 *
Variable of Importance
## Area.Size purpose Area.Type bedrooms baths
## 4.595945e+19 3.609194e+19 2.326152e+19 1.332939e+19 1.008154e+19
fit.tree$variable.importance %>%
data.frame() %>%
rownames_to_column(var = "Feature") %>%
rename(Overall = '.') %>%
ggplot(aes(x = fct_reorder(Feature, Overall), y = Overall)) +
geom_pointrange(aes(ymin = 0, ymax = Overall), color = "red", size = .3) +
theme_minimal() +
coord_flip() +
labs(x = "", y = "", title = "Variable Importance with Simple Classication")
Berdasarkan variable importance didapatkan variabel terpenting dalam klasfisikasi yaitu area.size sementara untuk variable yang memiliki pengaruh terkecil adalah tipe area(area.type)
Prediksi Klasifikasi
Akurasi dan Evaluasi
mse_train <- mse(pred.tree_train, data.train$Area.Size)
rmse_train <- rmse(pred.tree_train, data.train$Area.Size)
mape_train <- mape(pred.tree_train, data.train$Area.Size)
mae_train <- mae(pred.tree_train, data.train$Area.Size)
mse_test <- mse(pred.tree_test, data.train$Area.Size)
## Warning in actual - predicted: longer object length is not a multiple of
## shorter object length
## Warning in actual - predicted: longer object length is not a multiple of
## shorter object length
## Warning in actual - predicted: longer object length is not a multiple of
## shorter object length
## Warning in ae(actual, predicted)/abs(actual): longer object length is not a
## multiple of shorter object length
## Warning in actual - predicted: longer object length is not a multiple of
## shorter object length
df_eval <- data.frame("MSE" = c(mse_train, mse_test),
"RMSE" = c(rmse_train,rmse_test),
"MAPE" = c(mape_train,mape_test),
"MAE" = c(mae_train, mae_test))
rownames(df_eval) <- c("Data Train", "Data Test")
round(df_eval, 2)
## MSE RMSE MAPE MAE
## Data Train 1.072900e+15 32755156 1 17827745
## Data Test 1.048212e+15 32376102 1 17822722
##Optimasi
##
## Regression tree:
## rpart(formula = price ~ ., data = data.train, method = "anova")
##
## Variables actually used in tree construction:
## [1] Area.Size Area.Type bedrooms purpose
##
## Root node error: 1.4965e+20/117913 = 1.2692e+15
##
## n= 117913
##
## CP nsplit rel error xerror xstd
## 1 0.184318 0 1.00000 1.00002 0.039778
## 2 0.116460 2 0.63136 0.63146 0.034123
## 3 0.028711 3 0.51490 0.51667 0.029380
## 4 0.020549 5 0.45748 0.45929 0.029362
## 5 0.011309 7 0.41638 0.41868 0.026741
## 6 0.010000 8 0.40507 0.40817 0.026699
Pemodelan Kedua
# Explicitly request the lowest cp value
fit.tree$cptable[which.min(fit.tree$cptable[,"xerror"]),"CP"]
## [1] 0.01
bestcp <-fit.tree$cptable[which.min(fit.tree$cptable[,"xerror"]),"CP"]
pruned.tree <- prune(fit.tree, cp = bestcp)
rpart.plot(fit.tree)
Baik pada plot regression tree awal dan setelah di pruned menghasilkan klasifikasi yang sama.
Kemudian berdasarkan plot yang terbentuk dapat ditarik beberapa kesimpulan :
- Area Type Marla cenderung memiliki harga lebih rendah dibandingkan area lainnya.
- Rumah yang disewakan (For Rent) lebih dipengaruhi oleh luas area dibanding jumlah kamar.
- Semakin kecil luas area, semakin rendah harga rumahnya.
- Kamar tidur mempengaruhi harga di kategori harga yang lebih tinggi.
- Properti dengan luas sangat besar (>3.7) bisa mencapai harga 312 juta, tetapi sangat jarang terjadi (0%).
# Alternate specification
pred.prune_train = predict(pruned.tree, data.train)
pred.prune_test = predict(pruned.tree, data.test)
mse_prune_train <- mse(pred.prune_train, data.train$Area.Size)
rmse_prune_train <- rmse(pred.prune_train, data.train$Area.Size)
mape_prune_train <- mape(pred.prune_train, data.train$Area.Size)
mae_prune_train <- mae(pred.prune_train, data.train$Area.Size)
mse_prune_test <- mse(pred.prune_test, data.test$Area.Size)
rmse_prune_test <- rmse(pred.prune_test, data.test$Area.Size)
mape_prune_test <- mape(pred.prune_test, data.test$Area.Size)
mae_prune_test <- mae(pred.prune_test, data.test$Area.Size)
df_prune_eval <- data.frame("MSE" = c(mse_prune_train, mse_prune_test),
"RMSE" = c(rmse_prune_train,rmse_prune_test),
"MAPE" = c(mape_prune_train,mape_prune_test),
"MAE" = c(mae_prune_train, mae_prune_test))
rownames(df_prune_eval) <- c("Data Train", "Data Test")
round(df_prune_eval, 2)
## MSE RMSE MAPE MAE
## Data Train 1.072900e+15 32755156 1 17827745
## Data Test 1.039152e+15 32235875 1 17740271
Berdasarkan hasil evaluasi model prediksi harga rumah, terdapat empat metrik utama yang digunakan, yaitu MSE, RMSE, MAPE, dan MAE. Nilai MSE yang sangat besar, hal tersebut menunjukkan bahwa kesalahan kuadrat dalam prediksi masih cukup tinggi, meskipun metrik ini kurang intuitif untuk interpretasi langsung. Sementara itu, RMSE pada data train dan test berkisar antara 3,2 juta – 3,3 juta unit mata uang, yang mengindikasikan rata-rata kesalahan prediksi dalam satuan harga rumah. MAE juga menunjukkan angka yang cukup tinggi, sekitar 17,7 juta – 17,8 juta unit mata uang, menandakan bahwa model masih memiliki tingkat kesalahan yang cukup signifikan dalam estimasi harga.
Di sisi lain, MAPE tercatat sebesar 1, yang perlu diklarifikasi lebih lanjut. Jika angka tersebut sebenarnya 100%, maka model memiliki performa yang sangat buruk dan tidak dapat diandalkan. Namun, jika MAPE sebenarnya adalah 1% (0.01 dalam format desimal), maka model memiliki tingkat akurasi yang cukup baik, dengan kesalahan prediksi hanya sekitar 1% dari harga aktual. Secara keseluruhan, meskipun model menunjukkan akurasi yang baik jika MAPE benar 1%, tingginya nilai RMSE dan MAE mengindikasikan bahwa model masih perlu perbaikan, baik dalam pemilihan fitur, tuning parameter, atau eksplorasi model yang lebih kompleks.