LATIHAN TPM PRAK 4

Library

library(DescTools)
library(ggplot2)
library(cowplot)
library(corrplot)
## corrplot 0.94 loaded
library(lattice)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:DescTools':
## 
##     Recode
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(Metrics)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(skimr)
library(dplyr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ✔ readr     2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::recode()    masks car::recode()
## ✖ purrr::some()      masks car::some()
## ✖ lubridate::stamp() masks cowplot::stamp()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rpart)
library(caret)
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## The following objects are masked from 'package:Metrics':
## 
##     precision, recall
## 
## The following objects are masked from 'package:DescTools':
## 
##     MAE, RMSE
library(rpart.plot)
library(ROCR)
library(ISLR)

Data

Didapatkan data properti di daerah pakistan, dengan beberapa peubah yaitu :

data <- read.csv("C:\\Users\\Ghonniyu\\Documents\\Semester 6\\TPM\\Pakistan house price dataset.csv")
head(data)
##   property_id location_id
## 1      237062        3325
## 2      346905        3236
## 3      386513         764
## 4      656161         340
## 5      841645        3226
## 6      850762        3390
##                                                                                                                                page_url
## 1                 https://www.zameen.com/Property/g_10_g_10_2_ground_floor_corner_apartment_with_green_lawn_for_sale-237062-3325-1.html
## 2                                    https://www.zameen.com/Property/e_11_2_services_society_flat_available_for_sale-346905-3236-1.html
## 3                                          https://www.zameen.com/Property/islamabad_g_15_house_is_available_for_sale-386513-764-1.html
## 4                   https://www.zameen.com/Property/islamabad_bani_gala_a_rare_minimalist_concept_in_a_quiet_location-656161-340-1.html
## 5                    https://www.zameen.com/Property/dha_valley_dha_homes_islamabad_dha_valley_8_marla_home_for_sale-841645-3226-1.html
## 6 https://www.zameen.com/Property/ghauri_town_ghauri_town_phase_1_house_is_available_for_sale_in_ghauri_town_phase_1-850762-3390-1.html
##   property_type    price    location      city     province_name latitude
## 1          Flat 10000000        G-10 Islamabad Islamabad Capital 33.67989
## 2          Flat  6900000        E-11 Islamabad Islamabad Capital 33.70099
## 3         House 16500000        G-15 Islamabad Islamabad Capital 33.63149
## 4         House 43500000   Bani Gala Islamabad Islamabad Capital 33.70757
## 5         House  7000000 DHA Defence Islamabad Islamabad Capital 33.49259
## 6         House 34500000 Ghauri Town Islamabad Islamabad Capital 33.62395
##   longitude baths      area  purpose bedrooms date_added        agency
## 1  73.01264     2   4 Marla For Sale        2   2/4/2019              
## 2  72.97149     3 5.6 Marla For Sale        3   5/4/2019              
## 3  72.92656     6   8 Marla For Sale        5  7/17/2019              
## 4  73.15120     4   2 Kanal For Sale        4   4/5/2019              
## 5  73.30134     3   8 Marla For Sale        3  7/10/2019 Easy Property
## 6  73.12659     8 1.6 Kanal For Sale        8   4/5/2019              
##                                          agent Area.Type Area.Size
## 1                                                  Marla       4.0
## 2                                                  Marla       5.6
## 3                                                  Marla       8.0
## 4                                                  Kanal       2.0
## 5 Muhammad Junaid Ceo Muhammad Shahid Director     Marla       8.0
## 6                                                  Kanal       1.6
##   Area.Category
## 1     0-5 Marla
## 2    5-10 Marla
## 3    5-10 Marla
## 4     1-5 Kanal
## 5    5-10 Marla
## 6     1-5 Kanal

Data Exploration

summary(data)
##   property_id        location_id      page_url         property_type     
##  Min.   :   86575   Min.   :    1   Length:168446      Length:168446     
##  1st Qu.:14883202   1st Qu.: 1058   Class :character   Class :character  
##  Median :16658506   Median : 3286   Mode  :character   Mode  :character  
##  Mean   :15596255   Mean   : 4376                                        
##  3rd Qu.:17086620   3rd Qu.: 7220                                        
##  Max.   :17357718   Max.   :14220                                        
##      price             location             city           province_name     
##  Min.   :0.000e+00   Length:168446      Length:168446      Length:168446     
##  1st Qu.:1.750e+05   Class :character   Class :character   Class :character  
##  Median :8.500e+06   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1.777e+07                                                           
##  3rd Qu.:1.950e+07                                                           
##  Max.   :2.000e+09                                                           
##     latitude       longitude         baths             area          
##  Min.   :11.05   Min.   :25.91   Min.   :  0.000   Length:168446     
##  1st Qu.:24.95   1st Qu.:67.13   1st Qu.:  0.000   Class :character  
##  Median :31.46   Median :73.06   Median :  3.000   Mode  :character  
##  Mean   :29.86   Mean   :71.24   Mean   :  2.874                     
##  3rd Qu.:33.56   3rd Qu.:73.26   3rd Qu.:  4.000                     
##  Max.   :73.18   Max.   :80.16   Max.   :403.000                     
##    purpose             bedrooms       date_added           agency         
##  Length:168446      Min.   : 0.000   Length:168446      Length:168446     
##  Class :character   1st Qu.: 2.000   Class :character   Class :character  
##  Mode  :character   Median : 3.000   Mode  :character   Mode  :character  
##                     Mean   : 3.179                                        
##                     3rd Qu.: 4.000                                        
##                     Max.   :68.000                                        
##     agent            Area.Type           Area.Size       Area.Category     
##  Length:168446      Length:168446      Min.   :  0.000   Length:168446     
##  Class :character   Class :character   1st Qu.:  3.000   Class :character  
##  Mode  :character   Mode  :character   Median :  5.000   Mode  :character  
##                                        Mean   :  5.892                     
##                                        3rd Qu.:  8.000                     
##                                        Max.   :800.000

Memastikan apakah ada missing value

colSums(is.na(data))
##   property_id   location_id      page_url property_type         price 
##             0             0             0             0             0 
##      location          city province_name      latitude     longitude 
##             0             0             0             0             0 
##         baths          area       purpose      bedrooms    date_added 
##             0             0             0             0             0 
##        agency         agent     Area.Type     Area.Size Area.Category 
##             0             0             0             0             0
str(data)
## 'data.frame':    168446 obs. of  20 variables:
##  $ property_id  : int  237062 346905 386513 656161 841645 850762 937975 1258636 1402466 1418706 ...
##  $ location_id  : int  3325 3236 764 340 3226 3390 445 3241 376 3282 ...
##  $ page_url     : chr  "https://www.zameen.com/Property/g_10_g_10_2_ground_floor_corner_apartment_with_green_lawn_for_sale-237062-3325-1.html" "https://www.zameen.com/Property/e_11_2_services_society_flat_available_for_sale-346905-3236-1.html" "https://www.zameen.com/Property/islamabad_g_15_house_is_available_for_sale-386513-764-1.html" "https://www.zameen.com/Property/islamabad_bani_gala_a_rare_minimalist_concept_in_a_quiet_location-656161-340-1.html" ...
##  $ property_type: chr  "Flat" "Flat" "House" "House" ...
##  $ price        : int  10000000 6900000 16500000 43500000 7000000 34500000 27000000 7800000 50000000 40000000 ...
##  $ location     : chr  "G-10" "E-11" "G-15" "Bani Gala" ...
##  $ city         : chr  "Islamabad" "Islamabad" "Islamabad" "Islamabad" ...
##  $ province_name: chr  "Islamabad Capital" "Islamabad Capital" "Islamabad Capital" "Islamabad Capital" ...
##  $ latitude     : num  33.7 33.7 33.6 33.7 33.5 ...
##  $ longitude    : num  73 73 72.9 73.2 73.3 ...
##  $ baths        : int  2 3 6 4 3 8 8 2 7 5 ...
##  $ area         : chr  "4 Marla" "5.6 Marla" "8 Marla" "2 Kanal" ...
##  $ purpose      : chr  "For Sale" "For Sale" "For Sale" "For Sale" ...
##  $ bedrooms     : int  2 3 5 4 3 8 8 2 7 5 ...
##  $ date_added   : chr  "2/4/2019" "5/4/2019" "7/17/2019" "4/5/2019" ...
##  $ agency       : chr  "" "" "" "" ...
##  $ agent        : chr  "" "" "" "" ...
##  $ Area.Type    : chr  "Marla" "Marla" "Marla" "Kanal" ...
##  $ Area.Size    : num  4 5.6 8 2 8 1.6 1 6.2 1 1 ...
##  $ Area.Category: chr  "0-5 Marla" "5-10 Marla" "5-10 Marla" "1-5 Kanal" ...
numeric_vars <- data %>% 
  select_if(is.numeric) %>% 
  names()

# Loop untuk membuat boxplot setiap kolom numerik
for (col in numeric_vars) {
  # Membuat boxplot
  print(
    ggplot(data, aes(x = property_type, y = .data[[col]], fill = property_type)) +
      geom_boxplot(color = "black", outlier.color = "red", outlier.size = 2) +
      labs(title = paste("Boxplot of", col), y = col, x = "property_type") +
      theme_light()
  )
}

Berdasarkan hasil eksplorasi salah satu variabel berdasarkan tipe properti melalui boxplot didapatkan hasil yang menjelaskan tipe properti terbanyak adalah rumah (house), lalu terdapat data pencilan pada total kamar tidur dan kamar mandi pada tipe rumah (house).

Modelling

Input Data

data <- read.csv(("C:\\Users\\Ghonniyu\\Documents\\Semester 6\\TPM\\Pakistan house price dataset.csv"))
data <- data[data$price >= 1000, ]
str(data)
## 'data.frame':    168427 obs. of  20 variables:
##  $ property_id  : int  237062 346905 386513 656161 841645 850762 937975 1258636 1402466 1418706 ...
##  $ location_id  : int  3325 3236 764 340 3226 3390 445 3241 376 3282 ...
##  $ page_url     : chr  "https://www.zameen.com/Property/g_10_g_10_2_ground_floor_corner_apartment_with_green_lawn_for_sale-237062-3325-1.html" "https://www.zameen.com/Property/e_11_2_services_society_flat_available_for_sale-346905-3236-1.html" "https://www.zameen.com/Property/islamabad_g_15_house_is_available_for_sale-386513-764-1.html" "https://www.zameen.com/Property/islamabad_bani_gala_a_rare_minimalist_concept_in_a_quiet_location-656161-340-1.html" ...
##  $ property_type: chr  "Flat" "Flat" "House" "House" ...
##  $ price        : int  10000000 6900000 16500000 43500000 7000000 34500000 27000000 7800000 50000000 40000000 ...
##  $ location     : chr  "G-10" "E-11" "G-15" "Bani Gala" ...
##  $ city         : chr  "Islamabad" "Islamabad" "Islamabad" "Islamabad" ...
##  $ province_name: chr  "Islamabad Capital" "Islamabad Capital" "Islamabad Capital" "Islamabad Capital" ...
##  $ latitude     : num  33.7 33.7 33.6 33.7 33.5 ...
##  $ longitude    : num  73 73 72.9 73.2 73.3 ...
##  $ baths        : int  2 3 6 4 3 8 8 2 7 5 ...
##  $ area         : chr  "4 Marla" "5.6 Marla" "8 Marla" "2 Kanal" ...
##  $ purpose      : chr  "For Sale" "For Sale" "For Sale" "For Sale" ...
##  $ bedrooms     : int  2 3 5 4 3 8 8 2 7 5 ...
##  $ date_added   : chr  "2/4/2019" "5/4/2019" "7/17/2019" "4/5/2019" ...
##  $ agency       : chr  "" "" "" "" ...
##  $ agent        : chr  "" "" "" "" ...
##  $ Area.Type    : chr  "Marla" "Marla" "Marla" "Kanal" ...
##  $ Area.Size    : num  4 5.6 8 2 8 1.6 1 6.2 1 1 ...
##  $ Area.Category: chr  "0-5 Marla" "5-10 Marla" "5-10 Marla" "1-5 Kanal" ...

Pemodelan dilakukan menggunakan variabel berupa price, purpose, tipe area, ukuran area, total kamar mandi, dan total kamar tidur.

# Creates a new binary variable, High.
harga = ifelse(data$price <=17767764, "Murah", "Mahal")
# Add High to the data set.
data=data.frame(data,harga)
# Remove the Sales variable from the data.
data.H <- data %>% select(-c(property_type,city,province_name,property_id,location_id,page_url,date_added,location,latitude,longitude,area,date_added,agency,agent,Area.Category,price))
# Code High as a factor variable
data.H$harga = as.factor(data$harga)
class(data.H$harga)
## [1] "factor"

Preprocessing

Membagi data menjadi data training dan testing dengan pembagian 70% dan 30%

library(caret)

set.seed(234)
train <- createDataPartition(as.factor(data.H$harga), p=0.7, list=FALSE)

data.train=data.H[train,]
data.test=data.H[-train,]

harga.test=harga[-train]

head(data.H)
##   baths  purpose bedrooms Area.Type Area.Size harga
## 1     2 For Sale        2     Marla       4.0 Murah
## 2     3 For Sale        3     Marla       5.6 Murah
## 3     6 For Sale        5     Marla       8.0 Murah
## 4     4 For Sale        4     Kanal       2.0 Mahal
## 5     3 For Sale        3     Marla       8.0 Murah
## 6     8 For Sale        8     Kanal       1.6 Mahal

Hyperparameter Tuning

Melakukan hyperparameter tuning untuk mendapatkan parameter terbaik berdasarkan cp value, minsplit, dan maxdepth

library(rpart)
library(caret)

cp_values <- seq(0.001, 0.02, by = 0.002)  
minsplit_values <- c(1,2,3,4,5,6,7,8,9,10)
maxdepth_values <- c(5, 10, 15)

best_acc <- 0
best_params <- list()

for (cp in cp_values) {
  for (minsplit in minsplit_values) {
    for (maxdepth in maxdepth_values) {
      
      control <- rpart.control(cp = cp, minsplit = minsplit, maxdepth = maxdepth)
      
      model <- rpart(harga ~ ., data = data.train, method = "class", control = control)
      
      pred <- predict(model, data.test, type = "class")
      acc <- mean(pred == data.test$harga)
      
      if (acc > best_acc) {
        best_acc <- acc
        best_params <- list(cp = cp, minsplit = minsplit, maxdepth = maxdepth)
      }
    }
  }
}

print(paste("Best Accuracy:", best_acc))
## [1] "Best Accuracy: 0.915174065351198"
print("Best Parameters:")
## [1] "Best Parameters:"
print(best_params)
## $cp
## [1] 0.001
## 
## $minsplit
## [1] 1
## 
## $maxdepth
## [1] 10

Model Awal

Berdasarkan hasil hypertuning, yang akan digunakan adalah

  • CP = 0.001

  • Minsplit = 1

  • Maxdepth = 10

library(rpart)
control= rpart.control(cp=0.001, minsplit = 1, maxdepth = 10, xval = 10, minbucket = 10)

fit.tree = rpart(harga ~ ., data=data.train, method = 'class', control = control)

fit.tree
## n= 117900 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 117900 32099 Murah (0.27225615 0.72774385)  
##     2) bedrooms>=3.5 45359 20734 Mahal (0.54289116 0.45710884)  
##       4) purpose=For Sale 37409 12784 Mahal (0.65826405 0.34173595)  
##         8) Area.Type=Kanal 11231   109 Mahal (0.99029472 0.00970528) *
##         9) Area.Type=Marla 26178 12675 Mahal (0.51581481 0.48418519)  
##          18) Area.Size>=8.55 13502  2463 Mahal (0.81758258 0.18241742) *
##          19) Area.Size< 8.55 12676  2464 Murah (0.19438309 0.80561691)  
##            38) Area.Size>=6.25 3716  1356 Murah (0.36490850 0.63509150)  
##              76) bedrooms>=5.5 768   350 Mahal (0.54427083 0.45572917)  
##               152) Area.Size>=7.7 467   159 Mahal (0.65952891 0.34047109) *
##               153) Area.Size< 7.7 301   110 Murah (0.36544850 0.63455150) *
##              77) bedrooms< 5.5 2948   938 Murah (0.31818182 0.68181818) *
##            39) Area.Size< 6.25 8960  1108 Murah (0.12366071 0.87633929) *
##       5) purpose=For Rent 7950     0 Murah (0.00000000 1.00000000) *
##     3) bedrooms< 3.5 72541  7474 Murah (0.10303139 0.89696861)  
##       6) Area.Size>=7.05 19599  4943 Murah (0.25220675 0.74779325)  
##        12) purpose=For Sale 10936  4943 Murah (0.45199342 0.54800658)  
##          24) Area.Size>=10.15 2117   513 Mahal (0.75767596 0.24232404) *
##          25) Area.Size< 10.15 8819  3339 Murah (0.37861436 0.62138564)  
##            50) bedrooms< 1.5 1550   679 Mahal (0.56193548 0.43806452)  
##             100) Area.Size>=8.45 945   359 Mahal (0.62010582 0.37989418) *
##             101) Area.Size< 8.45 605   285 Murah (0.47107438 0.52892562) *
##            51) bedrooms>=1.5 7269  2468 Murah (0.33952401 0.66047599)  
##             102) Area.Size< 9.9 5699  2092 Murah (0.36708194 0.63291806)  
##               204) Area.Size>=8.85 978   456 Mahal (0.53374233 0.46625767)  
##                 408) Area.Size< 8.95 227    49 Mahal (0.78414097 0.21585903) *
##                 409) Area.Size>=8.95 751   344 Murah (0.45805593 0.54194407)  
##                   818) Area.Size>=9.65 118    39 Mahal (0.66949153 0.33050847) *
##                   819) Area.Size< 9.65 633   265 Murah (0.41864139 0.58135861) *
##               205) Area.Size< 8.85 4721  1570 Murah (0.33255666 0.66744334) *
##             103) Area.Size>=9.9 1570   376 Murah (0.23949045 0.76050955) *
##        13) purpose=For Rent 8663     0 Murah (0.00000000 1.00000000) *
##       7) Area.Size< 7.05 52942  2531 Murah (0.04780703 0.95219297)  
##        14) Area.Type=Kanal 5980  1392 Murah (0.23277592 0.76722408)  
##          28) purpose=For Sale 1612   220 Mahal (0.86352357 0.13647643) *
##          29) purpose=For Rent 4368     0 Murah (0.00000000 1.00000000) *
##        15) Area.Type=Marla 46962  1139 Murah (0.02425365 0.97574635) *
library(rpart.plot)
rpart.plot(fit.tree, cex = 0.5)

Berdasarkan klasifikasi yang dihasilkan akan digunakan untuk mengklasifikasikan properti berdasarkan apakah “Murah” atau “Mahal” penentuan tersebut menggunakan rata-rata harga propertinya, jika dia menginginkan . Berikut interpretasi dari diagram ini:

  • Akar Pohon (Root Node)

Pada bagian atas terdapat node utama yang memisahkan properti berdasarkan jumlah bedrooms (>= 4). Jika jumlah kamar tidur >= 4, properti dikategorikan sebagai Murah dengan probabilitas 0.73 (100%). Jika jumlah kamar tidur < 4, pohon akan bercabang lebih lanjut berdasarkan variabel lain.

  • Cabang dan Pemisahan

Beberapa variabel yang digunakan untuk memisahkan kategori properti adalah:

Purpose (For Sale) → apakah properti untuk dijual. Area.Type (Kanal) → tipe area properti. Area.Size → ukuran area properti dalam satuan tertentu. Jumlah bedrooms → jumlah kamar tidur. Setiap cabang menunjukkan aturan keputusan yang lebih spesifik, membagi properti menjadi lebih banyak kategori.

  • Level Daun (Leaf Nodes)

Node paling bawah dalam pohon merupakan keputusan akhir, apakah properti tergolong Murah atau Mahal, beserta probabilitasnya.

Jika probabilitas mendekati 1, maka properti sangat pasti masuk dalam kategori tersebut.

Variable Importnace

fit.tree$variable.importance
## Area.Size  bedrooms   purpose     baths Area.Type 
## 13226.803 11940.732  9413.208  8661.697  6216.822
library(tibble)
library(forcats)
fit.tree$variable.importance %>% 
   data.frame() %>%
   rownames_to_column(var = "Feature") %>%
   rename(Overall = '.') %>%
   ggplot(aes(x = fct_reorder(Feature, Overall), y = Overall)) +
   geom_pointrange(aes(ymin = 0, ymax = Overall), color = "cadetblue", size = .3) +
   theme_minimal() +
   coord_flip() +
   labs(x = "", y = "", title = "Variable Importance with Simple Classication")

Berdasarkan variable importance didapatkan variabel terpenting dalam klasfisikasi yaitu area.size sementara untuk variable yang memiliki pengaruh terkecil adalah tipe area(area.type)

Prediksi Klasifikasi

pred.tree_train = predict(fit.tree, data.train, type = "class")
pred.tree_test = predict(fit.tree, data.test, type = "class")

Akurasi dan Evaluasi

table(pred.tree_train,data.train$harga)
##                
## pred.tree_train Mahal Murah
##           Mahal 26308  3911
##           Murah  5791 81890
table(pred.tree_test,harga.test)
##               harga.test
## pred.tree_test Mahal Murah
##          Mahal 11180  1710
##          Murah  2576 35061
cm_train <- confusionMatrix(pred.tree_train, data.train$harga)
cm_train
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Mahal Murah
##      Mahal 26308  3911
##      Murah  5791 81890
##                                           
##                Accuracy : 0.9177          
##                  95% CI : (0.9161, 0.9193)
##     No Information Rate : 0.7277          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7885          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8196          
##             Specificity : 0.9544          
##          Pos Pred Value : 0.8706          
##          Neg Pred Value : 0.9340          
##              Prevalence : 0.2723          
##          Detection Rate : 0.2231          
##    Detection Prevalence : 0.2563          
##       Balanced Accuracy : 0.8870          
##                                           
##        'Positive' Class : Mahal           
## 
cm_test <- confusionMatrix(pred.tree_test, data.test$harga)
cm_test
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Mahal Murah
##      Mahal 11180  1710
##      Murah  2576 35061
##                                           
##                Accuracy : 0.9152          
##                  95% CI : (0.9127, 0.9176)
##     No Information Rate : 0.7277          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7816          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8127          
##             Specificity : 0.9535          
##          Pos Pred Value : 0.8673          
##          Neg Pred Value : 0.9316          
##              Prevalence : 0.2723          
##          Detection Rate : 0.2213          
##    Detection Prevalence : 0.2551          
##       Balanced Accuracy : 0.8831          
##                                           
##        'Positive' Class : Mahal           
## 
akurasi_train <- cm_train$overall['Accuracy']
balanced_accuracy.train <- cm_train$byClass['Balanced Accuracy']
f1_score_train <- cm_train$byClass['F1']

akurasi_test <- cm_test$overall['Accuracy']
balanced_accuracy.test <- cm_test$byClass['Balanced Accuracy']
f1_score_test <- cm_test$byClass['F1']

df_eval <- data.frame("Accuracy" = c(akurasi_train,akurasi_test),
          "Bal. Accuracy" = c(balanced_accuracy.train, balanced_accuracy.test),
          "F1 Score" = c(f1_score_train, f1_score_test))
rownames(df_eval) <- c("Data Train", "Data Test")
round(df_eval, 2)
##            Accuracy Bal..Accuracy F1.Score
## Data Train     0.92          0.89     0.84
## Data Test      0.92          0.88     0.84

Berdasarkan nilai akurasi dan F1.Score baik pada data train dan data test didapatkan hasil yang sangat memuaskan, dimana accuracy sebesar 92% dan F1.Score sebesar 84%

### kesalahan klasifikasi

conf_matrix_train <- table(pred.tree_train, data.train$harga)
print(conf_matrix_train)
##                
## pred.tree_train Mahal Murah
##           Mahal 26308  3911
##           Murah  5791 81890
conf_matrix_test <- table(pred.tree_test, harga.test)
print(conf_matrix_test)
##               harga.test
## pred.tree_test Mahal Murah
##          Mahal 11180  1710
##          Murah  2576 35061
accuracy_train <- sum(diag(conf_matrix_train)) / sum(conf_matrix_train)
accuracy_test <- sum(diag(conf_matrix_test)) / sum(conf_matrix_test)

misclassification_rate_train <- 1 - accuracy_train
misclassification_rate_test <- 1 - accuracy_test

cat("Tingkat Kesalahan Training:", misclassification_rate_train * 100, "%\n")
## Tingkat Kesalahan Training: 8.229008 %
cat("Tingkat Kesalahan Testing:", misclassification_rate_test * 100, "%\n")
## Tingkat Kesalahan Testing: 8.482593 %

Dilanjutkan dengan melihat tingkat kesalahan training dan testing yang cukup kecil yaitu pada tingkat 8%

Optimisasi

printcp(fit.tree)
## 
## Classification tree:
## rpart(formula = harga ~ ., data = data.train, method = "class", 
##     control = control)
## 
## Variables actually used in tree construction:
## [1] Area.Size Area.Type bedrooms  purpose  
## 
## Root node error: 32099/117900 = 0.27226
## 
## n= 117900 
## 
##          CP nsplit rel error  xerror      xstd
## 1 0.1844450      0   1.00000 1.00000 0.0047615
## 2 0.1206891      2   0.63111 0.63111 0.0040352
## 3 0.0141001      4   0.38973 0.38973 0.0032944
## 4 0.0059815      9   0.31923 0.32017 0.0030174
## 5 0.0015473     10   0.31325 0.31415 0.0029916
## 6 0.0010904     13   0.30861 0.30976 0.0029726
## 7 0.0010281     14   0.30752 0.30552 0.0029541
## 8 0.0010000     18   0.30225 0.30552 0.0029541
fit.tree$cptable[which.min(fit.tree$cptable[,"xerror"]),"CP"]
## [1] 0.001028069

Berdasarkan optimisasi didapatkan nilai CP yang sama dengan hypertuning sehingga tidak diperlukan membuat model lain.

Regresi

Setelah melakukan klasifikasi, dilanjutkan dengan regression tree untuk memprediksi angka/harga dari properti berdasarkan kebutuhan fasilitas/peubah yang diinginkan

Input Data

data <- read.csv(("C:\\Users\\Ghonniyu\\Documents\\Semester 6\\TPM\\Pakistan house price dataset.csv"))
data <- as.data.frame(data)
head(data)
##   property_id location_id
## 1      237062        3325
## 2      346905        3236
## 3      386513         764
## 4      656161         340
## 5      841645        3226
## 6      850762        3390
##                                                                                                                                page_url
## 1                 https://www.zameen.com/Property/g_10_g_10_2_ground_floor_corner_apartment_with_green_lawn_for_sale-237062-3325-1.html
## 2                                    https://www.zameen.com/Property/e_11_2_services_society_flat_available_for_sale-346905-3236-1.html
## 3                                          https://www.zameen.com/Property/islamabad_g_15_house_is_available_for_sale-386513-764-1.html
## 4                   https://www.zameen.com/Property/islamabad_bani_gala_a_rare_minimalist_concept_in_a_quiet_location-656161-340-1.html
## 5                    https://www.zameen.com/Property/dha_valley_dha_homes_islamabad_dha_valley_8_marla_home_for_sale-841645-3226-1.html
## 6 https://www.zameen.com/Property/ghauri_town_ghauri_town_phase_1_house_is_available_for_sale_in_ghauri_town_phase_1-850762-3390-1.html
##   property_type    price    location      city     province_name latitude
## 1          Flat 10000000        G-10 Islamabad Islamabad Capital 33.67989
## 2          Flat  6900000        E-11 Islamabad Islamabad Capital 33.70099
## 3         House 16500000        G-15 Islamabad Islamabad Capital 33.63149
## 4         House 43500000   Bani Gala Islamabad Islamabad Capital 33.70757
## 5         House  7000000 DHA Defence Islamabad Islamabad Capital 33.49259
## 6         House 34500000 Ghauri Town Islamabad Islamabad Capital 33.62395
##   longitude baths      area  purpose bedrooms date_added        agency
## 1  73.01264     2   4 Marla For Sale        2   2/4/2019              
## 2  72.97149     3 5.6 Marla For Sale        3   5/4/2019              
## 3  72.92656     6   8 Marla For Sale        5  7/17/2019              
## 4  73.15120     4   2 Kanal For Sale        4   4/5/2019              
## 5  73.30134     3   8 Marla For Sale        3  7/10/2019 Easy Property
## 6  73.12659     8 1.6 Kanal For Sale        8   4/5/2019              
##                                          agent Area.Type Area.Size
## 1                                                  Marla       4.0
## 2                                                  Marla       5.6
## 3                                                  Marla       8.0
## 4                                                  Kanal       2.0
## 5 Muhammad Junaid Ceo Muhammad Shahid Director     Marla       8.0
## 6                                                  Kanal       1.6
##   Area.Category
## 1     0-5 Marla
## 2    5-10 Marla
## 3    5-10 Marla
## 4     1-5 Kanal
## 5    5-10 Marla
## 6     1-5 Kanal

Preprocessing

Melakukan pemilihan peubah seperti pada tahap klasifikasi ### Cleaning

data <- data %>% select(-c(property_type,city,province_name,property_id,location_id,page_url,date_added,location,latitude,longitude,area,date_added,agency,agent,Area.Category))
any(is.na(data))
## [1] FALSE
data <- na.omit(data)
str(data)
## 'data.frame':    168446 obs. of  6 variables:
##  $ price    : int  10000000 6900000 16500000 43500000 7000000 34500000 27000000 7800000 50000000 40000000 ...
##  $ baths    : int  2 3 6 4 3 8 8 2 7 5 ...
##  $ purpose  : chr  "For Sale" "For Sale" "For Sale" "For Sale" ...
##  $ bedrooms : int  2 3 5 4 3 8 8 2 7 5 ...
##  $ Area.Type: chr  "Marla" "Marla" "Marla" "Kanal" ...
##  $ Area.Size: num  4 5.6 8 2 8 1.6 1 6.2 1 1 ...
library(fastDummies)
## Warning: package 'fastDummies' was built under R version 4.4.2
data$purpose <- as.factor(data$purpose)
data$Area.Type <- as.factor(data$Area.Type)
str(data)
## 'data.frame':    168446 obs. of  6 variables:
##  $ price    : int  10000000 6900000 16500000 43500000 7000000 34500000 27000000 7800000 50000000 40000000 ...
##  $ baths    : int  2 3 6 4 3 8 8 2 7 5 ...
##  $ purpose  : Factor w/ 2 levels "For Rent","For Sale": 2 2 2 2 2 2 2 2 2 2 ...
##  $ bedrooms : int  2 3 5 4 3 8 8 2 7 5 ...
##  $ Area.Type: Factor w/ 2 levels "Kanal","Marla": 2 2 2 1 2 1 1 2 1 1 ...
##  $ Area.Size: num  4 5.6 8 2 8 1.6 1 6.2 1 1 ...

Membagi data menjadi data training dan testing dengan pembagian 70% dan 30%

library(caret)
set.seed(046)
train <- createDataPartition((data$price), p = 0.7, list = FALSE)

#train = sample(1:nrow(data.H), 200)
data.train = data[train,]
data.test = data[-train,]

Pembuatan Model Awal

fit.tree = rpart(price ~ ., data=data.train, method = 'anova')
fit.tree
## n= 117913 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 117913 1.496534e+20  17827750.00  
##    2) Area.Type=Marla 96465 1.763716e+19  11204870.00  
##      4) purpose=For Rent 24939 2.090168e+14     49106.58 *
##      5) purpose=For Sale 71526 1.345110e+19  15094560.00  
##       10) Area.Size< 8.55 52346 2.975806e+18  10342790.00 *
##       11) Area.Size>=8.55 19180 6.067618e+18  28063070.00 *
##    3) Area.Type=Kanal 21448 1.087547e+20  47614950.00  
##      6) purpose=For Rent 8547 6.069483e+14    229144.10 *
##      7) purpose=For Sale 12901 7.684804e+19  79008360.00  
##       14) Area.Size< 1.75 10210 1.386297e+19  60138720.00  
##         28) Area.Size< 1.15 8846 8.163188e+18  55083080.00 *
##         29) Area.Size>=1.15 1364 4.007345e+18  92926300.00 *
##       15) Area.Size>=1.75 2691 4.555647e+19 150602200.00  
##         30) bedrooms< 5.5 1428 2.194962e+19 123225400.00 *
##         31) bedrooms>=5.5 1263 2.132649e+19 181555600.00  
##           62) Area.Size< 3.65 1070 7.376142e+18 158045500.00 *
##           63) Area.Size>=3.65 193 1.008011e+19 311896400.00 *

Visualisasi Pohon Klasifikasi Awal

rpart.plot(fit.tree)

Variable of Importance

# Checking the order of variable importance
fit.tree$variable.importance
##    Area.Size      purpose    Area.Type     bedrooms        baths 
## 4.595945e+19 3.609194e+19 2.326152e+19 1.332939e+19 1.008154e+19
fit.tree$variable.importance %>% 
   data.frame() %>%
   rownames_to_column(var = "Feature") %>%
   rename(Overall = '.') %>%
   ggplot(aes(x = fct_reorder(Feature, Overall), y = Overall)) +
   geom_pointrange(aes(ymin = 0, ymax = Overall), color = "red", size = .3) +
   theme_minimal() +
   coord_flip() +
   labs(x = "", y = "", title = "Variable Importance with Simple Classication")

Berdasarkan variable importance didapatkan variabel terpenting dalam klasfisikasi yaitu area.size sementara untuk variable yang memiliki pengaruh terkecil adalah tipe area(area.type)

Prediksi Klasifikasi

pred.tree_train = predict(fit.tree, data.train)
pred.tree_test = predict(fit.tree, data.test)

Akurasi dan Evaluasi

mse_train <- mse(pred.tree_train, data.train$Area.Size)
rmse_train <- rmse(pred.tree_train, data.train$Area.Size)
mape_train <- mape(pred.tree_train, data.train$Area.Size)
mae_train <- mae(pred.tree_train, data.train$Area.Size)

mse_test <- mse(pred.tree_test, data.train$Area.Size)
## Warning in actual - predicted: longer object length is not a multiple of
## shorter object length
rmse_test <- rmse(pred.tree_test, data.train$Area.Size)
## Warning in actual - predicted: longer object length is not a multiple of
## shorter object length
mape_test <- mape(pred.tree_test, data.train$Area.Size)
## Warning in actual - predicted: longer object length is not a multiple of
## shorter object length
## Warning in ae(actual, predicted)/abs(actual): longer object length is not a
## multiple of shorter object length
mae_test <- mae(pred.tree_test, data.train$Area.Size)
## Warning in actual - predicted: longer object length is not a multiple of
## shorter object length
df_eval <- data.frame("MSE" = c(mse_train, mse_test),
          "RMSE" = c(rmse_train,rmse_test),
          "MAPE" = c(mape_train,mape_test),
          "MAE" = c(mae_train, mae_test))
rownames(df_eval) <- c("Data Train", "Data Test")
round(df_eval, 2)
##                     MSE     RMSE MAPE      MAE
## Data Train 1.072900e+15 32755156    1 17827745
## Data Test  1.048212e+15 32376102    1 17822722

##Optimasi

printcp(fit.tree)
## 
## Regression tree:
## rpart(formula = price ~ ., data = data.train, method = "anova")
## 
## Variables actually used in tree construction:
## [1] Area.Size Area.Type bedrooms  purpose  
## 
## Root node error: 1.4965e+20/117913 = 1.2692e+15
## 
## n= 117913 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.184318      0   1.00000 1.00002 0.039778
## 2 0.116460      2   0.63136 0.63146 0.034123
## 3 0.028711      3   0.51490 0.51667 0.029380
## 4 0.020549      5   0.45748 0.45929 0.029362
## 5 0.011309      7   0.41638 0.41868 0.026741
## 6 0.010000      8   0.40507 0.40817 0.026699

Pemodelan Kedua

# Explicitly request the lowest cp value
fit.tree$cptable[which.min(fit.tree$cptable[,"xerror"]),"CP"]
## [1] 0.01
plotcp(fit.tree, upper = "splits")

bestcp <-fit.tree$cptable[which.min(fit.tree$cptable[,"xerror"]),"CP"]
pruned.tree <- prune(fit.tree, cp = bestcp)
rpart.plot(fit.tree)

rpart.plot(pruned.tree)

Baik pada plot regression tree awal dan setelah di pruned menghasilkan klasifikasi yang sama.

Kemudian berdasarkan plot yang terbentuk dapat ditarik beberapa kesimpulan :

  1. Area Type Marla cenderung memiliki harga lebih rendah dibandingkan area lainnya.
  2. Rumah yang disewakan (For Rent) lebih dipengaruhi oleh luas area dibanding jumlah kamar.
  3. Semakin kecil luas area, semakin rendah harga rumahnya.
  4. Kamar tidur mempengaruhi harga di kategori harga yang lebih tinggi.
  5. Properti dengan luas sangat besar (>3.7) bisa mencapai harga 312 juta, tetapi sangat jarang terjadi (0%).
# Alternate specification 
pred.prune_train = predict(pruned.tree, data.train)
pred.prune_test = predict(pruned.tree, data.test)
mse_prune_train <- mse(pred.prune_train, data.train$Area.Size)
rmse_prune_train <- rmse(pred.prune_train, data.train$Area.Size)
mape_prune_train <- mape(pred.prune_train, data.train$Area.Size)
mae_prune_train <- mae(pred.prune_train, data.train$Area.Size)

mse_prune_test <- mse(pred.prune_test, data.test$Area.Size)
rmse_prune_test <- rmse(pred.prune_test, data.test$Area.Size)
mape_prune_test <- mape(pred.prune_test, data.test$Area.Size)
mae_prune_test <- mae(pred.prune_test, data.test$Area.Size)
df_prune_eval <- data.frame("MSE" = c(mse_prune_train, mse_prune_test),
          "RMSE" = c(rmse_prune_train,rmse_prune_test),
          "MAPE" = c(mape_prune_train,mape_prune_test),
          "MAE" = c(mae_prune_train, mae_prune_test))
rownames(df_prune_eval) <- c("Data Train", "Data Test")
round(df_prune_eval, 2)
##                     MSE     RMSE MAPE      MAE
## Data Train 1.072900e+15 32755156    1 17827745
## Data Test  1.039152e+15 32235875    1 17740271

Berdasarkan hasil evaluasi model prediksi harga rumah, terdapat empat metrik utama yang digunakan, yaitu MSE, RMSE, MAPE, dan MAE. Nilai MSE yang sangat besar, hal tersebut menunjukkan bahwa kesalahan kuadrat dalam prediksi masih cukup tinggi, meskipun metrik ini kurang intuitif untuk interpretasi langsung. Sementara itu, RMSE pada data train dan test berkisar antara 3,2 juta – 3,3 juta unit mata uang, yang mengindikasikan rata-rata kesalahan prediksi dalam satuan harga rumah. MAE juga menunjukkan angka yang cukup tinggi, sekitar 17,7 juta – 17,8 juta unit mata uang, menandakan bahwa model masih memiliki tingkat kesalahan yang cukup signifikan dalam estimasi harga.

Di sisi lain, MAPE tercatat sebesar 1, yang perlu diklarifikasi lebih lanjut. Jika angka tersebut sebenarnya 100%, maka model memiliki performa yang sangat buruk dan tidak dapat diandalkan. Namun, jika MAPE sebenarnya adalah 1% (0.01 dalam format desimal), maka model memiliki tingkat akurasi yang cukup baik, dengan kesalahan prediksi hanya sekitar 1% dari harga aktual. Secara keseluruhan, meskipun model menunjukkan akurasi yang baik jika MAPE benar 1%, tingginya nilai RMSE dan MAE mengindikasikan bahwa model masih perlu perbaikan, baik dalam pemilihan fitur, tuning parameter, atau eksplorasi model yang lebih kompleks.