HomeScape Architecture

by MF-Faqih

28 June 2023

Welcome to HomeScape architecture, here you can see whatever is done behind HomeScape, starting from read data, machine learning building, clorhoplet map building until the plot I used in the application. I hope whoever read and get understanding about my project can develop this application even more better.

IMPORT LIBRARY

Import library

library(dplyr)
library(GGally)
library(factoextra)
library(FactoMineR)
library(MLmetrics)
library(randomForest)
library(caret)
library(partykit)
library(ROCR)
library(leaflet)
library(rgdal)
library(stringr)#str_to_title
library(DT)

library(sp)
library(rgeos)
library(sf)

library(ggplot2)
library(scales)
library(plotly)
library(glue)
library(gganimate)

READ DATA

HomeScape use two type of data, csv data and json data. CSV data was obtained through web scrapping process in lamudi.co.id website and will be used for make prediction, recommendation and other analysis, while json data was downloaded from gadm.org and processed through mapshaper.org.

Map Data

Read the json data

jakarta_json <- rgdal::readOGR("gadm41_IDN_3.json")
#> OGR data source with driver: TopoJSON 
#> Source: "C:\Faqih's file\Algoritma School\Final Project\gadm41_IDN_3.json", layer: "gadm41_IDN_3"
#> with 47 features
#> It has 17 fields

After reading hte data, we need to conver the data type into json. This step is really important, because the data we obtain from mapshaper has data type of SP (spatial data type) which we cannot do any pre processing into the data. After convert the data type, we need to do some pre processing step such as eliminating unused row, column and etc.

jakarta_json_mod <- sf::st_as_sf(jakarta_json)

#Removing `Kepulauan Seribu` and some columns we won't use 
jakarta_json_mod <- jakarta_json_mod %>% 
  # Change the name as template
  mutate(NAME_3 = str_replace_all(NAME_3, fixed(" "), "") %>% str_to_title()) %>%       
  # Removing `Kepulauan Seribu`
  filter(NAME_2 != "Kepulauan Seribu")  %>%                        
  # Removing some columns
  dplyr::select(-c(id, NL_NAME_1, NL_NAME_2, NL_NAME_3, VARNAME_3, HASC_3, TYPE_3, ENGTYPE_3)) 

glimpse(jakarta_json_mod)
#> Rows: 45
#> Columns: 10
#> $ GID_3    <chr> "IDN.7.1.1_1", "IDN.7.1.2_1", "IDN.7.1.3_1", "IDN.7.1.4_1", "…
#> $ GID_0    <chr> "IDN", "IDN", "IDN", "IDN", "IDN", "IDN", "IDN", "IDN", "IDN"…
#> $ COUNTRY  <chr> "Indonesia", "Indonesia", "Indonesia", "Indonesia", "Indonesi…
#> $ GID_1    <chr> "IDN.7_1", "IDN.7_1", "IDN.7_1", "IDN.7_1", "IDN.7_1", "IDN.7…
#> $ NAME_1   <chr> "Jakarta Raya", "Jakarta Raya", "Jakarta Raya", "Jakarta Raya…
#> $ GID_2    <chr> "IDN.7.1_1", "IDN.7.1_1", "IDN.7.1_1", "IDN.7.1_1", "IDN.7.1_…
#> $ NAME_2   <chr> "Jakarta Barat", "Jakarta Barat", "Jakarta Barat", "Jakarta B…
#> $ NAME_3   <chr> "Cengkareng", "Grogolpetamburan", "Kalideres", "Kebonjeruk", …
#> $ CC_3     <chr> "3174070", "3174040", "3174080", "3174020", "3174010", "31740…
#> $ geometry <MULTIPOLYGON> MULTIPOLYGON (((106.7004 -6..., MULTIPOLYGON (((106.…

Apartment Data

Read the dataset. This data contain 2773 rows and 9 columns

apartment <- read.csv("apartment_jakarta.csv")
datatable(apartment, options = list(scrollX = T))

Column description:

  • Name: Name of the apartment or seller given name
  • Address: Apartment location (District and Sub district)
  • Bedroom: Total bedroom of apartment
  • Bathroom: Total bathroom of apartment
  • Total_area: Total area of apartment (in m²)
  • Price: Apartment price
  • Phone: Contact person

DATA PREPROCESSING

Garbage in, garbage out. Every data that we get from any sources, will have an unusual value that can make our analysis result bad, or simply we can call it outlier. Outlier is an unusual value or data that occur only once in our data. This data will look different than any other value, for example, generally price of an apartment will have range between 300 Million rupiah until 10 Billion rupiah (based on where the apartment is), so if an apartment in our data have price, let say, 100 Million or less we can say that this data is an outlier and we need to handle it in data pre processing. Not only an outlier, correcting data type, feature selection, feature engineering also part of data preprocessing. So Data Preprocessing is steps and techniques used to prepare raw data for analysis or machine learning tasks. It involves transforming and cleaning the data to ensure its quality, consistency, and compatibility with the chosen data analysis or machine learning algorithms.

Data Coertion

Before we do any pre processing into the data, First we need to do data coertion. Simply Data coertion is a step to convert any column data type to it’s appropriate data type. The reason why we need to make sure every column already have correct data type, is because in order to get correct result from further process, we need to have correct data type. For example, we can’t do any mathematical expression such as add, subset, root and many more to the character data type.

glimpse(apartment)
#> Rows: 2,969
#> Columns: 7
#> $ Name       <chr> "Ciputra International 1 Bedroom Ruang Keluarga Luas Balkon…
#> $ Address    <chr> "Jakarta Barat", "Karet Semanggi, Jakarta Selatan", "Karet …
#> $ Bedroom    <int> 1, 2, 1, 2, 3, 2, 1, 1, 1, 2, 2, 3, 1, 1, 2, 2, 1, 1, 2, 1,…
#> $ Bathroom   <chr> "1", "1", "1", "2", "3", "2", "1", "1", "1", "2", "1", "3",…
#> $ Total_area <chr> "52 m²", "60 m²", "40 m²", "86 m²", "117 m²", "93 m²", "30 …
#> $ Price      <chr> " 1,505,000,000", " 3,000,000,000", " 2,000,000,000", " 3,7…
#> $ Phone      <chr> "+6288212204495", "+6281191415518", "+6281191507415", "+628…

From result above, we can say that Bedroom, Bathroom, Total_area and Price column is not in their data type yet.

First, I need to remove any value from Bathroom column that contain any alphabet, because this column must be contain only number

apartment <- apartment[!grepl("[A-Za-z]", apartment$Bathroom), ]

Total areea column data type will be converted into numeric, so to do that, we need to remove any alfabet contained within.

apartment$Total_area <- gsub(" m²", "", apartment$Total_area)

Same as total area column, to convert price column into numeric data type, the column must contain only number, so we will remove any comma separator.

apartment$Price <- gsub(",", "", apartment$Price)

Convert antoher column data type

apartment <- apartment %>% 
  mutate(Bedroom = as.numeric(Bedroom),
         Bathroom = as.numeric(Bathroom),
         Total_area = as.numeric(Total_area),
         Price = as.numeric(Price))

Feature Engineering

Feature engineering is the process of creating new features or transforming existing features from raw data to improve the performance of machine learning models. It involves selecting, creating, or modifying features to make them more informative and representative of the underlying patterns or relationships in the data.

In this project we will made 2 new columns, they are District Column and Sub_district column, both are extracted from address column

Create new column that contain only District name

apartment$District <- sapply(strsplit(apartment$Address, ",\\s*"), "[", 2)

Some row in address column has no sub district name, so it will result in NA (empty) value in district column. We will eliminate all NA values0

apartment <- apartment %>% 
  na.omit()

Create new column that contain only District name

apartment$Sub_district <- sapply(strsplit(apartment$Address, ",\\s*"), "[", 1)
apartment <- apartment %>% 
  mutate(District = as.factor(District),
         Sub_district = as.factor(Sub_district))
apartment <- apartment %>% 
  mutate(Sub_district = gsub(" ", "", Sub_district),
         Sub_district = stringr::str_to_title(Sub_district),
         Sub_district = as.factor(Sub_district))

Handling Abnormal Values

I’ll split my original data (apartment) into 2, first data frame (apartment_raw) will be used for building map and recommendation and the second one (apartment_clean) will be used for build a model prediction. The different between first and second data is, our first data will retain as much data as possible, so the list of apartment available as recommendation will show the actual apartment available in the market.

Based on room sketcher website, there are several types of apartments that are differentiated based on the area of the building, they are studio apartment which has total area around 25-44m², 1-bed apartment with total area around 45-75m², 2-bed apartment with total around area 55-95m², 3-bed apartment with total area around 76-130m² and 4-bed apartment with total area around 90-150m². Also there’s another type of apartment that has total area smaller than studio apartment, namely micro apartment which has total area around 12-24m². So, total area of apartment will only have range between 12-150m²

First data frame for map and recommendation

apartment_raw <- apartment %>% 
  filter(Total_area >= 12,
         Total_area <= 150,
         Bedroom <= 4,
         Bathroom <= Bedroom)
summary(apartment_raw)
#>      Name             Address             Bedroom         Bathroom    
#>  Length:2328        Length:2328        Min.   :1.000   Min.   :1.000  
#>  Class :character   Class :character   1st Qu.:1.000   1st Qu.:1.000  
#>  Mode  :character   Mode  :character   Median :2.000   Median :1.000  
#>                                        Mean   :1.954   Mean   :1.384  
#>                                        3rd Qu.:2.000   3rd Qu.:2.000  
#>                                        Max.   :4.000   Max.   :4.000  
#>                                                                       
#>    Total_area         Price                Phone          
#>  Min.   : 14.00   Min.   :    8500000   Length:2328       
#>  1st Qu.: 35.00   1st Qu.:  500000000   Class :character  
#>  Median : 49.00   Median :  950000000   Mode  :character  
#>  Mean   : 63.44   Mean   : 1517959950                     
#>  3rd Qu.: 88.00   3rd Qu.: 2000000000                     
#>  Max.   :150.00   Max.   :37500000000                     
#>                                                           
#>             District         Sub_district 
#>  Jakarta Barat  :375   Kelapagading: 210  
#>  Jakarta Pusat  :369   Kuningan    : 135  
#>  Jakarta Selatan:865   Kemayoran   : 114  
#>  Jakarta Timur  :217   Setiabudi   : 103  
#>  Jakarta Utara  :502   Cempakaputih:  88  
#>                        Kalibata    :  70  
#>                        (Other)     :1608

Second data frame for build a model

apartment_clean <- apartment %>% 
  filter(Bedroom <= 4,
         Total_area >= 12,
         Total_area <= 150,
         Bathroom <= Bedroom,
         Price > 300000000,
         
         !(Total_area >= 12 & Total_area <= 54 & Bedroom > 1),
         !(Total_area >= 55 & Total_area <= 75 & Bedroom > 2),
         !(Total_area >= 76 & Total_area <= 90 & Bedroom < 2),
         !(Total_area >= 76 & Total_area <= 90 & Bedroom > 3),
         !(Total_area >= 91 & Total_area <= 95 & Bedroom < 2),
         !(Total_area >= 91 & Total_area <= 95 & Bedroom > 4),
         !(Total_area >= 96 & Total_area <= 150 & Bedroom < 3),
         !(Total_area >= 96 & Total_area <= 150 & Bedroom > 4),
         
         !(Total_area >= 25 & Total_area <= 44 & Sub_district == "Pasar Minggu" & Price == 2749000000),
         !(Total_area >= 25 & Total_area <= 44 & Sub_district == "Kuningan" & Price == 2500000000),
         !(Total_area >= 45 & Total_area <= 75 & Sub_district == "SetiaBudi" & Price == 3500000000),
         !(Total_area >= 45 & Total_area <= 75 & Sub_district == "Tebet" & Price == 3750000000),
         !(Total_area >= 45 & Total_area <= 75 & Sub_district == "Kuningan" & Price == 3700000000),
         !(Total_area >= 45 & Total_area <= 75 & Sub_district == "Sudirman" & Price == 4500000000),
         !(Total_area >= 55 & Total_area <= 95 & Sub_district == "Cilandak" & Price == 1750000000),
         !(Total_area >= 90 & Total_area <= 150 & Sub_district == "Kemang" & Price == 7900000000),
         !(Total_area >= 90 & Total_area <= 150 & Sub_district == "SetiaBudi" & Price > 7000000000),
         !(Total_area >= 90 & Total_area <= 150 & Sub_district == "Kebayoran Baru" & Price >= 9000000000),
         !(Total_area >= 45 & Total_area <= 75 & Sub_district == "Thamrin" & Price >= 37500000000),
         !(Total_area >= 90 & Total_area <= 150 & Sub_district == "Tanah Abang" & Price >= 7750000000),
         !(Total_area >= 45 & Total_area <= 75 & Sub_district == "Kelapa Gading" & Price >= 3000000000),
         !(Total_area >= 55 & Total_area <= 95 & Sub_district == "Kelapa Gading" & Price >= 3250000000)
         
         )
summary(apartment_clean)
#>      Name             Address             Bedroom        Bathroom   
#>  Length:1253        Length:1253        Min.   :1.00   Min.   :1.00  
#>  Class :character   Class :character   1st Qu.:1.00   1st Qu.:1.00  
#>  Mode  :character   Mode  :character   Median :2.00   Median :1.00  
#>                                        Mean   :1.93   Mean   :1.52  
#>                                        3rd Qu.:3.00   3rd Qu.:2.00  
#>                                        Max.   :4.00   Max.   :4.00  
#>                                                                     
#>    Total_area         Price               Phone                      District  
#>  Min.   : 17.00   Min.   : 310000000   Length:1253        Jakarta Barat  :223  
#>  1st Qu.: 37.00   1st Qu.: 800000000   Class :character   Jakarta Pusat  :212  
#>  Median : 69.00   Median :1450000000   Mode  :character   Jakarta Selatan:556  
#>  Mean   : 71.53   Mean   :1762750211                      Jakarta Timur  : 73  
#>  3rd Qu.: 94.00   3rd Qu.:2500000000                      Jakarta Utara  :189  
#>  Max.   :150.00   Max.   :9000000000                                           
#>                                                                                
#>        Sub_district
#>  Kuningan    :116  
#>  Setiabudi   : 86  
#>  Kemayoran   : 76  
#>  Kelapagading: 71  
#>  Kembangan   : 38  
#>  Tanjungduren: 36  
#>  (Other)     :830

From summary above, we get some interesting insight, such as:

  • Most total of apartment bedroom that most sold in the market is 2.
  • 3-bedroom and 4-bedroom is the least type of apartment sold in this (75% total data of Total_area column is below 100m²).
  • People tend to sell apartment in Jakarta Selatan than another sub district.

K-MEANS CLUSTERING

Unsupervised learning is a machine learning approach in which a model is trained on unlabeled data without any specific target variable or known outputs. In this project, one of the unsupervised learning, namely K-means clustering, will be used

K-Means clustering is unsupervised learning used clustering data into some group with same characteristic, the algorithm use in this machine learning is by counting the distance of each data into the center of the data (centroid). We will use this method to make an recommendation apartment with same characteristics in this project. If you want to know more about K-means clustering, you can see through this link

First we need to do one hot encoding into any column that have factor data type (District and Sub district column)

apartment_cluster <- apartment_raw

apartment_District <- apartment_cluster$District

apartment_Subdistrict <- apartment_cluster$Sub_district

District_encoded <- model.matrix(~ apartment_District - 1)

Subdistrict_encoded <- model.matrix(~ apartment_Subdistrict - 1)

Scaling

As mentioned earlier, K-means clustering works by counting the distance of each data into the center data. So unbalance scale of our data will be affected the result. To handle this we need to do scaling our data so each data will have same scale (or at least the different scale of the data will be minimum)

apartment_scale <- apartment_cluster %>% 
  select(c(Bedroom, Bathroom, Total_area, Price)) %>% 
  scale()
summary(apartment_scale)
#>     Bedroom            Bathroom         Total_area          Price        
#>  Min.   :-1.29645   Min.   :-0.6498   Min.   :-1.3738   Min.   :-0.9447  
#>  1st Qu.:-1.29645   1st Qu.:-0.6498   1st Qu.:-0.7902   1st Qu.:-0.6371  
#>  Median : 0.06307   Median :-0.6498   Median :-0.4012   Median :-0.3555  
#>  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
#>  3rd Qu.: 0.06307   3rd Qu.: 1.0423   3rd Qu.: 0.6825   3rd Qu.: 0.3017  
#>  Max.   : 2.78212   Max.   : 4.4266   Max.   : 2.4053   Max.   :22.5197

As we can see above our data has been scaled, you can compare the number with the data before we apply scale function.

Combining all column into one data frame.

apartment_z <- cbind(apartment_scale, District_encoded, Subdistrict_encoded)
apartment_z <- as.data.frame(apartment_z)

Find Optimum Number of K

After combining all columns, we can perform K-means clustering. But first we need to know what optimum K value (how much group we will group our data) to use in this project.

fviz_nbclust(x = apartment_z,
             FUNcluster = kmeans,
             method = "wss")

From graph above, I can say the best number to do clustering is 5, since the WSS from number k of 5 to 6 is increased (no significant different of total WSS), so we will use 5 as our K number

Clustering

RNGkind(sample.kind = "Rounding")
set.seed(100)

apartment_km <- kmeans(x = apartment_z,
                       centers = 5)

Model Evaluation

Model evaluation is used to see how good our model is. There are some things we can see, they are total within sum of square (sum of distance of each data into centroid, the smaller the better or every data is more simmilar to ach other), between sum of square/tots (the closer to 1 the better)

apartment_km$tot.withinss
#> [1] 6329.603
apartment_km$betweenss / apartment_km$totss
#> [1] 0.5251956

Combining result clustering into our data frame

apartment_raw$Cluster <- apartment_km$cluster

RANDOM FOREST

Random forest is a machine learning that made from many decision tree, it works by combining each output of decision tree and choose the best one as best result, random forest is model of machine learning that robust to an outliers. This model is very good to use to predict apartment price, this because the relationship between predictor and the target variable is not linear (simply, some people may sell 2 bed apartment above 2 billion while the other will sell same apartment under 2 billion, because they want to sell the apartment as fast as possible). If you want to know more about how random forest works, you can see here.

Cross Validation

Simply Cross validation is step where we divide our data into data train and data test. Data train (as it’s name) will be used to train the model, while data test will be used as unseen data (new data test the model, so we can know either our model is good or bad). I’ll use 80% from our data as data train and the rest as data test.

In our data, Price column will be act as our target variable (we will predict the price of the apartment), while the other column (except name, Address and Cluster) will be predictor. But first I need to do logarithmic transformation in target variable (Price column) because the distribution of this column is unbalance (has skew) as you can see in plot bellow

set.seed(123)

index <- sample(nrow(apartment_clean), 0.8*nrow(apartment_clean))

data_train <- apartment_clean[index,]
data_train <- data_train %>% 
  select(c(Bedroom, Bathroom, Total_area, District, Sub_district, Price)) %>% 
  mutate(Price = log(Price))

data_test <- apartment_clean[-index,]
data_test <- data_test %>% 
  select(c(Bedroom, Bathroom, Total_area, District, Sub_district, Price)) %>% 
  mutate(Price = log(Price))
summary(data_train)
#>     Bedroom         Bathroom       Total_area                District  
#>  Min.   :1.000   Min.   :1.000   Min.   : 17.00   Jakarta Barat  :190  
#>  1st Qu.:1.000   1st Qu.:1.000   1st Qu.: 36.25   Jakarta Pusat  :162  
#>  Median :2.000   Median :1.000   Median : 67.50   Jakarta Selatan:434  
#>  Mean   :1.921   Mean   :1.525   Mean   : 71.18   Jakarta Timur  : 62  
#>  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.: 94.00   Jakarta Utara  :154  
#>  Max.   :4.000   Max.   :4.000   Max.   :150.00                        
#>                                                                        
#>        Sub_district     Price      
#>  Kuningan    : 86   Min.   :19.55  
#>  Setiabudi   : 70   1st Qu.:20.45  
#>  Kelapagading: 55   Median :21.06  
#>  Kemayoran   : 55   Mean   :21.03  
#>  Kembangan   : 33   3rd Qu.:21.64  
#>  Tanjungduren: 31   Max.   :22.80  
#>  (Other)     :672

Please give attention to Price column, we can see the distribution of the data is already balance (have normal distribution, if you want to know more about normal distribution, you can see here), as you can see in plot bellow

First Model

Our first model will be made as bellow, some properties we used in this model that we should give attention is number and repeats. Number is when the model automatically do cross validation to test which combination will give the best result. While repeats indicate how many repetition performed during do cross validation. In our firs model we will use 5 as number and 3 as repeats.

#ctrl <- trainControl(method = "repeatedcv",
#                     number = 5,
#                     repeats = 3)


#Train random forest model
#fb_forest <- train(Price ~ .,
#                   data = data_train,
#                   method = "rf",
#                   trControl = ctrl)

#saveRDS(fb_forest, "apartment_forest.RDS")

To optimize our time (since it takes some time to build the model) I’ll just read the model that I build before.

apartment_forest <- readRDS("apartment_forest.RDS")
print(apartment_forest)
#> Random Forest 
#> 
#> 1002 samples
#>    5 predictor
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times) 
#> Summary of sample sizes: 801, 802, 801, 802, 802, 799, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  RMSE       Rsquared   MAE      
#>     2   0.6445281  0.6438033  0.5302139
#>    80   0.3267135  0.8001731  0.2439477
#>   159   0.3315215  0.7945110  0.2453422
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was mtry = 80.

Seeing variability explained by the model.

apartment_forest$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: regression
#>                      Number of trees: 500
#> No. of variables tried at each split: 80
#> 
#>           Mean of squared residuals: 0.1050513
#>                     % Var explained: 80.2

Try to predict unseen data we made before, then see it’s RMSE (Root Mean Squared Error), Rsquared and MAE (Mean Absolute Error)

apartment_predict_forest <- predict(apartment_forest, newdata = data_test %>% select(-Price))
postResample(exp(apartment_predict_forest), exp(data_test$Price))
#>             RMSE         Rsquared              MAE 
#> 740877346.369777         0.677093 437207150.928202

See the Standar Deviation

sd(apartment_predict_forest)
#> [1] 0.6024655

What we get from our first model is:

  • This model has RMSE of 740877346.4
  • have R squared 0.68
  • Standard deviation of 0.6024655

Second Model

This is our second model. We still use same number and repeats but with some improvements. As we can see bellow, we use tuneLegth in the model. The tuneLength parameter determines the number of unique combinations of hyperparameter values that will be evaluated during the tuning process. Hyperparameter are parameter that not learned from the data but set by the user before training the model. In the case of Random Forest, examples of hyperparameters include the number of trees in the forest, the maximum depth of each tree, and the number of features to consider at each split.

#ctrl <- trainControl(method = "repeatedcv",
#                     number = 5,
#                     repeats = 3,
#                     search = 'random')


#set.seed(123)

#fb_forest <- train(Price ~ .,
#                   data = data_train, 
#                   method = "rf",
#                   trControl = ctrl,
#                   tuneLength = 15,
#                   metric = 'RMSE')

#saveRDS(fb_forest, "apartment_forest2.RDS")
apartment_forest2 <- readRDS("apartment_forest2.RDS")
print(apartment_forest2)
#> Random Forest 
#> 
#> 1002 samples
#>    5 predictor
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times) 
#> Summary of sample sizes: 803, 801, 801, 800, 803, 802, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  RMSE       Rsquared   MAE      
#>     8   0.4080778  0.7232129  0.3286968
#>    17   0.3550654  0.7711534  0.2808036
#>    66   0.3263386  0.8001292  0.2447009
#>    73   0.3264412  0.8000018  0.2442177
#>    84   0.3272976  0.7988862  0.2445491
#>    88   0.3271607  0.7991070  0.2441659
#>    92   0.3271358  0.7990549  0.2438568
#>   108   0.3286199  0.7973431  0.2442020
#>   126   0.3301879  0.7954123  0.2448238
#>   141   0.3310163  0.7944084  0.2449911
#>   142   0.3312688  0.7941916  0.2452195
#>   144   0.3310306  0.7944264  0.2448428
#>   150   0.3316497  0.7936773  0.2452595
#>   153   0.3319564  0.7933476  0.2456927
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was mtry = 66.
apartment_predict_forest <- predict(apartment_forest2, newdata = data_test %>% select(-Price))
postResample(exp(apartment_predict_forest), exp(data_test$Price))
#>              RMSE          Rsquared               MAE 
#> 745598924.7589433         0.6756595 439084601.3558860
sd(apartment_predict_forest)
#> [1] 0.5954363

As conclusion, we can say that our first model is better, because our first model has lower RMSE (first model = 740877346.4, second model = 745598924.8), bigger R squared (R squared is parameter to describe how much variability captured by the model, first model = 0.677, second model = 0.675)

DECISION TREE

Another model that can be used to do prediction of apartment price is decision tree, Decision Tree is supervised machine learning algorithm that is used for both classification and regression tasks. It is a flowchart-like structure where each internal node represents a test on a feature, each branch represents the outcome of the test, and each leaf node represents a class label or a value. If you know want to know more about decision tree, you can see here.

First Model

Since we already do cross validation when made random forest model, we can use data_train and data_test in this model as well. But in order to be used in this model, we need to remove Sub_district as predictor, since decision tree model can’t use predictor with unique level more than 35.

data_train_tree <- data_train %>% select(-Sub_district)
data_test_tree <- data_test %>% select(-Sub_district)

After our data train and data test is ready, it’s time to make decision tree model. In our first model, we will not do any tuning in hyperparameter. Some hyperparameter that can be tuned in this model are minciriterion, minsplit and minbucket. Mincriterion is number of p-value for the node to be able to create a branch, the bigger the minciriterion, the more significant each node to create new branch (can result in over fitting)(default: 0.95). Minsplit is minimum number of observation in each branch (internal node) after splitting, if the amount of observation does not meet minimum amount, new branch will not be created (default: 20). Minbucket is minimum number of observation in terminal node, if not meet the minimum requirement, new branch will not be created (default: 7). All of this tree hyperparameter will be used in our second model.

To get better understanding about decision tree structure, see the picture bellow

Build first model

diabetes_tree <- ctree(formula = Price ~ ., data = data_train_tree)

plot(diabetes_tree, type = "extended")

apartment_predict_tree <- predict(diabetes_tree, newdata = data_test_tree %>% select(-Price))

Model Evaluation

postResample(exp(apartment_predict_tree), exp(data_test_tree$Price))
#>              RMSE          Rsquared               MAE 
#> 880336202.7537044         0.5404666 539380234.9162143

Second Model

In our second model, we will try to control all the tree hyperparameters.

diabetes_tree2 <- ctree(formula = Price ~ ., data = data_train_tree,
                       control = ctree_control(mincriterion = 0.5,
                                               minsplit = 10,
                                               minbucket = 5))

plot(diabetes_tree2, type = "simple")

apartment_predict_tree2 <- predict(diabetes_tree2, newdata = data_test_tree %>% select(-Price))

Model Evaluation

postResample(exp(apartment_predict_tree2), exp(data_test_tree$Price))
#>              RMSE          Rsquared               MAE 
#> 883747984.2431744         0.5334467 538074756.3757199

Conclusion:

From RMSE, Rsquared and MAE value in the model has, we can say random forest is the best model to use for predicting apartment price. Both decision tree model has bigger three-values compared to random forest. So random forest will be chosen as final model

FINAL MODEL

As final model, we will use whole data to build the model and store it as final_model.RDS file

df_final_model <- apartment_clean %>% 
  select(c(Bedroom, Bathroom, Total_area, Price, District, Sub_district)) %>% 
  mutate(Price = log(Price))
#ctrl <- trainControl(method = "repeatedcv",
#                     number = 5,
#                     repeats = 3)


#fb_forest <- train(Price ~ .,
#                   data = df_final_model,
#                   method = "rf",
#                   trControl = ctrl)

#saveRDS(fb_forest, "final_model.RDS")

BUILD MAP

To Build map, first we will combining our first data (apartment_raw) with json data.

#Made data frame that grouped by sub district and get average price of each sub district
apartment_raw_mean <- apartment_raw %>% 
  group_by(District, Sub_district) %>% 
  summarise(Mean = mean(Price),
            Total = n()) %>% 
  ungroup()

apartment_for_maps <- apartment_raw_mean %>% 
  left_join(jakarta_json_mod, by = c("District" = "NAME_2", "Sub_district" = "NAME_3")) %>% 
  na.omit() %>% 
  st_as_sf()

To give color to our plot, we need to determine what parameter to use as color fill. In this project we will use average price of apartment in each sub district as color fill

pal <- colorBin(c("blue", "yellow", "red"), domain = apartment_for_maps$Mean)

Building Map

leaflet(apartment_for_maps) %>% 
      addTiles() %>% 
      addPolygons(fillColor = ~pal(Mean),
                  label = paste0("Sub District: ", apartment_for_maps$Sub_district), 
                  fillOpacity = .8,
                  weight = 2,
                  color = "white",
                  highlight = highlightOptions(
                    weight = 1,
                    color = "black", 
                    bringToFront = TRUE,
                    opacity = 0.8)) %>% 
  addLegend("bottomright", 
            pal = pal,
            values = ~Mean,
            title = "Average Price",
            labFormat = labelFormat(digits = 2),
            opacity = 1)

BUILD PLOT

Plot will be used to visualize our analysis in our data. Here we will use two kind of plots, they are Bar Plot and Donuts plot. bar Plot will be show the distribution of average price in chosen district, total bedroom and bathroom, so the user can know is the predicted value either cheap or expensive. Donuts plot will be used to visualize the characteristic of an apartment that is most sold in the market, so the user can make better decisions.

Bar Plot

Made new data frame that contain only average price, for an example we consider the user choosing “Jakarta Selatan” as chosen district with 1 bedroom and 1 bathroom apartment.

apartment_for_map <- apartment_raw %>% 
  left_join(jakarta_json_mod, by = c("District" = "NAME_2", "Sub_district" = "NAME_3")) %>% 
  na.omit()

apartment_filter <- apartment_for_map %>% 
  filter(Bedroom == 1,
         Bathroom == 1,
         District == "Jakarta Selatan") %>% 
  group_by(Sub_district) %>% 
  summarise(Price_mean = mean(Price)) %>% 
  ungroup() %>% 
  mutate(Sub_district = as.factor(Sub_district))

apartment_filter
#> # A tibble: 10 × 2
#>    Sub_district     Price_mean
#>    <fct>                 <dbl>
#>  1 Cilandak        1703673962.
#>  2 Jagakarsa       1200000000 
#>  3 Kebayoranbaru   2148444444.
#>  4 Kebayoranlama   1173518250 
#>  5 Mampangprapatan 1577500000 
#>  6 Pancoran         846000000 
#>  7 Pasarminggu     1071461326.
#>  8 Pesanggrahan     539640000 
#>  9 Setiabudi       1555200000 
#> 10 Tebet            946125000
apartment_filter <- apartment_filter %>% 
  mutate(
    label = glue(
      "District: {Sub_district}
       Mean Price: {comma(Price_mean)}"
    )
  )
plot1 <- ggplot(apartment_filter, aes(x = reorder(Sub_district, Price_mean),
                                      y = Price_mean, 
                                      text = label)) +
  geom_col(aes(fill = Price_mean)) +
  scale_fill_gradient(low="red", high="black") +
  coord_flip() +
  scale_y_continuous(labels = comma) +
  theme_minimal() 
  

plot1

ggplotly(plot1, tooltip = "text")

Donut Plot

We consider that user choose “Jakarta Selatan” as chosen district

apartment_type <- apartment_for_map %>% 
      filter(District == "Jakarta Selatan") %>% 
      select(c(District, Total_area, Bedroom))

# Make new column that contain types of apartment 
apartment_type$Type <- ifelse(apartment_type$Total_area >= 12 & apartment_type$Total_area <= 24, "Micro", 
                              ifelse(apartment_type$Total_area >= 25 & apartment_type$Total_area <= 44, "Studio",
                                     ifelse(apartment_type$Bedroom == 1, "1 Bed",
                                            ifelse(apartment_type$Bedroom == 2, "2 Bed",
                                                   ifelse(apartment_type$Bedroom == 3, "3 Bed",
                                                          ifelse(apartment_type$Bedroom == 4, "4 Bed", "None"))))))

datatable(apartment_type, options = list(scrollX = T))

Basically donuts plot is a bar plot with some adjustment. Generally donuts plot are used to display percentages of data

apartment_type$Type <- as.factor(apartment_type$Type)
    
apartment_type <- apartment_type %>% 
  group_by(Type) %>% 
  summarise(Jumlah = n()) %>% 
  ungroup()

#Compute percentages    
apartment_type$fraction <- apartment_type$Jumlah / sum(apartment_type$Jumlah)

# Compute the cumulative percentages (top of each rectangle)
apartment_type$ymax <- cumsum(apartment_type$fraction)

# Compute the bottom of each rectangle
apartment_type$ymin <- c(0, head(apartment_type$ymax, n=-1))

# Compute label position
apartment_type$labelPosition <- (apartment_type$ymax + apartment_type$ymin) / 2

# Compute a good label
apartment_type$label <- paste0(apartment_type$Type, "\n Total: ", floor(apartment_type$fraction*100), "%")
    
# Convert to data frame
apartment_type <- as.data.frame(apartment_type)

apartment_type
#>     Type Jumlah   fraction      ymax      ymin labelPosition
#> 1  1 Bed     37 0.08096280 0.0809628 0.0000000     0.0404814
#> 2  2 Bed    203 0.44420131 0.5251641 0.0809628     0.3030635
#> 3  3 Bed     96 0.21006565 0.7352298 0.5251641     0.6301969
#> 4  4 Bed     14 0.03063457 0.7658643 0.7352298     0.7505470
#> 5  Micro     12 0.02625821 0.7921225 0.7658643     0.7789934
#> 6 Studio     95 0.20787746 1.0000000 0.7921225     0.8960613
#>                 label
#> 1   1 Bed\n Total: 8%
#> 2  2 Bed\n Total: 44%
#> 3  3 Bed\n Total: 21%
#> 4   4 Bed\n Total: 3%
#> 5   Micro\n Total: 2%
#> 6 Studio\n Total: 20%

Make the plot

ggplot(apartment_type, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=Type)) +
      geom_rect() +
      geom_text( x=2, aes(y=labelPosition, label=label, color=Type), size=6) +
      scale_fill_manual(values=c("#D52027", "#FFB049", "#1E4558", "#F8EEE3", "#ABCDEF", "#123456")) +
      coord_polar(theta="y") +
      xlim(c(-1, 4)) +
      theme_void() +
      theme(legend.position = "none")