Welcome to HomeScape architecture, here you can see whatever is done behind HomeScape, starting from read data, machine learning building, clorhoplet map building until the plot I used in the application. I hope whoever read and get understanding about my project can develop this application even more better.
IMPORT LIBRARY
Import library
library(dplyr)
library(GGally)
library(factoextra)
library(FactoMineR)
library(MLmetrics)
library(randomForest)
library(caret)
library(partykit)
library(ROCR)
library(leaflet)
library(rgdal)
library(stringr)#str_to_title
library(DT)
library(sp)
library(rgeos)
library(sf)
library(ggplot2)
library(scales)
library(plotly)
library(glue)
library(gganimate)
READ DATA
HomeScape use two type of data, csv data and json data. CSV data was obtained through web scrapping process in lamudi.co.id website and will be used for make prediction, recommendation and other analysis, while json data was downloaded from gadm.org and processed through mapshaper.org.
Map Data
Read the json data
jakarta_json <- rgdal::readOGR("gadm41_IDN_3.json")
#> OGR data source with driver: TopoJSON
#> Source: "C:\Faqih's file\Algoritma School\Final Project\gadm41_IDN_3.json", layer: "gadm41_IDN_3"
#> with 47 features
#> It has 17 fields
After reading hte data, we need to conver the data type into json. This step is really important, because the data we obtain from mapshaper has data type of SP (spatial data type) which we cannot do any pre processing into the data. After convert the data type, we need to do some pre processing step such as eliminating unused row, column and etc.
jakarta_json_mod <- sf::st_as_sf(jakarta_json)
#Removing `Kepulauan Seribu` and some columns we won't use
jakarta_json_mod <- jakarta_json_mod %>%
# Change the name as template
mutate(NAME_3 = str_replace_all(NAME_3, fixed(" "), "") %>% str_to_title()) %>%
# Removing `Kepulauan Seribu`
filter(NAME_2 != "Kepulauan Seribu") %>%
# Removing some columns
dplyr::select(-c(id, NL_NAME_1, NL_NAME_2, NL_NAME_3, VARNAME_3, HASC_3, TYPE_3, ENGTYPE_3))
glimpse(jakarta_json_mod)
#> Rows: 45
#> Columns: 10
#> $ GID_3 <chr> "IDN.7.1.1_1", "IDN.7.1.2_1", "IDN.7.1.3_1", "IDN.7.1.4_1", "…
#> $ GID_0 <chr> "IDN", "IDN", "IDN", "IDN", "IDN", "IDN", "IDN", "IDN", "IDN"…
#> $ COUNTRY <chr> "Indonesia", "Indonesia", "Indonesia", "Indonesia", "Indonesi…
#> $ GID_1 <chr> "IDN.7_1", "IDN.7_1", "IDN.7_1", "IDN.7_1", "IDN.7_1", "IDN.7…
#> $ NAME_1 <chr> "Jakarta Raya", "Jakarta Raya", "Jakarta Raya", "Jakarta Raya…
#> $ GID_2 <chr> "IDN.7.1_1", "IDN.7.1_1", "IDN.7.1_1", "IDN.7.1_1", "IDN.7.1_…
#> $ NAME_2 <chr> "Jakarta Barat", "Jakarta Barat", "Jakarta Barat", "Jakarta B…
#> $ NAME_3 <chr> "Cengkareng", "Grogolpetamburan", "Kalideres", "Kebonjeruk", …
#> $ CC_3 <chr> "3174070", "3174040", "3174080", "3174020", "3174010", "31740…
#> $ geometry <MULTIPOLYGON> MULTIPOLYGON (((106.7004 -6..., MULTIPOLYGON (((106.…
Apartment Data
Read the dataset. This data contain 2773 rows and 9 columns
apartment <- read.csv("apartment_jakarta.csv")
datatable(apartment, options = list(scrollX = T))
Column description:
- Name: Name of the apartment or seller given name
- Address: Apartment location (District and Sub district)
- Bedroom: Total bedroom of apartment
- Bathroom: Total bathroom of apartment
- Total_area: Total area of apartment (in m²)
- Price: Apartment price
- Phone: Contact person
DATA PREPROCESSING
Garbage in, garbage out. Every data that we get from any sources, will have an unusual value that can make our analysis result bad, or simply we can call it outlier. Outlier is an unusual value or data that occur only once in our data. This data will look different than any other value, for example, generally price of an apartment will have range between 300 Million rupiah until 10 Billion rupiah (based on where the apartment is), so if an apartment in our data have price, let say, 100 Million or less we can say that this data is an outlier and we need to handle it in data pre processing. Not only an outlier, correcting data type, feature selection, feature engineering also part of data preprocessing. So Data Preprocessing is steps and techniques used to prepare raw data for analysis or machine learning tasks. It involves transforming and cleaning the data to ensure its quality, consistency, and compatibility with the chosen data analysis or machine learning algorithms.
Data Coertion
Before we do any pre processing into the data, First we need to do data coertion. Simply Data coertion is a step to convert any column data type to it’s appropriate data type. The reason why we need to make sure every column already have correct data type, is because in order to get correct result from further process, we need to have correct data type. For example, we can’t do any mathematical expression such as add, subset, root and many more to the character data type.
glimpse(apartment)
#> Rows: 2,969
#> Columns: 7
#> $ Name <chr> "Ciputra International 1 Bedroom Ruang Keluarga Luas Balkon…
#> $ Address <chr> "Jakarta Barat", "Karet Semanggi, Jakarta Selatan", "Karet …
#> $ Bedroom <int> 1, 2, 1, 2, 3, 2, 1, 1, 1, 2, 2, 3, 1, 1, 2, 2, 1, 1, 2, 1,…
#> $ Bathroom <chr> "1", "1", "1", "2", "3", "2", "1", "1", "1", "2", "1", "3",…
#> $ Total_area <chr> "52 m²", "60 m²", "40 m²", "86 m²", "117 m²", "93 m²", "30 …
#> $ Price <chr> " 1,505,000,000", " 3,000,000,000", " 2,000,000,000", " 3,7…
#> $ Phone <chr> "+6288212204495", "+6281191415518", "+6281191507415", "+628…
From result above, we can say that Bedroom, Bathroom, Total_area and Price column is not in their data type yet.
First, I need to remove any value from Bathroom column that contain any alphabet, because this column must be contain only number
apartment <- apartment[!grepl("[A-Za-z]", apartment$Bathroom), ]
Total areea column data type will be converted into numeric, so to do that, we need to remove any alfabet contained within.
apartment$Total_area <- gsub(" m²", "", apartment$Total_area)
Same as total area column, to convert price column into numeric data type, the column must contain only number, so we will remove any comma separator.
apartment$Price <- gsub(",", "", apartment$Price)
Convert antoher column data type
apartment <- apartment %>%
mutate(Bedroom = as.numeric(Bedroom),
Bathroom = as.numeric(Bathroom),
Total_area = as.numeric(Total_area),
Price = as.numeric(Price))
Feature Engineering
Feature engineering is the process of creating new features or transforming existing features from raw data to improve the performance of machine learning models. It involves selecting, creating, or modifying features to make them more informative and representative of the underlying patterns or relationships in the data.
In this project we will made 2 new columns, they are District Column and Sub_district column, both are extracted from address column
Create new column that contain only District name
apartment$District <- sapply(strsplit(apartment$Address, ",\\s*"), "[", 2)
Some row in address column has no sub district name, so it will result in NA (empty) value in district column. We will eliminate all NA values0
apartment <- apartment %>%
na.omit()
Create new column that contain only District name
apartment$Sub_district <- sapply(strsplit(apartment$Address, ",\\s*"), "[", 1)
apartment <- apartment %>%
mutate(District = as.factor(District),
Sub_district = as.factor(Sub_district))
apartment <- apartment %>%
mutate(Sub_district = gsub(" ", "", Sub_district),
Sub_district = stringr::str_to_title(Sub_district),
Sub_district = as.factor(Sub_district))
Handling Abnormal Values
I’ll split my original data (apartment) into 2, first data frame (apartment_raw) will be used for building map and recommendation and the second one (apartment_clean) will be used for build a model prediction. The different between first and second data is, our first data will retain as much data as possible, so the list of apartment available as recommendation will show the actual apartment available in the market.
Based on room sketcher website, there are several types of apartments that are differentiated based on the area of the building, they are studio apartment which has total area around 25-44m², 1-bed apartment with total area around 45-75m², 2-bed apartment with total around area 55-95m², 3-bed apartment with total area around 76-130m² and 4-bed apartment with total area around 90-150m². Also there’s another type of apartment that has total area smaller than studio apartment, namely micro apartment which has total area around 12-24m². So, total area of apartment will only have range between 12-150m²
First data frame for map and recommendation
apartment_raw <- apartment %>%
filter(Total_area >= 12,
Total_area <= 150,
Bedroom <= 4,
Bathroom <= Bedroom)
summary(apartment_raw)
#> Name Address Bedroom Bathroom
#> Length:2328 Length:2328 Min. :1.000 Min. :1.000
#> Class :character Class :character 1st Qu.:1.000 1st Qu.:1.000
#> Mode :character Mode :character Median :2.000 Median :1.000
#> Mean :1.954 Mean :1.384
#> 3rd Qu.:2.000 3rd Qu.:2.000
#> Max. :4.000 Max. :4.000
#>
#> Total_area Price Phone
#> Min. : 14.00 Min. : 8500000 Length:2328
#> 1st Qu.: 35.00 1st Qu.: 500000000 Class :character
#> Median : 49.00 Median : 950000000 Mode :character
#> Mean : 63.44 Mean : 1517959950
#> 3rd Qu.: 88.00 3rd Qu.: 2000000000
#> Max. :150.00 Max. :37500000000
#>
#> District Sub_district
#> Jakarta Barat :375 Kelapagading: 210
#> Jakarta Pusat :369 Kuningan : 135
#> Jakarta Selatan:865 Kemayoran : 114
#> Jakarta Timur :217 Setiabudi : 103
#> Jakarta Utara :502 Cempakaputih: 88
#> Kalibata : 70
#> (Other) :1608
Second data frame for build a model
apartment_clean <- apartment %>%
filter(Bedroom <= 4,
Total_area >= 12,
Total_area <= 150,
Bathroom <= Bedroom,
Price > 300000000,
!(Total_area >= 12 & Total_area <= 54 & Bedroom > 1),
!(Total_area >= 55 & Total_area <= 75 & Bedroom > 2),
!(Total_area >= 76 & Total_area <= 90 & Bedroom < 2),
!(Total_area >= 76 & Total_area <= 90 & Bedroom > 3),
!(Total_area >= 91 & Total_area <= 95 & Bedroom < 2),
!(Total_area >= 91 & Total_area <= 95 & Bedroom > 4),
!(Total_area >= 96 & Total_area <= 150 & Bedroom < 3),
!(Total_area >= 96 & Total_area <= 150 & Bedroom > 4),
!(Total_area >= 25 & Total_area <= 44 & Sub_district == "Pasar Minggu" & Price == 2749000000),
!(Total_area >= 25 & Total_area <= 44 & Sub_district == "Kuningan" & Price == 2500000000),
!(Total_area >= 45 & Total_area <= 75 & Sub_district == "SetiaBudi" & Price == 3500000000),
!(Total_area >= 45 & Total_area <= 75 & Sub_district == "Tebet" & Price == 3750000000),
!(Total_area >= 45 & Total_area <= 75 & Sub_district == "Kuningan" & Price == 3700000000),
!(Total_area >= 45 & Total_area <= 75 & Sub_district == "Sudirman" & Price == 4500000000),
!(Total_area >= 55 & Total_area <= 95 & Sub_district == "Cilandak" & Price == 1750000000),
!(Total_area >= 90 & Total_area <= 150 & Sub_district == "Kemang" & Price == 7900000000),
!(Total_area >= 90 & Total_area <= 150 & Sub_district == "SetiaBudi" & Price > 7000000000),
!(Total_area >= 90 & Total_area <= 150 & Sub_district == "Kebayoran Baru" & Price >= 9000000000),
!(Total_area >= 45 & Total_area <= 75 & Sub_district == "Thamrin" & Price >= 37500000000),
!(Total_area >= 90 & Total_area <= 150 & Sub_district == "Tanah Abang" & Price >= 7750000000),
!(Total_area >= 45 & Total_area <= 75 & Sub_district == "Kelapa Gading" & Price >= 3000000000),
!(Total_area >= 55 & Total_area <= 95 & Sub_district == "Kelapa Gading" & Price >= 3250000000)
)
summary(apartment_clean)
#> Name Address Bedroom Bathroom
#> Length:1253 Length:1253 Min. :1.00 Min. :1.00
#> Class :character Class :character 1st Qu.:1.00 1st Qu.:1.00
#> Mode :character Mode :character Median :2.00 Median :1.00
#> Mean :1.93 Mean :1.52
#> 3rd Qu.:3.00 3rd Qu.:2.00
#> Max. :4.00 Max. :4.00
#>
#> Total_area Price Phone District
#> Min. : 17.00 Min. : 310000000 Length:1253 Jakarta Barat :223
#> 1st Qu.: 37.00 1st Qu.: 800000000 Class :character Jakarta Pusat :212
#> Median : 69.00 Median :1450000000 Mode :character Jakarta Selatan:556
#> Mean : 71.53 Mean :1762750211 Jakarta Timur : 73
#> 3rd Qu.: 94.00 3rd Qu.:2500000000 Jakarta Utara :189
#> Max. :150.00 Max. :9000000000
#>
#> Sub_district
#> Kuningan :116
#> Setiabudi : 86
#> Kemayoran : 76
#> Kelapagading: 71
#> Kembangan : 38
#> Tanjungduren: 36
#> (Other) :830
From summary above, we get some interesting insight, such as:
- Most total of apartment bedroom that most sold in the market is 2.
- 3-bedroom and 4-bedroom is the least type of apartment sold in this (75% total data of Total_area column is below 100m²).
- People tend to sell apartment in Jakarta Selatan than another sub district.
K-MEANS CLUSTERING
Unsupervised learning is a machine learning approach in which a model is trained on unlabeled data without any specific target variable or known outputs. In this project, one of the unsupervised learning, namely K-means clustering, will be used
K-Means clustering is unsupervised learning used clustering data into some group with same characteristic, the algorithm use in this machine learning is by counting the distance of each data into the center of the data (centroid). We will use this method to make an recommendation apartment with same characteristics in this project. If you want to know more about K-means clustering, you can see through this link
First we need to do one hot encoding into any column that have factor data type (District and Sub district column)
apartment_cluster <- apartment_raw
apartment_District <- apartment_cluster$District
apartment_Subdistrict <- apartment_cluster$Sub_district
District_encoded <- model.matrix(~ apartment_District - 1)
Subdistrict_encoded <- model.matrix(~ apartment_Subdistrict - 1)
Scaling
As mentioned earlier, K-means clustering works by counting the distance of each data into the center data. So unbalance scale of our data will be affected the result. To handle this we need to do scaling our data so each data will have same scale (or at least the different scale of the data will be minimum)
apartment_scale <- apartment_cluster %>%
select(c(Bedroom, Bathroom, Total_area, Price)) %>%
scale()
summary(apartment_scale)
#> Bedroom Bathroom Total_area Price
#> Min. :-1.29645 Min. :-0.6498 Min. :-1.3738 Min. :-0.9447
#> 1st Qu.:-1.29645 1st Qu.:-0.6498 1st Qu.:-0.7902 1st Qu.:-0.6371
#> Median : 0.06307 Median :-0.6498 Median :-0.4012 Median :-0.3555
#> Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
#> 3rd Qu.: 0.06307 3rd Qu.: 1.0423 3rd Qu.: 0.6825 3rd Qu.: 0.3017
#> Max. : 2.78212 Max. : 4.4266 Max. : 2.4053 Max. :22.5197
As we can see above our data has been scaled, you can compare the number with the data before we apply scale function.
Combining all column into one data frame.
apartment_z <- cbind(apartment_scale, District_encoded, Subdistrict_encoded)
apartment_z <- as.data.frame(apartment_z)
Find Optimum Number of K
After combining all columns, we can perform K-means clustering. But first we need to know what optimum K value (how much group we will group our data) to use in this project.
fviz_nbclust(x = apartment_z,
FUNcluster = kmeans,
method = "wss")
From graph above, I can say the best number to do clustering is 5, since the WSS from number k of 5 to 6 is increased (no significant different of total WSS), so we will use 5 as our K number
Clustering
RNGkind(sample.kind = "Rounding")
set.seed(100)
apartment_km <- kmeans(x = apartment_z,
centers = 5)
Model Evaluation
Model evaluation is used to see how good our model is. There are some things we can see, they are total within sum of square (sum of distance of each data into centroid, the smaller the better or every data is more simmilar to ach other), between sum of square/tots (the closer to 1 the better)
apartment_km$tot.withinss
#> [1] 6329.603
apartment_km$betweenss / apartment_km$totss
#> [1] 0.5251956
Combining result clustering into our data frame
apartment_raw$Cluster <- apartment_km$cluster
RANDOM FOREST
Random forest is a machine learning that made from many decision tree, it works by combining each output of decision tree and choose the best one as best result, random forest is model of machine learning that robust to an outliers. This model is very good to use to predict apartment price, this because the relationship between predictor and the target variable is not linear (simply, some people may sell 2 bed apartment above 2 billion while the other will sell same apartment under 2 billion, because they want to sell the apartment as fast as possible). If you want to know more about how random forest works, you can see here.
Cross Validation
Simply Cross validation is step where we divide our data into data train and data test. Data train (as it’s name) will be used to train the model, while data test will be used as unseen data (new data test the model, so we can know either our model is good or bad). I’ll use 80% from our data as data train and the rest as data test.
In our data, Price column will be act as our target variable (we will predict the price of the apartment), while the other column (except name, Address and Cluster) will be predictor. But first I need to do logarithmic transformation in target variable (Price column) because the distribution of this column is unbalance (has skew) as you can see in plot bellow
set.seed(123)
index <- sample(nrow(apartment_clean), 0.8*nrow(apartment_clean))
data_train <- apartment_clean[index,]
data_train <- data_train %>%
select(c(Bedroom, Bathroom, Total_area, District, Sub_district, Price)) %>%
mutate(Price = log(Price))
data_test <- apartment_clean[-index,]
data_test <- data_test %>%
select(c(Bedroom, Bathroom, Total_area, District, Sub_district, Price)) %>%
mutate(Price = log(Price))
summary(data_train)
#> Bedroom Bathroom Total_area District
#> Min. :1.000 Min. :1.000 Min. : 17.00 Jakarta Barat :190
#> 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 36.25 Jakarta Pusat :162
#> Median :2.000 Median :1.000 Median : 67.50 Jakarta Selatan:434
#> Mean :1.921 Mean :1.525 Mean : 71.18 Jakarta Timur : 62
#> 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.: 94.00 Jakarta Utara :154
#> Max. :4.000 Max. :4.000 Max. :150.00
#>
#> Sub_district Price
#> Kuningan : 86 Min. :19.55
#> Setiabudi : 70 1st Qu.:20.45
#> Kelapagading: 55 Median :21.06
#> Kemayoran : 55 Mean :21.03
#> Kembangan : 33 3rd Qu.:21.64
#> Tanjungduren: 31 Max. :22.80
#> (Other) :672
Please give attention to Price column, we can see the distribution of the data is already balance (have normal distribution, if you want to know more about normal distribution, you can see here), as you can see in plot bellow
First Model
Our first model will be made as bellow, some properties we used in this model that we should give attention is number and repeats. Number is when the model automatically do cross validation to test which combination will give the best result. While repeats indicate how many repetition performed during do cross validation. In our firs model we will use 5 as number and 3 as repeats.
#ctrl <- trainControl(method = "repeatedcv",
# number = 5,
# repeats = 3)
#Train random forest model
#fb_forest <- train(Price ~ .,
# data = data_train,
# method = "rf",
# trControl = ctrl)
#saveRDS(fb_forest, "apartment_forest.RDS")
To optimize our time (since it takes some time to build the model) I’ll just read the model that I build before.
apartment_forest <- readRDS("apartment_forest.RDS")
print(apartment_forest)
#> Random Forest
#>
#> 1002 samples
#> 5 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times)
#> Summary of sample sizes: 801, 802, 801, 802, 802, 799, ...
#> Resampling results across tuning parameters:
#>
#> mtry RMSE Rsquared MAE
#> 2 0.6445281 0.6438033 0.5302139
#> 80 0.3267135 0.8001731 0.2439477
#> 159 0.3315215 0.7945110 0.2453422
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was mtry = 80.
Seeing variability explained by the model.
apartment_forest$finalModel
#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry)
#> Type of random forest: regression
#> Number of trees: 500
#> No. of variables tried at each split: 80
#>
#> Mean of squared residuals: 0.1050513
#> % Var explained: 80.2
Try to predict unseen data we made before, then see it’s RMSE (Root Mean Squared Error), Rsquared and MAE (Mean Absolute Error)
apartment_predict_forest <- predict(apartment_forest, newdata = data_test %>% select(-Price))
postResample(exp(apartment_predict_forest), exp(data_test$Price))
#> RMSE Rsquared MAE
#> 740877346.369777 0.677093 437207150.928202
See the Standar Deviation
sd(apartment_predict_forest)
#> [1] 0.6024655
What we get from our first model is:
- This model has RMSE of 740877346.4
- have R squared 0.68
- Standard deviation of 0.6024655
Second Model
This is our second model. We still use same number and repeats but with some improvements. As we can see bellow, we use tuneLegth in the model. The tuneLength parameter determines the number of unique combinations of hyperparameter values that will be evaluated during the tuning process. Hyperparameter are parameter that not learned from the data but set by the user before training the model. In the case of Random Forest, examples of hyperparameters include the number of trees in the forest, the maximum depth of each tree, and the number of features to consider at each split.
#ctrl <- trainControl(method = "repeatedcv",
# number = 5,
# repeats = 3,
# search = 'random')
#set.seed(123)
#fb_forest <- train(Price ~ .,
# data = data_train,
# method = "rf",
# trControl = ctrl,
# tuneLength = 15,
# metric = 'RMSE')
#saveRDS(fb_forest, "apartment_forest2.RDS")
apartment_forest2 <- readRDS("apartment_forest2.RDS")
print(apartment_forest2)
#> Random Forest
#>
#> 1002 samples
#> 5 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times)
#> Summary of sample sizes: 803, 801, 801, 800, 803, 802, ...
#> Resampling results across tuning parameters:
#>
#> mtry RMSE Rsquared MAE
#> 8 0.4080778 0.7232129 0.3286968
#> 17 0.3550654 0.7711534 0.2808036
#> 66 0.3263386 0.8001292 0.2447009
#> 73 0.3264412 0.8000018 0.2442177
#> 84 0.3272976 0.7988862 0.2445491
#> 88 0.3271607 0.7991070 0.2441659
#> 92 0.3271358 0.7990549 0.2438568
#> 108 0.3286199 0.7973431 0.2442020
#> 126 0.3301879 0.7954123 0.2448238
#> 141 0.3310163 0.7944084 0.2449911
#> 142 0.3312688 0.7941916 0.2452195
#> 144 0.3310306 0.7944264 0.2448428
#> 150 0.3316497 0.7936773 0.2452595
#> 153 0.3319564 0.7933476 0.2456927
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was mtry = 66.
apartment_predict_forest <- predict(apartment_forest2, newdata = data_test %>% select(-Price))
postResample(exp(apartment_predict_forest), exp(data_test$Price))
#> RMSE Rsquared MAE
#> 745598924.7589433 0.6756595 439084601.3558860
sd(apartment_predict_forest)
#> [1] 0.5954363
As conclusion, we can say that our first model is better, because our first model has lower RMSE (first model = 740877346.4, second model = 745598924.8), bigger R squared (R squared is parameter to describe how much variability captured by the model, first model = 0.677, second model = 0.675)
DECISION TREE
Another model that can be used to do prediction of apartment price is decision tree, Decision Tree is supervised machine learning algorithm that is used for both classification and regression tasks. It is a flowchart-like structure where each internal node represents a test on a feature, each branch represents the outcome of the test, and each leaf node represents a class label or a value. If you know want to know more about decision tree, you can see here.
First Model
Since we already do cross validation when made random forest model, we can use data_train and data_test in this model as well. But in order to be used in this model, we need to remove Sub_district as predictor, since decision tree model can’t use predictor with unique level more than 35.
data_train_tree <- data_train %>% select(-Sub_district)
data_test_tree <- data_test %>% select(-Sub_district)
After our data train and data test is ready, it’s time to make decision tree model. In our first model, we will not do any tuning in hyperparameter. Some hyperparameter that can be tuned in this model are minciriterion, minsplit and minbucket. Mincriterion is number of p-value for the node to be able to create a branch, the bigger the minciriterion, the more significant each node to create new branch (can result in over fitting)(default: 0.95). Minsplit is minimum number of observation in each branch (internal node) after splitting, if the amount of observation does not meet minimum amount, new branch will not be created (default: 20). Minbucket is minimum number of observation in terminal node, if not meet the minimum requirement, new branch will not be created (default: 7). All of this tree hyperparameter will be used in our second model.
To get better understanding about decision tree structure, see the picture bellow
Build first model
diabetes_tree <- ctree(formula = Price ~ ., data = data_train_tree)
plot(diabetes_tree, type = "extended")
apartment_predict_tree <- predict(diabetes_tree, newdata = data_test_tree %>% select(-Price))
Model Evaluation
postResample(exp(apartment_predict_tree), exp(data_test_tree$Price))
#> RMSE Rsquared MAE
#> 880336202.7537044 0.5404666 539380234.9162143
Second Model
In our second model, we will try to control all the tree hyperparameters.
diabetes_tree2 <- ctree(formula = Price ~ ., data = data_train_tree,
control = ctree_control(mincriterion = 0.5,
minsplit = 10,
minbucket = 5))
plot(diabetes_tree2, type = "simple")
apartment_predict_tree2 <- predict(diabetes_tree2, newdata = data_test_tree %>% select(-Price))
Model Evaluation
postResample(exp(apartment_predict_tree2), exp(data_test_tree$Price))
#> RMSE Rsquared MAE
#> 883747984.2431744 0.5334467 538074756.3757199
Conclusion:
From RMSE, Rsquared and MAE value in the model has, we can say random forest is the best model to use for predicting apartment price. Both decision tree model has bigger three-values compared to random forest. So random forest will be chosen as final model
FINAL MODEL
As final model, we will use whole data to build the model and store it as final_model.RDS file
df_final_model <- apartment_clean %>%
select(c(Bedroom, Bathroom, Total_area, Price, District, Sub_district)) %>%
mutate(Price = log(Price))
#ctrl <- trainControl(method = "repeatedcv",
# number = 5,
# repeats = 3)
#fb_forest <- train(Price ~ .,
# data = df_final_model,
# method = "rf",
# trControl = ctrl)
#saveRDS(fb_forest, "final_model.RDS")
BUILD MAP
To Build map, first we will combining our first data (apartment_raw) with json data.
#Made data frame that grouped by sub district and get average price of each sub district
apartment_raw_mean <- apartment_raw %>%
group_by(District, Sub_district) %>%
summarise(Mean = mean(Price),
Total = n()) %>%
ungroup()
apartment_for_maps <- apartment_raw_mean %>%
left_join(jakarta_json_mod, by = c("District" = "NAME_2", "Sub_district" = "NAME_3")) %>%
na.omit() %>%
st_as_sf()
To give color to our plot, we need to determine what parameter to use as color fill. In this project we will use average price of apartment in each sub district as color fill
pal <- colorBin(c("blue", "yellow", "red"), domain = apartment_for_maps$Mean)
Building Map
leaflet(apartment_for_maps) %>%
addTiles() %>%
addPolygons(fillColor = ~pal(Mean),
label = paste0("Sub District: ", apartment_for_maps$Sub_district),
fillOpacity = .8,
weight = 2,
color = "white",
highlight = highlightOptions(
weight = 1,
color = "black",
bringToFront = TRUE,
opacity = 0.8)) %>%
addLegend("bottomright",
pal = pal,
values = ~Mean,
title = "Average Price",
labFormat = labelFormat(digits = 2),
opacity = 1)
BUILD PLOT
Plot will be used to visualize our analysis in our data. Here we will use two kind of plots, they are Bar Plot and Donuts plot. bar Plot will be show the distribution of average price in chosen district, total bedroom and bathroom, so the user can know is the predicted value either cheap or expensive. Donuts plot will be used to visualize the characteristic of an apartment that is most sold in the market, so the user can make better decisions.
Bar Plot
Made new data frame that contain only average price, for an example we consider the user choosing “Jakarta Selatan” as chosen district with 1 bedroom and 1 bathroom apartment.
apartment_for_map <- apartment_raw %>%
left_join(jakarta_json_mod, by = c("District" = "NAME_2", "Sub_district" = "NAME_3")) %>%
na.omit()
apartment_filter <- apartment_for_map %>%
filter(Bedroom == 1,
Bathroom == 1,
District == "Jakarta Selatan") %>%
group_by(Sub_district) %>%
summarise(Price_mean = mean(Price)) %>%
ungroup() %>%
mutate(Sub_district = as.factor(Sub_district))
apartment_filter
#> # A tibble: 10 × 2
#> Sub_district Price_mean
#> <fct> <dbl>
#> 1 Cilandak 1703673962.
#> 2 Jagakarsa 1200000000
#> 3 Kebayoranbaru 2148444444.
#> 4 Kebayoranlama 1173518250
#> 5 Mampangprapatan 1577500000
#> 6 Pancoran 846000000
#> 7 Pasarminggu 1071461326.
#> 8 Pesanggrahan 539640000
#> 9 Setiabudi 1555200000
#> 10 Tebet 946125000
apartment_filter <- apartment_filter %>%
mutate(
label = glue(
"District: {Sub_district}
Mean Price: {comma(Price_mean)}"
)
)
plot1 <- ggplot(apartment_filter, aes(x = reorder(Sub_district, Price_mean),
y = Price_mean,
text = label)) +
geom_col(aes(fill = Price_mean)) +
scale_fill_gradient(low="red", high="black") +
coord_flip() +
scale_y_continuous(labels = comma) +
theme_minimal()
plot1
ggplotly(plot1, tooltip = "text")
Donut Plot
We consider that user choose “Jakarta Selatan” as chosen district
apartment_type <- apartment_for_map %>%
filter(District == "Jakarta Selatan") %>%
select(c(District, Total_area, Bedroom))
# Make new column that contain types of apartment
apartment_type$Type <- ifelse(apartment_type$Total_area >= 12 & apartment_type$Total_area <= 24, "Micro",
ifelse(apartment_type$Total_area >= 25 & apartment_type$Total_area <= 44, "Studio",
ifelse(apartment_type$Bedroom == 1, "1 Bed",
ifelse(apartment_type$Bedroom == 2, "2 Bed",
ifelse(apartment_type$Bedroom == 3, "3 Bed",
ifelse(apartment_type$Bedroom == 4, "4 Bed", "None"))))))
datatable(apartment_type, options = list(scrollX = T))
Basically donuts plot is a bar plot with some adjustment. Generally donuts plot are used to display percentages of data
apartment_type$Type <- as.factor(apartment_type$Type)
apartment_type <- apartment_type %>%
group_by(Type) %>%
summarise(Jumlah = n()) %>%
ungroup()
#Compute percentages
apartment_type$fraction <- apartment_type$Jumlah / sum(apartment_type$Jumlah)
# Compute the cumulative percentages (top of each rectangle)
apartment_type$ymax <- cumsum(apartment_type$fraction)
# Compute the bottom of each rectangle
apartment_type$ymin <- c(0, head(apartment_type$ymax, n=-1))
# Compute label position
apartment_type$labelPosition <- (apartment_type$ymax + apartment_type$ymin) / 2
# Compute a good label
apartment_type$label <- paste0(apartment_type$Type, "\n Total: ", floor(apartment_type$fraction*100), "%")
# Convert to data frame
apartment_type <- as.data.frame(apartment_type)
apartment_type
#> Type Jumlah fraction ymax ymin labelPosition
#> 1 1 Bed 37 0.08096280 0.0809628 0.0000000 0.0404814
#> 2 2 Bed 203 0.44420131 0.5251641 0.0809628 0.3030635
#> 3 3 Bed 96 0.21006565 0.7352298 0.5251641 0.6301969
#> 4 4 Bed 14 0.03063457 0.7658643 0.7352298 0.7505470
#> 5 Micro 12 0.02625821 0.7921225 0.7658643 0.7789934
#> 6 Studio 95 0.20787746 1.0000000 0.7921225 0.8960613
#> label
#> 1 1 Bed\n Total: 8%
#> 2 2 Bed\n Total: 44%
#> 3 3 Bed\n Total: 21%
#> 4 4 Bed\n Total: 3%
#> 5 Micro\n Total: 2%
#> 6 Studio\n Total: 20%
Make the plot
ggplot(apartment_type, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=Type)) +
geom_rect() +
geom_text( x=2, aes(y=labelPosition, label=label, color=Type), size=6) +
scale_fill_manual(values=c("#D52027", "#FFB049", "#1E4558", "#F8EEE3", "#ABCDEF", "#123456")) +
coord_polar(theta="y") +
xlim(c(-1, 4)) +
theme_void() +
theme(legend.position = "none")