2. Main System
2.1 Load libraries and functions
source("function.R")
source("visualization.R")
library(tidyverse)
library(tidyr)
library(dplyr)
library(stringr)
library(lubridate)
library(readr)
library(magrittr)
library(ggplot2)
library(gridExtra)
library(caret)
library(missForest)
library(mice)
library(gbm)
library(randomForest)
library(reshape2)
library(rpart)
library(rpart.plot)
library(knitr)
library(scales)
library(e1071)
library(patchwork)
library(GGally)
2.2 General overview of the raw dataset
2.2.1 Load raw dataset
df <- read_csv("raw_rental mala.csv")
glimpse(df)
## Rows: 19,991
## Columns: 14
## $ ads_id <dbl> 100323185, 100203973, 100323128, 100191767, 9702~
## $ prop_name <chr> "The Hipster @ Taman Desa", "Segar Courts", "Pan~
## $ completion_year <dbl> 2022, NA, NA, 2020, NA, NA, NA, 2018, 2014, NA, ~
## $ monthly_rent <chr> "RM 4 200 per month", "RM 2 300 per month", "RM ~
## $ location <chr> "Kuala Lumpur - Taman Desa", "Kuala Lumpur - Che~
## $ property_type <chr> "Condominium", "Condominium", "Apartment", "Apar~
## $ rooms <dbl> 5, 3, 3, 2, 1, 3, 3, 1, 2, 3, 2, 4, 1, 3, 5, 3, ~
## $ parking <dbl> 2, 1, NA, 1, 1, 1, 2, 1, 1, 1, NA, 2, 1, 1, NA, ~
## $ bathroom <dbl> 6, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, ~
## $ size <chr> "1842 sq.ft.", "1170 sq.ft.", "650 sq.ft.", "743~
## $ furnished <chr> "Fully Furnished", "Partially Furnished", "Fully~
## $ facilities <chr> "Minimart, Gymnasium, Security, Playground, Swim~
## $ additional_facilities <chr> "Air-Cond, Cooking Allowed, Washing Machine", "A~
## $ region <chr> "Kuala Lumpur", "Kuala Lumpur", "Kuala Lumpur", ~
It can be found that the raw dataset is very dirty. We first find the following issues:
- ads_id and prop_name can be meanless variables.
- completion_year, parking include large number of NA value.
- monthly_rent wrongly assigned to chr type because of RM and per month description.
- location is chr data, need further process.
- property_type, is char data, need further transformation.
- rooms, bathroom seems fine for first view.
- size wrongly assigned to chr type because of sq.ft.
- furnished is char data, need further transformation.
- facilities and additional_facilities seems meanless, need further process.
- region is char data, need further transformation.
2.3 Data Pre-processing
2.3.1 Remove duplicated data
# Remove duplicated values
dup1 <- sum(duplicated(df))
cat('The number of duplicated data from the dataset before processing is:',dup1)
## The number of duplicated data from the dataset before processing is: 8
df <- df[!duplicated(df), ]
dup2 <- sum(duplicated(df))
cat(' The number of duplicated data from the dataset after processing is:',dup2)
## The number of duplicated data from the dataset after processing is: 0
- Firstly, we check and delete the duplicated data from the raw dataset.
2.3.2 Removing ‘RM’ from monthly rent and rename the column
df <- df %>%
mutate(monthly_rent = gsub("RM|per month|,", "", monthly_rent),
monthly_rent = gsub("\\s+", "", monthly_rent),
monthly_rent = as.numeric(monthly_rent))
colnames(df)[colnames(df) == "monthly_rent"] <- "monthly_rent_rm"
- For monthly_rent, we remove the RM, per month, transform it as numeric and rename the column as monthly_rent_rm.
2.3.3 Dropping ‘sq.ft’ from size
df$size <- as.numeric(str_extract(df$size, "(.*?)(?= sq)"))
# rename column name as size_sqft
colnames(df)[colnames(df) == "size"] <- "size_sqft"
- For size, we remove the sq.ft, transform it as numeric and and rename the column as size_sqft.
2.3.4 Extract near_by_ktm_lrt from location count with yes/no
df$location <- str_extract(df$location, "\\w+$")
- For location, we remove the province name, such as KL and Selangor.
- Further transformation will process after visualization.
2.3.5 Facilities and additional_facilities processing
# count the no of facilities
df <- df %>%
mutate(fac_count = str_count(facilities, ",") + 1,
add_fac_count = str_count(additional_facilities, ",") + 1)
# generate a new column named nearby_ktm_lrt from aditional_facilities
df$nearby_ktm_lrt <- sapply(df$additional_facilities, function(x) extract_near_ktm_lrt(x))
- Facilities and additional_facilities, we count the number of facilities it included for each and create two new column as fac_count and add_fac_count.
- Secondly, we think that if the house near LRT can be a important factor that affects monthly rent.
- So, we create a new column named nearby_ktm_lrt and judgment the value yes/no based on LRT and KTM appeared in additional_facilities.
2.3.6 drop ads_id, prop_name, facilities, additional_facilities, location
df <- df[, !(names(df) %in% c('ads_id', 'prop_name', 'facilities', 'additional_facilities'))]
glimpse(df)
## Rows: 19,983
## Columns: 13
## $ completion_year <dbl> 2022, NA, NA, 2020, NA, NA, NA, 2018, 2014, NA, 2019, ~
## $ monthly_rent_rm <dbl> 4200, 2300, 1000, 1700, 1299, 1500, 2900, 1550, 1400, ~
## $ location <chr> "Desa", "Cheras", "Desa", "Sentul", "Kiara", "Setapak"~
## $ property_type <chr> "Condominium", "Condominium", "Apartment", "Apartment"~
## $ rooms <dbl> 5, 3, 3, 2, 1, 3, 3, 1, 2, 3, 2, 4, 1, 3, 5, 3, 3, 2, ~
## $ parking <dbl> 2, 1, NA, 1, 1, 1, 2, 1, 1, 1, NA, 2, 1, 1, NA, 1, 1, ~
## $ bathroom <dbl> 6, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, ~
## $ size_sqft <dbl> 1842, 1170, 650, 743, 494, 884, 982, 700, 750, 862, 86~
## $ furnished <chr> "Fully Furnished", "Partially Furnished", "Fully Furni~
## $ region <chr> "Kuala Lumpur", "Kuala Lumpur", "Kuala Lumpur", "Kuala~
## $ fac_count <dbl> 10, 9, 4, 8, 11, 6, 10, 7, 5, 6, 7, 9, 13, 11, 3, 13, ~
## $ add_fac_count <dbl> 3, 3, NA, 3, 1, 2, 4, 4, 4, 3, 2, 4, 4, 3, 3, 5, 5, 3,~
## $ nearby_ktm_lrt <chr> "no", "yes", NA, "yes", "no", "yes", "yes", "yes", "ye~
- Next, we drop the 4 useless columns ‘ads_id’, ‘prop_name’, ‘facilities’, ‘additional_facilities’.
2.4 Visualization
2.4.1 Bar Plot
ggplot(df) + geom_bar(mapping = aes(x = furnished, fill = "red")) + geom_text(mapping = aes(x = furnished, y = ..count.., label = ..count..), stat = "count", position = position_stack(vjust = 0.5)) + theme(legend.position = "none")
ggplot(df) + geom_bar(mapping = aes(x = region, fill = "red")) + geom_text(mapping = aes(x = region, y = ..count.., label = ..count..), stat = "count", position = position_stack(vjust = 0.5)) + theme(legend.position = "none")
ggplot(df) + geom_bar(mapping = aes(x = completion_year, fill = "red")) + geom_text(mapping = aes(x = completion_year, y = ..count.., label = ..count..), stat = "count", position = position_stack(vjust = 0.5)) + theme(legend.position = "none")
ggplot(df) + geom_bar(mapping = aes(x = bathroom, fill = "red")) + geom_text(mapping = aes(x = bathroom, y = ..count.., label = ..count..), stat = "count", position = position_stack(vjust = 0.5)) + theme(legend.position = "none")
- Fully furnished shows the most count and there are 5 NA data in it
- Kuala Lumpur and Selangor have almost the same value count.
- There are more rental shows in year 2021 compare to the rest.
- Most of the rented house have 2 bathroom and 3 of the rented house have 8 bathroom.
2.4.2 Stack bar description
ggplot(df, aes(fill=bathroom, y=monthly_rent_rm, x=rooms)) +
geom_bar(position="dodge", stat="identity")
- From the above bar chart we can conclude that 2 room and below have 1 bathroom. 4 rooms have 4-6 bathrooms. There are more rented house with 4 rooms with 4-6 bathrooms.
2.4.3 Parallel Coordinate Plot
#Normalization done to shows a better Parallel Coordinate results
process <- preProcess(as.data.frame(df), method=c("range"))
norm_scale <- predict(process, as.data.frame(df))
ggparcoord(norm_scale,
columns = 4:6,
groupColumn = "monthly_rent_rm",
showPoints = TRUE)
- Above plot is to shows relationship between the 3 variables which is rooms, parking and bathroom.
- The values in a parallel coordinate plot are always normalized.
- Each of the line are collection of points of each of the variables and all are connected.
- This graph below shows that all this 3 variables relationship with monthly rental is at the price below 0.25.
- The cons of this graph is graph may become over-cluttered as data hasn’t been clean yet which may be one of the reason all are showing below 0.25.
2.4.4 Histogram for interval cropped
# histogram
grid.arrange(his_completion_year, his_monthly_rm, his_parking, his_size,
his_rooms,his_bathroom,his_fac_count,his_add_fac_count,nrow = 4,ncol = 2)
- I first using histogram to overview all the interval varaibles.
- After the parameters adjustment, the histogram is shown as the above diagram.
- The best parameters setting wiil be shown in next data cleaning section.
2.4.5 Box plot for some cropped comparation
# boxplot
grid.arrange(box_uncropped_size,box_cropped_size,
box_uncropped_bathrooms,box_cropped_bathrooms,
box_uncropped_parking,box_cropped_parking,
box_uncropped_rent,box_cropped_rent,
box_uncropped_rooms,box_cropped_rooms,nrow = 5, ncol = 2)
- Secondly, I using box-plot to show the data distribution for before and after cropped.
2.4.6 Pie Chart for nominal variables
pie_furnished <- create_pie_chart(df, "furnished")
pie_property_type <- create_pie_chart(df, "property_type")
pie_region <- create_pie_chart(df, "region")
pie_nearby_ktm_lrt <- create_pie_chart(df, "nearby_ktm_lrt")
- Next I used pie chart to overview nominal variable distribution.
- As you can view, all the nominal variable except property_type distribution seems good.
- For property_type, more than 95% percent of data are in Apartment, Condominium, and Service Residence.
2.4.7 Line chart(interval) VS monthly_rent
grid.arrange(line_size_VS_rent, line_year_VS_rent, line_Parking_VS_rent,
line_add_fac_count_VS_rent,line_fac_count_VS_rent,line_rooms_VS_rent,line_bathroom_VS_rent,nrow = 4,ncol = 2)
- In the end, I use plot chart to view the relation among interval variable to monthly rent.
- And box plot to view the nominal variable compare to monthly rent.
- It can be found that, for plot chart, Rental prices tend to rise as most of these interval variable increase. Completion Year appear to have the least variation.
2.4.7 Box plot(normal) VS monthly_rent
grid.arrange(box_lrt_VS_rent,
box_furnished_VS_rent,
box_property_VS_rent,
box_region_VS_rent, nrow = 4, ncol = 1)
- Through the box plot, it can be seen that most of the nominal variables have some interference data when the rent price is greater than 4000 and smaller than 500.
2.5 Data cleaning after visualization
2.5.1 Noise/outliers processing
df1 <- df
df1 <- df1 %>% filter(monthly_rent_rm > 400 & monthly_rent_rm < 3500)
df1 <- df1 %>% filter(size_sqft > 100 & size_sqft < 2000)
df1 <- df1 %>% filter(rooms > 0 & rooms < 4)
df1 <- df1 %>% filter(parking > 0 & parking < 4)
df1 <- df1 %>% filter(bathroom > 0 & bathroom < 4)
df1 <- df1[df1$property_type %in% c("Apartment", "Condominium", "Service Residence"), ]
- After visualization, as you can see, we filter the most suitable range for the numeric variables.
- And for property_type, it can be found that 95% of data are in “Apartment”, “Condominium”, “Service Residence”. So, we only keep these 3 types of data.
2.5.2 Categorical to numerical transformation
df1$furnished <- round(as.numeric(factor(df1$furnished, levels = c("Not Furnished", "Partially Furnished", "Fully Furnished"), labels = c(1, 2, 3))),0)
df1$region <- round(as.numeric(factor(df1$region, levels = c("Kuala Lumpur", "Selangor"), labels = c(1, 2))),0)
df1$nearby_ktm_lrt <- round(as.numeric(factor(df1$nearby_ktm_lrt, levels = c("yes", "no"), labels = c(2, 1))),0)
df1$property_type <- round(as.numeric(factor(df1$property_type, levels = c("Condominium", "Apartment", "Service Residence"), labels = c(1, 2, 3 ))),0)
- For categorical data transformation, we transform it to 1,2,3 based the unique value for each type except location.
2.5.3 count unique value for categorical data
is_categorical <- df1[,sapply(df1, function(x) is.factor(x) || is.character(x))]
print(paste('The number of unique values for categorical data:'))
## [1] "The number of unique values for categorical data:"
sapply(is_categorical, function(x) length(unique(x)))
## location
## 90
- Next, we count the number of Unique values for all the categorical variables.
- The same time we want to check the if all categorical data transform successfully.
2.5.4 Location filter
# count the number of rows for each unique value
location_counts <- df1 %>%
group_by(location) %>%
summarise(n = n())
# get the unique value count greater than 100
selected_locations <- location_counts %>%
filter(n > 100) %>%
pull(location)
# keep the unique value count greater than 100
df1 <- df1 %>%
filter(location %in% selected_locations)
- We tried to drop location first, however, r2-score for regression models and accuracy for classification models are lower around 10%.
- So, we keep the location processing. Firstly, we count the number of each unique values of location, and remove the unique values counted less than 100 rows.
2.5.5 Location processing
# transform unique value of in category of X to new column
X_cat <- df1[,sapply(df1, function(x) is.factor(x) || is.character(x))]
X_cat_ohe <- dummyVars(~ . , data = X_cat)
X_cat_ohe <- predict(X_cat_ohe, newdata = X_cat)
- Next, We convert the unique values that appear in all locations into new dummy variables through the dummyVars() function, and the values of these columns are displayed in binary form.
- And we cbind the numeric data imputed and the location together.
2.5.6 Null value processing
#NA processing
colSums(is.na(df1))
## completion_year monthly_rent_rm location property_type rooms
## 4548 0 0 0 0
## parking bathroom size_sqft furnished region
## 0 0 0 0 0
## fac_count add_fac_count nearby_ktm_lrt
## 843 2741 2741
# Apply numerical imputation
num_name <- colnames(df1)[sapply(df1, function(x) is.numeric(x) || is.integer(x))]
X_num <- df1[, num_name]
X_num_imputed <- numericalImputation(X_num, strategy = "mode")
colSums(is.na(X_num_imputed))
## completion_year monthly_rent_rm property_type rooms parking
## 0 0 0 0 0
## bathroom size_sqft furnished region fac_count
## 0 0 0 0 0
## add_fac_count nearby_ktm_lrt
## 0 0
df1 <- cbind(X_num_imputed, X_cat_ohe)
- Next we check the NA value for all the columns.
- If can be found that for location there is no NA value, all NA are in numerical variables.
- So, we fill the NA value by using mode. Here we didn’t use mean because if we using mean, the value filled might not be integer for some data, such as parking number.
2.6 Regression model pre-processing
2.6.1 X/y split, train/test split
# Call the function and assign the results to X and y
result <- extractInputOutput(data = df1, output_column_name = "monthly_rent_rm")
X1 <- result$input_data
y1 <- result$output_data
X_clean1 <- standardizerData(X1)
set.seed(123)
train_indices1 <- createDataPartition(y1, p = 0.8, list = FALSE)
# Split the data into training and testing sets for regression model
X_train1 <- X_clean1[train_indices1, ]
y_train1 <- y1[train_indices1]
X_test1 <- X_clean1[-train_indices1, ]
y_test1 <- y1[-train_indices1]
cat("Proportion of test set:", round(nrow(X_test1) / nrow(X_clean1), 2))
## Proportion of test set: 0.2
- Before building models for regression, I split the dataset set the monthly_rent_rm as y value.
- Then Split to train set and test set.
- The proportion of the test set is 20%.
2.7 Train regression model
2.7.1 Linear regression
## linear regression model
y_baseline <- rep(mean(y_train1), length(y_train1))
start_time_lr <- Sys.time()
# Train the linear regression model
model_lr <- lm(y_train1 ~ ., data = X_train1)
end_time_lr <- Sys.time()
time_difference_lr <- end_time_lr - start_time_lr
# Predict using the test data
y_pred_lr_train <- round(predict(model_lr, X_train1),0)
# Calculate metrics
lr_score_train <- calculate_metrics(y_train1, y_pred_lr_train)
plot(y_train1, y_pred_lr_train, type = "p",
xlab = "Actual Rent",
ylab = "Predicted Rent",
main = "Actual Rent vs Predicted Rent for train set")
lines(smooth.spline(y_train1, y_pred_lr_train), col = "red")
# Predict using the test data
y_pred_lr_test <- round(predict(model_lr, X_test1),0)
# Calculate metrics
lr_score_test <- calculate_metrics(y_test1, y_pred_lr_test)
plot(y_test1, y_pred_lr_test, type = "p",
xlab = "Actual Rent",
ylab = "Predicted Rent",
main = "Actual Rent vs Predicted Rent for test set")
lines(smooth.spline(y_test1, y_pred_lr_test), col = "red")
2.7.2 Gradient Boosting Regression
start_time_gbr <- Sys.time()
# Build Gradient Boosting Regression model
model_gbr <- gbm(y_train1 ~ ., data = X_train1, n.trees = 100,
distribution = "gaussian", interaction.depth = 4, shrinkage = 0.1,
n.minobsinnode = 10, bag.fraction = 0.5, train.fraction = 0.8,
cv.folds = 5, verbose = TRUE)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 249821.4660 223845.3719 0.1000 18009.7644
## 2 234682.3656 211888.8292 0.1000 15311.9433
## 3 221694.2735 201495.1385 0.1000 13211.0232
## 4 210494.9206 192986.3171 0.1000 11478.8619
## 5 200848.5780 184834.0948 0.1000 9083.6325
## 6 192724.1202 179578.1656 0.1000 7552.5611
## 7 185941.5911 173882.7275 0.1000 6469.6144
## 8 179596.7761 168041.7649 0.1000 6148.0347
## 9 173787.9203 163247.1498 0.1000 5476.3547
## 10 168773.0376 159266.6308 0.1000 4728.8987
## 20 134899.3940 129697.8388 0.1000 2660.6819
## 40 107820.4139 103505.6837 0.1000 586.2031
## 60 95651.3158 90540.1896 0.1000 337.5581
## 80 88944.8184 83961.4944 0.1000 54.5086
## 100 84705.8242 80010.2710 0.1000 -2.6924
end_time_gbr <- Sys.time()
time_difference_gbr <- end_time_gbr - start_time_gbr
y_pred_gbr_train <- predict(model_gbr, newdata = X_train1, n.trees = 100, type = "response")
# Calculate mean absolute error and R- suqared
gbr_score_train <- calculate_metrics(y_train1, y_pred_gbr_train)
# Print R-squared and MAE scores
plot(y_train1, y_pred_gbr_train, type = "p",
xlab = "Actual Rent",
ylab = "Predicted Rent",
main = "Actual Rent vs Predicted Rent for train set")
lines(smooth.spline(y_train1, y_pred_gbr_train), col = "red")
# Predict on the test set
y_pred_gbr_test <- predict(model_gbr, newdata = X_test1, n.trees = 100, type = "response")
# Calculate mean absolute error and R- suqared
gbr_score_test <- calculate_metrics(y_test1, y_pred_gbr_test)
# Print R-squared and MAE scores
plot(y_test1, y_pred_gbr_test, type = "p",
xlab = "Actual Rent",
ylab = "Predicted Rent",
main = "Actual Rent vs Predicted Rent for test set")
lines(smooth.spline(y_test1, y_pred_gbr_test), col = "red")
2.7.3 Random Forest Regression
# Build random forest regression model
colnames(X_train1) <- gsub(" ", "_", colnames(X_train1))
start_time_rf <- Sys.time()
model_rf <- randomForest(y_train1 ~ ., data = X_train1)
end_time_rf <- Sys.time()
time_difference_rf <- end_time_rf - start_time_rf
# Predict train
y_pred_rf_train <- predict(model_rf, newdata = X_train1)
# Calculate mean absolute error and R2-squared
rf_score_train <- calculate_metrics(y_train1, y_pred_rf_train)
plot(y_train1, y_pred_rf_train, type = "p",
xlab = "Actual Rent",
ylab = "Predicted Rent",
main = "Actual Rent vs Predicted Rent for train set")
lines(smooth.spline(y_train1, y_pred_gbr_train), col = "red")
# Predict test
y_pred_rf_test <- predict(model_rf, newdata = X_test1)
# Calculate mean absolute error and R2-squared
rf_score_test <- calculate_metrics(y_test1, y_pred_rf_test)
plot(y_test1, y_pred_rf_test, type = "p",
xlab = "Actual Rent",
ylab = "Predicted Rent",
main = "Actual Rent vs Predicted Rent for test set")
lines(smooth.spline(y_test1, y_pred_gbr_test), col = "red")
- According to the fit chart, random forest has the best fit.
2.7.4 Create the summary DataFrame
summary_df <- data.frame(
Model = c("LR", "GBR", "RFR"), # Update with model names
MAE_Score_train = c(lr_score_train$MAE_score,
gbr_score_train$MAE_score,
rf_score_train$MAE_score), # Update with MAE scores
R2_Score_train = c(lr_score_train$R2_score,
gbr_score_train$R2_score,
rf_score_train$R2_score), # Update with R2 scores
MAE_Score_test = c(lr_score_test$MAE_score,
gbr_score_test$MAE_score,
rf_score_test$MAE_score), # Update with MAE scores
R2_Score_test = c(lr_score_test$R2_score,
gbr_score_test$R2_score,
rf_score_test$R2_score), # Update with R2 scores
Modeling_Time_taken = round(c(time_difference_lr, time_difference_gbr, time_difference_rf),2)
)
# Sort the DataFrame by R2-Score in descending order
regression_results <- summary_df[order(summary_df$R2_Score_test, decreasing = TRUE), ]
regression_results
## Model MAE_Score_train R2_Score_train MAE_Score_test R2_Score_test
## 3 RFR 92.28 0.9334 160.4 0.7938
## 2 GBR 212.8 0.6803 214.1 0.6764
## 1 LR 240.7 0.5981 238.3 0.6053
## Modeling_Time_taken
## 3 88.68 secs
## 2 11.68 secs
## 1 0.04 secs
- I used MAE_Score, R^2_Score to evaluate the three models.
- MAE score is a robust measure of how accurately the model predicts the target variable. Lower MAE values are better as they indicate less error.
- r^2 score provides a measure of how well future outcomes are likely
to be predicted by the model. R^2 of 1 indicates that the regression
predictions perfectly fit the data.
- The performance of random forest has the lowest MAE score and highest r^2 score for both train set and test set. However, it takes longest time to train the model. Also, compare to the score with train set and test set, random forest seems has the largest differences. Which mean the score for random forest is not stable, not so reliable. Even so, the random forest regression model far outperforms than LR and GBR. Thus, random forest is the winner among the three regression models.
2.8 Classification processing
2.8.1 Create new target renovation_level
q33 <- quantile(df1$monthly_rent_rm/ df1$rooms, probs = 0.33)
q67 <- quantile(df1$monthly_rent_rm/ df1$rooms, probs = 0.67)
df1$renovation_level <- ifelse(df1$monthly_rent_rm / df1$rooms < 450, "standard",
ifelse(df1$monthly_rent_rm / df1$rooms >= 450 &
df1$monthly_rent_rm / df1$rooms <= 666,
"exquisite","luxury"))
df1$renovation_level <- round(as.numeric(factor(df1$renovation_level, levels = c("luxury", "exquisite", "standard"), labels = c(3, 2, 1))),0)
- Since this project requires 2 research questions, which are regression and classification.
- Our dataset doesn’t have another meaningful classification target.
- So, we decided to create a new target variable based on the range of monthly_rent_rm.
- The new target variable is renovation_level, and it has three types of class which are luxury, exquisite, and standard.
2.8.2 X/y split, train/test split
df2 <- df1
result2 <- extractInputOutput(data = df2, output_column_name = "renovation_level")
X2 <- result2$input_data
y2 <- result2$output_data
X_clean2 <- standardizerData(X2)
# Train-test split
# Create the train and test indices
train_indices2 <- createDataPartition(y2, p = 0.8, list = FALSE)
# Split the data into training and testing sets
X_train2 <- X_clean2[train_indices2, ]
y_train2 <- y2[train_indices2]
X_test2 <- X_clean2[-train_indices2, ]
y_test2 <- y2[-train_indices2]
round(nrow(X_test2) / nrow(X_clean2), 2)
## [1] 0.2
2.8.3 Classification variables exploration
# classification variables exploration
# draw the correlation heat map
table(y_test2)
## y_test2
## 1 2 3
## 694 763 695
# From the table(y_test2), we can view the 3 types of class distribution to ensure that there is not much difference in the amount of 3 types of data.
corr <- c("monthly_rent_rm","completion_year", "rooms", "parking", "bathroom", "size_sqft",
"fac_count", "add_fac_count","region", "nearby_ktm_lrt","property_type","renovation_level")
heat <- df2[,corr]
options(repr.plot.width=8, repr.plot.height=6)
qplot(x=Var1, y=Var2, data=melt(cor(heat, use="p")), fill=value, geom="tile") +
scale_fill_gradient2(low = "green", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Correlation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 8, hjust = 1))+
coord_fixed()+
ggtitle("Figure 7 Correlation Heatmap") +
theme(plot.title = element_text(hjust = 0.4))
2.9 Build Classification Model
2.9.1 Decision Tree Classification
# Create the decision tree model
X_train2 <- X_train2[, !(names(X_train2) %in% c("monthly_rent_rm"))]
X_test2 <- X_test2[, !(names(X_test2) %in% c("monthly_rent_rm"))]
ctrl=rpart.control(cp=0.001)
model_dt <- rpart(y_train2 ~ ., data = X_train2, method = 'class', control=ctrl)
# Predict using the tuned decision tree model
y_pred_dt <- predict(model_dt, X_test2, type = 'class')
# Get the predicted labels based on the highest column index
pre_mat_dt <- table(y_test2, y_pred_dt)
# print accuracy based on the confusion matrix prediction
acc_test_dt <- sum(diag(pre_mat_dt)) / sum(pre_mat_dt)
print(paste('Accuracy for Decision Tree test:', round(acc_test_dt, 4)))
## [1] "Accuracy for Decision Tree test: 0.7728"
- For classification model, we draw three types as well.
- Since the classification target is generate based on the monthly_rent, so before build the model, first we need to remove the monthly_rent_rm variable. otherwise the accuracy rate will be falsely high.
- The first one we used is Decision Tree.
- As you can see, the accuracy for decision tree is around 77%.
2.9.2 Naive Bayes Classification
# Naive Bayes using in-built package e1071
NBclassfier <- naiveBayes(y_train2 ~ ., data = X_train2)
y_pred_NB <- predict(NBclassfier, newdata = X_test2)
pre_mat_NB <- table(y_test2, y_pred_NB)
acc_test_NB <- sum(diag(pre_mat_NB)) / sum(pre_mat_NB)
print(paste('Accuracy for Naive Bayes test:', round(acc_test_NB, 4)))
## [1] "Accuracy for Naive Bayes test: 0.619"
2.9.3 Random Forest Classification
# Build random forest for classification
rf_classifer <- randomForest(y_train2 ~ ., data = X_train2, importance =TRUE,ntree=500,nodesize=7, na.action=na.roughfix)
# Predict
y_pred_rf2 <- round(predict(rf_classifer, newdata = X_test2))
pre_mat_rf2 <- table(y_test2, y_pred_rf2)
acc_test_rf <- sum(diag(pre_mat_rf2)) / sum(pre_mat_rf2)
print(paste('Accuracy for random forest test:', round(acc_test_rf, 4)))
## [1] "Accuracy for random forest test: 0.8397"
2.9.4 classification results
classification_results <- list(pre_mat_dt, pre_mat_NB, pre_mat_rf2)
# Calculate metrics
classification_results <- classification_results_metrics(classification_results)
classification_results
## Accuracy Precision 1 Precision 2 Precision 3 Recall 1 Recall 2 Recall 3
## DT 0.7728 0.8382 0.6803 0.8233 0.8285 0.7418 0.7511
## NB 0.6190 0.7674 0.5163 0.6651 0.5562 0.6855 0.6086
## RF 0.8397 0.9232 0.7485 0.8746 0.8660 0.8309 0.8230
## F1 Score 1 F1 Score 2 F1 Score 3
## DT 0.8333 0.7097 0.7856
## NB 0.6449 0.5890 0.6356
## RF 0.8937 0.7876 0.8480
These are the comparation results for classification models.
Here I used Accuracy, precision, recall, and the F1-score to evaluate their performance, since they can provide different perspectives on the model’s performance.
Accuracy is the proportion of correct predictions which is true positives and true negatives among the total number of cases examined. It’s a suitable measure when the classes are balanced, but can be misleading if the class distribution is imbalanced. For test set, the classes of y is balanced, so it is a reliable measures.
Precision is the proportion of true positives out of all positive predictions. It is a measure of a classifier’s exactness. A low precision can also indicate a high number of false positives. So, I want to know the Positive Predictive rate of the model.
Recall is the proportion of true positives out of all actual positives which is true positives + false negatives. Recall is a measure of a classifier’s completeness. A low recall indicates many false negatives. I want to know the True Positive Rate of the model.
The F1-score is the harmonic mean of precision and recall, and tries to balance the two. It is especially useful if the dataset have an uneven class distribution, as it seeks a balance between precision and recall. Here, I want to use F1-score as a reference.
As the results, we have 3 conclusion
- The precision for class 1 of all the three model are greater than accuracy, class 2 of all the three model are lower than accuracy and precision for class 1 and 2 very much, which mean the models are less confident and less correct when predicting Class 2. It could imply that Class 2 is more difficult for the models to correctly identify, possibly due to reasons such as overlap of features with other classes, lack of distinctive features, or poor quality of labels for Class 2.
- The precision, recall and F1-score for class 1 is clearly higher than 2 and 3 for all the three models. As the three classes are balanced. So, Class 1 might have more distinct or clear-cut features that make it easier to identify compared to classes 2 and 3.
- Random forest has the best performance.