Rental Price Prediction for Kuala Lumpur/ Selangor

#1.INTORDUCTION

The real estate market in Kuala Lumpur and Selangor is highly competitive and has seen a significant increase in rental prices in recent years. This makes it even more difficult for landlords to price their rentals. On the one hand, underpricing can result in a loss of income, while on the other hand, overpricing can make it difficult to rent the property and lose out on a suitable customer base. It is, therefore, crucial to examine the rental price carefully and suggest a fair rental rate that reflects the property’s value, ensuring both landlord and tenant benefit from the arrangement.

In recent years, machine learning has become a popular prediction approach due to the growing trend towards Big Data. Therefore, We can apply this technique to rental price prediction by developing models using various machine learning algorithms, such as regression models, decision trees or neural networks, using relevant features such as property location, size, amenities, etc., which can help potential tenants and landlords to obtain the best rental price per unit comparable to the market value.

#2.OBJECTIVE

Based on the existing problem, objectives can be established to clarify the research direction.

2.1 to use different machine learning techniques and models to explore the biggest factors that influence rental prices.

2.2 To use existing datasets to predict rents for properties with different conditions.

#3.DATA PRE-PROCESSING

In this section, the dataset was read and some of the missing values in it are removed.

data <- read.csv("RENTAL.csv", stringsAsFactors = FALSE)
data <- as.data.frame(data)

#Find missing number

data$prop_name[data$prop_name==""] <- NA
data$completion_year[data$completion_year==""] <- NA
data$monthly_rent[data$monthly_rent==""] <- NA
data$rooms[data$rooms==""] <- NA
data$parking[data$parking==""] <- NA
data$bathroom[data$bathroom==""] <- NA
data$furnished[data$furnished==""] <- NA
data$facilities[data$facilities==""] <- NA
data$additional_facilities[data$additional_facilities==""] <- NA


data = subset(data,select = -c(ads_id))
data.frame("variable"=c(colnames(data)), 
           "missing values count"=sapply(data, function(x) sum(is.na(x))),
           row.names=NULL)

##                 variable missing.values.count
## 1              prop_name                  948
## 2        completion_year                 9185
## 3           monthly_rent                    2
## 4               location                    0
## 5          property_type                    0
## 6                  rooms                    6
## 7                parking                 5702
## 8               bathroom                    6
## 9                   size                    0
## 10             furnished                    5
## 11            facilities                 2209
## 12 additional_facilities                 5948
## 13                region                    0

Fill in for the remaining missing values

#Replace NA for rating with mode
#Get mode
Mode <- function(x){
  ux <- unique(x)
  ux[which.max(tabulate(match(x,ux)))]
}

mode_monthly_rent <- Mode(data$monthly_rent)
data$monthly_rent[is.na(data$monthly_rent)] <- Mode(data$monthly_rent)

mode_rooms <- Mode(data$rooms)
data$rooms[is.na(data$rooms)] <- Mode(data$rooms)

mode_bathroom <- Mode(data$bathroom)
data$bathroom[is.na(data$bathroom)] <- Mode(data$bathroom)

mode_furnished <- Mode(data$furnished)
data$furnished[is.na(data$furnished)] <- Mode(data$furnished)

# DROP NULL IN PROP_NAME
data <- data[complete.cases(data$prop_name), ]

# Calculate the number of missing values for each variable
missing_count <- sapply(data, function(x) sum(is.na(x)))

# Set a threshold for the maximum number of missing values allowed
max_missing_count <- 1000  

# Create a logical vector indicating which variables to keep
keep_variables <- missing_count <= max_missing_count

# Subset the data to include only the variables with fewer missing values
c_data <- subset(data, select = keep_variables)

data.frame("variable"=c(colnames(c_data)), 
           "missing values count"=sapply(c_data, function(x) sum(is.na(x))),
           row.names=NULL)

##        variable missing.values.count
## 1     prop_name                    0
## 2  monthly_rent                    0
## 3      location                    0
## 4 property_type                    0
## 5         rooms                    0
## 6      bathroom                    0
## 7          size                    0
## 8     furnished                    0
## 9        region                    0

#Monthlyrental
c_data$monthly_rent <- as.numeric(gsub("[^0-9\\.]", "", c_data$monthly_rent))
summary(c_data$monthly_rent)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      70    1100    1400    2202    1800 2400000

remove_outliers <- function(c_data, colname) {
  # calculate the first and third quartiles
  Q1 <- quantile(c_data[[colname]], 0.25, na.rm = TRUE)
  Q3 <- quantile(c_data[[colname]], 0.75, na.rm = TRUE)
  
  # calculate the IQR
  IQR <- Q3 - Q1
  
  # calculate the lower and upper bounds for outliers
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  
  # remove outliers
  c_data[c_data[[colname]] >= lower_bound & c_data[[colname]] <= upper_bound, ]
}

c_data <- remove_outliers(c_data, "monthly_rent")
c_data <- remove_outliers(c_data, "size")

# rooms
c_data <- remove_outliers(c_data, "rooms")
summary(c_data)

##   prop_name          monthly_rent    location         property_type     
##  Length:17246       Min.   : 100   Length:17246       Length:17246      
##  Class :character   1st Qu.:1100   Class :character   Class :character  
##  Mode  :character   Median :1400   Mode  :character   Mode  :character  
##                     Mean   :1440                                        
##                     3rd Qu.:1700                                        
##                     Max.   :2850                                        
##      rooms          bathroom          size         furnished        
##  Min.   :1.000   Min.   :1.000   Min.   : 320.0   Length:17246      
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.: 736.0   Class :character  
##  Median :3.000   Median :2.000   Median : 862.0   Mode  :character  
##  Mean   :2.645   Mean   :1.837   Mean   : 872.3                     
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:1003.0                     
##  Max.   :4.000   Max.   :8.000   Max.   :1430.0                     
##     region         
##  Length:17246      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

This data set contains the minimum, median, mean and maximum values for each variable. Variables include property name, monthly rent, location, property type, rooms, bathrooms, size, furnished, region.

The 9 features in the acquired dataset described as follows:
prop_name indicates the name of the building;
monthly_rent indicates the monthly rent in Ringgit Malaysia; property_type indicates type of property location indicates the property location; property_type indicates the type of house;
rooms, bathroom and size are features that describe the condition of the unit;
furnished indicates the furnishing status;
region indicates the property’s location, whether is Kuala Lumpur or Selangor.

#spilt c_data to trainset and testset
s <- sample(c(1:nrow(c_data)), nrow(c_data)*0.8, replace=FALSE)
trainset <- c_data[s,]
testset <- c_data[-s,]

#4.EXPLORATORY DATA ANALYSIS

library(ggplot2)
library(plotly)

## 
## 载入程辑包：'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(dplyr)

## 
## 载入程辑包：'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(forcats)
library(tidyr)

# Plot histogram of number of rooms
ggplot(c_data, aes(x = rooms)) +
  geom_histogram(bins = 30, color = "black", fill = "lightblue") +
  labs(x = "Number of Rooms", y = "Count", title = "Number of Rooms")

This histogram visualizes the distribution of the variable ‘rooms’. It can be seen that three-bedroom units are the most available and four-bedroom units are the least available

# region
c_data %>%
  group_by(region) %>%
  summarise(countt = n()) %>%
  plot_ly(labels = ~region, values = ~countt, type = 'pie') %>%
  add_pie(hole = 0.4) %>%
  layout(title = "Distribution of Properties by Region")

The pie chart plots the distribution of region in this dataset, and it can be seen that the two areas account for a relatively even share, Selangor accounting for a slightly higher share of 52%, Kuala Lumpur 48%.

#prop_type
c_data %>%
  group_by(property_type) %>%
  summarise(countt = n()) %>%
  plot_ly(labels = ~property_type, values = ~countt, type = 'pie') %>%
  layout(title = "Distribution of Properties Type")

The pie chart depicts the distribution of property types in the dataset, condominiums being the most prevalent, followed by apartments and service residences. Other property types account for smaller proportions.

# Status
c_data %>%
  group_by(furnished) %>%
  summarise(countt = n()) %>%
  plot_ly(labels = ~furnished, values = ~countt, type = 'pie') %>%
  layout(title = "Furnished vs Unfurnished Rental Properties")

The pie chart depicts the furnish type of properties, which shows that only a small percentage of properties are not furnished at all, and the percentage of partically and fully furnished properties is about the same

#location
# count the number of rental properties in each location and region
location_counts <- c_data %>%
  group_by(location, region) %>%
  summarise(n = n(),.groups = "drop") %>%
  ungroup() %>%
  arrange(region, desc(n)) %>%
  group_by(region) %>%
  top_n(5, n)

# plot the bar chart
ggplot(location_counts, aes(x = fct_reorder(location, n), y = n, fill = region)) +
  geom_bar(stat = "identity", color = "black") +
  labs(x = "Location", y = "Count", title = "Top 5 Rental Properties by Location for Each Region") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  scale_fill_discrete(name = "Region")

### How facility affected to the rental price???Can get the exact value or not??
##Facility
# count the top 5 additional facilities and facilities

The top 5 rental properties by location in each region, red for Kuala Lumpur and blue for Selangor, are not far apart, with Kuala Lumpur-Bukit Jalil having the lowest number of rental properties and Kuala Lumpur-Cheras having the highest number of rental properties

#monthly_rent
c_data  %>%
  ggplot(aes(x = monthly_rent)) + 
  geom_histogram(bins = 30, color = "black", fill = "steelblue") +
  labs(x = "Monthly Rent", y = "Count")+ 
  ggtitle("Histogram of Monthly Rent")

This is a bar chart of monthly rents. 1000-2000 is the majority, but between 1000-1500 there is a very low number of plots, 2000-3000 is low and 0-500 is almost zero

#rent vs region
c_data  %>%
  ggplot(aes(x = monthly_rent, fill = region)) + 
  geom_histogram(bins = 30, color = "black") +
  facet_grid(rows = vars(region), scales = "free_y") +
  labs(x = "Monthly Rent", y = "Count")+ 
  ggtitle("Comparison of Monthly Rent by Region")

####TODO
#rent vs prop_type
# Define ranges of monthly rent 
rent_ranges <- c(0, 1000, 2000, 3000)

# Create a new column that labels each rental with its rent range
c_data $rent_range <- cut(c_data $monthly_rent, breaks = rent_ranges, labels = c("0-1000", "1001-2000", "2001-3000"))

# Group the data by rent range and apartment type, and calculate the count of apartments in each group
counts_grouped <- c_data %>%
  group_by(rent_range, property_type) %>%
  summarise(count = n(),.groups = "drop")

see from the graph, most of the properties prices in Kuala Lumpur are in the 1200 to 2000 range, while most of the properties prices in Selangor are in the 700 to 1300 range. Overall Kuala Lumpur has more high rental properties than Selangor

# Plot the grouped bar chart
ggplot(counts_grouped, aes(x = rent_range, y = count, fill = property_type)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Monthly Rent Range", y = "Count") +
  ggtitle("Comparison of Apartment Types by Monthly Rent Range") +
  scale_fill_brewer(palette = "Set1") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

  #scale_fill_manual(values = my_colors) +
  #theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability

Based on the bar chart we can see how the number of various types of housing compares in the different price ranges. The price range 0 to 1000 is where the number of flats is highest, at around 2400. The price range 1001 to 2000 has upwards of 5000 condominiums, followed by service residences at around 3400 and apartments at around 2500. while the price range 2001 to 3000 has relatively few housing types other than condominiums and service residences

# Status VS rental

ggplot(c_data, aes(x = furnished, y = monthly_rent, fill = furnished)) +
  geom_boxplot() +
  labs(x = "Furnished Status", y = "Monthly Rent") +
  ggtitle("Comparison of Monthly Rent by Furnished Status")

The box plot shows the distribution of monthly rents for different furniture states. Fully furnished houses have overall higher rents than semi-furnished and unfurnished houses, with a mean value of around 1700, the lower 25% of which is higher than the upper 75% of unfurnished rentals. The semi-furnished houses have higher rents than the unfurnished houses. In summary, this shows that furniture has a greater impact on rents, with the more furnished houses being more popular in the rental market

# room vs rental
ggplot(c_data, aes(x = as.factor(rooms), y = monthly_rent, fill = as.factor(rooms))) +
  geom_boxplot() +
  labs(x = "Number of Rooms", y = "Monthly Rental") +
  ggtitle("Rental Prices by Number of Rooms")

The graph shows that the average rental price of properties with 3 rooms and the range of rental distribution is rather less than that of 2 rooms. The 4 rooms units have the highest price distribution and the highest average price, while the one-bedroom units have the narrowest rental price distribution

#5.MODELING

##5.1 Factors affecting rent prices based on Linear regression

#install.packages("corrplot")
#install.packages("caret")
#install.packages("rpart.plot")
#install.packages("randomForest")
library(corrplot)

## corrplot 0.92 loaded

numeric_columns <- c("monthly_rent", "rooms", "bathroom", "size")
correlation_matrix <- cor(c_data[, numeric_columns], use = "complete.obs")

print(correlation_matrix)

##              monthly_rent       rooms  bathroom      size
## monthly_rent  1.000000000 0.009417684 0.1689390 0.3442384
## rooms         0.009417684 1.000000000 0.6884188 0.6544130
## bathroom      0.168939047 0.688418828 1.0000000 0.6535930
## size          0.344238364 0.654413015 0.6535930 1.0000000

The correlation factor between rent and rooms is close to 0. This means that changes in rent are rarely influenced by the number of rooms. The correlation factor between rent and bathrooms is over 0. This means that rents may increase slightly with the number of bathrooms. The correlation factor between size and rent is same as the bathroom, meaning that rent increases with size to the same extent as the bathroom

library(dplyr)
#install.packages("car")
library(car)

## 载入需要的程辑包：carData

## 
## 载入程辑包：'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

# Fit the linear regression model
lm_model <- lm(monthly_rent ~ location + property_type + rooms + bathroom + size + furnished + region, data = c_data)

# Get the coefficients and corresponding feature names
coefficients <- coef(lm_model)[-1]  # Exclude the intercept term
feature_names <- names(coefficients)

# Sort the coefficients by their absolute values in descending order
sorted_indices <- order(abs(coefficients), decreasing = TRUE)
sorted_coefficients <- coefficients[sorted_indices]
sorted_feature_names <- feature_names[sorted_indices]

# Create an empty data frame to store the largest coefficients for each variable
max_coeff_table <- data.frame(Variable = character(),
                              Coefficient = numeric(),
                              stringsAsFactors = FALSE)

# Create a vector to store the variable names
variables <- c("location", "property_type", "furnished", "size", "rooms", "bathroom")

# Iterate over the variables
for (variable in variables) {
  # Get the indices of coefficients corresponding to the current variable
  indices <- grep(paste0("^", variable), sorted_feature_names)
  
  # Check if any coefficient exists for the current variable
  if (length(indices) > 0) {
    # Get the largest absolute coefficient for the current variable
    largest_index <- indices[1]
    largest_coefficient <- abs(sorted_coefficients[largest_index])
    
    # Create a new row in the data frame with the variable name and coefficient
    row <- data.frame(Variable = variable,
                      Coefficient = largest_coefficient,
                      stringsAsFactors = FALSE)
    
    # Append the row to the max_coeff_table
    max_coeff_table <- rbind(max_coeff_table, row)
  }
}

# Print the maximum coefficient table
print(max_coeff_table)

##                                     Variable Coefficient
## locationSelangor - 360              location 639.0626815
## property_typeService Residence property_type 398.2590976
## furnishedNot Furnished             furnished 439.8938717
## size                                    size   0.7626243
## rooms                                  rooms  28.0329632
## bathroom                            bathroom  92.2344789

We explored the factors influencing rental prices by building a linear regression model and performing correlation analysis on some of the variables. This shows that location is the largest factor affecting rent price.

library(ggplot2)

# Plot the bar chart
ggplot(max_coeff_table, aes(x = Variable, y = Coefficient)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Impact of Factors on Monthly Rent", x = "Variable", y = "Coefficient") +
  theme_minimal()

The graph shows the extent to which different factors impact on the monthly rent and the results are shown by the graph. The ggplot function was used to create the graph, which was titled “Impact of factors on monthly rent”. In the graph we can draw the following conclusions: location is the biggest factor influencing the price of rent, furnished and property type have almost the same influence on the price of rent, the number of bathrooms also influences the price of rent, the higher the number, the higher the price. The number of rooms is the least affected factor.

##5.2 Predicting rent prices

###5.2.1 Predicting rent prices based on linear regression models

In this section, the performance of models are evaluated.

MSE (Mean Squared Error): It measures the average squared difference between the predicted and actual values. A lower MSE indicates better model performance. RMSE (Root Mean Squared Error): It is the square root of the MSE and represents the average magnitude of the prediction error. A lower RMSE indicates better model performance. R-squared: It is a statistical measure that represents the proportion of the variance in the dependent variable (target) that is predictable from the independent variables (features). R-squared ranges from 0 to 1, with higher values indicating a better fit of the model to the data.

In this section, a multiple linear regression model (model_multi) is created to predict the monthly_rent using multiple variables: location, property type, furnished or not.

model_multi <- lm(monthly_rent ~ location + property_type + furnished, data = c_data)

predicted_multi <- predict(model_multi, newdata = trainset)

head(predicted_multi, 5)

##      2587     13328     16323      3283     11057 
## 1473.0705  844.8487 1382.0934 1915.6648 1468.4200

The output is the first five predicted values. These predicted values represent the monthly rents based on the test dataset.

mr_mse <- mean((c_data$monthly_rent - predicted_multi)^2)

## Warning in c_data$monthly_rent - predicted_multi:
## 长的对象长度不是短的对象长度的整倍数

mr_rmse <- sqrt(mr_mse)
print(paste("MSE (Multi-variable model):", mr_mse))

## [1] "MSE (Multi-variable model): 384146.23577515"

print(paste("RMSE (Multi-variable model):", mr_rmse))

## [1] "RMSE (Multi-variable model): 619.79531764539"

model_summary <- summary(model_multi)
mr_r_squared <- model_summary$r.squared
print(paste("R-squared:", mr_r_squared))

## [1] "R-squared: 0.538935361265564"

In this section, the performance of the multiple linear regression model is evaluated.

The R-squared value is high, the values of MSE and RMSE are relatively large, suggesting significant prediction errors and a notable deviation between the predicted and actual values.

In summary, the multiple linear regression model has some explanatory power, but there is a large error in prediction. Further consideration of model improvements or the use of other more suitable models may be needed to improve the predictive performance.

###5.2.2 use decision trees to predict rent prices

In this section, The decision tree model is trained using data from the training set with the target variable of monthly rent and the predictor variables of region, furniture and property type.

library(rpart)
library(rpart.plot)

dt_model <- rpart(monthly_rent ~ region + property_type + furnished ,data = trainset)

Visualizes the decision tree model (dt_model) using the rpart.plot function

## n= 13796 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 13796 3468947000 1442.987  
##    2) property_type=Apartment,Flat,Others,Studio 4538  572472000 1093.711  
##      4) furnished=Not Furnished,Partially Furnished 3698  311452500 1016.718 *
##      5) furnished=Fully Furnished 840  142592600 1432.662 *
##    3) property_type=Condominium,Duplex,Service Residence,Townhouse Condo 9258 2071508000 1614.191  
##      6) furnished=Not Furnished,Partially Furnished 4744  732888400 1436.066  
##       12) furnished=Not Furnished 967  131833400 1212.884 *
##       13) furnished=Partially Furnished 3777  540557200 1493.205  
##         26) region=Selangor 1751  263278700 1388.408 *
##         27) region=Kuala Lumpur 2026  241428400 1583.777 *
##      7) furnished=Fully Furnished 4514 1029907000 1801.393  
##       14) region=Selangor 2113  448835100 1653.433 *
##       15) region=Kuala Lumpur 2401  494103200 1931.606 *

This is a decision tree with a depth of 5. Each node box shows the mean and sample size percentage of rent for that node condition, with darker colors indicating higher percentages.

Starting from the root node, the data is split according to different attributes. In the root node, the data is divided into left and right subtrees based on the property_type attribute, and the samples of Condominium, Duplex, Service Residence and Townhouse Condo attributes are analyzed to occupy 67% of the previous node samples.

The final split into 9 leaf nodes gives the results of rent prediction for properties with different conditions.

pre <- predict(dt_model, testset)
head(pre,5)

##        3       20       23       24       27 
## 1432.662 1931.606 1212.884 1212.884 1212.884

The output is the first five predicted values. These predicted values represent the monthly rents based on the test dataset.

dt_mse <- mean((c_data$monthly_rent - pre)^2)

## Warning in c_data$monthly_rent - pre: 长的对象长度不是短的对象长度的整倍数

dt_rmse <- sqrt(dt_mse)
print(paste("MSE (Multi-variable model):", dt_mse))

## [1] "MSE (Multi-variable model): 347340.060512765"

print(paste("RMSE (Multi-variable model):", dt_rmse))

## [1] "RMSE (Multi-variable model): 589.355631611988"

dt_r_squared <- cor(pre, testset$monthly_rent)^2
print(paste("R-squared:", dt_r_squared))

## [1] "R-squared: 0.393289838604463"

Calculates the Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) for the multi-variable model’s predictions. The MSE and RMSE values provided indicate the need to further optimize the model or adjust the model parameters to improve the fit.

###5.2.3 use random forest model to predict rent prices

In this section, a random forest model is built to predict rents and the performance of the model is evaluated.

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## 载入程辑包：'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

rf_model <- randomForest(monthly_rent ~ location + property_type + furnished, data = trainset)

prediction <- predict(rf_model, newdata = testset)
head(prediction, 5)

##        3       20       23       24       27 
## 1551.497 1862.144 1357.702 1357.702 1272.106

rf_mse <- mean((c_data$monthly_rent - prediction)^2)

## Warning in c_data$monthly_rent - prediction:
## 长的对象长度不是短的对象长度的整倍数

rf_rmse <- sqrt(rf_mse)
print(paste("MSE (Multi-variable model):", rf_mse))

## [1] "MSE (Multi-variable model): 334520.237356136"

print(paste("RMSE (Multi-variable model):", rf_rmse))

## [1] "RMSE (Multi-variable model): 578.37724484642"

rf_r_squared <- cor(prediction, testset$monthly_rent)^2
print(paste("R-squared:",rf_r_squared))

## [1] "R-squared: 0.42667261257275"

In this model, a large error between the predicted and actual observed values.Based on these results, further adjustments to the model or consideration of adding other characteristic variables may be required to improve the accuracy of the predictions.

#6.CONCLUSION

In this study, two questions were examined. The first question was to explore the factors that influence rent prices and the second question was to predict rent prices based on the selected dataset.

For the first question, we explored the factors influencing rental prices by building a linear regression model and performing correlation analysis on some of the variables. This shows that location is the largest factor affecting rent price.

For the second problem, a multiple regression model, decision tree and random forest were built to predict the rent price.

The table below shows the performance metrics of three different models: Multiple Linear Regression, Decision Tree, and Random Forest. The metrics evaluated are Mean Squared Error (MSE) and Root Mean Squared Error (RMSE).

# Create a data frame for the model results
model_results <- data.frame(Model = c("Multiple Regression", "Decision Tree", "Random Forest"),
                            MSE = c(mr_mse, dt_mse, rf_mse),
                            RMSE = c(mr_rmse, dt_rmse, rf_rmse),
                            R_squared = c(mr_r_squared, dt_r_squared,rf_r_squared))

# Print the model results table
print(model_results)

##                 Model      MSE     RMSE R_squared
## 1 Multiple Regression 384146.2 619.7953 0.5389354
## 2       Decision Tree 347340.1 589.3556 0.3932898
## 3       Random Forest 334520.2 578.3772 0.4266726

Based on table, The Random Forest model has shown the lowest Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) among the three models. These metrics indicate the model’s ability to make accurate predictions with minimal errors. However, the Multiple Regression model has a higher R-squared value, suggesting a relatively better ability to explain the variance in the data.

While the multiple regression model shows relatively better performance, all three models exhibit significant prediction errors. Further refinement or alternative modeling approaches may be necessary to improve their predictive capabilities.

In conclusion, building a predictive model with high accuracy is challenging yet essential step to enhance the competitiveness of real estate market and ultimately contribute to a more efficient and transparent real estate market in Malaysia.

Rental Price Prediction for Kuala Lumpur/ Selangor

2023-05-26