1.INTORDUCTION The real estate market in Kuala Lumpur and Selangor is highly competitive and has seen a significant increase in rental prices in recent years. This makes it even more difficult for landlords to price their rentals. On the one hand, underpricing can result in a loss of income, while on the other hand, overpricing can make it difficult to rent the property and lose out on a suitable customer base. It is, therefore, crucial to examine the rental price carefully and suggest a fair rental rate that reflects the property’s value, ensuring both landlord and tenant benefit from the arrangement. In recent years, machine learning has become a popular prediction approach due to the growing trend towards Big Data. Therefore, We can apply this technique to rental price prediction by developing models using various machine learning algorithms, such as regression models, decision trees or neural networks, using relevant features such as property location, size, amenities, etc., which can help potential tenants and landlords to obtain the best rental price per unit comparable to the market value.

2.OBJECTIVE Based on the existing problem, objectives can be established to clarify the research direction. 2.1 to use different machine learning techniques and models to explore the biggest factors that influence rental prices. 2.2 To use existing datasets to predict rents for properties with different conditions.

3.DATA PRE-PROCESSING & EDA

data <- read.csv("RENTAL.csv", stringsAsFactors = FALSE)
data <- as.data.frame(data)

#Find missing number

data$prop_name[data$prop_name==""] <- NA
data$completion_year[data$completion_year==""] <- NA
data$monthly_rent[data$monthly_rent==""] <- NA
data$rooms[data$rooms==""] <- NA
data$parking[data$parking==""] <- NA
data$bathroom[data$bathroom==""] <- NA
data$furnished[data$furnished==""] <- NA
data$facilities[data$facilities==""] <- NA
data$additional_facilities[data$additional_facilities==""] <- NA


data = subset(data,select = -c(ads_id))
data.frame("variable"=c(colnames(data)), 
           "missing values count"=sapply(data, function(x) sum(is.na(x))),
           row.names=NULL)
##                 variable missing.values.count
## 1              prop_name                  948
## 2        completion_year                 9185
## 3           monthly_rent                    2
## 4               location                    0
## 5          property_type                    0
## 6                  rooms                    6
## 7                parking                 5702
## 8               bathroom                    6
## 9                   size                    0
## 10             furnished                    5
## 11            facilities                 2209
## 12 additional_facilities                 5948
## 13                region                    0

This code is responsible for the EDA section, which reads the data file called “RENTAL.csv” and stores it in the variable data. The data is converted into the format of a data frame using the as.data.frame function. Replace the empty string (““) with a missing value (NA) to handle the missing value in the data frame. A new data frame is then created containing the variable name and the corresponding number of missing values. Apply a function to each column in the data box using the sapply function, which calculates the number of missing values in that column.

#Replace NA for rating with mode
#Get mode
Mode <- function(x){
  ux <- unique(x)
  ux[which.max(tabulate(match(x,ux)))]
}

mode_monthly_rent <- Mode(data$monthly_rent)
data$monthly_rent[is.na(data$monthly_rent)] <- Mode(data$monthly_rent)

mode_rooms <- Mode(data$rooms)
data$rooms[is.na(data$rooms)] <- Mode(data$rooms)

mode_bathroom <- Mode(data$bathroom)
data$bathroom[is.na(data$bathroom)] <- Mode(data$bathroom)

mode_furnished <- Mode(data$furnished)
data$furnished[is.na(data$furnished)] <- Mode(data$furnished)

data.frame("variable"=c(colnames(data)), 
           "missing values count"=sapply(data, function(x) sum(is.na(x))),
           row.names=NULL)
##                 variable missing.values.count
## 1              prop_name                  948
## 2        completion_year                 9185
## 3           monthly_rent                    0
## 4               location                    0
## 5          property_type                    0
## 6                  rooms                    0
## 7                parking                 5702
## 8               bathroom                    0
## 9                   size                    0
## 10             furnished                    0
## 11            facilities                 2209
## 12 additional_facilities                 5948
## 13                region                    0
# CLEAN DATA
summary(data)
##   prop_name         completion_year monthly_rent         location        
##  Length:19991       Min.   :1977    Length:19991       Length:19991      
##  Class :character   1st Qu.:2012    Class :character   Class :character  
##  Mode  :character   Median :2017    Mode  :character   Mode  :character  
##                     Mean   :2015                                         
##                     3rd Qu.:2020                                         
##                     Max.   :2025                                         
##                     NA's   :9185                                         
##  property_type          rooms           parking          bathroom    
##  Length:19991       Min.   : 1.000   Min.   : 1.000   Min.   :1.000  
##  Class :character   1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.:2.000  
##  Mode  :character   Median : 3.000   Median : 1.000   Median :2.000  
##                     Mean   : 2.681   Mean   : 1.417   Mean   :1.892  
##                     3rd Qu.: 3.000   3rd Qu.: 2.000   3rd Qu.:2.000  
##                     Max.   :10.000   Max.   :10.000   Max.   :8.000  
##                                      NA's   :5702                    
##       size           furnished          facilities        additional_facilities
##  Min.   :       1   Length:19991       Length:19991       Length:19991         
##  1st Qu.:     750   Class :character   Class :character   Class :character     
##  Median :     886   Mode  :character   Mode  :character   Mode  :character     
##  Mean   :    5922                                                              
##  3rd Qu.:    1044                                                              
##  Max.   :99999999                                                              
##                                                                                
##     region         
##  Length:19991      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Define a function Mode that calculates the plurality of a vector. For each variable, use the Mode function to calculate the plurality and replace the missing values with the plural values of the corresponding variable. Create a new data box again with the variable name and the corresponding number of missing values. Apply a function to each column in the data box using the sapply function, which calculates the number of missing values in that column. Use the summary function to view a statistical summary of the data to see the distribution of each variable.

# DROP NULL IN PROP_NAME
c_data <- data[complete.cases(data$prop_name), ]

data.frame("variable"=c(colnames(c_data)), 
           "missing values count"=sapply(c_data, function(x) sum(is.na(x))),
           row.names=NULL)
##                 variable missing.values.count
## 1              prop_name                    0
## 2        completion_year                 8237
## 3           monthly_rent                    0
## 4               location                    0
## 5          property_type                    0
## 6                  rooms                    0
## 7                parking                 5392
## 8               bathroom                    0
## 9                   size                    0
## 10             furnished                    0
## 11            facilities                 2089
## 12 additional_facilities                 5629
## 13                region                    0
#Monthlyrental
c_data$monthly_rent <- as.numeric(gsub("[^0-9\\.]", "", c_data$monthly_rent))
summary(c_data$monthly_rent)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      70    1100    1400    2202    1800 2400000
remove_outliers <- function(c_data, colname) {
  # calculate the first and third quartiles
  Q1 <- quantile(c_data[[colname]], 0.25, na.rm = TRUE)
  Q3 <- quantile(c_data[[colname]], 0.75, na.rm = TRUE)
  
  # calculate the IQR
  IQR <- Q3 - Q1
  
  # calculate the lower and upper bounds for outliers
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  
  # remove outliers
  c_data[c_data[[colname]] >= lower_bound & c_data[[colname]] <= upper_bound, ]
}

c_data <- remove_outliers(c_data, "monthly_rent")
summary(c_data)
##   prop_name         completion_year  monthly_rent    location        
##  Length:17902       Min.   :1977    Min.   :  70   Length:17902      
##  Class :character   1st Qu.:2013    1st Qu.:1100   Class :character  
##  Mode  :character   Median :2017    Median :1400   Mode  :character  
##                     Mean   :2015    Mean   :1446                     
##                     3rd Qu.:2020    3rd Qu.:1750                     
##                     Max.   :2025    Max.   :2850                     
##                     NA's   :7931                                     
##  property_type          rooms           parking         bathroom    
##  Length:17902       Min.   : 1.000   Min.   : 1.00   Min.   :1.000  
##  Class :character   1st Qu.: 2.000   1st Qu.: 1.00   1st Qu.:2.000  
##  Mode  :character   Median : 3.000   Median : 1.00   Median :2.000  
##                     Mean   : 2.657   Mean   : 1.39   Mean   :1.848  
##                     3rd Qu.: 3.000   3rd Qu.: 2.00   3rd Qu.:2.000  
##                     Max.   :10.000   Max.   :10.00   Max.   :8.000  
##                                      NA's   :5113                   
##       size           furnished          facilities        additional_facilities
##  Min.   :       1   Length:17902       Length:17902       Length:17902         
##  1st Qu.:     730   Class :character   Class :character   Class :character     
##  Median :     866   Mode  :character   Mode  :character   Mode  :character     
##  Mean   :    6472                                                              
##  3rd Qu.:    1010                                                              
##  Max.   :99999999                                                              
##                                                                                
##     region         
##  Length:17902      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
# rooms
#c_data$rooms <- as.numeric(c_data$rooms)
#c_data$rooms <- parse_number(c_data$rooms)
c_data <- remove_outliers(c_data, "rooms")
summary(c_data)
##   prop_name         completion_year  monthly_rent    location        
##  Length:17833       Min.   :1977    Min.   :  70   Length:17833      
##  Class :character   1st Qu.:2013    1st Qu.:1100   Class :character  
##  Mode  :character   Median :2017    Median :1400   Mode  :character  
##                     Mean   :2015    Mean   :1447                     
##                     3rd Qu.:2020    3rd Qu.:1750                     
##                     Max.   :2025    Max.   :2850                     
##                     NA's   :7903                                     
##  property_type          rooms          parking          bathroom    
##  Length:17833       Min.   :1.000   Min.   : 1.000   Min.   :1.000  
##  Class :character   1st Qu.:2.000   1st Qu.: 1.000   1st Qu.:2.000  
##  Mode  :character   Median :3.000   Median : 1.000   Median :2.000  
##                     Mean   :2.647   Mean   : 1.389   Mean   :1.845  
##                     3rd Qu.:3.000   3rd Qu.: 2.000   3rd Qu.:2.000  
##                     Max.   :4.000   Max.   :10.000   Max.   :8.000  
##                                     NA's   :5085                    
##       size           furnished          facilities        additional_facilities
##  Min.   :       1   Length:17833       Length:17833       Length:17833         
##  1st Qu.:     728   Class :character   Class :character   Class :character     
##  Median :     865   Mode  :character   Mode  :character   Mode  :character     
##  Mean   :    6492                                                              
##  3rd Qu.:    1010                                                              
##  Max.   :99999999                                                              
##                                                                                
##     region         
##  Length:17833      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Delete PROP_NAME rows with empty values: Use the complete.cases function to find rows with complete observations. Store these rows with complete observations in a new data frame, c_data.

Process the monthly rent variable: use the gsub function to remove non-numeric characters from the monthly rent column and convert the result to a numeric type. Use the summary function to see a statistical summary of the processed monthly rent column.

Removing outliers: A function called remove_outliers is defined to detect and remove outliers. Inside the function, upper and lower bounds for outliers are calculated using the quartiles (Q1 and Q3) and the interquartile range (IQR). A subset of the data is filtered using logical conditions, keeping the values between the upper and lower bounds in the data frame.

Handling room count variables

library(ggplot2)
library(plotly)
## 
## 载入程辑包:'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(dplyr)
## 
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(forcats)
library(tidyr)

# Plot histogram of number of rooms
ggplot(c_data, aes(x = rooms)) +
  geom_histogram(bins = 30, color = "black", fill = "lightblue") +
  labs(x = "Number of Rooms", y = "Count", title = "Number of Rooms")

This histogram visualizes the distribution of the variable ‘rooms’. It can be seen that three-bedroom units are the most available and four-bedroom units are the least available

# region
c_data %>%
  group_by(region) %>%
  summarise(countt = n()) %>%
  plot_ly(labels = ~region, values = ~countt, type = 'pie') %>%
  add_pie(hole = 0.4) %>%
  layout(title = "Distribution of Properties by Region")

The pie chart plots the distribution of region in this dataset, and it can be seen that the two areas account for a relatively even share, Selangor accounting for a slightly higher share of 52.1% (9297), Kuala Lumpur 47.9% (8536)

#prop_type
c_data %>%
  group_by(property_type) %>%
  summarise(countt = n()) %>%
  plot_ly(labels = ~property_type, values = ~countt, type = 'pie') %>%
  layout(title = "Distribution of Properties Type")

The pie chart depicts the distribution of property types in the dataset, condominiums being the most prevalent, followed by apartments and service residences. Other property types account for smaller proportions.

# Status
c_data %>%
  group_by(furnished) %>%
  summarise(countt = n()) %>%
  plot_ly(labels = ~furnished, values = ~countt, type = 'pie') %>%
  layout(title = "Furnished vs Unfurnished Rental Properties")

The pie chart depicts the furnish type of properties, which shows that only a small percentage of properties are not furnished at all, and the percentage of partically and fully furnished properties is about the same

#location
# count the number of rental properties in each location and region
location_counts <- c_data %>%
  group_by(location, region) %>%
  summarise(n = n(),.groups = "drop") %>%
  ungroup() %>%
  arrange(region, desc(n)) %>%
  group_by(region) %>%
  top_n(5, n)

# plot the bar chart
ggplot(location_counts, aes(x = fct_reorder(location, n), y = n, fill = region)) +
  geom_bar(stat = "identity", color = "black") +
  labs(x = "Location", y = "Count", title = "Top 5 Rental Properties by Location for Each Region") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  scale_fill_discrete(name = "Region")

### How facility affected to the rental price???Can get the exact value or not??
##Facility
# count the top 5 additional facilities and facilities

The top 5 rental properties by location in each region, red for Kuala Lumpur and blue for Selangor, are not far apart, with Kuala Lumpur-Bukit Jalil having the lowest number of rental properties and Kuala Lumpur-Cheras having the highest number of rental properties

data %>%
  separate_rows(facilities, sep = ", ") %>%
  count(facilities, sort = TRUE) %>%
  slice_head(n = 5) %>%
  mutate(facilities = fct_reorder(facilities, n)) %>%
  ggplot(aes(x = facilities, y = n, fill = n)) +
  geom_col() +
  scale_fill_gradient(low = "white", high = "purple") +
  coord_flip() +
  labs(x = "Facilities", y = "Count", title = "Top 5 Facilities") 

The top five facilities had about the same number of Security and Parking, but The Graph show about,the highest number of Security , the lowest number of Gymnasium and the middle number of Playground and Swimming pool.

c_data %>%
  separate_rows(additional_facilities, sep = ", ") %>%
  filter(!is.na(additional_facilities)) %>%
  count(additional_facilities, sort = TRUE) %>%
  slice_head(n = 5) %>%
  mutate(additional_facilities = fct_reorder(additional_facilities, n)) %>%
  ggplot(aes(x = additional_facilities, y = n, fill = n)) +
  geom_col() +
  scale_fill_gradient(low = "white", high = "purple") +
  coord_flip() +
  labs(x = "Additional Facilities", y = "Count", title = "Top 5 Additional Facilities")

This bar chart shows the top 5 additional facilities with the highest number of Cooking Allowed, the lowest number of Internet, Air-cond in second place, Near-LTM/LRA and Washing Machine almost the same

#monthly_rent
c_data  %>%
  ggplot(aes(x = monthly_rent)) + 
  geom_histogram(bins = 30, color = "black", fill = "steelblue") +
  labs(x = "Monthly Rent", y = "Count")+ 
  ggtitle("Histogram of Monthly Rent")

This is a bar chart of monthly rents. 1000-2000 is the majority, but between 1000-1500 there is a very low number of plots, 2000-3000 is low and 0-500 is almost zero

###is this histogram correct???
#rent vs region
c_data  %>%
  ggplot(aes(x = monthly_rent, fill = region)) + 
  geom_histogram(bins = 30, color = "black") +
  facet_grid(rows = vars(region), scales = "free_y") +
  labs(x = "Monthly Rent", y = "Count")+ 
  ggtitle("Comparison of Monthly Rent by Region") 

####TODO
#rent vs prop_type
# Define ranges of monthly rent 
rent_ranges <- c(0, 1000, 2000, 3000)

# Create a new column that labels each rental with its rent range
c_data $rent_range <- cut(c_data $monthly_rent, breaks = rent_ranges, labels = c("0-1000", "1001-2000", "2001-3000"))

# Group the data by rent range and apartment type, and calculate the count of apartments in each group
counts_grouped <- c_data %>%
  group_by(rent_range, property_type) %>%
  summarise(count = n(),.groups = "drop")

see from the graph, most of the properties prices in Kuala Lumpur are in the 1200 to 2000 range, while most of the properties prices in Selangor are in the 700 to 1300 range. Overall Kuala Lumpur has more high rental properties than Selangor

# Plot the grouped bar chart
ggplot(counts_grouped, aes(x = rent_range, y = count, fill = property_type)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Monthly Rent Range", y = "Count") +
  ggtitle("Comparison of Apartment Types by Monthly Rent Range") +
  scale_fill_brewer(palette = "Set1") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

  #scale_fill_manual(values = my_colors) +
  #theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability

Based on the bar chart we can see how the number of various types of housing compares in the different price ranges. The price range 0 to 1000 is where the number of flats is highest, at around 2400. The price range 1001 to 2000 has upwards of 5000 condominiums, followed by service residences at around 3400 and apartments at around 2500. while the price range 2001 to 3000 has relatively few housing types other than condominiums and service residences

#completion year
c_data  %>%
  ggplot(aes(x = completion_year)) + 
  geom_histogram(bins = 30, color = "black", fill = "steelblue") +
  labs(x = "Completion Year", y = "Count")+ 
  ggtitle("Histogram of Completion Year")
## Warning: Removed 7903 rows containing non-finite values (`stat_bin()`).

####TODO
#completion year vs average rental price
# Group data by completion year and calculate the average rental price for each year
avg_rental_prices <- c_data %>%
  group_by(completion_year) %>%
  summarise(avg_rent = mean(monthly_rent))

The histogram shows the distribution of housing rentals by year of building completion. The years of construction completion in the housing rental data set are concentrated in the years 2015 to 2022, while the number of housing rentals in other years is relatively small. The overall trend is upwards, indicating that the newer the house, the more popular it is in the rental market

# Plot line chart
ggplot(avg_rental_prices, aes(x = completion_year, y = avg_rent)) +
  geom_line() +
  labs(x = "Completion Year", y = "Average Rental Price") +
  ggtitle("Completion Year vs. Average Rental Price")
## Warning: Removed 1 row containing missing values (`geom_line()`).

The line graph shows the relationship between the year of completion of the building and the average rental price in the corresponding year. The line graph shows an overall upward trend in average rental prices as the year of building completion increases, but in a few cases there are large fluctuations.

# Status VS rental

ggplot(c_data, aes(x = furnished, y = monthly_rent, fill = furnished)) +
  geom_boxplot() +
  labs(x = "Furnished Status", y = "Monthly Rent") +
  ggtitle("Comparison of Monthly Rent by Furnished Status")

The box plot shows the distribution of monthly rents for different furniture states. Fully furnished houses have overall higher rents than semi-furnished and unfurnished houses, with a mean value of around 1700, the lower 25% of which is higher than the upper 75% of unfurnished rentals. The semi-furnished houses have higher rents than the unfurnished houses. In summary, this shows that furniture has a greater impact on rents, with the more furnished houses being more popular in the rental market

# room vs rental
ggplot(c_data, aes(x = as.factor(rooms), y = monthly_rent, fill = as.factor(rooms))) +
  geom_boxplot() +
  labs(x = "Number of Rooms", y = "Monthly Rental") +
  ggtitle("Rental Prices by Number of Rooms")

The graph shows that the average rental price of properties with 3 rooms and the range of rental distribution is rather less than that of 2 rooms. The 4 rooms units have the highest price distribution and the highest average price, while the one-bedroom units have the narrowest rental price distribution

4.MODELING

4.1 Factors affecting rent prices

4.1.1 Factors affecting rent prices based on Linear regression

# Creating a linear regression model
lm_model <- lm(monthly_rent ~ rooms + bathroom + furnished, data = c_data)

# Extraction factor estimate
coefficients <- coef(lm_model)[-1]
labels <- names(coefficients)

Creating a linear regression model: A linear regression model was created using the lm() function. The formula for the model is monthly_rent ~ rooms + bathroom + furnished, where monthly_rent is the dependent variable and rooms, bathroom and furnished are the independent variables. The data source is the previously cleaned DATA data frame.

Extracting coefficient estimates: coef() function was used to extract coefficient estimates for the linear regression model lm_model. The intercept term is removed by [-1] and the remaining coefficient estimates of the independent variables are stored in the coefficients variable.

Create a data frame for plotting: store the coefficient estimates and the corresponding independent variable names in a data frame. The data.frame() function creates a data frame plot_data, where the Factors column stores the names of the independent variables (labels variable) and the Coefficient_Estimate column stores the corresponding coefficient estimates (coefficients variable).

# Creating data boxes for plotting
plot_data <- data.frame(Factors = labels, Coefficient_Estimate = coefficients)

ggplot(data = plot_data, aes(x = Factors, y = Coefficient_Estimate)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(x = "Factors", y = "Coefficient Estimate", title = "Impact of Factors on Monthly Rent") +
  theme_minimal()

Creating the data frame for plotting: A data frame plot_data has been created in the previous code, containing the name of the independent variable (Factors) and the corresponding coefficient estimate (Coefficient_Estimate).

Plotting the coefficient estimates: use the ggplot() function to create a plot object and specify the data as plot_data. set the x-axis to Factors and the y-axis to Coefficient_Estimate using the aes() function. then use the geom_bar() function to create a bar graph with stat = “identity” to indicate The height of the bar is determined by the data and fill = “steelblue” sets the fill colour of the bar.

Add labels and titles: use the labs() function to set the x-axis label to “Factors”, the y-axis label to “Coefficient Estimate” and the title to “Impact of Factors on Monthly Rent”.

Set the plot style: use the theme_minimal() function to set the theme style of the plot.

ggplot(data = plot_data, aes(x = "Factors", y = Coefficient_Estimate, fill = Factors)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  labs(fill = "", title = "Impact of Factors on Monthly Rent") +
  theme_minimal() +
  theme(legend.position = "right")

This code plots the coefficient estimates of the linear regression model using the ggplot2 package, showing the effect of the independent variable on the dependent variable. Data cleaning is first performed, then the linear regression model is created and coefficient estimates are extracted. The data frames for plotting were then created and bar and polar plots were plotted using the ggplot() function to show the coefficient estimates. The final visual analysis of the coefficients of the linear regression model was achieved.

4.2 Predicting rent prices

4.2.1 Predicting rent prices based on linear regression models

model_uni <- lm(monthly_rent ~ size, data = c_data)

summary(model_uni)
## 
## Call:
## lm(formula = monthly_rent ~ size, data = c_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1376.62  -346.62   -46.61   303.39  1403.39 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.447e+03  3.829e+00 377.838  < 2e-16 ***
## size        -1.328e-05  5.113e-06  -2.597  0.00942 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 511.3 on 17831 degrees of freedom
## Multiple R-squared:  0.000378,   Adjusted R-squared:  0.000322 
## F-statistic: 6.743 on 1 and 17831 DF,  p-value: 0.009418

In this section, a simple linear regression model is created using the lm() function. The model predicts the monthly_rent based on the size variable from the c_data data frame.

The summary of this model provides information on the regression coefficients, statistical significance, model fit and residuals. Of these, the coefficient estimates for the SIZE variable indicate that it is statistically significant, but the low R-squared value indicates that the variable makes a small contribution to explaining the variance of the target variables…

set.seed(123)

num_predictions <- 1000
num_iterations <- ceiling(nrow(c_data) / num_predictions)
predicted_uni <- vector("numeric", length = nrow(c_data))

for (i in 1:num_iterations) {
  start_index <- (i - 1) * num_predictions + 1
  end_index <- min(i * num_predictions, nrow(c_data))
  
  size_subset <- sample(400:1800, end_index - start_index + 1, replace = TRUE)
  
  predicted_uni[start_index:end_index] <- predict(model_uni, newdata = data.frame(size = size_subset))
}

head(predicted_uni, 5)
## [1] 1446.615 1446.614 1446.618 1446.614 1446.618

In this section, a one-variable linear regression model (model_uni) is used to predict the monthly_rent based on the size variable. To assess the model’s performance, 1000 random values within the range of 400 to 1800 are generated for the size variable, and the corresponding monthly_rent values are predicted using the predict() function. The predicted values are stored in the predicted_uni vector. The head() function is then used to display the first five predicted values. These predicted values represent the monthly rent based on the given size

mse_uni <- mean((c_data$monthly_rent - predicted_uni)^2)
rmse_uni <- sqrt(mse_uni)
print(paste("MSE (One-variable model):", mse_uni))
## [1] "MSE (One-variable model): 261461.741488379"
print(paste("RMSE (One-variable model):", rmse_uni))
## [1] "RMSE (One-variable model): 511.333297848261"

In this section, the performance of the one-variable linear regression model is evaluated. The mean squared error (MSE) and root mean squared error (RMSE) are calculated to quantify the model’s prediction accuracy. These metrics measure the average squared difference between the actual monthly_rent values (c_data$monthly_rent) and the predicted values (predicted_uni). The lower the MSE and RMSE, the better the model’s performance in predicting monthly rent.

In this model, the MSE was 261461.741488379 and the RMSE was 511.333297848261, indicating a large error between the predicted and actual observed values.

Based on these results, it can be concluded that the predictive power of this one-dimensional regression model could be improved, as the values of MSE and RMSE are relatively high. Further adjustments to the model or consideration of adding other characteristic variables may be required to improve the accuracy of the predictions.

model_multi <- lm(monthly_rent ~ size + rooms + bathroom, data = c_data)

summary(model_uni)
## 
## Call:
## lm(formula = monthly_rent ~ size, data = c_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1376.62  -346.62   -46.61   303.39  1403.39 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.447e+03  3.829e+00 377.838  < 2e-16 ***
## size        -1.328e-05  5.113e-06  -2.597  0.00942 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 511.3 on 17831 degrees of freedom
## Multiple R-squared:  0.000378,   Adjusted R-squared:  0.000322 
## F-statistic: 6.743 on 1 and 17831 DF,  p-value: 0.009418
num_predictions <- 1000
num_iterations <- ceiling(nrow(c_data) / num_predictions)
predicted_multi <- vector("numeric", length = nrow(c_data))

for (i in 1:num_iterations) {
  start_index <- (i - 1) * num_predictions + 1
  end_index <- min(i * num_predictions, nrow(c_data))
  
  subset <- data.frame(
    size = runif(end_index - start_index + 1, 400, 1800),
    rooms = runif(end_index - start_index + 1, min(c_data$rooms), max(c_data$rooms)),
    bathroom = runif(end_index - start_index + 1, min(c_data$bathroom), max(c_data$bathroom))
  )
  
  predicted_multi[start_index:end_index] <- predict(model_multi, newdata = subset)
}

head(predicted_multi, 5)
## [1] 1254.759 2801.370 2877.954 1174.069 2367.582

In this section, a multiple linear regression model (model_multi) is created to predict the monthly_rent using multiple variables: size, rooms, and bathroom.

Similar to the previous section, random values are generated for the variables size, rooms, and bathroom within appropriate ranges, and the corresponding monthly_rent values are predicted using the predict() function. The predicted values are stored in the predicted_multi vector.

By analysing the results, the following conclusions can be drawn:

Model coefficients: the intercept term is 1.447e+03, indicating that the predicted value of monthly_rent is 1.447e+03 when size is 0. The coefficient on size is -1.328e-05, indicating that each unit increase in size produces a -1.328e-05 change in the predicted value of monthly_rent.

Multiple R-squared and adjusted R-squared are low: the model explains a low degree of variability in the observed data, indicating that the model does not explain the variation in monthly_rent well.

Small p-value for the F-statistic: The F-statistic tests the overall significance of the model with a p-value of 0.009418, which is below the significance level (e.g. 0.05) and allows the original hypothesis to be rejected, indicating that the model is statistically significant.

In summary, the predictive power of this multiple linear regression model is poor and the model may need to be reconsidered in terms of feature selection or the addition of more relevant independent variables to improve the performance of the model.

mse_multi <- mean((c_data$monthly_rent - predicted_multi)^2)
rmse_multi <- sqrt(mse_multi)
print(paste("MSE (Multi-variable model):", mse_multi))
## [1] "MSE (Multi-variable model): 1818880.03797811"
print(paste("RMSE (Multi-variable model):", rmse_multi))
## [1] "RMSE (Multi-variable model): 1348.65860690469"

In this section, the performance of the multiple linear regression model is evaluated. The mean squared error (MSE) and root mean squared error (RMSE) are calculated to assess the model’s prediction accuracy. These metrics measure the average squared difference between the actual monthly_rent values (c_data$monthly_rent) and the predicted values (predicted_multi).

Analysing these values together, the following conclusions can be drawn:

The values of MSE and RMSE are large: MSE is 651,962.44 and RMSE is 807.44, which means that this multiple regression model has a large average prediction error and a poor fit to the actual observations, and further improvements are needed to improve the prediction accuracy.

## Warning in sqrt(crit * p * (1 - hh)/hh): 产生了NaNs

## Warning in sqrt(crit * p * (1 - hh)/hh): 产生了NaNs

4.2.2 use decision trees to predict rent prices

In this section, the rpart library is used to train the decision tree model for predicting monthly rent. The dataset c_data is divided into a training set and a test set. The decision tree model is trained using data from the training set with the target variable of monthly rent and the predictor variables of area, furniture, room, bathroom, and property type. The obtained model is stored in the variable dt_model and is used for further analysis or prediction tasks.

library(rpart)
library(rpart.plot)

#spilt c_data to trainset and testset
s <- sample(c(1:nrow(c_data)), nrow(c_data)*0.8, replace=FALSE)
trainset <- c_data[s,]
testset <- c_data[-s,]

# Example assuming you have split your data into train_data and test_data
dt_model <- rpart(monthly_rent ~ region + furnished + rooms + bathroom + property_type,
            data = trainset)

Visualizes the decision tree model (dt_model) using the rpart.plot function

Applies the trained decision tree model (dt_model) to the test dataset (testset) using the predict function.

pre <- predict(dt_model, testset)
head(pre,5)
##         2         9        18        20        25 
## 1579.9894 1419.1201 1579.9894 1888.8332  908.5909

Calculates the Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) for the multi-variable model’s predictions. The MSE and RMSE values provided indicate the need to further optimize the model or adjust the model parameters to improve the fit.

mse_multi <- mean((c_data$monthly_rent - pre)^2)
## Warning in c_data$monthly_rent - pre: 长的对象长度不是短的对象长度的整倍数
rmse_multi <- sqrt(mse_multi)
print(paste("MSE (Multi-variable model):", mse_multi))
## [1] "MSE (Multi-variable model): 367930.481266472"
print(paste("RMSE (Multi-variable model):", rmse_multi))
## [1] "RMSE (Multi-variable model): 606.5727336985"

Generates a plot to compare the actual monthly rent values (c_data$monthly_rent) with the predicted values obtained from the decision tree model.

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## 载入程辑包:'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
rf_model <- randomForest(monthly_rent ~ rooms + bathroom + furnished, data = c_data)

new_data <- data.frame(
  rooms = sample(c(1, 2, 3, 4), 1000, replace = TRUE),
  parking = sample(c(0, 1, 2), 1000, replace = TRUE),
  bathroom = sample(c(1, 2), 1000, replace = TRUE),
  furnished = sample(c("Yes", "No"), 1000, replace = TRUE)
)

prediction <- predict(rf_model, newdata = new_data)
head(prediction, 5)
##        1        2        3        4        5 
## 1477.458 1477.458 1599.321 1330.993 1599.321
mse_ran <- mean((c_data$monthly_rent - prediction)^2)
## Warning in c_data$monthly_rent - prediction:
## 长的对象长度不是短的对象长度的整倍数
rmse_ran <- sqrt(mse_ran)
print(paste("MSE (Multi-variable model):", mse_ran))
## [1] "MSE (Multi-variable model): 338296.449813593"
print(paste("RMSE (Multi-variable model):", rmse_ran))
## [1] "RMSE (Multi-variable model): 581.63257286159"

In this section, the performance of the one-variable linear regression model is evaluated. The mean squared error (MSE) and root mean squared error (RMSE) are calculated to quantify the model’s prediction accuracy. These metrics measure the average squared difference between the actual monthly_rent values (c_data$monthly_rent) and the predicted values (predicted_uni). The lower the MSE and RMSE, the better the model’s performance in predicting monthly rent.

In this model, the MSE was 261461.741488379 and the RMSE was 511.333297848261, indicating a large error between the predicted and actual observed values.

Based on these results, further adjustments to the model or consideration of adding other characteristic variables may be required to improve the accuracy of the predictions.