1. Business Understanding

1.1 Problem Statement

In the real estate industry, one of the biggest issue is related to a very high cost in lead, when compared to the conversion ratio, and the problem expands to even second hand houses. This scenario could be due the mismatch in pricing, or the failure to find the common ground of pricing among the buyer and the seller. A property’s value usually depends on multiple factors, such as location, size, population density, and etc. Therefore, there is a need to have house price prediction models to help the buyers and sellers to reach a mutual agreement in the price of the house without bias. In addition, prediction of house prices will allow people who plan to buy a house to forecast the price range in the future, so they can have a better financial plan.

1.2 Project Objectives

  1. To determine the factors that influence house prices based on historical data.

  2. To develop a regression and classification model for house price prediction using machine learning.

  3. To evaluate the performance metrics of each model for the best model selection.

1.3 Data Mining Goals

  1. To develop a classification model that predicts the price of a house based on factors, such as age of the house, size of the property lot, and the surrounding conditions of the property.

  2. To develop a regression model that predicts the price of the house accurately

2. Data Understanding

Based on a set of house sale prices in King County, Seattle, we explored the application of regression and classification models for house price prediction. The data set contains houses sold from the period of May 2014 to May 2015, with information such as year built, size of house lot, and price of house. It contains 21,613 sample observations with 21 attribute variables that directly or indirectly affect the price of the house.

knitr::include_graphics("C:/Users/San/Documents/UM/WQD7004/R wd/Variable Explanation.png")

Image 1: Variable Explanation in dataset

3. Data Preparation

3.1 Install required packages

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(caret)
## Loading required package: lattice
library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.2, built: 2024-04-10)
## ## Copyright (C) 2005-2024 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(rpart)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(e1071)
library(readr)
library(RColorBrewer)
library(corrplot)
## corrplot 0.92 loaded
library(ggmap)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
##   Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service/>
##   OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles/>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0     ✔ stringr 1.5.1
## ✔ purrr   1.0.2     ✔ tibble  3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ randomForest::combine() masks gridExtra::combine(), dplyr::combine()
## ✖ mice::filter()          masks dplyr::filter(), stats::filter()
## ✖ dplyr::lag()            masks stats::lag()
## ✖ purrr::lift()           masks caret::lift()
## ✖ randomForest::margin()  masks ggplot2::margin()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(moderndive)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(readr)
library(Boruta)
library(Metrics)
## 
## Attaching package: 'Metrics'
## 
## The following objects are masked from 'package:caret':
## 
##     precision, recall
library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
suppressWarnings(library(readr))

3.2 Importing the dataset

house_df = read.csv("house_data.csv")
head(house_df)
##           id            date   price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 20141013T000000  221900        3      1.00        1180     5650
## 2 6414100192 20141209T000000  538000        3      2.25        2570     7242
## 3 5631500400 20150225T000000  180000        2      1.00         770    10000
## 4 2487200875 20141209T000000  604000        4      3.00        1960     5000
## 5 1954400510 20150218T000000  510000        3      2.00        1680     8080
## 6 7237550310 20140512T000000 1225000        4      4.50        5420   101930
##   floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1      1          0    0         3     7       1180             0     1955
## 2      2          0    0         3     7       2170           400     1951
## 3      1          0    0         3     6        770             0     1933
## 4      1          0    0         5     7       1050           910     1965
## 5      1          0    0         3     8       1680             0     1987
## 6      1          0    0         3    11       3890          1530     2001
##   yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1            0   98178 47.5112 -122.257          1340       5650
## 2         1991   98125 47.7210 -122.319          1690       7639
## 3            0   98028 47.7379 -122.233          2720       8062
## 4            0   98136 47.5208 -122.393          1360       5000
## 5            0   98074 47.6168 -122.045          1800       7503
## 6            0   98053 47.6561 -122.005          4760     101930
summary(house_df)
##        id                date               price            bedrooms     
##  Min.   :1.000e+06   Length:21613       Min.   :  75000   Min.   : 0.000  
##  1st Qu.:2.123e+09   Class :character   1st Qu.: 321950   1st Qu.: 3.000  
##  Median :3.905e+09   Mode  :character   Median : 450000   Median : 3.000  
##  Mean   :4.580e+09                      Mean   : 540088   Mean   : 3.371  
##  3rd Qu.:7.309e+09                      3rd Qu.: 645000   3rd Qu.: 4.000  
##  Max.   :9.900e+09                      Max.   :7700000   Max.   :33.000  
##    bathrooms      sqft_living       sqft_lot           floors     
##  Min.   :0.000   Min.   :  290   Min.   :    520   Min.   :1.000  
##  1st Qu.:1.750   1st Qu.: 1427   1st Qu.:   5040   1st Qu.:1.000  
##  Median :2.250   Median : 1910   Median :   7618   Median :1.500  
##  Mean   :2.115   Mean   : 2080   Mean   :  15107   Mean   :1.494  
##  3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10688   3rd Qu.:2.000  
##  Max.   :8.000   Max.   :13540   Max.   :1651359   Max.   :3.500  
##    waterfront            view          condition         grade       
##  Min.   :0.000000   Min.   :0.0000   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.: 7.000  
##  Median :0.000000   Median :0.0000   Median :3.000   Median : 7.000  
##  Mean   :0.007542   Mean   :0.2343   Mean   :3.409   Mean   : 7.657  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.: 8.000  
##  Max.   :1.000000   Max.   :4.0000   Max.   :5.000   Max.   :13.000  
##    sqft_above   sqft_basement       yr_built     yr_renovated   
##  Min.   : 290   Min.   :   0.0   Min.   :1900   Min.   :   0.0  
##  1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.0  
##  Median :1560   Median :   0.0   Median :1975   Median :   0.0  
##  Mean   :1788   Mean   : 291.5   Mean   :1971   Mean   :  84.4  
##  3rd Qu.:2210   3rd Qu.: 560.0   3rd Qu.:1997   3rd Qu.:   0.0  
##  Max.   :9410   Max.   :4820.0   Max.   :2015   Max.   :2015.0  
##     zipcode           lat             long        sqft_living15 
##  Min.   :98001   Min.   :47.16   Min.   :-122.5   Min.   : 399  
##  1st Qu.:98033   1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1490  
##  Median :98065   Median :47.57   Median :-122.2   Median :1840  
##  Mean   :98078   Mean   :47.56   Mean   :-122.2   Mean   :1987  
##  3rd Qu.:98118   3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360  
##  Max.   :98199   Max.   :47.78   Max.   :-121.3   Max.   :6210  
##    sqft_lot15    
##  Min.   :   651  
##  1st Qu.:  5100  
##  Median :  7620  
##  Mean   : 12768  
##  3rd Qu.: 10083  
##  Max.   :871200
str(house_df)
## 'data.frame':    21613 obs. of  21 variables:
##  $ id           : num  7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
##  $ date         : chr  "20141013T000000" "20141209T000000" "20150225T000000" "20141209T000000" ...
##  $ price        : num  221900 538000 180000 604000 510000 ...
##  $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
##  $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
##  $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
##  $ sqft_basement: int  0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
##  $ yr_renovated : int  0 1991 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
##  $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
##  $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
glimpse(house_df)
## Rows: 21,613
## Columns: 21
## $ id            <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544005…
## $ date          <chr> "20141013T000000", "20141209T000000", "20150225T000000",…
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,…
## $ bedrooms      <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
## $ sqft_living   <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
## $ sqft_lot      <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
## $ waterfront    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ view          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
## $ condition     <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
## $ grade         <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
## $ sqft_above    <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
## $ yr_built      <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
## $ yr_renovated  <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ zipcode       <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, …
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23…
## $ sqft_lot15    <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, …

3.3 Data Cleaning

Check for missing values

sum(is.na(house_df))
## [1] 0
#Check for missing values in each column
colSums(is.na(house_df))
##            id          date         price      bedrooms     bathrooms 
##             0             0             0             0             0 
##   sqft_living      sqft_lot        floors    waterfront          view 
##             0             0             0             0             0 
##     condition         grade    sqft_above sqft_basement      yr_built 
##             0             0             0             0             0 
##  yr_renovated       zipcode           lat          long sqft_living15 
##             0             0             0             0             0 
##    sqft_lot15 
##             0

There are no missing values in our data set.

Check for duplicates

duplicates <- house_df[duplicated(house_df), ]
print("Duplicate rows in the dataset:")
## [1] "Duplicate rows in the dataset:"
print(duplicates)
##  [1] id            date          price         bedrooms      bathrooms    
##  [6] sqft_living   sqft_lot      floors        waterfront    view         
## [11] condition     grade         sqft_above    sqft_basement yr_built     
## [16] yr_renovated  zipcode       lat           long          sqft_living15
## [21] sqft_lot15   
## <0 rows> (or 0-length row.names)
num_duplicates <- sum(duplicated(house_df))
cat("Number of duplicate rows:", num_duplicates, "\n")
## Number of duplicate rows: 0

There are no duplicates in our data set.

Checking unique values for each column

unique_counts <- sapply(house_df, function(x) length(unique(x)))
print(unique_counts)
##            id          date         price      bedrooms     bathrooms 
##         21436           372          4028            13            30 
##   sqft_living      sqft_lot        floors    waterfront          view 
##          1038          9782             6             2             5 
##     condition         grade    sqft_above sqft_basement      yr_built 
##             5            12           946           306           116 
##  yr_renovated       zipcode           lat          long sqft_living15 
##            70            70          5034           752           777 
##    sqft_lot15 
##          8689

3.4 Data Transformation

house_df$date <- ymd_hms(house_df$date)
head(house_df)
##           id       date   price bedrooms bathrooms sqft_living sqft_lot floors
## 1 7129300520 2014-10-13  221900        3      1.00        1180     5650      1
## 2 6414100192 2014-12-09  538000        3      2.25        2570     7242      2
## 3 5631500400 2015-02-25  180000        2      1.00         770    10000      1
## 4 2487200875 2014-12-09  604000        4      3.00        1960     5000      1
## 5 1954400510 2015-02-18  510000        3      2.00        1680     8080      1
## 6 7237550310 2014-05-12 1225000        4      4.50        5420   101930      1
##   waterfront view condition grade sqft_above sqft_basement yr_built
## 1          0    0         3     7       1180             0     1955
## 2          0    0         3     7       2170           400     1951
## 3          0    0         3     6        770             0     1933
## 4          0    0         5     7       1050           910     1965
## 5          0    0         3     8       1680             0     1987
## 6          0    0         3    11       3890          1530     2001
##   yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1            0   98178 47.5112 -122.257          1340       5650
## 2         1991   98125 47.7210 -122.319          1690       7639
## 3            0   98028 47.7379 -122.233          2720       8062
## 4            0   98136 47.5208 -122.393          1360       5000
## 5            0   98074 47.6168 -122.045          1800       7503
## 6            0   98053 47.6561 -122.005          4760     101930

Making a copy of the data set

house_df_copy <- house_df
head(house_df_copy)
##           id       date   price bedrooms bathrooms sqft_living sqft_lot floors
## 1 7129300520 2014-10-13  221900        3      1.00        1180     5650      1
## 2 6414100192 2014-12-09  538000        3      2.25        2570     7242      2
## 3 5631500400 2015-02-25  180000        2      1.00         770    10000      1
## 4 2487200875 2014-12-09  604000        4      3.00        1960     5000      1
## 5 1954400510 2015-02-18  510000        3      2.00        1680     8080      1
## 6 7237550310 2014-05-12 1225000        4      4.50        5420   101930      1
##   waterfront view condition grade sqft_above sqft_basement yr_built
## 1          0    0         3     7       1180             0     1955
## 2          0    0         3     7       2170           400     1951
## 3          0    0         3     6        770             0     1933
## 4          0    0         5     7       1050           910     1965
## 5          0    0         3     8       1680             0     1987
## 6          0    0         3    11       3890          1530     2001
##   yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1            0   98178 47.5112 -122.257          1340       5650
## 2         1991   98125 47.7210 -122.319          1690       7639
## 3            0   98028 47.7379 -122.233          2720       8062
## 4            0   98136 47.5208 -122.393          1360       5000
## 5            0   98074 47.6168 -122.045          1800       7503
## 6            0   98053 47.6561 -122.005          4760     101930

4. EDA & Modeling For Regression

4.1 Data Transformation

Density Plot of Price and Sqft Living

d1<-ggplot(house_df_copy, aes(x=price)) +
  geom_density(fill="turquoise3") +
  scale_color_brewer(palette="Accent") + theme_minimal() +
  labs(title=bquote(bold("Density Plot for Price")),x="Price",y="Density")

d2<-ggplot(house_df_copy, aes(x=sqft_living)) + 
  geom_density(fill="turquoise3") + 
  scale_color_brewer(palette="Accent") + theme_minimal() +
  labs(title=bquote(bold("Density Plot for Sqft Living")),x="Sqft Living",y="Density")

grid.arrange(d1,d2,nrow=1,ncol=2)

Based on the density plot:

  • It is evident that there is a high level of skewness. It is indicated by the pronounced maximum value on the x-axis, making the extent of the right tail imperceptible.
  • However, it is highly leptokurtic distribution that lends this variable to be better classified as high rather than extreme.
  • From the observation, both house Price and Sqft living skewed to the right.
  • Log transformation will be the best fit in terms of normalizing the distribution.

Log Transformation

house_df_copy<-house_df_copy %>%
  mutate(log_price=log10(price),
         log_size=log10(sqft_living))

h1<-ggplot(house_df_copy) +
  geom_histogram(aes(x=log_price), fill="turquoise3", binwidth=0.10) +
  labs(title=bquote(bold("Histogram of Log-Transformed Price")),x="Log Price",y="Frequency")

h2<-ggplot(house_df_copy) +
  geom_histogram(aes(x=log_size), fill="turquoise3", binwidth=0.10) +
  labs(title=bquote(bold("Histogram of Log-Transformed Sqft Living")),x="Log Sqft Living",y="Frequency")

grid.arrange(h1,h2,nrow=1,ncol=2)

After performing log transformation:

  • Both histograms are normally distributed.
  • House price is between 5.4 to 6 million.
  • Sqft living is between of 3.0 to 3.5 sq.ft.

4.2 Analysis

Observe Bathroom Feature

ggplot(house_df_copy,aes(x=bathrooms)) + 
  geom_histogram(fill="turquoise3",binwidth=0.5,size=0.1) +
  scale_x_continuous(limits=c(1,8)) +
  labs(title=bquote(bold("Histogram of Bathroom Feature")),x="Bathroom",y="Count")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 86 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Based on the histogram:

  • The feature bathroom is right skewed with less instance for number of bathroom with shower.
  • The slight skewness shows that some houses have have multiple features like having both toilet and shower.

Scatterplot of Bedrooms with Log price

ggplot(house_df_copy, aes(x=bedrooms, y=log_price)) +
  geom_point(alpha=0.5, size=2) + 
  geom_smooth(method="lm", se = FALSE, color="darkred") + 
  labs(title = bquote(bold("Scatterplot of Bedrooms against Log Price")), x="Bedrooms", y="Log_price") + 
  theme(legend.position="none")
## `geom_smooth()` using formula = 'y ~ x'

Based on the scatterplot:

  • There is an outlier as seen in the scatter plot which unsually shows a large number of bathrooms from a normal house. Since the outlier can skew the mean and inflate the variance, it will lead to poor statistical inference.
  • Outliers tends to increase the residual sum of squares in regression analysis, which will led to a poorer fit and and cause overfitting to the outliers.
  • Hence, there is a significant evidence to exclude one outlier.

Handling One Outlier

house_df_copy %>% filter(bedrooms<30)%>%
  ggplot(aes(x=bedrooms,y=log_price,col=bedrooms)) +
  geom_point(alpha=0.5,size=2) +
  geom_smooth(method="lm",se=F, color="darkred") +
  labs(title=bquote(bold("Scatterplot of Bedrooms against Price")), x="Bedrooms", y="Log_price") +
  theme(legend.position="none")
## `geom_smooth()` using formula = 'y ~ x'

Based on the scatterplot after adding the handling:

  • There is a clearer representation of the relationship between the number of bathrooms and the house prices.
  • Since the one outlier have obscured pattern and is then removed, it can now reduce the impact of skewness.

Table of House Prices based on Condition

table(house_df_copy$condition)
## 
##     1     2     3     4     5 
##    30   172 14031  5679  1701

Based on the table:

  • 1 represent the worst condition and 5 represents the best rated house.

Table of Relative Mean Prices based on House Conditions

house_df_copy %>%
  group_by(condition=factor(condition)) %>%
  summarise(
    mean_price=mean(log_price),
    sd=sd(log_price),
    count=n()
  )
## # A tibble: 5 × 4
##   condition mean_price    sd count
##   <fct>          <dbl> <dbl> <int>
## 1 1               5.42 0.293    30
## 2 2               5.45 0.233   172
## 3 3               5.67 0.224 14031
## 4 4               5.65 0.228  5679
## 5 5               5.71 0.244  1701

Based on the table:

  • House with good condition (3) valued for relatively $5.67 million.
  • Majority of the houses greatly represent a significant number of good condition.

Distribution of Prices based on House Condition

ggplot(house_df_copy,aes(factor(condition),log_price,fill=factor(condition)))+
  geom_bar(stat="identity", fill="turquoise3") +
  theme(legend.position="none") +
  labs(title=bquote(bold("Distribution of Prices based on House Condition")), x="House Condition", y="Log_price")

Relationship of Log Price & Log Size against Condition(1-5)

options(repr.plot.width=8, repr.plot.height=5)
ggplot(house_df_copy, aes(x=log_size, y=log_price, color=factor(condition))) +
  geom_point(size=0.5) +
  geom_smooth(method="lm", se=FALSE, alpha=0.6, size=0.5, color="red") +
  scale_color_manual(values = rep("turquoise3", length(unique(house_df_copy$condition)))) +
  facet_wrap(~condition) +
  labs(title=bquote(bold("Relationship of Log Price & Log Size against House Condition")), x="Log_size", y="Log_price")
## `geom_smooth()` using formula = 'y ~ x'

Condition 1 and 2:

  • Observation: As for condition 1 and 2, they indicate a lower density of points with less variation in log price and log size
  • Reasoning: The houses which is in poor conditions are generally less desirable by people, hence it will cause less sales and limited variability in terms of the price and size.

Conditions 3, 4 and 5:

  • Observation: For conditions 3, 4, and 5, they indicate a much higher density of points and a wider spread in log price and log size.
  • Reasoning: When a house condition improves, the properties become more desirable, which will lead to more sales and greater variability in both size and price.

Display Table of House Floors

table1<- table(house_df_copy$floors)
print(table1)
## 
##     1   1.5     2   2.5     3   3.5 
## 10680  1910  8241   161   613     8

Barplot of Number of Houses against Floors

house_df_copy %>% 
  group_by(flr=factor(floors)) %>%
  summarise(floor_cnt=n()) %>%
  ggplot(aes(x=flr, y=floor_cnt, fill=flr)) +
  geom_bar(stat="identity", alpha=0.5) +
  scale_fill_manual(values=rep("turquoise3", length(unique(house_df_copy$floors)))) +
  theme(legend.position="none") +
  labs(x="Floors", y= "Number of Houses")

Single-Story Houses (1 Floor):

  • With 10,680 observations, single-story houses dominate the dataset, constituting the majority of residential properties.
  • Single-story homes are preferred by many homeowners.

Two-Story Houses (2 Floors):

  • Following single-story houses, two-story houses are the next most prevalent category, with 8,241 instances.
  • Two-story homes are popular for their efficient use of space.

Three Story Houses (3 Floors):

  • In contrast, three story houses are less common category, with only 613 observations.
  • These residences are rare and often feature unique architectural designs, potentially appealing to buyers seeking distinctive properties.

4.3 Split Train and Test Set

set.seed(123)
train_index <- createDataPartition(house_df_copy$price, p = 0.8, list = FALSE)
train_data <- house_df_copy[train_index, ]
test_data <- house_df_copy[-train_index, ]

4.4 Type of Regression Models

Linear Regression Model

linear_model <- lm(log_price ~ log_size, data = train_data)
summary(linear_model)
## 
## Call:
## lm(formula = log_price ~ log_size, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47878 -0.12766  0.00577  0.11188  0.56929 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.909201   0.022947   126.8   <2e-16 ***
## log_size    0.840864   0.006987   120.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.169 on 17290 degrees of freedom
## Multiple R-squared:  0.4559, Adjusted R-squared:  0.4558 
## F-statistic: 1.449e+04 on 1 and 17290 DF,  p-value: < 2.2e-16
# Predict
linear_pred <- predict(linear_model, test_data)
# NRMSE
linear_rmse <- rmse(test_data$log_price, linear_pred)
price_range <- max(house_df_copy$log_price) - min(house_df_copy$log_price)
normalized_rmse <- linear_rmse / price_range
cat("Normalized RMSE:", normalized_rmse, "\n")
## Normalized RMSE: 0.08348207

Multiple Regression Model

mlr_model <- lm(log_price ~ log_size+bedrooms+bathrooms+view+grade, data = train_data)
summary(mlr_model)
## 
## Call:
## lm(formula = log_price ~ log_size + bedrooms + bathrooms + view + 
##     grade, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49508 -0.10869  0.00165  0.09916  0.57102 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646069   0.031281 116.559  < 2e-16 ***
## log_size     0.427195   0.012412  34.418  < 2e-16 ***
## bedrooms    -0.006439   0.001596  -4.036 5.47e-05 ***
## bathrooms   -0.002144   0.002371  -0.904    0.366    
## view         0.047087   0.001569  30.018  < 2e-16 ***
## grade        0.082918   0.001536  53.977  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.15 on 17286 degrees of freedom
## Multiple R-squared:  0.5716, Adjusted R-squared:  0.5715 
## F-statistic:  4614 on 5 and 17286 DF,  p-value: < 2.2e-16
# Predict
mlr_pred <- predict(mlr_model, test_data)
# NRMSE
mlr_rmse <- sqrt(mean((test_data$log_price - mlr_pred)^2))
price_range1 <- max(house_df_copy$log_price) - min(house_df_copy$log_price)
normalized_rmse <- mlr_rmse / price_range1
cat("Normalized RMSE:", normalized_rmse, "\n")
## Normalized RMSE: 0.07414972

Multiple Linear Regression with Forward Selection

fwd_model<-lm(log_price~log_size+bedrooms+bathrooms+waterfront+view+condition+grade+yr_built+yr_renovated+zipcode+lat+long,data=train_data)
summary(fwd_model)
## 
## Call:
## lm(formula = log_price ~ log_size + bedrooms + bathrooms + waterfront + 
##     view + condition + grade + yr_built + yr_renovated + zipcode + 
##     lat + long, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58039 -0.07132  0.00025  0.06911  0.59150 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.115e+00  1.735e+00  -0.642    0.521    
## log_size      3.954e-01  9.488e-03  41.675  < 2e-16 ***
## bedrooms     -1.147e-02  1.202e-03  -9.546  < 2e-16 ***
## bathrooms     3.571e-02  1.933e-03  18.471  < 2e-16 ***
## waterfront    1.651e-01  1.112e-02  14.840  < 2e-16 ***
## view          3.091e-02  1.299e-03  23.784  < 2e-16 ***
## condition     2.125e-02  1.446e-03  14.698  < 2e-16 ***
## grade         8.175e-02  1.204e-03  67.920  < 2e-16 ***
## yr_built     -1.481e-03  4.325e-05 -34.230  < 2e-16 ***
## yr_renovated  1.176e-05  2.260e-06   5.203 1.99e-07 ***
## zipcode      -2.724e-04  2.026e-05 -13.443  < 2e-16 ***
## lat           6.128e-01  6.587e-03  93.023  < 2e-16 ***
## long         -4.278e-02  7.720e-03  -5.541 3.05e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1119 on 17279 degrees of freedom
## Multiple R-squared:  0.7618, Adjusted R-squared:  0.7616 
## F-statistic:  4604 on 12 and 17279 DF,  p-value: < 2.2e-16
# Predict
fwd_pred <- predict(fwd_model, test_data)
# NRMSE
fwd_rmse <- sqrt(mean((test_data$log_price - fwd_pred)^2))
price_range <- max(house_df_copy$log_price) - min(house_df_copy$log_price)
normalized_rmse <- fwd_rmse / price_range
cat("Normalized RMSE:", normalized_rmse, "\n")
## Normalized RMSE: 0.05421957

4.4 Model Evaluation

Results:

regression_results <- data.frame(
  Model = c("Linear Regression", "Multiple Linear Regression", "Multiple Linear Regression with Forward Selection"),
  `Residual Standard Error` = c(0.169, 0.150, 0.111),
  `R squared` = c(0.4559, 0.5716, 0.7635),
  `Adjusted R squared` = c(0.4558, 0.5715, 0.7633),
  `NRMSE` = c("0.0835 (8.35%)", "0.07415 (7.42%)", "0.0542 (5.42%)"),
  check.names = FALSE
)

# Visualize in table form
kable(regression_results, format = "html", caption = "Model Performance Comparison") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE, position = "center") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:5, width = "150px") # Adjust column width as needed
Model Performance Comparison
Model Residual Standard Error R squared Adjusted R squared NRMSE
Linear Regression 0.169 0.4559 0.4558 0.0835 (8.35%)
Multiple Linear Regression 0.150 0.5716 0.5715 0.07415 (7.42%)
Multiple Linear Regression with Forward Selection 0.111 0.7635 0.7633 0.0542 (5.42%)

Evaluation:

  • Multiple linear regression with forward selection shows a better outcome in its performance metrics compared to the simple linear regression and multiple linear regression.
  • Forward selection is a variable selection method in regression that adds variables to the model one by one, starting with the variable that has the largest correlation with the price.
  • By running multiple simulation, when the predictor variable meets the criteria, the process will continue by selecting the next variable. The simulations stops when no other predictor variables meet the criteria.
  • Hence, from the result of the simulations made, adding the the predictor variable whose inclusion to the model, gives the most statistically significant improvement of the fit as shown in the final model.
  • The smaller residual error suggest the model prediction are better.
  • R² and adjusted R² for linear regression shows the model has poor fit.
  • From range 0 to 1, multiple linear regression with forward selection R² and adjusted R² shows the best goodness of fit to the model.
  • NMRSE of the multiple linear regression model with forward selection shows lowest NMRSE suggesting the model prediction error is relatively the lower when compare to the range of the house prices.
  • Adjusted R² in this model is the highest among all the models. This model is the best choice among all.

4.5 Test Best Model

# Convert log price to actual
predicted_prices <- round(exp(fwd_pred), 0)

# Plot Predicted Price
ggplot(test_data, aes(x = price, y = predicted_prices)) +
  geom_point(color = "turquoise3") +
  geom_smooth(se = FALSE, color = "red4") +
  labs(title = "Actual vs Predicted Prices",
       x = "Actual Price",
       y = "Predicted Price")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Breusch-Pagan Test

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(fwd_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  fwd_model
## BP = 605.95, df = 12, p-value < 2.2e-16

Homoscedasticity is the assumption that the residuals of the linear regression are distributed with equal variance at each level of the predictor variable. The Breusch-Pagan test is a test to determine if heteroscedasticity is present,

Since the best model is the multiple linear regression model, it is tested with the Breusch-Pagan test. With a p value less than 0.05, it concludes that the model contains heteroscedasticity, which means that there could be problems when the standard errors of the estimates is taken into account.

Multicollinearity

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
## 
##     some
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(fwd_model)
##     log_size     bedrooms    bathrooms   waterfront         view    condition 
##     4.209932     1.750675     3.049256     1.187862     1.338062     1.226033 
##        grade     yr_built yr_renovated      zipcode          lat         long 
##     2.756002     2.230270     1.146816     1.625435     1.154380     1.629755

Variance inflation factor (VIF) measures how much the variance of a regression coefficient is inflated due to multicollinearity. A VIR number means that our predictors are correlated, so the lower variance the better.

There is no multicollinearity in our model, as the VIF values are all less than 5, which indicates a low correlation of a predictor with other predictors.

5. Modeling For Classification

5.1 Classification: Classifying house price into 4 categories

0:‘Very Affordable’, 1:‘Affordable’, 2:‘Moderate’, 3:‘Expensive’

house_df_class <- house_df_copy
quartiles <- quantile(house_df_class$price, probs = c(0, 0.25, 0.5, 0.75, 1))

# Create a new variable 'price_class' based on quartiles
house_df_class$price_class <- cut(house_df_class$price, 
                                  breaks = quartiles, 
                                  labels = c(0, 1, 2, 3),
                                  include.lowest = TRUE)
head(house_df_class$price_class, 5)
## [1] 0 2 0 2 2
## Levels: 0 1 2 3

For classification purpose, new column ‘price_class’ was created to classify the price of the houses into four categories based on percentiles.

house_df_class <- house_df_class %>%
  select(-price)
head(house_df_class)
##           id       date bedrooms bathrooms sqft_living sqft_lot floors
## 1 7129300520 2014-10-13        3      1.00        1180     5650      1
## 2 6414100192 2014-12-09        3      2.25        2570     7242      2
## 3 5631500400 2015-02-25        2      1.00         770    10000      1
## 4 2487200875 2014-12-09        4      3.00        1960     5000      1
## 5 1954400510 2015-02-18        3      2.00        1680     8080      1
## 6 7237550310 2014-05-12        4      4.50        5420   101930      1
##   waterfront view condition grade sqft_above sqft_basement yr_built
## 1          0    0         3     7       1180             0     1955
## 2          0    0         3     7       2170           400     1951
## 3          0    0         3     6        770             0     1933
## 4          0    0         5     7       1050           910     1965
## 5          0    0         3     8       1680             0     1987
## 6          0    0         3    11       3890          1530     2001
##   yr_renovated zipcode     lat     long sqft_living15 sqft_lot15 log_price
## 1            0   98178 47.5112 -122.257          1340       5650  5.346157
## 2         1991   98125 47.7210 -122.319          1690       7639  5.730782
## 3            0   98028 47.7379 -122.233          2720       8062  5.255273
## 4            0   98136 47.5208 -122.393          1360       5000  5.781037
## 5            0   98074 47.6168 -122.045          1800       7503  5.707570
## 6            0   98053 47.6561 -122.005          4760     101930  6.088136
##   log_size price_class
## 1 3.071882           0
## 2 3.409933           2
## 3 2.886491           0
## 4 3.292256           2
## 5 3.225309           2
## 6 3.733999           3

Once classified,the original ‘price’ column was removed from the dataset for house_df_class.

Distribution of price_class

ggplot(house_df_class, aes(x = price_class)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of price_class",
       x = "Price Class",
       y = "Count") +
  theme_minimal()

Correlation Heatmap

cor_matrix <- cor(house_df_copy[, sapply(house_df_copy, is.numeric)], use = "complete.obs")
melted_cor_matrix <- melt(cor_matrix)

ggplot(data = melted_cor_matrix, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "turquoise3", high = "red4", mid = "white", midpoint = 0,
                       limit = c(-1, 1), space = "Lab", 
                       name = "Correlation") +
  geom_text(aes(label = round(value, 2)), color = "black", size = 3) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 12, hjust = 1)) +
  labs(title = "Correlation Heatmap", x = "", y = "")

5.2 Feature Selection using Boruta Method

house_df_class$price_class <- as.factor(house_df_class$price_class)
set.seed(125)  # For reproducibility
boruta_result <- Boruta(price_class ~ ., data = house_df_class, doTrace = 2)
##  1. run of importance source...
##  2. run of importance source...
##  3. run of importance source...
##  4. run of importance source...
##  5. run of importance source...
##  6. run of importance source...
##  7. run of importance source...
##  8. run of importance source...
##  9. run of importance source...
##  10. run of importance source...
##  11. run of importance source...
##  12. run of importance source...
## After 12 iterations, +44 secs:
##  confirmed 22 attributes: bathrooms, bedrooms, condition, date, floors and 17 more;
##  no more attributes left.
print(boruta_result)
## Boruta performed 12 iterations in 44.13559 secs.
##  22 attributes confirmed important: bathrooms, bedrooms, condition,
## date, floors and 17 more;
##  No attributes deemed unimportant.
# Get a summary of the Boruta result
summary(boruta_result)
##               Length Class    Mode     
## finalDecision  22    factor   numeric  
## ImpHistory    300    -none-   numeric  
## pValue          1    -none-   numeric  
## maxRuns         1    -none-   numeric  
## light           1    -none-   logical  
## mcAdj           1    -none-   logical  
## timeTaken       1    difftime numeric  
## roughfixed      1    -none-   logical  
## call            4    -none-   call     
## impSource       1    -none-   character
# Plot the Boruta result for a visual inspection
plot(boruta_result, las = 2, cex.axis = 0.7)

# Finalize the tentative features (default settings)
final_boruta <- TentativeRoughFix(boruta_result)
## Warning in TentativeRoughFix(boruta_result): There are no Tentative attributes!
## Returning original object.
# Get the names of the confirmed important features
important_features <- getSelectedAttributes(final_boruta, withTentative = FALSE)

# Print the important features
print(important_features)
##  [1] "id"            "date"          "bedrooms"      "bathrooms"    
##  [5] "sqft_living"   "sqft_lot"      "floors"        "waterfront"   
##  [9] "view"          "condition"     "grade"         "sqft_above"   
## [13] "sqft_basement" "yr_built"      "yr_renovated"  "zipcode"      
## [17] "lat"           "long"          "sqft_living15" "sqft_lot15"   
## [21] "log_price"     "log_size"
selected_data <- house_df_class[, c(important_features, "price_class")]

Feature selection was done using the Boruta algorithm.It works as a wrapper algorithm around Random Forest.Boruta method outputs that all 20 variables are confirmed important in the dataset.

5.3 Split dataset into training and testing set

house_classmodel <- selected_data
set.seed(123)  # Set a seed for reproducibility
trainIndex <- createDataPartition(house_classmodel$price_class, p = 0.8, list = FALSE)
train_data <- house_classmodel[trainIndex, ]
test_data <- house_classmodel[-trainIndex, ]

5.4 Type of Classification Models

Decision Tree Classifier

dt_model <- rpart(price_class ~ ., data = train_data, method = "class")
test_pred <- predict(dt_model, newdata = test_data, type = "class")
confusionMatrix(data = as.factor(test_pred), reference = as.factor(test_data$price_class))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3
##          0 1080    0    0    0
##          1    0 1092    0    0
##          2    0    0 1075    0
##          3    0    0    0 1074
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9991, 1)
##     No Information Rate : 0.2527     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3
## Sensitivity            1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2499   0.2527   0.2488   0.2486
## Detection Rate         0.2499   0.2527   0.2488   0.2486
## Detection Prevalence   0.2499   0.2527   0.2488   0.2486
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000

Random Forest Classifier

rf_model <- randomForest(price_class ~ ., data = train_data)
test_pred <- predict(rf_model, newdata = test_data)
confusionMatrix(data = as.factor(test_pred), reference = as.factor(test_data$price_class))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3
##          0 1080    0    0    0
##          1    0 1092    0    0
##          2    0    0 1075    1
##          3    0    0    0 1073
## 
## Overall Statistics
##                                      
##                Accuracy : 0.9998     
##                  95% CI : (0.9987, 1)
##     No Information Rate : 0.2527     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 0.9997     
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3
## Sensitivity            1.0000   1.0000   1.0000   0.9991
## Specificity            1.0000   1.0000   0.9997   1.0000
## Pos Pred Value         1.0000   1.0000   0.9991   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   0.9997
## Prevalence             0.2499   0.2527   0.2488   0.2486
## Detection Rate         0.2499   0.2527   0.2488   0.2483
## Detection Prevalence   0.2499   0.2527   0.2490   0.2483
## Balanced Accuracy      1.0000   1.0000   0.9998   0.9995

Support Vector Machine(SVM)

svm_model <- svm(price_class ~ ., data = train_data, kernel = "radial")
test_pred <- predict(svm_model, newdata = test_data)
confusionMatrix(data = as.factor(test_pred), reference = as.factor(test_data$price_class))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3
##          0 1017   13    0    0
##          1   63 1047   44    0
##          2    0   32  998   44
##          3    0    0   33 1030
## 
## Overall Statistics
##                                           
##                Accuracy : 0.947           
##                  95% CI : (0.9399, 0.9535)
##     No Information Rate : 0.2527          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9293          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3
## Sensitivity            0.9417   0.9588   0.9284   0.9590
## Specificity            0.9960   0.9669   0.9766   0.9898
## Pos Pred Value         0.9874   0.9073   0.9292   0.9690
## Neg Pred Value         0.9809   0.9858   0.9763   0.9865
## Prevalence             0.2499   0.2527   0.2488   0.2486
## Detection Rate         0.2354   0.2423   0.2310   0.2384
## Detection Prevalence   0.2384   0.2671   0.2486   0.2460
## Balanced Accuracy      0.9688   0.9628   0.9525   0.9744

5.5 Select Best Model

From the classification models, Decision Tree Model achieved the perfect accuracy of 1.00 and Kappa value of 1.00 followed by Random Forest Model with 0.99 and SVM with 0.95. Perfect accuracy achieved by Decision Tree Model might indicate that it is overfitting. An accuracy of 0.99 for Random Forest shows that the model correctly predicted 99% of the intances in the test set. Kappa value of 0.99 for Random Forest indicates almost perfect agreement between predicted and actual classes.An accuracy of 0.95% for SVM shows that the model correctly predicted 95% of the instances in the test set.Since decision tree model has achieved the perfect accuracy and taking overfitting into consideration, the best model selected is Random Forest Model in terms of accuracy and Kappa value.

6. Conclusion

In this study, we analyze house price data to explore the key factors affecting house prices. Data analysis includes the following aspects:

  1. Data visualization: By drawing scatter plots, histograms and other visualization tools, we discovered the relationship between housing prices and various variables.
  1. Classification analysis: We classify housing prices into four categories (very affordable, affordable, moderate, expensive) and use decision trees, random forests and support vector machines (SVM) to build classification models. The results show that the random forest model has the highest accuracy, reaching 0.99.

Based on the research results, we propose the following solutions:

  1. Improve the accuracy of house price forecasts:
  1. Improve transaction efficiency:
  1. Policy support:

In view of the research results, the contribution of this research to society can be mainly reflected in the following aspects:

  1. Market Transparency: A more transparent and predictable real estate market can help stabilize the market and reduce speculation.

  2. Policy formulation: Governments and policymakers can use these predictive models and data to formulate more effective housing policies and promote the healthy development of the real estate market.

In summary, through this study, we utilize multiple regression and classification techniques to significantly improve the accuracy and efficiency of house price prediction. This not only helps home buyers, sellers, real estate agents and investors make more informed decisions and reduce financial losses, but also provides financial institutions and policymakers with a powerful decision support system. Ultimately, these efforts will enhance the transparency and stability of the real estate market and promote its healthy and sustainable development.

7. References

Saskia. (2019). How to normalise the RMSE. Retrieved from https://www.marinedatascience.co/blog/2019/01/07/normalizing-the-rmse/. Accessed on 3rd June 2024.

Sheather, S.J. (2009). A Modern Approach to Regression with R. Springer. New York, USA.

Wang, L. (2017). Regression Methods for Skewed and Heteroscedastic Response with High Dimensional Covariates. Florida State University.