In the real estate industry, one of the biggest issue is related to a very high cost in lead, when compared to the conversion ratio, and the problem expands to even second hand houses. This scenario could be due the mismatch in pricing, or the failure to find the common ground of pricing among the buyer and the seller. A property’s value usually depends on multiple factors, such as location, size, population density, and etc. Therefore, there is a need to have house price prediction models to help the buyers and sellers to reach a mutual agreement in the price of the house without bias. In addition, prediction of house prices will allow people who plan to buy a house to forecast the price range in the future, so they can have a better financial plan.
To determine the factors that influence house prices based on historical data.
To develop a regression and classification model for house price prediction using machine learning.
To evaluate the performance metrics of each model for the best model selection.
To develop a classification model that predicts the price of a house based on factors, such as age of the house, size of the property lot, and the surrounding conditions of the property.
To develop a regression model that predicts the price of the house accurately
Based on a set of house sale prices in King County, Seattle, we explored the application of regression and classification models for house price prediction. The data set contains houses sold from the period of May 2014 to May 2015, with information such as year built, size of house lot, and price of house. It contains 21,613 sample observations with 21 attribute variables that directly or indirectly affect the price of the house.
knitr::include_graphics("C:/Users/San/Documents/UM/WQD7004/R wd/Variable Explanation.png")
Image 1: Variable Explanation in dataset
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(caret)
## Loading required package: lattice
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.2, built: 2024-04-10)
## ## Copyright (C) 2005-2024 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(rpart)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(e1071)
library(readr)
library(RColorBrewer)
library(corrplot)
## corrplot 0.92 loaded
library(ggmap)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
## Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service/>
## OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles/>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ purrr 1.0.2 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ randomForest::combine() masks gridExtra::combine(), dplyr::combine()
## ✖ mice::filter() masks dplyr::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ randomForest::margin() masks ggplot2::margin()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(moderndive)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(readr)
library(Boruta)
library(Metrics)
##
## Attaching package: 'Metrics'
##
## The following objects are masked from 'package:caret':
##
## precision, recall
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
suppressWarnings(library(readr))
house_df = read.csv("house_data.csv")
head(house_df)
## id date price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 20141013T000000 221900 3 1.00 1180 5650
## 2 6414100192 20141209T000000 538000 3 2.25 2570 7242
## 3 5631500400 20150225T000000 180000 2 1.00 770 10000
## 4 2487200875 20141209T000000 604000 4 3.00 1960 5000
## 5 1954400510 20150218T000000 510000 3 2.00 1680 8080
## 6 7237550310 20140512T000000 1225000 4 4.50 5420 101930
## floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1 1 0 0 3 7 1180 0 1955
## 2 2 0 0 3 7 2170 400 1951
## 3 1 0 0 3 6 770 0 1933
## 4 1 0 0 5 7 1050 910 1965
## 5 1 0 0 3 8 1680 0 1987
## 6 1 0 0 3 11 3890 1530 2001
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98178 47.5112 -122.257 1340 5650
## 2 1991 98125 47.7210 -122.319 1690 7639
## 3 0 98028 47.7379 -122.233 2720 8062
## 4 0 98136 47.5208 -122.393 1360 5000
## 5 0 98074 47.6168 -122.045 1800 7503
## 6 0 98053 47.6561 -122.005 4760 101930
summary(house_df)
## id date price bedrooms
## Min. :1.000e+06 Length:21613 Min. : 75000 Min. : 0.000
## 1st Qu.:2.123e+09 Class :character 1st Qu.: 321950 1st Qu.: 3.000
## Median :3.905e+09 Mode :character Median : 450000 Median : 3.000
## Mean :4.580e+09 Mean : 540088 Mean : 3.371
## 3rd Qu.:7.309e+09 3rd Qu.: 645000 3rd Qu.: 4.000
## Max. :9.900e+09 Max. :7700000 Max. :33.000
## bathrooms sqft_living sqft_lot floors
## Min. :0.000 Min. : 290 Min. : 520 Min. :1.000
## 1st Qu.:1.750 1st Qu.: 1427 1st Qu.: 5040 1st Qu.:1.000
## Median :2.250 Median : 1910 Median : 7618 Median :1.500
## Mean :2.115 Mean : 2080 Mean : 15107 Mean :1.494
## 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10688 3rd Qu.:2.000
## Max. :8.000 Max. :13540 Max. :1651359 Max. :3.500
## waterfront view condition grade
## Min. :0.000000 Min. :0.0000 Min. :1.000 Min. : 1.000
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.: 7.000
## Median :0.000000 Median :0.0000 Median :3.000 Median : 7.000
## Mean :0.007542 Mean :0.2343 Mean :3.409 Mean : 7.657
## 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.: 8.000
## Max. :1.000000 Max. :4.0000 Max. :5.000 Max. :13.000
## sqft_above sqft_basement yr_built yr_renovated
## Min. : 290 Min. : 0.0 Min. :1900 Min. : 0.0
## 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.0
## Median :1560 Median : 0.0 Median :1975 Median : 0.0
## Mean :1788 Mean : 291.5 Mean :1971 Mean : 84.4
## 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997 3rd Qu.: 0.0
## Max. :9410 Max. :4820.0 Max. :2015 Max. :2015.0
## zipcode lat long sqft_living15
## Min. :98001 Min. :47.16 Min. :-122.5 Min. : 399
## 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490
## Median :98065 Median :47.57 Median :-122.2 Median :1840
## Mean :98078 Mean :47.56 Mean :-122.2 Mean :1987
## 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360
## Max. :98199 Max. :47.78 Max. :-121.3 Max. :6210
## sqft_lot15
## Min. : 651
## 1st Qu.: 5100
## Median : 7620
## Mean : 12768
## 3rd Qu.: 10083
## Max. :871200
str(house_df)
## 'data.frame': 21613 obs. of 21 variables:
## $ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
## $ date : chr "20141013T000000" "20141209T000000" "20150225T000000" "20141209T000000" ...
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
glimpse(house_df)
## Rows: 21,613
## Columns: 21
## $ id <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544005…
## $ date <chr> "20141013T000000", "20141209T000000", "20150225T000000",…
## $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,…
## $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
## $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
## $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
## $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
## $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
## $ waterfront <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ view <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
## $ condition <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
## $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
## $ sqft_above <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
## $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
## $ yr_renovated <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ zipcode <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, …
## $ lat <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
## $ long <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23…
## $ sqft_lot15 <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, …
sum(is.na(house_df))
## [1] 0
#Check for missing values in each column
colSums(is.na(house_df))
## id date price bedrooms bathrooms
## 0 0 0 0 0
## sqft_living sqft_lot floors waterfront view
## 0 0 0 0 0
## condition grade sqft_above sqft_basement yr_built
## 0 0 0 0 0
## yr_renovated zipcode lat long sqft_living15
## 0 0 0 0 0
## sqft_lot15
## 0
There are no missing values in our data set.
duplicates <- house_df[duplicated(house_df), ]
print("Duplicate rows in the dataset:")
## [1] "Duplicate rows in the dataset:"
print(duplicates)
## [1] id date price bedrooms bathrooms
## [6] sqft_living sqft_lot floors waterfront view
## [11] condition grade sqft_above sqft_basement yr_built
## [16] yr_renovated zipcode lat long sqft_living15
## [21] sqft_lot15
## <0 rows> (or 0-length row.names)
num_duplicates <- sum(duplicated(house_df))
cat("Number of duplicate rows:", num_duplicates, "\n")
## Number of duplicate rows: 0
There are no duplicates in our data set.
unique_counts <- sapply(house_df, function(x) length(unique(x)))
print(unique_counts)
## id date price bedrooms bathrooms
## 21436 372 4028 13 30
## sqft_living sqft_lot floors waterfront view
## 1038 9782 6 2 5
## condition grade sqft_above sqft_basement yr_built
## 5 12 946 306 116
## yr_renovated zipcode lat long sqft_living15
## 70 70 5034 752 777
## sqft_lot15
## 8689
house_df$date <- ymd_hms(house_df$date)
head(house_df)
## id date price bedrooms bathrooms sqft_living sqft_lot floors
## 1 7129300520 2014-10-13 221900 3 1.00 1180 5650 1
## 2 6414100192 2014-12-09 538000 3 2.25 2570 7242 2
## 3 5631500400 2015-02-25 180000 2 1.00 770 10000 1
## 4 2487200875 2014-12-09 604000 4 3.00 1960 5000 1
## 5 1954400510 2015-02-18 510000 3 2.00 1680 8080 1
## 6 7237550310 2014-05-12 1225000 4 4.50 5420 101930 1
## waterfront view condition grade sqft_above sqft_basement yr_built
## 1 0 0 3 7 1180 0 1955
## 2 0 0 3 7 2170 400 1951
## 3 0 0 3 6 770 0 1933
## 4 0 0 5 7 1050 910 1965
## 5 0 0 3 8 1680 0 1987
## 6 0 0 3 11 3890 1530 2001
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98178 47.5112 -122.257 1340 5650
## 2 1991 98125 47.7210 -122.319 1690 7639
## 3 0 98028 47.7379 -122.233 2720 8062
## 4 0 98136 47.5208 -122.393 1360 5000
## 5 0 98074 47.6168 -122.045 1800 7503
## 6 0 98053 47.6561 -122.005 4760 101930
house_df_copy <- house_df
head(house_df_copy)
## id date price bedrooms bathrooms sqft_living sqft_lot floors
## 1 7129300520 2014-10-13 221900 3 1.00 1180 5650 1
## 2 6414100192 2014-12-09 538000 3 2.25 2570 7242 2
## 3 5631500400 2015-02-25 180000 2 1.00 770 10000 1
## 4 2487200875 2014-12-09 604000 4 3.00 1960 5000 1
## 5 1954400510 2015-02-18 510000 3 2.00 1680 8080 1
## 6 7237550310 2014-05-12 1225000 4 4.50 5420 101930 1
## waterfront view condition grade sqft_above sqft_basement yr_built
## 1 0 0 3 7 1180 0 1955
## 2 0 0 3 7 2170 400 1951
## 3 0 0 3 6 770 0 1933
## 4 0 0 5 7 1050 910 1965
## 5 0 0 3 8 1680 0 1987
## 6 0 0 3 11 3890 1530 2001
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98178 47.5112 -122.257 1340 5650
## 2 1991 98125 47.7210 -122.319 1690 7639
## 3 0 98028 47.7379 -122.233 2720 8062
## 4 0 98136 47.5208 -122.393 1360 5000
## 5 0 98074 47.6168 -122.045 1800 7503
## 6 0 98053 47.6561 -122.005 4760 101930
d1<-ggplot(house_df_copy, aes(x=price)) +
geom_density(fill="turquoise3") +
scale_color_brewer(palette="Accent") + theme_minimal() +
labs(title=bquote(bold("Density Plot for Price")),x="Price",y="Density")
d2<-ggplot(house_df_copy, aes(x=sqft_living)) +
geom_density(fill="turquoise3") +
scale_color_brewer(palette="Accent") + theme_minimal() +
labs(title=bquote(bold("Density Plot for Sqft Living")),x="Sqft Living",y="Density")
grid.arrange(d1,d2,nrow=1,ncol=2)
Based on the density plot:
house_df_copy<-house_df_copy %>%
mutate(log_price=log10(price),
log_size=log10(sqft_living))
h1<-ggplot(house_df_copy) +
geom_histogram(aes(x=log_price), fill="turquoise3", binwidth=0.10) +
labs(title=bquote(bold("Histogram of Log-Transformed Price")),x="Log Price",y="Frequency")
h2<-ggplot(house_df_copy) +
geom_histogram(aes(x=log_size), fill="turquoise3", binwidth=0.10) +
labs(title=bquote(bold("Histogram of Log-Transformed Sqft Living")),x="Log Sqft Living",y="Frequency")
grid.arrange(h1,h2,nrow=1,ncol=2)
After performing log transformation:
ggplot(house_df_copy,aes(x=bathrooms)) +
geom_histogram(fill="turquoise3",binwidth=0.5,size=0.1) +
scale_x_continuous(limits=c(1,8)) +
labs(title=bquote(bold("Histogram of Bathroom Feature")),x="Bathroom",y="Count")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 86 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
Based on the histogram:
ggplot(house_df_copy, aes(x=bedrooms, y=log_price)) +
geom_point(alpha=0.5, size=2) +
geom_smooth(method="lm", se = FALSE, color="darkred") +
labs(title = bquote(bold("Scatterplot of Bedrooms against Log Price")), x="Bedrooms", y="Log_price") +
theme(legend.position="none")
## `geom_smooth()` using formula = 'y ~ x'
Based on the scatterplot:
house_df_copy %>% filter(bedrooms<30)%>%
ggplot(aes(x=bedrooms,y=log_price,col=bedrooms)) +
geom_point(alpha=0.5,size=2) +
geom_smooth(method="lm",se=F, color="darkred") +
labs(title=bquote(bold("Scatterplot of Bedrooms against Price")), x="Bedrooms", y="Log_price") +
theme(legend.position="none")
## `geom_smooth()` using formula = 'y ~ x'
Based on the scatterplot after adding the handling:
table(house_df_copy$condition)
##
## 1 2 3 4 5
## 30 172 14031 5679 1701
Based on the table:
house_df_copy %>%
group_by(condition=factor(condition)) %>%
summarise(
mean_price=mean(log_price),
sd=sd(log_price),
count=n()
)
## # A tibble: 5 × 4
## condition mean_price sd count
## <fct> <dbl> <dbl> <int>
## 1 1 5.42 0.293 30
## 2 2 5.45 0.233 172
## 3 3 5.67 0.224 14031
## 4 4 5.65 0.228 5679
## 5 5 5.71 0.244 1701
Based on the table:
ggplot(house_df_copy,aes(factor(condition),log_price,fill=factor(condition)))+
geom_bar(stat="identity", fill="turquoise3") +
theme(legend.position="none") +
labs(title=bquote(bold("Distribution of Prices based on House Condition")), x="House Condition", y="Log_price")
options(repr.plot.width=8, repr.plot.height=5)
ggplot(house_df_copy, aes(x=log_size, y=log_price, color=factor(condition))) +
geom_point(size=0.5) +
geom_smooth(method="lm", se=FALSE, alpha=0.6, size=0.5, color="red") +
scale_color_manual(values = rep("turquoise3", length(unique(house_df_copy$condition)))) +
facet_wrap(~condition) +
labs(title=bquote(bold("Relationship of Log Price & Log Size against House Condition")), x="Log_size", y="Log_price")
## `geom_smooth()` using formula = 'y ~ x'
Condition 1 and 2:
Conditions 3, 4 and 5:
Display Table of House Floors
table1<- table(house_df_copy$floors)
print(table1)
##
## 1 1.5 2 2.5 3 3.5
## 10680 1910 8241 161 613 8
house_df_copy %>%
group_by(flr=factor(floors)) %>%
summarise(floor_cnt=n()) %>%
ggplot(aes(x=flr, y=floor_cnt, fill=flr)) +
geom_bar(stat="identity", alpha=0.5) +
scale_fill_manual(values=rep("turquoise3", length(unique(house_df_copy$floors)))) +
theme(legend.position="none") +
labs(x="Floors", y= "Number of Houses")
Single-Story Houses (1 Floor):
Two-Story Houses (2 Floors):
Three Story Houses (3 Floors):
set.seed(123)
train_index <- createDataPartition(house_df_copy$price, p = 0.8, list = FALSE)
train_data <- house_df_copy[train_index, ]
test_data <- house_df_copy[-train_index, ]
linear_model <- lm(log_price ~ log_size, data = train_data)
summary(linear_model)
##
## Call:
## lm(formula = log_price ~ log_size, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.47878 -0.12766 0.00577 0.11188 0.56929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.909201 0.022947 126.8 <2e-16 ***
## log_size 0.840864 0.006987 120.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.169 on 17290 degrees of freedom
## Multiple R-squared: 0.4559, Adjusted R-squared: 0.4558
## F-statistic: 1.449e+04 on 1 and 17290 DF, p-value: < 2.2e-16
# Predict
linear_pred <- predict(linear_model, test_data)
# NRMSE
linear_rmse <- rmse(test_data$log_price, linear_pred)
price_range <- max(house_df_copy$log_price) - min(house_df_copy$log_price)
normalized_rmse <- linear_rmse / price_range
cat("Normalized RMSE:", normalized_rmse, "\n")
## Normalized RMSE: 0.08348207
mlr_model <- lm(log_price ~ log_size+bedrooms+bathrooms+view+grade, data = train_data)
summary(mlr_model)
##
## Call:
## lm(formula = log_price ~ log_size + bedrooms + bathrooms + view +
## grade, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.49508 -0.10869 0.00165 0.09916 0.57102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646069 0.031281 116.559 < 2e-16 ***
## log_size 0.427195 0.012412 34.418 < 2e-16 ***
## bedrooms -0.006439 0.001596 -4.036 5.47e-05 ***
## bathrooms -0.002144 0.002371 -0.904 0.366
## view 0.047087 0.001569 30.018 < 2e-16 ***
## grade 0.082918 0.001536 53.977 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.15 on 17286 degrees of freedom
## Multiple R-squared: 0.5716, Adjusted R-squared: 0.5715
## F-statistic: 4614 on 5 and 17286 DF, p-value: < 2.2e-16
# Predict
mlr_pred <- predict(mlr_model, test_data)
# NRMSE
mlr_rmse <- sqrt(mean((test_data$log_price - mlr_pred)^2))
price_range1 <- max(house_df_copy$log_price) - min(house_df_copy$log_price)
normalized_rmse <- mlr_rmse / price_range1
cat("Normalized RMSE:", normalized_rmse, "\n")
## Normalized RMSE: 0.07414972
fwd_model<-lm(log_price~log_size+bedrooms+bathrooms+waterfront+view+condition+grade+yr_built+yr_renovated+zipcode+lat+long,data=train_data)
summary(fwd_model)
##
## Call:
## lm(formula = log_price ~ log_size + bedrooms + bathrooms + waterfront +
## view + condition + grade + yr_built + yr_renovated + zipcode +
## lat + long, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58039 -0.07132 0.00025 0.06911 0.59150
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.115e+00 1.735e+00 -0.642 0.521
## log_size 3.954e-01 9.488e-03 41.675 < 2e-16 ***
## bedrooms -1.147e-02 1.202e-03 -9.546 < 2e-16 ***
## bathrooms 3.571e-02 1.933e-03 18.471 < 2e-16 ***
## waterfront 1.651e-01 1.112e-02 14.840 < 2e-16 ***
## view 3.091e-02 1.299e-03 23.784 < 2e-16 ***
## condition 2.125e-02 1.446e-03 14.698 < 2e-16 ***
## grade 8.175e-02 1.204e-03 67.920 < 2e-16 ***
## yr_built -1.481e-03 4.325e-05 -34.230 < 2e-16 ***
## yr_renovated 1.176e-05 2.260e-06 5.203 1.99e-07 ***
## zipcode -2.724e-04 2.026e-05 -13.443 < 2e-16 ***
## lat 6.128e-01 6.587e-03 93.023 < 2e-16 ***
## long -4.278e-02 7.720e-03 -5.541 3.05e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1119 on 17279 degrees of freedom
## Multiple R-squared: 0.7618, Adjusted R-squared: 0.7616
## F-statistic: 4604 on 12 and 17279 DF, p-value: < 2.2e-16
# Predict
fwd_pred <- predict(fwd_model, test_data)
# NRMSE
fwd_rmse <- sqrt(mean((test_data$log_price - fwd_pred)^2))
price_range <- max(house_df_copy$log_price) - min(house_df_copy$log_price)
normalized_rmse <- fwd_rmse / price_range
cat("Normalized RMSE:", normalized_rmse, "\n")
## Normalized RMSE: 0.05421957
Results:
regression_results <- data.frame(
Model = c("Linear Regression", "Multiple Linear Regression", "Multiple Linear Regression with Forward Selection"),
`Residual Standard Error` = c(0.169, 0.150, 0.111),
`R squared` = c(0.4559, 0.5716, 0.7635),
`Adjusted R squared` = c(0.4558, 0.5715, 0.7633),
`NRMSE` = c("0.0835 (8.35%)", "0.07415 (7.42%)", "0.0542 (5.42%)"),
check.names = FALSE
)
# Visualize in table form
kable(regression_results, format = "html", caption = "Model Performance Comparison") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE, position = "center") %>%
column_spec(1, bold = TRUE) %>%
column_spec(2:5, width = "150px") # Adjust column width as needed
| Model | Residual Standard Error | R squared | Adjusted R squared | NRMSE |
|---|---|---|---|---|
| Linear Regression | 0.169 | 0.4559 | 0.4558 | 0.0835 (8.35%) |
| Multiple Linear Regression | 0.150 | 0.5716 | 0.5715 | 0.07415 (7.42%) |
| Multiple Linear Regression with Forward Selection | 0.111 | 0.7635 | 0.7633 | 0.0542 (5.42%) |
Evaluation:
# Convert log price to actual
predicted_prices <- round(exp(fwd_pred), 0)
# Plot Predicted Price
ggplot(test_data, aes(x = price, y = predicted_prices)) +
geom_point(color = "turquoise3") +
geom_smooth(se = FALSE, color = "red4") +
labs(title = "Actual vs Predicted Prices",
x = "Actual Price",
y = "Predicted Price")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(fwd_model)
##
## studentized Breusch-Pagan test
##
## data: fwd_model
## BP = 605.95, df = 12, p-value < 2.2e-16
Homoscedasticity is the assumption that the residuals of the linear regression are distributed with equal variance at each level of the predictor variable. The Breusch-Pagan test is a test to determine if heteroscedasticity is present,
Since the best model is the multiple linear regression model, it is tested with the Breusch-Pagan test. With a p value less than 0.05, it concludes that the model contains heteroscedasticity, which means that there could be problems when the standard errors of the estimates is taken into account.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:dplyr':
##
## recode
vif(fwd_model)
## log_size bedrooms bathrooms waterfront view condition
## 4.209932 1.750675 3.049256 1.187862 1.338062 1.226033
## grade yr_built yr_renovated zipcode lat long
## 2.756002 2.230270 1.146816 1.625435 1.154380 1.629755
Variance inflation factor (VIF) measures how much the variance of a regression coefficient is inflated due to multicollinearity. A VIR number means that our predictors are correlated, so the lower variance the better.
There is no multicollinearity in our model, as the VIF values are all less than 5, which indicates a low correlation of a predictor with other predictors.
0:‘Very Affordable’, 1:‘Affordable’, 2:‘Moderate’, 3:‘Expensive’
house_df_class <- house_df_copy
quartiles <- quantile(house_df_class$price, probs = c(0, 0.25, 0.5, 0.75, 1))
# Create a new variable 'price_class' based on quartiles
house_df_class$price_class <- cut(house_df_class$price,
breaks = quartiles,
labels = c(0, 1, 2, 3),
include.lowest = TRUE)
head(house_df_class$price_class, 5)
## [1] 0 2 0 2 2
## Levels: 0 1 2 3
For classification purpose, new column ‘price_class’ was created to classify the price of the houses into four categories based on percentiles.
house_df_class <- house_df_class %>%
select(-price)
head(house_df_class)
## id date bedrooms bathrooms sqft_living sqft_lot floors
## 1 7129300520 2014-10-13 3 1.00 1180 5650 1
## 2 6414100192 2014-12-09 3 2.25 2570 7242 2
## 3 5631500400 2015-02-25 2 1.00 770 10000 1
## 4 2487200875 2014-12-09 4 3.00 1960 5000 1
## 5 1954400510 2015-02-18 3 2.00 1680 8080 1
## 6 7237550310 2014-05-12 4 4.50 5420 101930 1
## waterfront view condition grade sqft_above sqft_basement yr_built
## 1 0 0 3 7 1180 0 1955
## 2 0 0 3 7 2170 400 1951
## 3 0 0 3 6 770 0 1933
## 4 0 0 5 7 1050 910 1965
## 5 0 0 3 8 1680 0 1987
## 6 0 0 3 11 3890 1530 2001
## yr_renovated zipcode lat long sqft_living15 sqft_lot15 log_price
## 1 0 98178 47.5112 -122.257 1340 5650 5.346157
## 2 1991 98125 47.7210 -122.319 1690 7639 5.730782
## 3 0 98028 47.7379 -122.233 2720 8062 5.255273
## 4 0 98136 47.5208 -122.393 1360 5000 5.781037
## 5 0 98074 47.6168 -122.045 1800 7503 5.707570
## 6 0 98053 47.6561 -122.005 4760 101930 6.088136
## log_size price_class
## 1 3.071882 0
## 2 3.409933 2
## 3 2.886491 0
## 4 3.292256 2
## 5 3.225309 2
## 6 3.733999 3
Once classified,the original ‘price’ column was removed from the dataset for house_df_class.
ggplot(house_df_class, aes(x = price_class)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of price_class",
x = "Price Class",
y = "Count") +
theme_minimal()
cor_matrix <- cor(house_df_copy[, sapply(house_df_copy, is.numeric)], use = "complete.obs")
melted_cor_matrix <- melt(cor_matrix)
ggplot(data = melted_cor_matrix, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "turquoise3", high = "red4", mid = "white", midpoint = 0,
limit = c(-1, 1), space = "Lab",
name = "Correlation") +
geom_text(aes(label = round(value, 2)), color = "black", size = 3) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1)) +
labs(title = "Correlation Heatmap", x = "", y = "")
house_df_class$price_class <- as.factor(house_df_class$price_class)
set.seed(125) # For reproducibility
boruta_result <- Boruta(price_class ~ ., data = house_df_class, doTrace = 2)
## 1. run of importance source...
## 2. run of importance source...
## 3. run of importance source...
## 4. run of importance source...
## 5. run of importance source...
## 6. run of importance source...
## 7. run of importance source...
## 8. run of importance source...
## 9. run of importance source...
## 10. run of importance source...
## 11. run of importance source...
## 12. run of importance source...
## After 12 iterations, +44 secs:
## confirmed 22 attributes: bathrooms, bedrooms, condition, date, floors and 17 more;
## no more attributes left.
print(boruta_result)
## Boruta performed 12 iterations in 44.13559 secs.
## 22 attributes confirmed important: bathrooms, bedrooms, condition,
## date, floors and 17 more;
## No attributes deemed unimportant.
# Get a summary of the Boruta result
summary(boruta_result)
## Length Class Mode
## finalDecision 22 factor numeric
## ImpHistory 300 -none- numeric
## pValue 1 -none- numeric
## maxRuns 1 -none- numeric
## light 1 -none- logical
## mcAdj 1 -none- logical
## timeTaken 1 difftime numeric
## roughfixed 1 -none- logical
## call 4 -none- call
## impSource 1 -none- character
# Plot the Boruta result for a visual inspection
plot(boruta_result, las = 2, cex.axis = 0.7)
# Finalize the tentative features (default settings)
final_boruta <- TentativeRoughFix(boruta_result)
## Warning in TentativeRoughFix(boruta_result): There are no Tentative attributes!
## Returning original object.
# Get the names of the confirmed important features
important_features <- getSelectedAttributes(final_boruta, withTentative = FALSE)
# Print the important features
print(important_features)
## [1] "id" "date" "bedrooms" "bathrooms"
## [5] "sqft_living" "sqft_lot" "floors" "waterfront"
## [9] "view" "condition" "grade" "sqft_above"
## [13] "sqft_basement" "yr_built" "yr_renovated" "zipcode"
## [17] "lat" "long" "sqft_living15" "sqft_lot15"
## [21] "log_price" "log_size"
selected_data <- house_df_class[, c(important_features, "price_class")]
Feature selection was done using the Boruta algorithm.It works as a wrapper algorithm around Random Forest.Boruta method outputs that all 20 variables are confirmed important in the dataset.
house_classmodel <- selected_data
set.seed(123) # Set a seed for reproducibility
trainIndex <- createDataPartition(house_classmodel$price_class, p = 0.8, list = FALSE)
train_data <- house_classmodel[trainIndex, ]
test_data <- house_classmodel[-trainIndex, ]
dt_model <- rpart(price_class ~ ., data = train_data, method = "class")
test_pred <- predict(dt_model, newdata = test_data, type = "class")
confusionMatrix(data = as.factor(test_pred), reference = as.factor(test_data$price_class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3
## 0 1080 0 0 0
## 1 0 1092 0 0
## 2 0 0 1075 0
## 3 0 0 0 1074
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9991, 1)
## No Information Rate : 0.2527
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3
## Sensitivity 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2499 0.2527 0.2488 0.2486
## Detection Rate 0.2499 0.2527 0.2488 0.2486
## Detection Prevalence 0.2499 0.2527 0.2488 0.2486
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000
rf_model <- randomForest(price_class ~ ., data = train_data)
test_pred <- predict(rf_model, newdata = test_data)
confusionMatrix(data = as.factor(test_pred), reference = as.factor(test_data$price_class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3
## 0 1080 0 0 0
## 1 0 1092 0 0
## 2 0 0 1075 1
## 3 0 0 0 1073
##
## Overall Statistics
##
## Accuracy : 0.9998
## 95% CI : (0.9987, 1)
## No Information Rate : 0.2527
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9997
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3
## Sensitivity 1.0000 1.0000 1.0000 0.9991
## Specificity 1.0000 1.0000 0.9997 1.0000
## Pos Pred Value 1.0000 1.0000 0.9991 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 0.9997
## Prevalence 0.2499 0.2527 0.2488 0.2486
## Detection Rate 0.2499 0.2527 0.2488 0.2483
## Detection Prevalence 0.2499 0.2527 0.2490 0.2483
## Balanced Accuracy 1.0000 1.0000 0.9998 0.9995
svm_model <- svm(price_class ~ ., data = train_data, kernel = "radial")
test_pred <- predict(svm_model, newdata = test_data)
confusionMatrix(data = as.factor(test_pred), reference = as.factor(test_data$price_class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3
## 0 1017 13 0 0
## 1 63 1047 44 0
## 2 0 32 998 44
## 3 0 0 33 1030
##
## Overall Statistics
##
## Accuracy : 0.947
## 95% CI : (0.9399, 0.9535)
## No Information Rate : 0.2527
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9293
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3
## Sensitivity 0.9417 0.9588 0.9284 0.9590
## Specificity 0.9960 0.9669 0.9766 0.9898
## Pos Pred Value 0.9874 0.9073 0.9292 0.9690
## Neg Pred Value 0.9809 0.9858 0.9763 0.9865
## Prevalence 0.2499 0.2527 0.2488 0.2486
## Detection Rate 0.2354 0.2423 0.2310 0.2384
## Detection Prevalence 0.2384 0.2671 0.2486 0.2460
## Balanced Accuracy 0.9688 0.9628 0.9525 0.9744
From the classification models, Decision Tree Model achieved the perfect accuracy of 1.00 and Kappa value of 1.00 followed by Random Forest Model with 0.99 and SVM with 0.95. Perfect accuracy achieved by Decision Tree Model might indicate that it is overfitting. An accuracy of 0.99 for Random Forest shows that the model correctly predicted 99% of the intances in the test set. Kappa value of 0.99 for Random Forest indicates almost perfect agreement between predicted and actual classes.An accuracy of 0.95% for SVM shows that the model correctly predicted 95% of the instances in the test set.Since decision tree model has achieved the perfect accuracy and taking overfitting into consideration, the best model selected is Random Forest Model in terms of accuracy and Kappa value.
In this study, we analyze house price data to explore the key factors affecting house prices. Data analysis includes the following aspects:
As the number of bedrooms and bathrooms increases, so does the home price.
Factors such as the number of floors a home has, whether it has water views, views, condition and grade have a significant impact on home prices.
Based on the research results, we propose the following solutions:
Data-driven decisions: Using a variety of regression and classification techniques to predict home prices can greatly improve the accuracy of predictions, helping homebuyers, sellers, real estate agents, and investors make more informed decisions.
Reduce financial losses: More accurate price forecasts can help all parties avoid unnecessary financial losses.
In view of the research results, the contribution of this research to society can be mainly reflected in the following aspects:
Market Transparency: A more transparent and predictable real estate market can help stabilize the market and reduce speculation.
Policy formulation: Governments and policymakers can use these predictive models and data to formulate more effective housing policies and promote the healthy development of the real estate market.
In summary, through this study, we utilize multiple regression and classification techniques to significantly improve the accuracy and efficiency of house price prediction. This not only helps home buyers, sellers, real estate agents and investors make more informed decisions and reduce financial losses, but also provides financial institutions and policymakers with a powerful decision support system. Ultimately, these efforts will enhance the transparency and stability of the real estate market and promote its healthy and sustainable development.
Saskia. (2019). How to normalise the RMSE. Retrieved from https://www.marinedatascience.co/blog/2019/01/07/normalizing-the-rmse/. Accessed on 3rd June 2024.
Sheather, S.J. (2009). A Modern Approach to Regression with R. Springer. New York, USA.
Wang, L. (2017). Regression Methods for Skewed and Heteroscedastic Response with High Dimensional Covariates. Florida State University.