The Zillow Home Price Data are time series datasets available online via this link. The datasets selected are updated monthly:
Home Values – Zillow Home Values Index (ZHVI) that tells us the typical home value in a given geography (metro area, city, ZIP code, etc.), now and over time.).
Home Values Forescasts: is the one-year forecast of ZHVI and is created using the all homes, mid-tier cut of ZHVI.
Rentals: a smoothed measure of the typical observed market rate rent acreoss a given region.
Inventory: The count of unique listings that were active at any time in a given month.
LIST AND SALE PRICES:The median price at which homes across various geographies were listed and sold.
SALES COUNT AND PRICE CUTS: is the estimated number of unique properties that sold during the month after accounting for the latency between when sales occur and when they are reported.
These time series datasets were selected as they are consistent with my interest of working on a project of “Forecasting Home Value in Cincinnati”.
In 2020, as COVID-19 pandemic spreads the world and policies of lock down and social distancing coming out, the predictions on housing market has been very pessimistic. Experts and media made forecasts and implied a shrink in US proprietary prices and housing market as the GDP decreases and economic struggled to recover. However, the housing market had unprecedented increase since late 2020, rising by nearly 20% over last year (2021). In 2022, how will the housing market perform? Is the demand still strong or should we concern about home values decrease or even crash of the market? The abnormal pattern in 2021 may make it difficult to forecast as it is not a normal trend and will increase the expectation of sellers. We have to see if the increase trend is a one-time thing or it will continue in 2022 as well.
Import datasets
We loaded five separate datasets. These are all wide dataset that need to be transform to long dataset later:
home_value has 908 obs(regions) and 269 variables (monthly; Jan/2000- Dec/2021).
Rentals: has 102 obs(regions) and 99 variables (monthly; Jan/2014- Dec/2021).
Inventory: has 917 obs(regions) and 53 variables (monthly; Jan/2018- Dec/2021).
LIST AND SALE PRICES:has 95 obs(regions) and 53 variables (monthly; Jan/2018- Dec/2021).
SALES COUNT AND PRICE CUTS: has 95 obs(regions) and 172 variables (monthly; Feb/2008- Dec/2021).
home_value<- read_csv("data/home_values.csv")
inventory<- read_csv("data/inventory.csv")
list_and_sale_price<- read_csv("data/list_and_sale_price.csv")
rentals<- read_csv("data/rentals.csv")
sales_count_and_price_cuts<- read_csv("data/sales_count_and_price_cuts.csv")
Data for Cincinnati, OH
We can learn from the above datasets that the city Cincinnati, OH was given a unique region id “394466”. The size of the city ranked 28 among 900 and more cities (regions). Additionally, the region type of the city is Metropolitan statistical areas (MSA).
cin_home_value<- home_value%>%
filter(RegionID==394466)%>%
pivot_longer(6:269,names_to="date",values_to="home_value")%>%
mutate(new_month= as.Date(date, "%Y-%m-%d"))
cin_inventory<- inventory%>%
filter(RegionID==394466)%>%
pivot_longer(6:53,names_to="date",values_to="inventory")%>%
mutate(new_month= as.Date(date, "%Y-%m-%d"))
cin_list_and_sale_price<- list_and_sale_price%>%
filter(RegionID==394466)%>%
pivot_longer(6:53,names_to="date",values_to="list_sale_price")%>%
mutate(new_month= as.Date(date, "%Y-%m-%d"))
cin_rentals<- rentals%>%
filter(RegionID==394466)%>%
pivot_longer(4:99,names_to="date",values_to="rentals")%>%
mutate(month= as.Date(as.yearmon(date)))%>%
mutate(new_month=as.Date(month) + months(1) - days(1))
cin_sales_count_and_price_cuts<- sales_count_and_price_cuts%>%
filter(RegionID==394466)%>%
pivot_longer(6:172,names_to="date",values_to="sales_count_price")%>%
mutate(new_month= as.Date(date, "%Y-%m-%d"))
This is the Cincinnati full combined data of the five separate datasets with different time length and the dimension is 264 observations and 12 columns.
cin_combine_full<- cin_home_value%>%
left_join(cin_sales_count_and_price_cuts[,7:8], by="new_month")%>%
left_join(cin_rentals[,c(5,7)], by="new_month")%>%
left_join(cin_inventory[,7:8], by="new_month")%>%
left_join(cin_list_and_sale_price[,7:8], by="new_month")
# good alternative library for descriptive statistics
kable(describe(cin_combine_full),
format='markdown',
caption="cin_combine_full",
digits=4)
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RegionID | 1 | 264 | 394466.000 | 0.0000 | 394466.0 | 394466.000 | 0.0000 | 394466 | 394466 | 0 | NaN | NaN | 0.0000 |
| SizeRank | 2 | 264 | 28.000 | 0.0000 | 28.0 | 28.000 | 0.0000 | 28 | 28 | 0 | NaN | NaN | 0.0000 |
| RegionName* | 3 | 264 | 1.000 | 0.0000 | 1.0 | 1.000 | 0.0000 | 1 | 1 | 0 | NaN | NaN | 0.0000 |
| RegionType* | 4 | 264 | 1.000 | 0.0000 | 1.0 | 1.000 | 0.0000 | 1 | 1 | 0 | NaN | NaN | 0.0000 |
| StateName* | 5 | 264 | 1.000 | 0.0000 | 1.0 | 1.000 | 0.0000 | 1 | 1 | 0 | NaN | NaN | 0.0000 |
| date* | 6 | 264 | 132.500 | 76.3544 | 132.5 | 132.500 | 97.8516 | 1 | 264 | 263 | 0.0000 | -1.2136 | 4.6993 |
| home_value | 7 | 264 | 156425.159 | 21975.3198 | 149947.5 | 152736.406 | 15177.3762 | 129290 | 240833 | 111543 | 1.7211 | 3.0916 | 1352.4874 |
| new_month | 8 | 264 | NaN | NA | NA | NaN | NA | Inf | -Inf | -Inf | NA | NA | NA |
| sales_count_price | 9 | 167 | 2198.054 | 753.7124 | 2100.0 | 2191.318 | 907.3512 | 788 | 3819 | 3031 | 0.0967 | -1.0884 | 58.3240 |
| rentals | 10 | 96 | 1107.562 | 119.7290 | 1085.5 | 1099.769 | 135.6579 | 931 | 1380 | 449 | 0.4691 | -0.7964 | 12.2198 |
| inventory | 11 | 48 | 7716.104 | 1612.7197 | 7629.5 | 7742.450 | 2003.7339 | 4834 | 10286 | 5452 | -0.1090 | -1.1401 | 232.7760 |
| list_sale_price | 12 | 48 | 286471.938 | 37555.2336 | 283800.0 | 286422.300 | 46306.7871 | 218300 | 354433 | 136133 | 0.0240 | -1.1700 | 5420.6311 |
This is the Cincinnati combined data of the five separate datasets above with common time length and it has 48 obs and 12 variables.
cin_combine<- cin_inventory%>%
left_join(cin_list_and_sale_price[,7:8], by="new_month")%>%
left_join(cin_rentals[,c(5,7)], by="new_month")%>%
left_join(cin_sales_count_and_price_cuts[,7:8], by="new_month")%>%
left_join(cin_home_value[,7:8], by="new_month")
# good alternative library for descriptive statistics
kable(describe(cin_combine),
format='markdown',
caption="cin_combine",
digits=4)
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RegionID | 1 | 48 | 394466.000 | 0.0000 | 394466.0 | 394466.00 | 0.0000 | 394466 | 394466 | 0 | NaN | NaN | 0.0000 |
| SizeRank | 2 | 48 | 28.000 | 0.0000 | 28.0 | 28.00 | 0.0000 | 28 | 28 | 0 | NaN | NaN | 0.0000 |
| RegionName* | 3 | 48 | 1.000 | 0.0000 | 1.0 | 1.00 | 0.0000 | 1 | 1 | 0 | NaN | NaN | 0.0000 |
| RegionType* | 4 | 48 | 1.000 | 0.0000 | 1.0 | 1.00 | 0.0000 | 1 | 1 | 0 | NaN | NaN | 0.0000 |
| StateName* | 5 | 48 | 1.000 | 0.0000 | 1.0 | 1.00 | 0.0000 | 1 | 1 | 0 | NaN | NaN | 0.0000 |
| date* | 6 | 48 | 24.500 | 14.0000 | 24.5 | 24.50 | 17.7912 | 1 | 48 | 47 | 0.0000 | -1.2752 | 2.0207 |
| inventory | 7 | 48 | 7716.104 | 1612.7197 | 7629.5 | 7742.45 | 2003.7339 | 4834 | 10286 | 5452 | -0.1090 | -1.1401 | 232.7760 |
| new_month | 8 | 48 | NaN | NA | NA | NaN | NA | Inf | -Inf | -Inf | NA | NA | NA |
| list_sale_price | 9 | 48 | 286471.938 | 37555.2336 | 283800.0 | 286422.30 | 46306.7871 | 218300 | 354433 | 136133 | 0.0240 | -1.1700 | 5420.6311 |
| rentals | 10 | 48 | 1207.042 | 82.0039 | 1197.0 | 1202.75 | 89.6973 | 1087 | 1380 | 293 | 0.4379 | -0.8635 | 11.8362 |
| sales_count_price | 11 | 48 | 2828.875 | 520.0404 | 2929.0 | 2847.30 | 558.1989 | 1797 | 3819 | 2022 | -0.3858 | -0.8844 | 75.0614 |
| home_value | 12 | 48 | 194761.771 | 21069.2998 | 187386.5 | 193074.45 | 17209.2795 | 167537 | 240833 | 73296 | 0.7606 | -0.6403 | 3041.0915 |
This is a subset of combined dataset which omited info about the city.
cin_sub<- cin_combine[, 7:12]
# good alternative library for descriptive statistics
kable(describe(cin_sub),
format='markdown',
caption="cin_sub",
digits=4)
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| inventory | 1 | 48 | 7716.104 | 1612.7197 | 7629.5 | 7742.45 | 2003.7339 | 4834 | 10286 | 5452 | -0.1090 | -1.1401 | 232.7760 |
| new_month | 2 | 48 | NaN | NA | NA | NaN | NA | Inf | -Inf | -Inf | NA | NA | NA |
| list_sale_price | 3 | 48 | 286471.938 | 37555.2336 | 283800.0 | 286422.30 | 46306.7871 | 218300 | 354433 | 136133 | 0.0240 | -1.1700 | 5420.6311 |
| rentals | 4 | 48 | 1207.042 | 82.0039 | 1197.0 | 1202.75 | 89.6973 | 1087 | 1380 | 293 | 0.4379 | -0.8635 | 11.8362 |
| sales_count_price | 5 | 48 | 2828.875 | 520.0404 | 2929.0 | 2847.30 | 558.1989 | 1797 | 3819 | 2022 | -0.3858 | -0.8844 | 75.0614 |
| home_value | 6 | 48 | 194761.771 | 21069.2998 | 187386.5 | 193074.45 | 17209.2795 | 167537 | 240833 | 73296 | 0.7606 | -0.6403 | 3041.0915 |
home_value_full_plot = ggplot(cin_combine_full)+
geom_line(aes(new_month,home_value))+
theme_bw()+
xlab("Month")+
ylab("Home Value")+
labs(
title = 'Home Value in Cincinnati',
subtitle = 'January 2000 - December 2021')
inventory_full_plot = ggplot(cin_combine_full)+
geom_line(aes(new_month,inventory))+
theme_bw()+
xlab("Month")+
ylab("Inventory")+
labs(
title = 'Inventory in Cincinnati',
subtitle = 'January 2018 - December 2021')
list_and_sale_price_full_plot = ggplot(cin_combine_full)+
geom_line(aes(new_month,list_sale_price))+
theme_bw()+
xlab("Month")+
ylab("List and sale price")+
labs(
title = 'List and sale price in Cincinnati',
subtitle = 'January 2018 - December 2021')
rentals_full_plot = ggplot(cin_combine_full)+
geom_line(aes(new_month,rentals))+
theme_bw()+
xlab("Month")+
ylab("Rentals")+
labs(
title = 'Rentals in Cincinnati',
subtitle = 'January 2014 - December 2021')
sales_count_price_full_plot = ggplot(cin_combine_full)+
geom_line(aes(new_month,sales_count_price))+
theme_bw()+
xlab("Month")+
ylab("Sales count and price cut")+
labs(
title = 'Sales count and price cut in Cincinnati',
subtitle = 'Feburary 2008 - December 2021')
grid.arrange(home_value_full_plot,
inventory_full_plot,
list_and_sale_price_full_plot,
rentals_full_plot,
sales_count_price_full_plot,
ncol=2, nrow =3)
home_value
home_value_plot = ggplot(cin_combine)+
geom_line(aes(new_month,home_value))+
geom_smooth(aes(new_month,home_value),method='lm',color='red')+
theme_bw()+
xlab("Month")+
ylab("Home Value")+
labs(
title = 'Home Value in Cincinnati',
subtitle = 'January 2018 - December 2021')
home_value_year_plot<- cin_sub%>%
ggplot(aes(month(new_month, label=TRUE, abbr=TRUE),
home_value, group=factor(year(new_month)), colour=factor(year(new_month)))) +
geom_line() +
geom_point() +
labs(x="Month", colour="Year") +
theme_classic()+
labs(title= 'Home Value in Cincinnati by Year' )
grid.arrange(home_value_plot, home_value_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'
inventory
inventory_plot = ggplot(cin_combine)+
geom_line(aes(new_month,inventory))+
geom_smooth(aes(new_month,inventory),method='lm',color='red')+
theme_bw()+
xlab("Month")+
ylab("Inventory")+
labs(
title = 'Inventory in Cincinnati',
subtitle = 'January 2018 - December 2021')
inventory_year_plot<- cin_sub%>%
ggplot(aes(month(new_month, label=TRUE, abbr=TRUE),
inventory, group=factor(year(new_month)), colour=factor(year(new_month)))) +
geom_line() +
geom_point() +
labs(x="Month", colour="Year") +
theme_classic()+
labs(title= 'Inventory in Cincinnati by Year' )
grid.arrange(inventory_plot, inventory_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'
list_sale_price
list_and_sale_price_plot = ggplot(cin_combine)+
geom_line(aes(new_month,list_sale_price))+
geom_smooth(aes(new_month,list_sale_price),method='lm',color='red')+
theme_bw()+
xlab("Month")+
ylab("List and sale price")+
labs(
title = 'List and sale price in Cincinnati',
subtitle = 'January 2018 - December 2021')
list_and_sale_price_year_plot<- cin_sub%>%
ggplot(aes(month(new_month, label=TRUE, abbr=TRUE),
list_sale_price, group=factor(year(new_month)), colour=factor(year(new_month)))) +
geom_line() +
geom_point() +
labs(x="Month", colour="Year") +
theme_classic()+
labs(title= 'List and sale price in Cincinnati by Year' )
grid.arrange(list_and_sale_price_plot, list_and_sale_price_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'
`rentals``
rentals_plot = ggplot(cin_combine)+
geom_line(aes(new_month,rentals))+
geom_smooth(aes(new_month,rentals),method='lm',color='red')+
theme_bw()+
xlab("Month")+
ylab("Rentals")+
labs(
title = 'Rentals in Cincinnati',
subtitle = 'January 2018 - December 2021')
rentals_year_plot<- cin_sub%>%
ggplot(aes(month(new_month, label=TRUE, abbr=TRUE),
rentals, group=factor(year(new_month)), colour=factor(year(new_month)))) +
geom_line() +
geom_point() +
labs(x="Month", colour="Year") +
theme_classic()+
labs(title= 'Rentals in Cincinnati by Year' )
grid.arrange(rentals_plot, rentals_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'
`sales_count_price``
sales_count_price_plot = ggplot(cin_combine)+
geom_line(aes(new_month,sales_count_price))+
geom_smooth(aes(new_month,sales_count_price),method='lm',color='red')+
theme_bw()+
xlab("Month")+
ylab("Sales count and price cut")+
labs(
title = 'Sales count and price cut in Cincinnati',
subtitle = 'January 2018 - December 2021')
sales_count_price_year_plot<- cin_sub%>%
ggplot(aes(month(new_month, label=TRUE, abbr=TRUE),
sales_count_price, group=factor(year(new_month)), colour=factor(year(new_month)))) +
geom_line() +
geom_point() +
labs(x="Month", colour="Year") +
theme_classic()+
labs(title= 'Sales count and price cut in Cincinnati by Year' )
grid.arrange(sales_count_price_plot, sales_count_price_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'
As implied in the EDA part, home value , and rentalsseem to increase over time and indicated the same for each year (2018-2021); Similarly, for list and sale price, there’s an increasing trend over the years but a variation within each year (this maybe the seasonal); On the contrary, inventory seems to decreases overtime as well as yearly; Lastly, for sales count and price cut the variation tends to slightly decrease over the years.
Now we want to model the linear trend and see if it’s consistent with our observations from the visualization.
# Test whether there is a true linaer trend over time
mod1 = lm(home_value~new_month,data=cin_combine_full)
summary(mod1)
mod2 = lm(inventory~new_month,data=cin_combine_full)
summary(mod2)
mod3 = lm(list_sale_price~new_month,data=cin_combine_full)
summary(mod3)
mod4 = lm(rentals~new_month,data=cin_combine_full)
summary(mod4)
mod5 = lm(sales_count_price~new_month,data=cin_combine_full)
summary(mod5)
The outcome table:
tab_model(mod1, mod2,mod3,mod4,mod5)
| home value | inventory | list sale price | rentals | sales count price | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 67293.58 | 53729.25 – 80857.90 | <0.001 | 62976.81 | 50644.46 – 75309.16 | <0.001 | -1130447.61 | -1357341.28 – -903553.93 | <0.001 | -1332.23 | -1420.22 – -1244.23 | <0.001 | -4112.02 | -4975.12 – -3248.91 | <0.001 |
| new month | 5.95 | 5.05 – 6.84 | <0.001 | -3.02 | -3.70 – -2.35 | <0.001 | 77.53 | 65.12 – 89.94 | <0.001 | 0.14 | 0.13 – 0.14 | <0.001 | 0.38 | 0.33 – 0.44 | <0.001 |
| Observations | 264 | 48 | 48 | 96 | 167 | ||||||||||
| R2 / R2 adjusted | 0.396 / 0.393 | 0.639 / 0.631 | 0.775 / 0.770 | 0.970 / 0.970 | 0.560 / 0.557 | ||||||||||