Section 1

Datasets Description

The Zillow Home Price Data are time series datasets available online via this link. The datasets selected are updated monthly:

  • Home Values – Zillow Home Values Index (ZHVI) that tells us the typical home value in a given geography (metro area, city, ZIP code, etc.), now and over time.).

  • Home Values Forescasts: is the one-year forecast of ZHVI and is created using the all homes, mid-tier cut of ZHVI.

  • Rentals: a smoothed measure of the typical observed market rate rent acreoss a given region.

  • Inventory: The count of unique listings that were active at any time in a given month.

  • LIST AND SALE PRICES:The median price at which homes across various geographies were listed and sold.

  • SALES COUNT AND PRICE CUTS: is the estimated number of unique properties that sold during the month after accounting for the latency between when sales occur and when they are reported.

Project Proposal

These time series datasets were selected as they are consistent with my interest of working on a project of “Forecasting Home Value in Cincinnati”.

In 2020, as COVID-19 pandemic spreads the world and policies of lock down and social distancing coming out, the predictions on housing market has been very pessimistic. Experts and media made forecasts and implied a shrink in US proprietary prices and housing market as the GDP decreases and economic struggled to recover. However, the housing market had unprecedented increase since late 2020, rising by nearly 20% over last year (2021). In 2022, how will the housing market perform? Is the demand still strong or should we concern about home values decrease or even crash of the market? The abnormal pattern in 2021 may make it difficult to forecast as it is not a normal trend and will increase the expectation of sellers. We have to see if the increase trend is a one-time thing or it will continue in 2022 as well.

Section 2

Data preparation

Import datasets

We loaded five separate datasets. These are all wide dataset that need to be transform to long dataset later:

  • home_value has 908 obs(regions) and 269 variables (monthly; Jan/2000- Dec/2021).

  • Rentals: has 102 obs(regions) and 99 variables (monthly; Jan/2014- Dec/2021).

  • Inventory: has 917 obs(regions) and 53 variables (monthly; Jan/2018- Dec/2021).

  • LIST AND SALE PRICES:has 95 obs(regions) and 53 variables (monthly; Jan/2018- Dec/2021).

  • SALES COUNT AND PRICE CUTS: has 95 obs(regions) and 172 variables (monthly; Feb/2008- Dec/2021).

home_value<- read_csv("data/home_values.csv")
inventory<- read_csv("data/inventory.csv")
list_and_sale_price<- read_csv("data/list_and_sale_price.csv")
rentals<- read_csv("data/rentals.csv")
sales_count_and_price_cuts<- read_csv("data/sales_count_and_price_cuts.csv")

Transformed datasets

Data for Cincinnati, OH

  • Separate datasets

We can learn from the above datasets that the city Cincinnati, OH was given a unique region id “394466”. The size of the city ranked 28 among 900 and more cities (regions). Additionally, the region type of the city is Metropolitan statistical areas (MSA).

cin_home_value<- home_value%>%
  filter(RegionID==394466)%>%
  pivot_longer(6:269,names_to="date",values_to="home_value")%>%
  mutate(new_month= as.Date(date, "%Y-%m-%d"))

cin_inventory<- inventory%>%
  filter(RegionID==394466)%>%
  pivot_longer(6:53,names_to="date",values_to="inventory")%>%
  mutate(new_month= as.Date(date, "%Y-%m-%d"))

cin_list_and_sale_price<- list_and_sale_price%>%
  filter(RegionID==394466)%>%
  pivot_longer(6:53,names_to="date",values_to="list_sale_price")%>%
  mutate(new_month= as.Date(date, "%Y-%m-%d"))

cin_rentals<- rentals%>%
  filter(RegionID==394466)%>%
  pivot_longer(4:99,names_to="date",values_to="rentals")%>%
  mutate(month= as.Date(as.yearmon(date)))%>%
  mutate(new_month=as.Date(month) + months(1) - days(1))

cin_sales_count_and_price_cuts<- sales_count_and_price_cuts%>%
  filter(RegionID==394466)%>%
  pivot_longer(6:172,names_to="date",values_to="sales_count_price")%>%
  mutate(new_month= as.Date(date, "%Y-%m-%d"))
  • Combined dataset

This is the Cincinnati full combined data of the five separate datasets with different time length and the dimension is 264 observations and 12 columns.

cin_combine_full<- cin_home_value%>%
  left_join(cin_sales_count_and_price_cuts[,7:8], by="new_month")%>%
  left_join(cin_rentals[,c(5,7)], by="new_month")%>%
  left_join(cin_inventory[,7:8], by="new_month")%>%
  left_join(cin_list_and_sale_price[,7:8], by="new_month")
# good alternative library for descriptive statistics
kable(describe(cin_combine_full), 
      format='markdown', 
      caption="cin_combine_full",
      digits=4)
cin_combine_full
vars n mean sd median trimmed mad min max range skew kurtosis se
RegionID 1 264 394466.000 0.0000 394466.0 394466.000 0.0000 394466 394466 0 NaN NaN 0.0000
SizeRank 2 264 28.000 0.0000 28.0 28.000 0.0000 28 28 0 NaN NaN 0.0000
RegionName* 3 264 1.000 0.0000 1.0 1.000 0.0000 1 1 0 NaN NaN 0.0000
RegionType* 4 264 1.000 0.0000 1.0 1.000 0.0000 1 1 0 NaN NaN 0.0000
StateName* 5 264 1.000 0.0000 1.0 1.000 0.0000 1 1 0 NaN NaN 0.0000
date* 6 264 132.500 76.3544 132.5 132.500 97.8516 1 264 263 0.0000 -1.2136 4.6993
home_value 7 264 156425.159 21975.3198 149947.5 152736.406 15177.3762 129290 240833 111543 1.7211 3.0916 1352.4874
new_month 8 264 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
sales_count_price 9 167 2198.054 753.7124 2100.0 2191.318 907.3512 788 3819 3031 0.0967 -1.0884 58.3240
rentals 10 96 1107.562 119.7290 1085.5 1099.769 135.6579 931 1380 449 0.4691 -0.7964 12.2198
inventory 11 48 7716.104 1612.7197 7629.5 7742.450 2003.7339 4834 10286 5452 -0.1090 -1.1401 232.7760
list_sale_price 12 48 286471.938 37555.2336 283800.0 286422.300 46306.7871 218300 354433 136133 0.0240 -1.1700 5420.6311

This is the Cincinnati combined data of the five separate datasets above with common time length and it has 48 obs and 12 variables.

cin_combine<- cin_inventory%>%
  left_join(cin_list_and_sale_price[,7:8], by="new_month")%>%
  left_join(cin_rentals[,c(5,7)], by="new_month")%>%
  left_join(cin_sales_count_and_price_cuts[,7:8], by="new_month")%>%
  left_join(cin_home_value[,7:8], by="new_month")
# good alternative library for descriptive statistics
kable(describe(cin_combine), 
      format='markdown', 
      caption="cin_combine",
      digits=4)
cin_combine
vars n mean sd median trimmed mad min max range skew kurtosis se
RegionID 1 48 394466.000 0.0000 394466.0 394466.00 0.0000 394466 394466 0 NaN NaN 0.0000
SizeRank 2 48 28.000 0.0000 28.0 28.00 0.0000 28 28 0 NaN NaN 0.0000
RegionName* 3 48 1.000 0.0000 1.0 1.00 0.0000 1 1 0 NaN NaN 0.0000
RegionType* 4 48 1.000 0.0000 1.0 1.00 0.0000 1 1 0 NaN NaN 0.0000
StateName* 5 48 1.000 0.0000 1.0 1.00 0.0000 1 1 0 NaN NaN 0.0000
date* 6 48 24.500 14.0000 24.5 24.50 17.7912 1 48 47 0.0000 -1.2752 2.0207
inventory 7 48 7716.104 1612.7197 7629.5 7742.45 2003.7339 4834 10286 5452 -0.1090 -1.1401 232.7760
new_month 8 48 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
list_sale_price 9 48 286471.938 37555.2336 283800.0 286422.30 46306.7871 218300 354433 136133 0.0240 -1.1700 5420.6311
rentals 10 48 1207.042 82.0039 1197.0 1202.75 89.6973 1087 1380 293 0.4379 -0.8635 11.8362
sales_count_price 11 48 2828.875 520.0404 2929.0 2847.30 558.1989 1797 3819 2022 -0.3858 -0.8844 75.0614
home_value 12 48 194761.771 21069.2998 187386.5 193074.45 17209.2795 167537 240833 73296 0.7606 -0.6403 3041.0915

This is a subset of combined dataset which omited info about the city.

cin_sub<- cin_combine[, 7:12]
# good alternative library for descriptive statistics
kable(describe(cin_sub), 
      format='markdown', 
      caption="cin_sub",
      digits=4)
cin_sub
vars n mean sd median trimmed mad min max range skew kurtosis se
inventory 1 48 7716.104 1612.7197 7629.5 7742.45 2003.7339 4834 10286 5452 -0.1090 -1.1401 232.7760
new_month 2 48 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
list_sale_price 3 48 286471.938 37555.2336 283800.0 286422.30 46306.7871 218300 354433 136133 0.0240 -1.1700 5420.6311
rentals 4 48 1207.042 82.0039 1197.0 1202.75 89.6973 1087 1380 293 0.4379 -0.8635 11.8362
sales_count_price 5 48 2828.875 520.0404 2929.0 2847.30 558.1989 1797 3819 2022 -0.3858 -0.8844 75.0614
home_value 6 48 194761.771 21069.2998 187386.5 193074.45 17209.2795 167537 240833 73296 0.7606 -0.6403 3041.0915

EDA -Full combined dataset

home_value_full_plot = ggplot(cin_combine_full)+
  geom_line(aes(new_month,home_value))+
  theme_bw()+
  xlab("Month")+
  ylab("Home Value")+
  labs(
    title = 'Home Value in Cincinnati',
    subtitle = 'January 2000 - December 2021')

inventory_full_plot = ggplot(cin_combine_full)+
  geom_line(aes(new_month,inventory))+
  theme_bw()+
  xlab("Month")+
  ylab("Inventory")+
   labs(
    title  = 'Inventory in Cincinnati',
    subtitle = 'January 2018 - December 2021')

list_and_sale_price_full_plot = ggplot(cin_combine_full)+
  geom_line(aes(new_month,list_sale_price))+
  theme_bw()+
  xlab("Month")+
  ylab("List and sale price")+
   labs(
    title  = 'List and sale price in Cincinnati',
    subtitle = 'January 2018 - December 2021')

rentals_full_plot = ggplot(cin_combine_full)+
  geom_line(aes(new_month,rentals))+
  theme_bw()+
  xlab("Month")+
  ylab("Rentals")+
  labs(
    title  = 'Rentals in Cincinnati',
    subtitle = 'January 2014 - December 2021')

sales_count_price_full_plot = ggplot(cin_combine_full)+
  geom_line(aes(new_month,sales_count_price))+
  theme_bw()+
  xlab("Month")+
  ylab("Sales count and price cut")+
   labs(
    title  = 'Sales count and price cut in Cincinnati',
    subtitle = 'Feburary 2008 - December 2021')

grid.arrange(home_value_full_plot,
          inventory_full_plot,
          list_and_sale_price_full_plot,
          rentals_full_plot,
          sales_count_price_full_plot, 
          ncol=2, nrow =3)

EDA -Time Trend -combined Data (2018-2021)

home_value

home_value_plot = ggplot(cin_combine)+
  geom_line(aes(new_month,home_value))+
  geom_smooth(aes(new_month,home_value),method='lm',color='red')+
  theme_bw()+
  xlab("Month")+
  ylab("Home Value")+
  labs(
    title = 'Home Value in Cincinnati',
    subtitle = 'January 2018 - December 2021')

home_value_year_plot<- cin_sub%>%
  ggplot(aes(month(new_month, label=TRUE, abbr=TRUE), 
                home_value, group=factor(year(new_month)), colour=factor(year(new_month)))) +
  geom_line() +
  geom_point() +
  labs(x="Month", colour="Year") +
  theme_classic()+
  labs(title= 'Home Value in Cincinnati by Year' )

grid.arrange(home_value_plot, home_value_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'

inventory

inventory_plot = ggplot(cin_combine)+
  geom_line(aes(new_month,inventory))+
  geom_smooth(aes(new_month,inventory),method='lm',color='red')+
  theme_bw()+
  xlab("Month")+
  ylab("Inventory")+
  labs(
    title = 'Inventory in Cincinnati',
    subtitle = 'January 2018 - December 2021')

inventory_year_plot<- cin_sub%>%
  ggplot(aes(month(new_month, label=TRUE, abbr=TRUE), 
                inventory, group=factor(year(new_month)), colour=factor(year(new_month)))) +
  geom_line() +
  geom_point() +
  labs(x="Month", colour="Year") +
  theme_classic()+
  labs(title= 'Inventory in Cincinnati by Year' )

grid.arrange(inventory_plot, inventory_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'

list_sale_price

list_and_sale_price_plot = ggplot(cin_combine)+
  geom_line(aes(new_month,list_sale_price))+
  geom_smooth(aes(new_month,list_sale_price),method='lm',color='red')+
  theme_bw()+
  xlab("Month")+
  ylab("List and sale price")+
  labs(
    title = 'List and sale price in Cincinnati',
    subtitle = 'January 2018 - December 2021')

list_and_sale_price_year_plot<- cin_sub%>%
  ggplot(aes(month(new_month, label=TRUE, abbr=TRUE), 
                list_sale_price, group=factor(year(new_month)), colour=factor(year(new_month)))) +
  geom_line() +
  geom_point() +
  labs(x="Month", colour="Year") +
  theme_classic()+
  labs(title= 'List and sale price in Cincinnati by Year' )

grid.arrange(list_and_sale_price_plot, list_and_sale_price_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'

`rentals``

rentals_plot = ggplot(cin_combine)+
  geom_line(aes(new_month,rentals))+
  geom_smooth(aes(new_month,rentals),method='lm',color='red')+
  theme_bw()+
  xlab("Month")+
  ylab("Rentals")+
  labs(
    title = 'Rentals in Cincinnati',
    subtitle = 'January 2018 - December 2021')

rentals_year_plot<- cin_sub%>%
  ggplot(aes(month(new_month, label=TRUE, abbr=TRUE), 
                rentals, group=factor(year(new_month)), colour=factor(year(new_month)))) +
  geom_line() +
  geom_point() +
  labs(x="Month", colour="Year") +
  theme_classic()+
  labs(title= 'Rentals in Cincinnati by Year' )

grid.arrange(rentals_plot, rentals_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'

`sales_count_price``

sales_count_price_plot = ggplot(cin_combine)+
  geom_line(aes(new_month,sales_count_price))+
  geom_smooth(aes(new_month,sales_count_price),method='lm',color='red')+
  theme_bw()+
  xlab("Month")+
  ylab("Sales count and price cut")+
  labs(
    title = 'Sales count and price cut in Cincinnati',
    subtitle = 'January 2018 - December 2021')

sales_count_price_year_plot<- cin_sub%>%
  ggplot(aes(month(new_month, label=TRUE, abbr=TRUE), 
                sales_count_price, group=factor(year(new_month)), colour=factor(year(new_month)))) +
  geom_line() +
  geom_point() +
  labs(x="Month", colour="Year") +
  theme_classic()+
  labs(title= 'Sales count and price cut in Cincinnati by Year' )

grid.arrange(sales_count_price_plot, sales_count_price_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'

Section 3

Initial data analysis

As implied in the EDA part, home value , and rentalsseem to increase over time and indicated the same for each year (2018-2021); Similarly, for list and sale price, there’s an increasing trend over the years but a variation within each year (this maybe the seasonal); On the contrary, inventory seems to decreases overtime as well as yearly; Lastly, for sales count and price cut the variation tends to slightly decrease over the years.

Now we want to model the linear trend and see if it’s consistent with our observations from the visualization.

# Test whether there is a true linaer trend over time
mod1 = lm(home_value~new_month,data=cin_combine_full)
summary(mod1)

mod2 = lm(inventory~new_month,data=cin_combine_full)
summary(mod2)

mod3 = lm(list_sale_price~new_month,data=cin_combine_full)
summary(mod3)


mod4 = lm(rentals~new_month,data=cin_combine_full)
summary(mod4)

mod5 = lm(sales_count_price~new_month,data=cin_combine_full)
summary(mod5)

The outcome table:

tab_model(mod1, mod2,mod3,mod4,mod5)
  home value inventory list sale price rentals sales count price
Predictors Estimates CI p Estimates CI p Estimates CI p Estimates CI p Estimates CI p
(Intercept) 67293.58 53729.25 – 80857.90 <0.001 62976.81 50644.46 – 75309.16 <0.001 -1130447.61 -1357341.28 – -903553.93 <0.001 -1332.23 -1420.22 – -1244.23 <0.001 -4112.02 -4975.12 – -3248.91 <0.001
new month 5.95 5.05 – 6.84 <0.001 -3.02 -3.70 – -2.35 <0.001 77.53 65.12 – 89.94 <0.001 0.14 0.13 – 0.14 <0.001 0.38 0.33 – 0.44 <0.001
Observations 264 48 48 96 167
R2 / R2 adjusted 0.396 / 0.393 0.639 / 0.631 0.775 / 0.770 0.970 / 0.970 0.560 / 0.557