Introduction

‘Zillow Housing Data’ here includes Zillow’s Home Value Index (ZHVI), Zillow Observed Rent Index (ZORI), List Price and, Sales Count all collected monthly. The data is publicly available and provided by Zillow.

All data points selected here are smoothed and seasonally adjusted. ZHVI reflects the data for the for homes in the 35th to 65th percentile range for ZHVI and 40th to 60th percentile for ZORI. This is done to give us a better representation of the housing market.

The US housing market as seen from Bubble bursting in the late 2000’s showed us that it was a linchpin in the US Economy. Understanding its volatility and being able to forecast home values can be a tricky proposition even with first hand quality data as Zillow themselves learnt.

Data Preperation

To enable any useful analysis the data will have to be joined into a single dataframe.

#zillow home value index
test1<-zhvi %>% 
  pivot_longer(,cols = 6:ncol(zhvi),
               names_to = "Time",
               values_to = "value",
               values_drop_na = TRUE) %>% mutate(Month = ymd(Time)) 

#list price
test2<-list_price %>% 
  pivot_longer(,cols = 6:ncol(list_price),
               names_to = "Time",
               values_to = "list_price",
               values_drop_na = TRUE) %>% mutate(Month = ymd(Time)) %>% 
  select(RegionID,Month,list_price)


#sales_count
test3<-sales_count %>% 
  pivot_longer(,cols = 6:ncol(sales_count),
               names_to = "Time",
               values_to = "sales_count",
               values_drop_na = TRUE) %>% mutate(Month = ymd(Time)) %>% 
  select(RegionID,Month,sales_count)

#zori
test4<-zori %>% 
  pivot_longer(,cols = 4:ncol(zori),
               names_to = "Time",
               values_to = "zori",
               values_drop_na = TRUE) %>% 
  mutate(Month = ceiling_date(ym(Time),'month') -1 )%>% 
  select(RegionID,Month,zori)

#combining all 4
combo_df <-inner_join(test1,test4)
combo_df <-inner_join(combo_df,test2)
combo_df <-right_join(combo_df,test3)
combo_df = combo_df %>% subset(select = -c(Time))

str(combo_df)

## tibble [15,429 x 10] (S3: tbl_df/tbl/data.frame)
##  $ RegionID   : num [1:15429] 102001 102001 102001 102001 102001 ...
##  $ SizeRank   : num [1:15429] 0 0 0 0 0 0 0 0 0 0 ...
##  $ RegionName : chr [1:15429] "United States" "United States" "United States" "United States" ...
##  $ RegionType : chr [1:15429] "Country" "Country" "Country" "Country" ...
##  $ StateName  : chr [1:15429] NA NA NA NA ...
##  $ value      : num [1:15429] 226599 227967 229298 230614 231830 ...
##  $ Month      : Date[1:15429], format: "2018-01-31" "2018-02-28" ...
##  $ zori       : num [1:15429] 1544 1549 1553 1558 1562 ...
##  $ list_price : num [1:15429] 271617 272983 279333 288667 295633 ...
##  $ sales_count: num [1:15429] 294228 279864 386522 409798 475661 ...

Exploratory Data Analysis

Below is the list of Cities that have the highest average ‘Zillow Home Value Index’.

top_10_all <- combo_df %>% 
  group_by(RegionName) %>% 
  summarise(mean = mean(value)) %>% 
  arrange(desc(mean)) %>% 
  top_n(10)
top_10_all

## # A tibble: 10 x 2
##    RegionName                             mean
##    <chr>                                 <dbl>
##  1 San Jose, CA                       1303264.
##  2 San Francisco, CA                  1159456.
##  3 Urban Honolulu, HI                  719912.
##  4 Los Angeles-Long Beach-Anaheim, CA  708701.
##  5 San Diego, CA                       642897.
##  6 Ventura, CA                         618729.
##  7 Seattle, WA                         556200.
##  8 Boston, MA                          512618.
##  9 New York, NY                        494284.
## 10 Denver, CO                          474328.

To get a better understanding of the data we can look at the standard deviation or variance in the Index over the years as proxy for the volatility in the home prices.

top_10_sd_all <- combo_df %>% 
  group_by(RegionName) %>% 
  summarise(Volatility = sd(value),
            Average = mean(value)) %>% 
  arrange(desc(Volatility)) %>% 
  top_n(10)
top_10_sd_all

## # A tibble: 10 x 3
##    RegionName                         Volatility  Average
##    <chr>                                   <dbl>    <dbl>
##  1 San Jose, CA                           92256. 1303264.
##  2 San Francisco, CA                      83979. 1159456.
##  3 San Diego, CA                          78324.  642897.
##  4 Seattle, WA                            65487.  556200.
##  5 Los Angeles-Long Beach-Anaheim, CA     64987.  708701.
##  6 Denver, CO                             47407.  474328.
##  7 Boston, MA                             42583.  512618.
##  8 New York, NY                           32990.  494284.
##  9 Ventura, CA                            15209.  618729.
## 10 Urban Honolulu, HI                      5703.  719912.

We can see that the cities with the greater volatility are also the ones with higer home values on average.

Visualising the trends

While the above list gives us an idea about the housing market on average it may behoove us to get a better understanding of the trend.

#zhvi for top 10 cities
inner_join(x = combo_df, y = top_10_all, by = "RegionName") %>% 
  ggplot(aes(x= Month,y =value,group = RegionName, color = RegionName)) +
  geom_line() + 
  ggtitle("Trend of Zillow Home Value Index for Top 10 Cities") + 
  scale_x_date(date_labels ="%Y") +
  theme(legend.position = "right") +
  ylab("ZHVI") + xlab("")

Here we can see that the trend for all the top 10 cities are similar in the sense that the it is relatively stable from ’19-’21 with a rise after that for all cities.

#sales count for top 10 cities
inner_join(x = combo_df, y = top_10_all, by = "RegionName") %>% 
  ggplot(aes(x= Month,y =sales_count,group = RegionName, color = RegionName)) +
  geom_line() + ggtitle("Sales count for top 10 Cities") +
  theme(legend.position = "right") +
  ylab("Sales (count)") + xlab("")

It is visible across all Regions that the COVID - 19 pandemic affected the Sales of houses in US. There is a great amount of volatility before 2020 but there is a sharp downturn across all regions.

#list prices
inner_join(x = combo_df, y = top_10_all, by = "RegionName") %>% 
  ggplot(aes(x= Month,y =list_price,group = RegionName, color = RegionName)) +
  geom_line() + ggtitle("list prices for top 10 Cities") +
  ylab("List Price") + xlab("")+ 
    theme(legend.position = "right")

Price listing seems to be be a very volatile in nature for certain regions as we can see that there are cities with the downturn post 2020, however this can be a market reaction to the pandemic as the trend is quite similar to the Sales Count noted above.

#ZORI
inner_join(x = combo_df, y = top_10_all, by = "RegionName") %>% 
  ggplot(aes(x= Month,y =zori,group = RegionName, color = RegionName)) +
  geom_line() + ggtitle("ZORI for top 10 Cities") +
  ylab("Observed Rent Index") + xlab("") + 
    theme(legend.position = "right")

The Rent Index is inline with the Home Value Index and is relatively stable across all regions. Interestingly New York shows to have the highest ZORI among all the regions and was also the most affected by the pandemic. It can also be seen that the market is trying to correct the decrease with observable increase in Rent Index to reach greater than pandemic levels.

Modelling the Zillow Home Value Index

As the housing market in itself can be very different across regions, it would behoove us to focus on a specific region while forecasting as to avoid incorrect analysis. Here we will focus on the Seattle, WA housing Market.

df <- filter(combo_df,RegionName == "Seattle, WA")
summary(df)

##     RegionID         SizeRank   RegionName         RegionType       
##  Min.   :395078   Min.   :15   Length:48          Length:48         
##  1st Qu.:395078   1st Qu.:15   Class :character   Class :character  
##  Median :395078   Median :15   Mode  :character   Mode  :character  
##  Mean   :395078   Mean   :15                                        
##  3rd Qu.:395078   3rd Qu.:15                                        
##  Max.   :395078   Max.   :15                                        
##   StateName             value            Month                 zori     
##  Length:48          Min.   :484561   Min.   :2018-01-31   Min.   :1789  
##  Class :character   1st Qu.:509258   1st Qu.:2019-01-23   1st Qu.:1854  
##  Mode  :character   Median :523373   Median :2020-01-15   Median :1906  
##                     Mean   :556200   Mean   :2020-01-14   Mean   :1913  
##                     3rd Qu.:589868   3rd Qu.:2021-01-07   3rd Qu.:1929  
##                     Max.   :718944   Max.   :2021-12-31   Max.   :2163  
##    list_price      sales_count  
##  Min.   :508316   Min.   :3594  
##  1st Qu.:596929   1st Qu.:5132  
##  Median :638167   Median :6060  
##  Mean   :630231   Mean   :5909  
##  3rd Qu.:664709   3rd Qu.:6893  
##  Max.   :701650   Max.   :7780

To get a better understanding of the features and their relationship w.r.t to the Home Value we can plot a scatter plot.

df %>% ggplot(aes(x = value,y = list_price)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x+I(x^2), level = 0.99) +
  xlab("ZHVI") + ylab("List Price") +
  ggtitle("ZHVI vs List Prices" )

Home Value and List Price appear to have some relation, one that appears to be quadratic in nature and can thus be used in forecasting the Home Value Index for Seattle.

df %>% ggplot(aes(x =  value,y =zori)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x+I(x^2), level = 0.99)+
xlab("ZHVI") + ylab("ZORI") +
  ggtitle("Home Value vs Rent" )

Zillow’s Home Value Index and Rent Index seem to have a strong relationship.

df %>% ggplot(aes(x = value,y = sales_count)) +
  geom_point() + 
  geom_smooth(method = "lm", formula = y ~ x, level = 0.99)+
  xlab("ZHVI") + ylab("Sales (Count)") +
  ggtitle("Home Value Index vs Sales (Count)" )

Here we can see that in Seattle the ZHVI and Sales (Count) have little to no relation as there appears to be random scatter across.

To build a regression model we can use

rent_mod <- lm(value ~ (zori),data = df)
price_mod <- lm(value ~ (list_price),data = df)
lin_mod <- lm(value ~ (list_price) + (zori),data = df)

summary(rent_mod)

## 
## Call:
## lm(formula = value ~ (zori), data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50178 -21198   5212  23482  43040 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -735085.25   87759.81  -8.376  8.3e-11 ***
## zori            674.98      45.83  14.729  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27690 on 46 degrees of freedom
## Multiple R-squared:  0.8251, Adjusted R-squared:  0.8213 
## F-statistic: 216.9 on 1 and 46 DF,  p-value: < 2.2e-16

summary(price_mod)

## 
## Call:
## lm(formula = value ~ (list_price), data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -63286 -35471   -406  25472 108421 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.861e+04  8.002e+04  -1.232    0.224    
## list_price   1.039e+00  1.266e-01   8.207 1.47e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42170 on 46 degrees of freedom
## Multiple R-squared:  0.5942, Adjusted R-squared:  0.5854 
## F-statistic: 67.36 on 1 and 46 DF,  p-value: 1.467e-10

summary(lin_mod)

## 
## Call:
## lm(formula = value ~ (list_price) + (zori), data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -48207 -23050   8494  23448  41677 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.093e+05  9.449e+04  -7.507 1.82e-09 ***
## list_price   1.104e-01  1.455e-01   0.759    0.452    
## zori         6.251e+02  8.022e+01   7.792 6.93e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27820 on 45 degrees of freedom
## Multiple R-squared:  0.8273, Adjusted R-squared:  0.8196 
## F-statistic: 107.8 on 2 and 45 DF,  p-value: < 2.2e-16

We can observe here that the R squared for the model with the Rent Index alone performs much better than a linear combination with the List Price or just List Price alone. The Residual Standard Error and F-Statistic are indicate that the ZORI only model is the better indicator.

The ZORI only model may benefit by exploring a polynomial relation of degree 2 as observed in the scatter plots.

Zillow Home Value Index Forecasting - Assignment 1

Saksham Bhutani

20/01/2022

Introduction

Data Preperation

Exploratory Data Analysis

Visualising the trends

Modelling the Zillow Home Value Index