‘Zillow Housing Data’ here includes Zillow’s Home Value Index (ZHVI), Zillow Observed Rent Index (ZORI), List Price and, Sales Count all collected monthly. The data is publicly available and provided by Zillow.
All data points selected here are smoothed and seasonally adjusted. ZHVI reflects the data for the for homes in the 35th to 65th percentile range for ZHVI and 40th to 60th percentile for ZORI. This is done to give us a better representation of the housing market.
The US housing market as seen from Bubble bursting in the late 2000’s showed us that it was a linchpin in the US Economy. Understanding its volatility and being able to forecast home values can be a tricky proposition even with first hand quality data as Zillow themselves learnt.
To enable any useful analysis the data will have to be joined into a single dataframe.
#zillow home value index
test1<-zhvi %>%
pivot_longer(,cols = 6:ncol(zhvi),
names_to = "Time",
values_to = "value",
values_drop_na = TRUE) %>% mutate(Month = ymd(Time))
#list price
test2<-list_price %>%
pivot_longer(,cols = 6:ncol(list_price),
names_to = "Time",
values_to = "list_price",
values_drop_na = TRUE) %>% mutate(Month = ymd(Time)) %>%
select(RegionID,Month,list_price)
#sales_count
test3<-sales_count %>%
pivot_longer(,cols = 6:ncol(sales_count),
names_to = "Time",
values_to = "sales_count",
values_drop_na = TRUE) %>% mutate(Month = ymd(Time)) %>%
select(RegionID,Month,sales_count)
#zori
test4<-zori %>%
pivot_longer(,cols = 4:ncol(zori),
names_to = "Time",
values_to = "zori",
values_drop_na = TRUE) %>%
mutate(Month = ceiling_date(ym(Time),'month') -1 )%>%
select(RegionID,Month,zori)
#combining all 4
combo_df <-inner_join(test1,test4)
combo_df <-inner_join(combo_df,test2)
combo_df <-right_join(combo_df,test3)
combo_df = combo_df %>% subset(select = -c(Time))
str(combo_df)
## tibble [15,429 x 10] (S3: tbl_df/tbl/data.frame)
## $ RegionID : num [1:15429] 102001 102001 102001 102001 102001 ...
## $ SizeRank : num [1:15429] 0 0 0 0 0 0 0 0 0 0 ...
## $ RegionName : chr [1:15429] "United States" "United States" "United States" "United States" ...
## $ RegionType : chr [1:15429] "Country" "Country" "Country" "Country" ...
## $ StateName : chr [1:15429] NA NA NA NA ...
## $ value : num [1:15429] 226599 227967 229298 230614 231830 ...
## $ Month : Date[1:15429], format: "2018-01-31" "2018-02-28" ...
## $ zori : num [1:15429] 1544 1549 1553 1558 1562 ...
## $ list_price : num [1:15429] 271617 272983 279333 288667 295633 ...
## $ sales_count: num [1:15429] 294228 279864 386522 409798 475661 ...
Below is the list of Cities that have the highest average ‘Zillow Home Value Index’.
top_10_all <- combo_df %>%
group_by(RegionName) %>%
summarise(mean = mean(value)) %>%
arrange(desc(mean)) %>%
top_n(10)
top_10_all
## # A tibble: 10 x 2
## RegionName mean
## <chr> <dbl>
## 1 San Jose, CA 1303264.
## 2 San Francisco, CA 1159456.
## 3 Urban Honolulu, HI 719912.
## 4 Los Angeles-Long Beach-Anaheim, CA 708701.
## 5 San Diego, CA 642897.
## 6 Ventura, CA 618729.
## 7 Seattle, WA 556200.
## 8 Boston, MA 512618.
## 9 New York, NY 494284.
## 10 Denver, CO 474328.
To get a better understanding of the data we can look at the standard deviation or variance in the Index over the years as proxy for the volatility in the home prices.
top_10_sd_all <- combo_df %>%
group_by(RegionName) %>%
summarise(Volatility = sd(value),
Average = mean(value)) %>%
arrange(desc(Volatility)) %>%
top_n(10)
top_10_sd_all
## # A tibble: 10 x 3
## RegionName Volatility Average
## <chr> <dbl> <dbl>
## 1 San Jose, CA 92256. 1303264.
## 2 San Francisco, CA 83979. 1159456.
## 3 San Diego, CA 78324. 642897.
## 4 Seattle, WA 65487. 556200.
## 5 Los Angeles-Long Beach-Anaheim, CA 64987. 708701.
## 6 Denver, CO 47407. 474328.
## 7 Boston, MA 42583. 512618.
## 8 New York, NY 32990. 494284.
## 9 Ventura, CA 15209. 618729.
## 10 Urban Honolulu, HI 5703. 719912.
We can see that the cities with the greater volatility are also the ones with higer home values on average.
While the above list gives us an idea about the housing market on average it may behoove us to get a better understanding of the trend.
#zhvi for top 10 cities
inner_join(x = combo_df, y = top_10_all, by = "RegionName") %>%
ggplot(aes(x= Month,y =value,group = RegionName, color = RegionName)) +
geom_line() +
ggtitle("Trend of Zillow Home Value Index for Top 10 Cities") +
scale_x_date(date_labels ="%Y") +
theme(legend.position = "right") +
ylab("ZHVI") + xlab("")
Here we can see that the trend for all the top 10 cities are similar in the sense that the it is relatively stable from ’19-’21 with a rise after that for all cities.
#sales count for top 10 cities
inner_join(x = combo_df, y = top_10_all, by = "RegionName") %>%
ggplot(aes(x= Month,y =sales_count,group = RegionName, color = RegionName)) +
geom_line() + ggtitle("Sales count for top 10 Cities") +
theme(legend.position = "right") +
ylab("Sales (count)") + xlab("")
It is visible across all Regions that the COVID - 19 pandemic affected the Sales of houses in US. There is a great amount of volatility before 2020 but there is a sharp downturn across all regions.
#list prices
inner_join(x = combo_df, y = top_10_all, by = "RegionName") %>%
ggplot(aes(x= Month,y =list_price,group = RegionName, color = RegionName)) +
geom_line() + ggtitle("list prices for top 10 Cities") +
ylab("List Price") + xlab("")+
theme(legend.position = "right")
Price listing seems to be be a very volatile in nature for certain regions as we can see that there are cities with the downturn post 2020, however this can be a market reaction to the pandemic as the trend is quite similar to the Sales Count noted above.
#ZORI
inner_join(x = combo_df, y = top_10_all, by = "RegionName") %>%
ggplot(aes(x= Month,y =zori,group = RegionName, color = RegionName)) +
geom_line() + ggtitle("ZORI for top 10 Cities") +
ylab("Observed Rent Index") + xlab("") +
theme(legend.position = "right")
The Rent Index is inline with the Home Value Index and is relatively stable across all regions. Interestingly New York shows to have the highest ZORI among all the regions and was also the most affected by the pandemic. It can also be seen that the market is trying to correct the decrease with observable increase in Rent Index to reach greater than pandemic levels.
As the housing market in itself can be very different across regions, it would behoove us to focus on a specific region while forecasting as to avoid incorrect analysis. Here we will focus on the Seattle, WA housing Market.
df <- filter(combo_df,RegionName == "Seattle, WA")
summary(df)
## RegionID SizeRank RegionName RegionType
## Min. :395078 Min. :15 Length:48 Length:48
## 1st Qu.:395078 1st Qu.:15 Class :character Class :character
## Median :395078 Median :15 Mode :character Mode :character
## Mean :395078 Mean :15
## 3rd Qu.:395078 3rd Qu.:15
## Max. :395078 Max. :15
## StateName value Month zori
## Length:48 Min. :484561 Min. :2018-01-31 Min. :1789
## Class :character 1st Qu.:509258 1st Qu.:2019-01-23 1st Qu.:1854
## Mode :character Median :523373 Median :2020-01-15 Median :1906
## Mean :556200 Mean :2020-01-14 Mean :1913
## 3rd Qu.:589868 3rd Qu.:2021-01-07 3rd Qu.:1929
## Max. :718944 Max. :2021-12-31 Max. :2163
## list_price sales_count
## Min. :508316 Min. :3594
## 1st Qu.:596929 1st Qu.:5132
## Median :638167 Median :6060
## Mean :630231 Mean :5909
## 3rd Qu.:664709 3rd Qu.:6893
## Max. :701650 Max. :7780
To get a better understanding of the features and their relationship w.r.t to the Home Value we can plot a scatter plot.
df %>% ggplot(aes(x = value,y = list_price)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x+I(x^2), level = 0.99) +
xlab("ZHVI") + ylab("List Price") +
ggtitle("ZHVI vs List Prices" )
Home Value and List Price appear to have some relation, one that appears to be quadratic in nature and can thus be used in forecasting the Home Value Index for Seattle.
df %>% ggplot(aes(x = value,y =zori)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x+I(x^2), level = 0.99)+
xlab("ZHVI") + ylab("ZORI") +
ggtitle("Home Value vs Rent" )
Zillow’s Home Value Index and Rent Index seem to have a strong relationship.
df %>% ggplot(aes(x = value,y = sales_count)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x, level = 0.99)+
xlab("ZHVI") + ylab("Sales (Count)") +
ggtitle("Home Value Index vs Sales (Count)" )
Here we can see that in Seattle the ZHVI and Sales (Count) have little to no relation as there appears to be random scatter across.
To build a regression model we can use
rent_mod <- lm(value ~ (zori),data = df)
price_mod <- lm(value ~ (list_price),data = df)
lin_mod <- lm(value ~ (list_price) + (zori),data = df)
summary(rent_mod)
##
## Call:
## lm(formula = value ~ (zori), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50178 -21198 5212 23482 43040
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -735085.25 87759.81 -8.376 8.3e-11 ***
## zori 674.98 45.83 14.729 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27690 on 46 degrees of freedom
## Multiple R-squared: 0.8251, Adjusted R-squared: 0.8213
## F-statistic: 216.9 on 1 and 46 DF, p-value: < 2.2e-16
summary(price_mod)
##
## Call:
## lm(formula = value ~ (list_price), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63286 -35471 -406 25472 108421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.861e+04 8.002e+04 -1.232 0.224
## list_price 1.039e+00 1.266e-01 8.207 1.47e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42170 on 46 degrees of freedom
## Multiple R-squared: 0.5942, Adjusted R-squared: 0.5854
## F-statistic: 67.36 on 1 and 46 DF, p-value: 1.467e-10
summary(lin_mod)
##
## Call:
## lm(formula = value ~ (list_price) + (zori), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48207 -23050 8494 23448 41677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.093e+05 9.449e+04 -7.507 1.82e-09 ***
## list_price 1.104e-01 1.455e-01 0.759 0.452
## zori 6.251e+02 8.022e+01 7.792 6.93e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27820 on 45 degrees of freedom
## Multiple R-squared: 0.8273, Adjusted R-squared: 0.8196
## F-statistic: 107.8 on 2 and 45 DF, p-value: < 2.2e-16
We can observe here that the R squared for the model with the Rent Index alone performs much better than a linear combination with the List Price or just List Price alone. The Residual Standard Error and F-Statistic are indicate that the ZORI only model is the better indicator.
The ZORI only model may benefit by exploring a polynomial relation of degree 2 as observed in the scatter plots.