I have started taking data science certificate courses to obtain practical applications and skills of data science. And in order to learn better, I started to apply the data science skills to a project I have been interested in which is ‘housing market’ For ‘housing market’ project, I will slice and dice data provided in Zillow using R programming which will help me understand trends of the housing market.
I am intersted in single family homes, and I have been waiting for the right moment to purchase a home. Home prices are higher than ever, and while some speculate it will continue to go up, some speculate it will go home. By answering questions below, I want to understand the market trends. And it will help me make a decision if I want to hold off on purchasing a house or not.
Is sale and listing price ratio going up or down? Is the sqft price of sale and listing ratio going up or down?
source("dbook.R")
load.packages(c("dplyr", "ggplot2", "tidyr", "knitr","stringr", "DT", "forecast", "fpp","lattice"))
## [1] loading dplyr
## [1] loading ggplot2
## [1] loading tidyr
## [1] loading knitr
## [1] loading stringr
## [1] loading DT
## [1] loading forecast
## [1] loading fpp
## [1] loading lattice
opts_chunk$set(comment=NA, fig.width=8, fig.height=5)
To facilitate the analysis, I also wrote a function that transforms ‘wide’ (one column per month) format from Zillow website into ‘narrow’ format (one row per month).
read.ztbl <- function(filename){
prep.ztbl(read.csv(filename))
}
prep.ztbl <- function(atbl){
c.nondate = length(which(!str_detect(colnames(atbl), "^X")))
range.datecol = (c.nondate+1):length(colnames(atbl))
tbl = atbl %>% gather("Month", "Value", range.datecol)
tbl$Date = as.Date(sprintf("%s-%s-01", str_extract(tbl$Month, "\\d{4}"),
str_extract(tbl$Month, "\\d{2}$")))
tbl$Month = NULL
tbl
}
Below shows the trend of seasonally adjusted percentages of listings with price cut for single family homes from 1/1/2010 to present for Seattle and Kirkland cities. Y-axis shows the price cut points for both cities, so the higher the value is, the higher the price cut.
pricecut = read.ztbl("http://files.zillowstatic.com/research/public/City/City_Listings_PriceCut_SeasAdj_SingleFamilyResidence.csv")
citylist = c("Seattle", "Kirkland")
pricecutcity = pricecut %>% subset(RegionName %in% citylist & State == "WA" & Date >= as.Date("2010-01-01"))
ggplot(pricecutcity, aes(x=Date, y=Value, color=as.factor(RegionName))) +
geom_line(aes(Date, Value), data=pricecutcity, size=0.5) + labs(x="Year", y="Percentages of listings with price cut-Seasonally Adjusted") + ggtitle("Percentages of listings with price cut-seasonally adjusted for single family homes")
We can see that the overall price cut has been decreasing from 2010-2013. After 2013, the price cut seems to increase until 2015 when it starts decreasing again.
Below shows the comparison between the median listing price and sale price per Sq Ft of all homes in Seattle and kirkland cities from 1/1/2010 to present. But unfortunately, the data Zillow provides for sale price is missing last few months of 2016 so it is hard to seehow listing price compares to sale price from end of 2016. Based on the chart, the sale price is exceeding the listing price most of 2012-2016. But the sale price closely trails behind the listing price from 2012 for Kirkland.
listprice = read.ztbl("http://files.zillowstatic.com/research/public/City/City_MedianListingPricePerSqft_AllHomes.csv")
saleprice = read.ztbl("http://files.zillowstatic.com/research/public/City/City_MedianSoldPricePerSqft_AllHomes.csv")
citylist = c("Seattle", "Kirkland")
for(city in citylist)
{
listpricecity = listprice %>% subset(RegionName == city & State == "WA" & Date >= as.Date("2010-01-01"))
salepricecity = saleprice %>% subset(RegionName == city & State == "WA" & Date >= as.Date("2010-01-01"))
combined <-full_join(listpricecity, salepricecity, by="Date")
plot(combined$Date, combined$Value.x, col="Red", type="l", lty=1, main= paste(city, "listing price and sale price per Sq Ft"), xlab="Year", ylab="Listing and sale price")
lines(combined$Date, combined$Value.y, col="Blue", type="l", lty=1)
legend("topleft", c("listing price", "sale price"), col = c("Red", "Blue"), cex=.75, lty=1)
}
Sale for loss is percentages of homes selling for a price lower than the previous sale and sale for gain is percentage of homes selling for a price higher than a previous sale. The chart shows that the gap between sale for loss and sale for gain for Seattle is widening from 2012 to present. Data for city of Kirkland was not available in Zillow, so the data is only for Seattle. Percentage of homes selling for gain is trending upward while the homes selling for loss is trending downward.
sellloss <- read.ztbl("http://files.zillowstatic.com/research/public/City/City_PctOfHomesSellingForLoss_AllHomes.csv")
sellgain <-read.ztbl("http://files.zillowstatic.com/research/public/City/City_PctOfHomesSellingForGain_AllHomes.csv")
citylist = c("Seattle")
for(city in citylist)
{
lossvalue = sellloss %>% subset(RegionName == city & State == "WA" & Date >= as.Date("2010-01-01"))
gainvalue = sellgain %>% subset(RegionName == city & State == "WA" & Date >= as.Date("2010-01-01"))
lossgain <-full_join(lossvalue, gainvalue, by="Date")
plot(lossgain$Date, lossgain$Value.x, col="Red", type="l", lty=1, main= paste(city, "Sell for Loss and Sell for Gain percentage"), xlab="Year", ylab="Sell Loss and Gain percentage",ylim=c(0,100))
lines(lossgain$Date, lossgain$Value.y, col="Blue", type="l", lty=1)
legend("topleft", c("% of homes selling for loss", "% of homes selling for gain"), col = c("Red", "Blue"), cex=.75, lty=1)
}
Age of inventory is calculated as the median number of days all active listings as of that Wednesday have been current. These medians are then aggregated into the number reported by taking the median across weekly values.
inventoryage<-read.ztbl("http://files.zillowstatic.com/research/public/Metro/AgeOfInventory_Metro_Public.csv")
city = "Seattle"
inventorycity = inventoryage %>% subset(str_detect(RegionName, city) & StateFullName == "Washington" & Date >= as.Date("2010-01-01"))
ggplot(inventorycity, aes(x=Date, y=Value, color=as.factor(RegionName))) +
geom_line(size=0.5) + labs(x="Year", y="Age of inventory") + ggtitle("Age of Inventory")
What is zhvi(Zillow home value index)? The Zillow Home Value Index is the median Zestimate valuation for a given geographic area on a given day. A Zestimate is Zillow’s estimate of the current market value for a home. It is the middle point, so exactly half the Zestimates for a region are below this number and half the Zestimates are above it.
The chart below shows that zhvi for both Seattle and Kirkland have been steadily going up since 2012.
zhviindex = read.ztbl("http://files.zillowstatic.com/research/public/City/City_Zhvi_SingleFamilyResidence.csv")
citylist = c("Seattle", "Kirkland")
indexvalue = zhviindex %>% subset(RegionName %in% citylist & State == "WA" & Date >= as.Date("2010-01-01"))
ggplot(indexvalue, aes(x=Date, y=Value, color=as.factor(RegionName))) +
geom_line(aes(Date, Value), data=indexvalue, size=0.5) + labs(x="Year", y="Zhvi index value") + ggtitle("Trend of Zhvi index value over time")