There are huge finances that are involved while buying a property. In order to make that investment in the right property, it is very important to understand the factors that drives the price and more important what would be the forecast price in future for that property. There are many agencies that individual agents which are into the business of bnuying and selling property. One of biggest player in the real state industry is Zillow who deals in buying, selling and renting a home.
Zillow Group, Inc., or simply Zillow, is an American online real estate marketplace company that was founded in 2006 and was created by Rich Barton and Lloyd Frink.Its service includes advertising for agents and real estate professionals, a home-buying business for flip over, which is anounced that company would shut down this business and sell off the inventory on-hand due to incur of huge losses and third is marketplace to advertise homes for sales and rent.
Zillow also list of properties that are not on market and provide the estimates on those properties which they commonly referes to as Zestimate.
The ZHVI will now be calculated using a new set of assumptions:
The above assumptions are being used to estimate the property prices and providing it to the users of Zillow website/app. It is also important for the inverstors and buyers to aware about the past trends as well as the future projections to get a better feeling about the property prices. With this project we are going to address this question of past trends and furtre projections of the property prices.The information that we have currently on our hand will allow us to get the past trends and our further analysis will provide the information about the future projections.
Currently we have the data available at the Zip Code level for the Condos all across the United States from Jan 2000 to Nov 2021. Using this data, I will be trying to forecast the prices atleast 2 years in future. For this project purpose, I will be taking 4 bedroom time series data for the region around Tampa Bay.
For the purpose of restricting the analysis around specific area, I will be randomly choosing 2 zip codes for the area around Tampa,Fl. As a part initial steps, I will be focusing on data cleaning and data manipulation steps.
The data set has been sourced from ZVHI website research data https://www.zillow.com/research/data/.
library(fpp3)
library(tidyverse)
library(knitr)
library(lubridate)
library(patchwork)
library(maps)
library(forecast)
library(tseries)
library(sarima)
library(kableExtra)
library(reshape2)
library(dplyr)
library(prophet)
Table 1: Package Summaries
Package | Version | Summary |
---|---|---|
tidyverse | 1.3.1 | A compilation of R tools for data manipulation and visualization |
knitr | 1.3.6 | Dynamic report generation in R, esp. for displaying tables |
lubridate | 1.8.0 | A package that makes handling dates easier |
patchwork | 1.1.1 | A package for combining multiple ggplot2 plots into a single figure |
maps | 3.4.0 | Used to prepare and display geographical maps |
forecast | 8.15 | Forecasting functions for time series modeling |
tseries | 0.10-49 | Time series analysis |
sarima | 0.8.5 | Generation of ARIMA models for time series modeling |
kableExtra | 1.3.4 | Construct complex table with ‘kable’ and pipe symbol |
reshape2 | 1.4.4 | Flexibly reshape data |
In this section, our data preparation process is described. This includes importing the data, handling missing values, and refactoring several of the variables.Each variable will also be briefly described, including the data gathering steps.
Table 2: Variable Summary
Column Name | Description |
---|---|
RegionID | This is unique Id for the Regions |
SizeRank | This is the ranking done based on the size of the region |
RegionName | This field contains the zip code of the region. |
RegionType | Type of region is Zip. |
StateName | State |
City | This column provide the specific City Name of Housing Data |
Metro | This provide the name of the metro city around that region |
County Name | This is the county name for that region |
Months Column | These columns contains the prices of region for every month since 2000 |
First, we need to import the data.
# Read in the Zip_zvhi_condo.csv
Four_Bedroom<-read.csv("Zip_zhvi_4bdrm.csv",header = TRUE, stringsAsFactors = FALSE)
# Check the structure of the data
str(Four_Bedroom, list.len=12)
## 'data.frame': 26619 obs. of 273 variables:
## $ RegionID : int 61639 84654 61637 91982 84616 91940 61616 91733 93144 84640 ...
## $ SizeRank : int 0 1 2 3 4 5 6 7 8 9 ...
## $ RegionName : int 10025 60657 10023 77494 60614 77449 10002 77084 79936 60640 ...
## $ RegionType : chr "Zip" "Zip" "Zip" "Zip" ...
## $ StateName : chr "NY" "IL" "NY" "TX" ...
## $ State : chr "NY" "IL" "NY" "TX" ...
## $ City : chr "New York" "Chicago" "New York" "Katy" ...
## $ Metro : chr "New York-Newark-Jersey City" "Chicago-Naperville-Elgin" "New York-Newark-Jersey City" "Houston-The Woodlands-Sugar Land" ...
## $ CountyName : chr "New York County" "Cook County" "New York County" "Harris County" ...
## $ X2000.01.31: num 1616896 520117 NA 220137 630771 ...
## $ X2000.02.29: num 1561368 521933 1564476 220152 633028 ...
## $ X2000.03.31: num 1531088 525219 1564714 221038 635739 ...
## [list output truncated]
It looks like the ZVHI dataset got imported correctly. However, there are clear issues that require correction during data preparation.For example the dates for the value taken are shown as an individual varibale in wider format. This needs to be tidy up into the long format as well there is a character X added to these dates which needs to be removed from every column.
The Data imported has been prefixes by X for every Date, we need to remove that X to bring the dates back to their standard format.
names(Four_Bedroom) <- sub("^X", "", names(Four_Bedroom))
head(Four_Bedroom[,c(1:11)],10)
## RegionID SizeRank RegionName RegionType StateName State City
## 1 61639 0 10025 Zip NY NY New York
## 2 84654 1 60657 Zip IL IL Chicago
## 3 61637 2 10023 Zip NY NY New York
## 4 91982 3 77494 Zip TX TX Katy
## 5 84616 4 60614 Zip IL IL Chicago
## 6 91940 5 77449 Zip TX TX Katy
## 7 61616 6 10002 Zip NY NY New York
## 8 91733 7 77084 Zip TX TX Houston
## 9 93144 8 79936 Zip TX TX El Paso
## 10 84640 9 60640 Zip IL IL Chicago
## Metro CountyName 2000.01.31 2000.02.29
## 1 New York-Newark-Jersey City New York County 1616896 1561368
## 2 Chicago-Naperville-Elgin Cook County 520117 521933
## 3 New York-Newark-Jersey City New York County NA 1564476
## 4 Houston-The Woodlands-Sugar Land Harris County 220137 220152
## 5 Chicago-Naperville-Elgin Cook County 630771 633028
## 6 Houston-The Woodlands-Sugar Land Harris County 123383 123367
## 7 New York-Newark-Jersey City New York County NA NA
## 8 Houston-The Woodlands-Sugar Land Harris County 128259 128127
## 9 El Paso El Paso County 115560 115606
## 10 Chicago-Naperville-Elgin Cook County 427818 431305
The data exists in wider format, we first need to convert it into long format, so that further analysis on the price can be performed conviniently and more appropriately
Four_Bedroom_long<-Four_Bedroom %>%
pivot_longer(-c(RegionID,SizeRank,RegionName,RegionType,StateName,State,City,Metro,CountyName),
names_to = "Monthly",
values_to = "Price"
)
str(Four_Bedroom_long)
## tibble [7,027,416 × 11] (S3: tbl_df/tbl/data.frame)
## $ RegionID : int [1:7027416] 61639 61639 61639 61639 61639 61639 61639 61639 61639 61639 ...
## $ SizeRank : int [1:7027416] 0 0 0 0 0 0 0 0 0 0 ...
## $ RegionName: int [1:7027416] 10025 10025 10025 10025 10025 10025 10025 10025 10025 10025 ...
## $ RegionType: chr [1:7027416] "Zip" "Zip" "Zip" "Zip" ...
## $ StateName : chr [1:7027416] "NY" "NY" "NY" "NY" ...
## $ State : chr [1:7027416] "NY" "NY" "NY" "NY" ...
## $ City : chr [1:7027416] "New York" "New York" "New York" "New York" ...
## $ Metro : chr [1:7027416] "New York-Newark-Jersey City" "New York-Newark-Jersey City" "New York-Newark-Jersey City" "New York-Newark-Jersey City" ...
## $ CountyName: chr [1:7027416] "New York County" "New York County" "New York County" "New York County" ...
## $ Monthly : chr [1:7027416] "2000.01.31" "2000.02.29" "2000.03.31" "2000.04.30" ...
## $ Price : num [1:7027416] 1616896 1561368 1531088 1508281 1533773 ...
Four_Bedroom_Clean1 <- Four_Bedroom_long %>%
mutate(Monthly_parsed = as.Date(Monthly,"%Y.%m.%d"))
#Converting the Date from factor to character
Four_Bedroom_long[["Monthly"]]<- as.character(Four_Bedroom_long$Monthly)
summary(Four_Bedroom_long)
## RegionID SizeRank RegionName RegionType
## Min. : 58196 Min. : 0 Min. : 1001 Length:7027416
## 1st Qu.: 68437 1st Qu.: 6734 1st Qu.:25039 Class :character
## Median : 78622 Median :13555 Median :47382 Mode :character
## Mean : 80140 Mean :14184 Mean :47959
## 3rd Qu.: 88570 3rd Qu.:21109 3rd Qu.:70085
## Max. :753844 Max. :34430 Max. :99901
##
## StateName State City Metro
## Length:7027416 Length:7027416 Length:7027416 Length:7027416
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## CountyName Monthly Price
## Length:7027416 Length:7027416 Min. : 9043
## Class :character Class :character 1st Qu.: 136178
## Mode :character Mode :character Median : 202624
## Mean : 279140
## 3rd Qu.: 317109
## Max. :10006952
## NA's :1580283
Above summary shows that there to be some issue with Monthly column where Other has 7831304 values and Price has 1726648 as NA’s
There are around 1580283 data in the Price column that does not have any value.So ,before taking any action on these NA’s we first filter on the Zip Code that we need to analyze and then make further data cleaning. As a part of this analysis we are filtering the zipcode of Brandon
For time series modelling, I will be working on the specific region in US to understand the dynamics of how the prices of housing has varied over time specific to that region. This will also help up to analyse the factors that are contributing the the price change for that particular region.
For this case study, I have chosen the Brandon City near Tampa region and the main reason for choosing this city is that I have visited this city multiple time over the last many years and have observed some drastic changes around this region in terms of development and population growth.
So, let’s begin with our analysis.
Four_Bedroom_Tampa <- Four_Bedroom_Clean1 %>%
filter(Four_Bedroom_Clean1$RegionName %in% c(33511))
str(Four_Bedroom_Tampa)
## tibble [264 × 12] (S3: tbl_df/tbl/data.frame)
## $ RegionID : int [1:264] 72653 72653 72653 72653 72653 72653 72653 72653 72653 72653 ...
## $ SizeRank : int [1:264] 476 476 476 476 476 476 476 476 476 476 ...
## $ RegionName : int [1:264] 33511 33511 33511 33511 33511 33511 33511 33511 33511 33511 ...
## $ RegionType : chr [1:264] "Zip" "Zip" "Zip" "Zip" ...
## $ StateName : chr [1:264] "FL" "FL" "FL" "FL" ...
## $ State : chr [1:264] "FL" "FL" "FL" "FL" ...
## $ City : chr [1:264] "Brandon" "Brandon" "Brandon" "Brandon" ...
## $ Metro : chr [1:264] "Tampa-St. Petersburg-Clearwater" "Tampa-St. Petersburg-Clearwater" "Tampa-St. Petersburg-Clearwater" "Tampa-St. Petersburg-Clearwater" ...
## $ CountyName : chr [1:264] "Hillsborough County" "Hillsborough County" "Hillsborough County" "Hillsborough County" ...
## $ Monthly : chr [1:264] "2000.01.31" "2000.02.29" "2000.03.31" "2000.04.30" ...
## $ Price : num [1:264] 139153 139852 139941 140537 139994 ...
## $ Monthly_parsed: Date[1:264], format: "2000-01-31" "2000-02-29" ...
colSums(is.na(Four_Bedroom_Tampa)) # To find the number of NA's in the data frame.
## RegionID SizeRank RegionName RegionType StateName
## 0 0 0 0 0
## State City Metro CountyName Monthly
## 0 0 0 0 0
## Price Monthly_parsed
## 0 0
# To check the authenticity of the data.
table(Four_Bedroom_Tampa$RegionID)
##
## 72653
## 264
table(Four_Bedroom_Tampa$SizeRank)
##
## 476
## 264
table(Four_Bedroom_Tampa$RegionName)
##
## 33511
## 264
table(Four_Bedroom_Tampa$StateName)
##
## FL
## 264
table(Four_Bedroom_Tampa$State)
##
## FL
## 264
table(Four_Bedroom_Tampa$City)
##
## Brandon
## 264
table(Four_Bedroom_Tampa$CountyName)
##
## Hillsborough County
## 264
zill_ts_df <- Four_Bedroom_Tampa[,c('Monthly_parsed','Price')]
sum(is.na(zill_ts_df))
## [1] 0
ggplot(data = Four_Bedroom_Tampa ,aes(x = Monthly_parsed, y=Price/1000)) +
geom_line(color = "red") +
xlab("Year") +
ylab ("Price in K")+
ggtitle("House Prices over time in Brandon Region of Tampa Bay") +
scale_x_date(date_labels = "%Y", breaks = "2 year") +
theme(axis.text.x = element_text(angle = 90,hjust = 1))
boxplot(Four_Bedroom_Tampa$Price)
From the above boxplot there seems to be some outliers in Price, but not any major price which needs to be removed from the data set.
model<-lm(RegionName ~ Price, data = Four_Bedroom_Tampa)
summary(model)
##
## Call:
## lm(formula = RegionName ~ Price, data = Four_Bedroom_Tampa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.284e-09 -1.150e-12 6.320e-12 1.103e-11 1.608e-11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.351e+04 2.134e-11 1.571e+15 <2e-16 ***
## Price 1.452e-16 9.584e-17 1.515e+00 0.131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.984e-11 on 262 degrees of freedom
## Multiple R-squared: 0.5017, Adjusted R-squared: 0.4998
## F-statistic: 263.7 on 1 and 262 DF, p-value: < 2.2e-16
Calculation and plotting the auto correlation factor and partial auto corelation factor for Brandon.It helps us determine the parameter required to fit in the time series data into ARIMA Model.
Assess the behavior/data-generating process of your time-series using the techniques discussed in class. From your understanding of the process, would you expect the time-series to be mean stationary? Variance stationary? Why or why not?
Analyze the Autocorrelation for the time series: 1. Lag
ts1<-ts(Four_Bedroom_Tampa$Price, Four_Bedroom_Tampa$Monthly_parsed)
par(mfrow=c(1,2))
acf(ts1, lag.max = 50)
pacf(ts1,lag.max = 10)
*Answer: The ACF plot clearly indicates that there is a dampening effect and PACF indicates that it is of AR1 level. As per current understanding the data generated process is non stationary.I would expect the trend to be mean non-stationary and variance stationary.
ts = Four_Bedroom_Tampa[,c("Monthly_parsed","Price")] %>%
mutate(lag_price = lag(Price)) %>%
drop_na()
cor = cor(ts$Price,ts$lag_price)
ts_plot = ggplot(ts)+
geom_line(aes(Monthly_parsed,Price))+
geom_hline(aes(yintercept=0))+
theme_bw()+
ggtitle(paste0("Correlation = ",cor))
#scatter_plot = ggplot(ts)+
# geom_point(aes(lag_price,Price))+
# geom_smooth(aes(lag_price,Price),method='lm',color='red',se=F)+
# theme_bw()
dens_plot = ggplot(ts)+
geom_density(aes(Price),alpha=0.4,fill='blue')+
geom_density(aes(lag_price),alpha=0.4,fill='red')+
theme_bw()
These plots indicates that the time series data is non mean reverting and somewhat additive in nature.
Four_Bedroom_Tampa<- Four_Bedroom_Tampa %>%
mutate(lag_price = lag(Four_Bedroom_Tampa$Price),
lag_price2 = lag(Four_Bedroom_Tampa$Price,2),
) %>%
drop_na()
cor(Four_Bedroom_Tampa$Price,Four_Bedroom_Tampa$lag_price)
## [1] 0.9987113
cor(Four_Bedroom_Tampa$Price,Four_Bedroom_Tampa$lag_price2)
## [1] 0.994927
Above correlation values shows that the current price is more closely related to one time period prior, than the 2 time period prior.
Let’s analyze the data for last 10 years rather than analyzing the data for last 20 years to get a better picture of the trend
Four_Bedroom_Tampa %>%
filter(Monthly_parsed>=ymd('2012-01-31') & Monthly_parsed<=ymd('2021-12-31')) %>%
ggplot(aes(x = Monthly_parsed, y=Price/1000)) +
geom_line(color = "red") +
xlab("Year") +
ylab ("Price in K")+
ggtitle("House Prices over time in Brandon Region of Tampa Bay", subtitle = "(2012-2021)") +
scale_x_date(date_labels = "%Y", breaks = "1 year") +
theme(axis.text.x = element_text(angle = 90,hjust = 1))
This is not a mean reverting trend for the house prices and looks mean non-stationary and additive in nature. The variance seems to be stationary through these years from 2012 -2020 and after that there is more variance.
Four_Bedroom_Tampa %>%
arrange(Monthly_parsed) %>%
filter(Monthly_parsed>ymd("2012-01-31")) %>%
mutate(return = (Price - lag(Price))/Price) %>%
drop_na() %>%
ggplot()+
geom_line(aes(Monthly_parsed,return))+
theme_bw()+
ggtitle("Monthly Change in House Price Over Time", subtitle = "(2012-2021)")+
ylab("Montlhly House Price")+
xlab("Date")
Looking at the above plot we can say that the House Price over time is non-stationary but the monthly growth in the house price is somewhat stationary , if we can ignore the change during the pandemic time, where the monthly change in house price also appears to be a non-stationary.
model1<- arima(Four_Bedroom_Tampa$Price, order = c(1,0,0), method = "ML")
model2<- arima(Four_Bedroom_Tampa$Price, order = c(2,0,0), method = "ML")
summary(model1)
##
## Call:
## arima(x = Four_Bedroom_Tampa$Price, order = c(1, 0, 0), method = "ML")
##
## Coefficients:
## ar1 intercept
## 0.9997 222568.3
## s.e. 0.0004 136423.5
##
## sigma^2 estimated as 8217066: log likelihood = -2461.28, aic = 4928.55
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 926.3237 2866.542 2048.481 0.3726319 0.8847368 0.9996889 0.9369554
summary(model2)
##
## Call:
## arima(x = Four_Bedroom_Tampa$Price, order = c(2, 0, 0), method = "ML")
##
## Coefficients:
## ar1 ar2 intercept
## 1.9634 -0.9641 299628.9
## s.e. 0.0181 0.0183 131921.0
##
## sigma^2 estimated as 677107: log likelihood = -2132.35, aic = 4272.69
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.7191751 825.8193 586.1337 -0.003108578 0.2631556 0.2860418
## ACF1
## Training set 0.105912
AIC for Model 2 is better than Model1. Hence Model 2 is better.
print(BIC(model1,model2))
## df BIC
## model1 3 4939.258
## model2 4 4286.964
BIC also suggests that Model 2 is better than Model1.
aa = Four_Bedroom_Tampa[,c("Monthly_parsed","Price")] %>%
arrange(Monthly_parsed) %>%
filter(Monthly_parsed>ymd("2012-01-31")) %>%
mutate(return = (Price - lag(Price))/lag(Price)) %>%
drop_na()
forecast::auto.arima(aa$return)
## Series: aa$return
## ARIMA(0,1,0)
##
## sigma^2 estimated as 1.087e-05: log likelihood=502.63
## AIC=-1003.26 AICc=-1003.23 BIC=-1000.5
zill_data_roll <- Four_Bedroom_Tampa %>%
filter(Monthly_parsed>=ymd('2012-01-31') & Monthly_parsed<=ymd("2021-12-31")) %>%
mutate(
price_mean = zoo::rollmean(
Price,
k = 6,
fill = NA),
price_sd = zoo::rollapply(
Price,
FUN = sd,
width = 6,
fill = NA)
)
par(mfrow=c(2,1))
zill_data_rollmean <- zill_data_roll %>%
ggplot() +
geom_line(aes(Monthly_parsed, price_mean)) +
theme_bw() +
ggtitle("House Price Mean over Time (6 month rolling window)") +
ylab("Housing Price") +
xlab("Time")
zill_data_rollsd <- zill_data_roll %>%
ggplot() +
geom_line(aes(Monthly_parsed, price_sd)) +
theme_bw() +
ggtitle("Differenced Housing Price SD over Time (6 month rolling window)") +
ylab("Housing Price") +
xlab("Time")
** If the data appear to be variance non-stationary, transform your time-series using a natural log or Box-Cox transformation.** ## 11. Lag Transformation
Now lets try some transformations on the data and see if that provides some better results.
zill_data_log = Four_Bedroom_Tampa %>%
filter(Monthly_parsed>=ymd('2012-01-31') & Monthly_parsed<=ymd("2021-12-31")) %>%
mutate(price_log = log1p(Price))
zill_data_boxcox = Four_Bedroom_Tampa %>%
filter(Monthly_parsed>=ymd('2012-01-31') & Monthly_parsed<=ymd("2021-12-31")) %>%
mutate(price_boxcox = forecast::BoxCox(Price,lambda='auto'))
par(mfrow=c(2,1))
zill_data_log<-zill_data_log %>%
ggplot()+
geom_line(aes(Monthly_parsed,price_log))+
theme_bw()+
ggtitle("House Prices over Time (Log)")+
ylab("Housing Price (Log)")+
xlab("Time")
zill_data_boxcox<-zill_data_boxcox %>%
ggplot()+
geom_line(aes(Monthly_parsed,price_boxcox))+
theme_bw()+
ggtitle("Housing Price over Time (Box-Cox Transformation)")+
ylab("Housing Price (Box-Cox Tranformation)")+
xlab("Time")
The above transformation does not provide any conclusive results since the actual trends have the variance stationary.
zill_data_diff = Four_Bedroom_Tampa %>%
mutate(price_diff = Price - lag(Price))
zill_data_diff %>%
select(Monthly_parsed,Price,price_diff) %>%
head()
## # A tibble: 6 × 3
## Monthly_parsed Price price_diff
## <date> <dbl> <dbl>
## 1 2000-03-31 139941 NA
## 2 2000-04-30 140537 596
## 3 2000-05-31 139994 -543
## 4 2000-06-30 139775 -219
## 5 2000-07-31 139821 46
## 6 2000-08-31 140500 679
zill_data_diff %>%
ggplot()+
geom_line(aes(Monthly_parsed,price_diff))+
theme_bw()+
ggtitle("Housing Price (First Difference)")+
ylab("Housing Price (Difference))")+
xlab("Time")
## Warning: Removed 1 row(s) containing missing values (geom_path).
zill_data_diff = Four_Bedroom_Tampa %>%
mutate(price_log = log1p(Price)) %>%
mutate(price_diff = price_log - lag(price_log))
zill_data_diff %>%
ggplot()+
geom_line(aes(Monthly_parsed,price_diff))+
theme_bw()+
ggtitle("Housing Price (Log; First Difference)")+
ylab("Log Housing Price (Difference))")+
xlab("Time")
## Warning: Removed 1 row(s) containing missing values (geom_path).
# First Price Difference
zill_data_diff = Four_Bedroom_Tampa %>%
filter(Monthly_parsed>=ymd("2012-01-31") & Monthly_parsed <= ymd("2021-12-31")) %>%
mutate(price_diff = Price - lag(Price)) %>%
drop_na()
zill_data_diff %>%
ggplot()+
geom_line(aes(Monthly_parsed,price_diff))+
theme_bw()+
ggtitle("Housing Price (First Difference)")+
ylab("Housing Price (Difference))")+
xlab("Time")
# First Price Difference in Log
zill_data_log_diff = Four_Bedroom_Tampa %>%
filter(Monthly_parsed>=ymd("2012-01-31") & Monthly_parsed <= ymd("2021-12-31")) %>%
mutate(price_log = log1p(Price)) %>%
mutate(price_diff = price_log - lag(price_log)) %>%
drop_na()
zill_data_log_diff %>%
ggplot()+
geom_line(aes(Monthly_parsed,price_diff))+
theme_bw()+
ggtitle("Housing Price (Log; First Difference)")+
ylab("Log Housing Price (Difference))")+
xlab("Time")
Conduct and interpret an Augmented Dickey Fuller and KPSS test for stationarity. Do the results suggest that the process is mean stationary? If the process is non-stationary, calculate the first difference of the data.
library(tseries) # Use tseries package for adf.test function
adf.test(zill_data_diff$Price) # Raw Price Value
##
## Augmented Dickey-Fuller Test
##
## data: zill_data_diff$Price
## Dickey-Fuller = -2.8206, Lag order = 4, p-value = 0.2358
## alternative hypothesis: stationary
adf.test(zill_data_diff$price_diff) # First Difference
##
## Augmented Dickey-Fuller Test
##
## data: zill_data_diff$price_diff
## Dickey-Fuller = -1.3887, Lag order = 4, p-value = 0.8304
## alternative hypothesis: stationary
adf.test(zill_data_log_diff$price_log) # Log Price Value
##
## Augmented Dickey-Fuller Test
##
## data: zill_data_log_diff$price_log
## Dickey-Fuller = -3.3303, Lag order = 4, p-value = 0.06955
## alternative hypothesis: stationary
adf.test(zill_data_log_diff$price_diff) # Differenced Log Price Value
##
## Augmented Dickey-Fuller Test
##
## data: zill_data_log_diff$price_diff
## Dickey-Fuller = -1.9341, Lag order = 4, p-value = 0.6039
## alternative hypothesis: stationary
kpss.test(zill_data_diff$Price) # Raw Price Value
## Warning in kpss.test(zill_data_diff$Price): p-value smaller than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: zill_data_diff$Price
## KPSS Level = 2.299, Truncation lag parameter = 4, p-value = 0.01
kpss.test(zill_data_diff$price_diff) # First Difference
##
## KPSS Test for Level Stationarity
##
## data: zill_data_diff$price_diff
## KPSS Level = 0.70956, Truncation lag parameter = 4, p-value = 0.01268
kpss.test(zill_data_log_diff$price_log) # Log Price Value
## Warning in kpss.test(zill_data_log_diff$price_log): p-value smaller than printed
## p-value
##
## KPSS Test for Level Stationarity
##
## data: zill_data_log_diff$price_log
## KPSS Level = 2.3643, Truncation lag parameter = 4, p-value = 0.01
kpss.test(zill_data_log_diff$price_diff) # Differenced Log Price Value
##
## KPSS Test for Level Stationarity
##
## data: zill_data_log_diff$price_diff
## KPSS Level = 0.42909, Truncation lag parameter = 4, p-value = 0.06462
Looking visually the plots that are generated through various methods, there does not seems to be any seasonality factor affecting the house prices. Major factor that is affecting the house prices is sudden and uncertain financial crisis in the market For example, global financial crisis during 2007-2008 resulted into the drop of housing prices from 2009-20012. After 2012 it was pretty steady until 2020 where the global pandemic triggers the surge in the housing prices since there was minimal to low impact on white collar jobs. Another reason could be the remote working concept has resulted in the investment in the remote and preferred place around US, which might have resulted in the increase in the prices.
ts1 <- ts(zill_data_diff$Price)
ts2 <- ts(zill_data_diff$price_diff)
ts3 <- ts(zill_data_log_diff$price_log)
ts4 <- ts(zill_data_log_diff$price_diff)
par(mfrow=c(2,2))
plot.ts(ts1)
plot.ts(ts2)
plot.ts(ts3)
plot.ts(ts4)
acf(ts1)
pacf(ts1)
acf(ts2)
pacf(ts2)
acf(ts3)
pacf(ts3)
acf(ts4)
pacf(ts4)
#ts_arima<- arima(Four_Bedroom_Tampa$Price)
# plot.ts(ts_arima)
We can observe some dampening effect on the acf plots almost on every transformation. Looking at the pacf plots it looks like the first order lag.
BIC Values for all the Order Lag
BIC(
arima(zill_data_log_diff$price_log, order=c(0,0,1)),
arima(zill_data_log_diff$price_log, order=c(0,1,1)),
arima(zill_data_log_diff$price_log, order=c(0,1,2)),
arima(zill_data_log_diff$price_log, order=c(0,1,3)),
arima(zill_data_log_diff$price_log, order=c(1,1,0)),
arima(zill_data_log_diff$price_log, order=c(1,1,1)),
arima(zill_data_log_diff$price_log, order=c(2,1,0)),
arima(zill_data_log_diff$price_log, order=c(3,1,0))
)
## Warning in BIC.default(arima(zill_data_log_diff$price_log, order = c(0, : models
## are not all fitted to the same number of observations
## df BIC
## arima(zill_data_log_diff$price_log, order = c(0, 0, 1)) 3 -181.8744
## arima(zill_data_log_diff$price_log, order = c(0, 1, 1)) 2 -848.9640
## arima(zill_data_log_diff$price_log, order = c(0, 1, 2)) 3 -949.3890
## arima(zill_data_log_diff$price_log, order = c(0, 1, 3)) 4 -981.8349
## arima(zill_data_log_diff$price_log, order = c(1, 1, 0)) 2 -1006.6656
## arima(zill_data_log_diff$price_log, order = c(1, 1, 1)) 3 -1003.2269
## arima(zill_data_log_diff$price_log, order = c(2, 1, 0)) 3 -1003.2888
## arima(zill_data_log_diff$price_log, order = c(3, 1, 0)) 4 -998.5988
The above BIC values indicates that AR1 process is the best among all the model.
auto.arima(zill_data_log_diff$price_log,stationary=FALSE,allowdrift=FALSE,
seasonal=FALSE,stepwise=FALSE,approximation=FALSE)
## Series: zill_data_log_diff$price_log
## ARIMA(1,2,2)
##
## Coefficients:
## ar1 ma1 ma2
## -0.5158 1.0415 0.9664
## s.e. 0.0871 0.0595 0.0998
##
## sigma^2 estimated as 1.1e-05: log likelihood=515.96
## AIC=-1023.92 AICc=-1023.57 BIC=-1012.87
The auto arima process indicates that it is ARIMA(1,1,2) order of time series is best to describe this data.
Fit several ARIMA models to your time-series variable. Which model is the “best” according to the AIC? According to the BIC?
best_model <- arima(Four_Bedroom_Tampa$Price, order = c(1,1,2) )
forecast::checkresiduals(best_model)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,2)
## Q* = 45.678, df = 7, p-value = 1.01e-07
##
## Model df: 3. Total lags used: 10
According to the above residual plots, we can observe that the residuals has a white noise plot.
best_mod = arima(zill_data_log_diff$price_log,order=c(1,1,2))
resid = best_mod$residuals
pred = resid+zill_data_log_diff$price_log
ggplot()+
geom_line(aes(zill_data_log_diff$Monthly_parsed,zill_data_log_diff$price_log))+
geom_line(aes(zill_data_log_diff$Monthly_parsed,pred),color='blue',alpha=0.4)+
theme_bw()+
xlab("Date")+
ylab("Log Price Diff Housing")
RMSE = sqrt(mean((expm1(pred) - expm1(zill_data_log_diff$price_log))^2,na.rm=T))
RMSE
## [1] 747.3862
In-sample predicted value is 747.38 units which seems to follow the trend in the data.
par(mfrow=c(1,2))
resid = best_mod$resid
acf(resid,lag.max=20)
pacf(resid,lag.max=20)
*ACF and PACF values of the residual shows that the residuals are white noise.
Box.test(resid,type='Ljung-Box',lag=1)
##
## Box-Ljung test
##
## data: resid
## X-squared = 11.114, df = 1, p-value = 0.000857
Box.test(resid,type='Ljung-Box',lag=5)
##
## Box-Ljung test
##
## data: resid
## X-squared = 12.199, df = 5, p-value = 0.03215
Box.test(resid,type='Ljung-Box',lag=10)
##
## Box-Ljung test
##
## data: resid
## X-squared = 13.597, df = 10, p-value = 0.1922
Box.test(resid,type='Ljung-Box',lag=20)
##
## Box-Ljung test
##
## data: resid
## X-squared = 21.677, df = 20, p-value = 0.3583
RMSE = sqrt(mean((expm1(pred) - expm1(zill_data_log_diff$price_log))^2,na.rm=T))
RMSE
## [1] 747.3862
best_mod %>%
forecast(h=6) %>%
autoplot(ylab = "Price Log")
prediction = predict(best_mod,n.ahead=6)
future_dates = c(
max(zill_data_log_diff$Monthly_parsed) %m+% months(1),
max(zill_data_log_diff$Monthly_parsed) %m+% months(2),
max(zill_data_log_diff$Monthly_parsed) %m+% months(3),
max(zill_data_log_diff$Monthly_parsed) %m+% months(4),
max(zill_data_log_diff$Monthly_parsed) %m+% months(5),
max(zill_data_log_diff$Monthly_parsed) %m+% months(6)
)
pred_data = data.frame(
pred = prediction,
Monthly_parsed = future_dates
)
pred_merge = Four_Bedroom_Tampa %>%
full_join(pred_data) %>%
filter(Monthly_parsed>ymd("2012-01-31") & Monthly_parsed<=ymd("2022-06-30")) %>%
mutate(
pred_high = expm1(pred.pred+2*pred.se),
pred_low = expm1(pred.pred - 2*pred.se),
pred.pred = expm1(pred.pred),
error = pred.pred - Price
)
## Joining, by = "Monthly_parsed"
pred_merge %>%
ggplot()+
geom_line(aes(Monthly_parsed,Price))+
geom_line(aes(Monthly_parsed,pred.pred),color='blue')+
geom_ribbon(aes(x=Monthly_parsed,ymin=pred_low,ymax = pred_high),fill='blue',alpha=0.2)
Note: You can use auto.arima() to find the “best” model, but you should also compare normal ARIMA fits as well.
library(prophet)
prophet_data = Four_Bedroom_Tampa %>%
rename(ds = Monthly_parsed, # Have to name our date variable "ds"
y = Price) # Have to name our time series "y"
train = prophet_data %>%
filter(ds<ymd("2020-01-01"))
test = prophet_data %>%
filter(ds>=ymd("2020-01-01"))
model = prophet(train)
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
future = make_future_dataframe(model,periods = 730)
forecast = predict(model,future)
plot(model,forecast)+
ylab("Price")+xlab("Date")+theme_bw()
### 14.3 Interactive Charts
dyplot.prophet(model,forecast)
## Warning: `select_()` was deprecated in dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Very similar to general time series decomposition
prophet_plot_components(model,forecast)
- Overall trend shows that there will be a gradual increase in price from 2018-2022. - Yearly trend shows there is some sort of seasonality where the Prices goes up in Summer during June timeframe and drops down during the Fall and then spikes up again during End of the Year.
plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Price")
### 14.6 Changepoint Hyperparameters A greater number of changepoints and more flexibility may improve performance, but may also increase risk of overfitting
# Number of Changepoints
model = prophet(train,n.changepoints=50)
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
forecast = predict(model,future)
plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Price")
The change points identified does not show much significance since there is a monthly data available which shows a gradual increase or decrease in housing price.
prophet_plot_components(model,forecast)
### 14.7 Non-Scaled Forecast
forecast_plot_data = forecast %>%
as_tibble() %>%
mutate(ds = as.Date(ds)) %>%
filter(ds>=ymd("2020-01-01"))
forecast_not_scaled = ggplot()+
geom_line(aes(test$ds,test$y))+
geom_line(aes(forecast_plot_data$ds,forecast_plot_data$yhat),color='red')
#forecast_scaled = forecast_not_scaled +
#ylim(0,1000)
forecast_not_scaled
The above plot shows that prediction is off.
three_yr_future = make_future_dataframe(model,periods = 1100)
three_yr_forecast = predict(model,three_yr_future)
plot(model,three_yr_forecast)+theme_bw()+xlab("Date")+ylab("Price")
We cannot conclude much from a three year forecast, it is increasing steadly in comparison the the reality during the COVID where there is a huge rise in the housing prices.
Looking at the above 3 year trend, it should not account for any saturation point maximum or minimum.
prophet_plot_components(model,forecast)
Above plot shows the seasonality with a sharp decline in prices during March and then increase during the summer month.
Additive and multiplicative seasonality
additive = prophet(train)
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
add_fcst = predict(additive,future)
plot(additive,add_fcst)
### 15.3 Additive Seasonality
prophet_plot_components(additive,add_fcst)
multi = prophet(train,seasonality.mode = 'multiplicative')
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
multi_fcst = predict(multi,future)
plot(multi,multi_fcst)
prophet_plot_components(multi, multi_fcst)
There will be no impact of holidays on Housing Data as the data is available for every month and not day wise.
forecast_metric_data = forecast %>%
as_tibble() %>%
mutate(ds = as.Date(ds)) %>%
filter(ds>=ymd("2020-01-01"))
RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
## Warning in test$y - forecast_metric_data$yhat: longer object length is not a
## multiple of shorter object length
MAE = mean(abs(test$y - forecast_metric_data$yhat))
## Warning in test$y - forecast_metric_data$yhat: longer object length is not a
## multiple of shorter object length
MAPE = mean(abs((test$y - forecast_metric_data$yhat)/test$y))
## Warning in test$y - forecast_metric_data$yhat: longer object length is not a
## multiple of shorter object length
## Warning in (test$y - forecast_metric_data$yhat)/test$y: longer object length is
## not a multiple of shorter object length
print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 42188.7"
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 30525.93"
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 0.09"
df.cv <- cross_validation(model, initial = 2920, period = 365, horizon = 365, units = "days")
## Making 11 forecasts with cutoffs between 2009-01-02 and 2018-12-31
head(df.cv)
## # A tibble: 6 × 6
## y ds yhat yhat_lower yhat_upper cutoff
## <dbl> <dttm> <dbl> <dbl> <dbl> <dttm>
## 1 206999 2009-01-31 00:00:00 209510. 207065. 211890. 2009-01-02 00:00:00
## 2 204291 2009-02-28 00:00:00 206775. 204376. 209036. 2009-01-02 00:00:00
## 3 201840 2009-03-31 00:00:00 203442. 200783. 206162. 2009-01-02 00:00:00
## 4 198242 2009-04-30 00:00:00 200184. 197353. 202907. 2009-01-02 00:00:00
## 5 195884 2009-05-31 00:00:00 196736. 193641. 199828. 2009-01-02 00:00:00
## 6 192684 2009-06-30 00:00:00 193250. 189825. 196810. 2009-01-02 00:00:00
unique(df.cv$cutoff)
## [1] "2009-01-02 GMT" "2010-01-02 GMT" "2011-01-02 GMT" "2012-01-02 GMT"
## [5] "2013-01-01 GMT" "2014-01-01 GMT" "2015-01-01 GMT" "2016-01-01 GMT"
## [9] "2016-12-31 GMT" "2017-12-31 GMT" "2018-12-31 GMT"
df.cv %>%
ggplot()+
geom_point(aes(ds,y)) +
geom_point(aes(ds,yhat,color=factor(cutoff)))+
theme_bw()+
xlab("Date")+
ylab("Price")+
scale_color_discrete(name = 'Cutoff')
performance_metrics(df.cv)
## horizon mse rmse mae mape mdape smape
## 1 57 days 105825408 10287.148 7194.786 0.03823971 0.01865136 0.03981570
## 2 58 days 124209984 11144.953 7907.724 0.04242600 0.01451994 0.04433405
## 3 59 days 93099180 9648.792 6182.004 0.03257739 0.00943273 0.03399875
## 4 88 days 129665891 11387.093 7602.062 0.04061261 0.01939580 0.04260262
## 5 89 days 139050169 11791.954 8016.831 0.04285878 0.01557659 0.04500396
## 6 90 days 114628506 10706.470 6899.224 0.03622710 0.01421974 0.03797148
## 7 118 days 156552866 12512.109 8343.723 0.04440717 0.01639912 0.04680881
## 8 119 days 167146696 12928.523 8944.088 0.04759875 0.02099539 0.05015176
## 9 120 days 138380635 11763.530 7741.450 0.04034309 0.01830325 0.04242291
## 10 149 days 190773923 13812.093 9193.089 0.04858329 0.02099539 0.05149475
## 11 150 days 203931897 14280.473 9988.467 0.05270238 0.02099539 0.05575262
## 12 151 days 167858618 12956.026 8409.832 0.04361577 0.01706310 0.04609178
## 13 179 days 231412077 15212.234 9982.783 0.05268921 0.01782473 0.05619120
## 14 180 days 246346622 15695.433 10865.039 0.05721455 0.03389750 0.06088714
## 15 181 days 203321416 14259.082 9097.765 0.04709150 0.01759668 0.05009397
## 16 210 days 276242686 16620.550 10954.344 0.05779244 0.02340230 0.06199901
## 17 211 days 290264013 17037.136 11889.490 0.06253203 0.03566445 0.06686004
## 18 212 days 237510576 15411.378 9848.494 0.05106021 0.02131809 0.05457992
## 19 241 days 319683278 17879.689 11932.112 0.06317913 0.03044333 0.06809612
## 20 242 days 331490983 18206.894 12745.518 0.06717952 0.03980312 0.07217087
## 21 243 days 269618146 16420.053 10495.483 0.05465120 0.02416739 0.05869990
## 22 271 days 355762724 18861.673 12817.504 0.06826580 0.03589756 0.07382762
## 23 272 days 371115120 19264.348 13575.832 0.07169865 0.03922339 0.07734059
## 24 273 days 301032603 17350.291 11203.925 0.05846624 0.03027716 0.06303050
## 25 302 days 399699656 19992.490 13926.812 0.07443759 0.04464477 0.08077389
## 26 303 days 430241322 20742.259 14824.596 0.07824909 0.04464477 0.08483365
## 27 304 days 350221970 18714.218 12368.340 0.06454354 0.03770889 0.06988204
## 28 332 days 458612270 21415.235 15403.817 0.08245176 0.05077511 0.08978582
## 29 333 days 498794063 22333.698 16299.472 0.08580852 0.05077511 0.09343648
## 30 334 days 403613576 20090.136 13598.821 0.07085551 0.03877666 0.07700207
## 31 363 days 529904666 23019.658 16966.091 0.09082214 0.05742924 0.09933773
## 32 364 days 566437309 23799.943 17724.813 0.09329064 0.05742924 0.10201734
## 33 365 days 453030016 21284.502 14704.084 0.07680833 0.03858939 0.08381069
## coverage
## 1 0.3653846
## 2 0.4230769
## 3 0.4871795
## 4 0.4615385
## 5 0.4615385
## 6 0.5000000
## 7 0.5128205
## 8 0.4230769
## 9 0.4230769
## 10 0.4102564
## 11 0.3076923
## 12 0.3846154
## 13 0.4102564
## 14 0.3076923
## 15 0.3846154
## 16 0.4102564
## 17 0.3076923
## 18 0.4615385
## 19 0.4871795
## 20 0.3846154
## 21 0.5000000
## 22 0.4102564
## 23 0.3076923
## 24 0.4230769
## 25 0.2820513
## 26 0.2307692
## 27 0.3461538
## 28 0.2307692
## 29 0.2307692
## 30 0.3461538
## 31 0.2307692
## 32 0.3076923
## 33 0.4230769
plot_cross_validation_metric(df.cv, metric = 'rmse')