The Zillow Home Price Data are time series datasets available online via this link. The datasets selected are updated monthly:
Home Values – Zillow Home Values Index (ZHVI) that tells us the typical home value in a given geography (metro area, city, ZIP code, etc.), now and over time.).
Home Values Forescasts: is the one-year forecast of ZHVI and is created using the all homes, mid-tier cut of ZHVI.
Rentals: a smoothed measure of the typical observed market rate rent acreoss a given region.
Inventory: The count of unique listings that were active at any time in a given month.
LIST AND SALE PRICES:The median price at which homes across various geographies were listed and sold.
SALES COUNT AND PRICE CUTS: is the estimated number of unique properties that sold during the month after accounting for the latency between when sales occur and when they are reported.
These monthly time series datasets from Zillow were selected as they are consistent with my interest of working on a project of “Forecasting Home Value in Cincinnati”.
Experts in real estates had been concerned about the cycle of housing market as below:
image
On top of that in 2020, as COVID-19 pandemic spreads the world and policies such as lock-down and social distancing coming out, the predictions on housing market had been very pessimistic. Experts and media made forecasts and implied a shrink in US proprietary prices and housing market as the GDP decreases and economic struggled to recover.
However, the housing market had unprecedented increase since late 2020, rising by nearly 20% over last year (2021). In 2022, how will the housing market perform? Is the demand still strong or should we concern about home values decrease or even crash of the market? The abnormal pattern in 2021 may make it difficult to forecast as it is not a normal trend and will increase the expectation of sellers. We want to forecast these time series to see if the increase trend is a one-time thing or it will continue in 2022 as well.
Import datasets
We loaded five separate datasets. These are all wide dataset that need to be transform to long dataset later:
home_value has 908 obs(regions) and 269 variables (monthly; Jan/2000- Dec/2021).
Rentals: has 102 obs(regions) and 99 variables (monthly; Jan/2014- Dec/2021).
Inventory: has 917 obs(regions) and 53 variables (monthly; Jan/2018- Dec/2021).
LIST AND SALE PRICES:has 95 obs(regions) and 53 variables (monthly; Jan/2018- Dec/2021).
SALES COUNT AND PRICE CUTS: has 95 obs(regions) and 172 variables (monthly; Feb/2008- Dec/2021).
home_value<- read_csv("data/home_values.csv")
inventory<- read_csv("data/inventory.csv")
list_and_sale_price<- read_csv("data/list_and_sale_price.csv")
rentals<- read_csv("data/rentals.csv")
sales_count_and_price_cuts<- read_csv("data/sales_count_and_price_cuts.csv")
Data for Cincinnati, OH
We can learn from the above datasets that the city Cincinnati, OH was given a unique region id “394466”. The size of the city ranked 28 among 900 and more cities (regions). Additionally, the region type of the city is Metropolitan statistical areas (MSA).
cin_home_value<- home_value%>%
filter(RegionID==394466)%>%
pivot_longer(6:269,names_to="date",values_to="home_value")%>%
mutate(new_month= as.Date(date, "%Y-%m-%d"))
cin_inventory<- inventory%>%
filter(RegionID==394466)%>%
pivot_longer(6:53,names_to="date",values_to="inventory")%>%
mutate(new_month= as.Date(date, "%Y-%m-%d"))
cin_list_and_sale_price<- list_and_sale_price%>%
filter(RegionID==394466)%>%
pivot_longer(6:53,names_to="date",values_to="list_sale_price")%>%
mutate(new_month= as.Date(date, "%Y-%m-%d"))
cin_rentals<- rentals%>%
filter(RegionID==394466)%>%
pivot_longer(4:99,names_to="date",values_to="rentals")%>%
mutate(month= as.Date(as.yearmon(date)))%>%
mutate(new_month=as.Date(month) + months(1) - days(1))
cin_sales_count_and_price_cuts<- sales_count_and_price_cuts%>%
filter(RegionID==394466)%>%
pivot_longer(6:172,names_to="date",values_to="sales_count_price")%>%
mutate(new_month= as.Date(date, "%Y-%m-%d"))
This is the Cincinnati full combined data of the five separate datasets with different time length and the dimension is 264 observations and 12 columns. Note: for inventory, list price, rentals,sales have shorter time period than home value data.
cin_combine_full<- cin_home_value%>%
left_join(cin_sales_count_and_price_cuts[,7:8], by="new_month")%>%
left_join(cin_rentals[,c(5,7)], by="new_month")%>%
left_join(cin_inventory[,7:8], by="new_month")%>%
left_join(cin_list_and_sale_price[,7:8], by="new_month")
# good alternative library for descriptive statistics
cin_combine_full%>%
tail(5)%>%
kbl(caption = "Table: Combined dataset of home information in Cincinnati--(2000-2021)")%>%
kable_classic(full_width = F, html_font = "Cambria")
| RegionID | SizeRank | RegionName | RegionType | StateName | date | home_value | new_month | sales_count_price | rentals | inventory | list_sale_price |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2021-08-31 | 231514 | 2021-08-31 | 3071 | 1338 | 6254 | 336233 |
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2021-09-30 | 234720 | 2021-09-30 | 3010 | 1348 | 6479 | 324600 |
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2021-10-31 | 236772 | 2021-10-31 | 3143 | 1359 | 6452 | 316633 |
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2021-11-30 | 238710 | 2021-11-30 | 3156 | 1369 | 6006 | 310300 |
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2021-12-31 | 240833 | 2021-12-31 | 2941 | 1380 | 5302 | 306967 |
This is the Cincinnati combined data of the five separate datasets above with common time length and it has 48 obs and 12 variables.
cin_combine<- cin_inventory%>%
left_join(cin_list_and_sale_price[,7:8], by="new_month")%>%
left_join(cin_rentals[,c(5,7)], by="new_month")%>%
left_join(cin_sales_count_and_price_cuts[,7:8], by="new_month")%>%
left_join(cin_home_value[,7:8], by="new_month")
# good alternative library for descriptive statistics
cin_combine%>%
head(5)%>%
kbl(caption = "Table: Combined dataset of home information in Cincinnati--(2018-2021)")%>%
kable_classic(full_width = F, html_font = "Cambria")
| RegionID | SizeRank | RegionName | RegionType | StateName | date | inventory | new_month | list_sale_price | rentals | sales_count_price | home_value |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2018-01-31 | 8017 | 2018-01-31 | 218300 | 1087 | 2100 | 167537 |
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2018-02-28 | 7636 | 2018-02-28 | 219967 | 1091 | 1797 | 168013 |
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2018-03-31 | 7846 | 2018-03-31 | 231600 | 1095 | 2765 | 168827 |
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2018-04-30 | 8252 | 2018-04-30 | 248267 | 1099 | 3206 | 170175 |
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2018-05-31 | 8957 | 2018-05-31 | 261600 | 1104 | 3278 | 172092 |
This is a subset of combined dataset which omited info about the city.
cin_sub<- cin_combine[, 7:12]
# good alternative library for descriptive statistics
cin_sub%>%
tail(5)%>%
kbl(caption = "Table: Subset dataset of home information in Cincinnati")%>%
kable_classic(full_width = F, html_font = "Cambria")
| inventory | new_month | list_sale_price | rentals | sales_count_price | home_value |
|---|---|---|---|---|---|
| 6254 | 2021-08-31 | 336233 | 1338 | 3071 | 231514 |
| 6479 | 2021-09-30 | 324600 | 1348 | 3010 | 234720 |
| 6452 | 2021-10-31 | 316633 | 1359 | 3143 | 236772 |
| 6006 | 2021-11-30 | 310300 | 1369 | 3156 | 238710 |
| 5302 | 2021-12-31 | 306967 | 1380 | 2941 | 240833 |
home_value_full_plot = ggplot(cin_combine_full)+
geom_line(aes(new_month,home_value))+
theme_bw()+
xlab("Month")+
ylab("Home Value")+
labs(
title = 'Home Value in Cincinnati',
subtitle = 'January 2000 - December 2021')
inventory_full_plot = ggplot(cin_combine_full)+
geom_line(aes(new_month,inventory))+
theme_bw()+
xlab("Month")+
ylab("Inventory")+
labs(
title = 'Inventory in Cincinnati',
subtitle = 'January 2018 - December 2021')
list_and_sale_price_full_plot = ggplot(cin_combine_full)+
geom_line(aes(new_month,list_sale_price))+
theme_bw()+
xlab("Month")+
ylab("List and sale price")+
labs(
title = 'List and sale price in Cincinnati',
subtitle = 'January 2018 - December 2021')
rentals_full_plot = ggplot(cin_combine_full)+
geom_line(aes(new_month,rentals))+
theme_bw()+
xlab("Month")+
ylab("Rentals")+
labs(
title = 'Rentals in Cincinnati',
subtitle = 'January 2014 - December 2021')
sales_count_price_full_plot = ggplot(cin_combine_full)+
geom_line(aes(new_month,sales_count_price))+
theme_bw()+
xlab("Month")+
ylab("Sales count and price cut")+
labs(
title = 'Sales count and price cut in Cincinnati',
subtitle = 'Feburary 2008 - December 2021')
grid.arrange(home_value_full_plot,
inventory_full_plot,
list_and_sale_price_full_plot,
rentals_full_plot,
sales_count_price_full_plot,
ncol=2, nrow =3)
home_value
home_value_plot = ggplot(cin_combine)+
geom_line(aes(new_month,home_value))+
geom_smooth(aes(new_month,home_value),method='lm',color='red')+
theme_bw()+
xlab("Month")+
ylab("Home Value")+
labs(
title = 'Home Value in Cincinnati',
subtitle = 'January 2018 - December 2021')
home_value_year_plot<- cin_sub%>%
ggplot(aes(month(new_month, label=TRUE, abbr=TRUE),
home_value, group=factor(year(new_month)), colour=factor(year(new_month)))) +
geom_line() +
geom_point() +
labs(x="Month", colour="Year") +
theme_classic()+
labs(title= 'Home Value in Cincinnati by Year' )
grid.arrange(home_value_plot, home_value_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'
#boxplot of home value
cin_home_value%>%ggplot(aes(year(new_month),
home_value, group=factor(year(new_month)),
colour=factor(year(new_month)))) +
geom_boxplot()+
labs(title= 'Boxplot of Home Value in Cincinnati by Year' )
inventory
inventory_plot = ggplot(cin_combine)+
geom_line(aes(new_month,inventory))+
geom_smooth(aes(new_month,inventory),method='lm',color='red')+
theme_bw()+
xlab("Month")+
ylab("Inventory")+
labs(
title = 'Inventory in Cincinnati',
subtitle = 'January 2018 - December 2021')
inventory_year_plot<- cin_sub%>%
ggplot(aes(month(new_month, label=TRUE, abbr=TRUE),
inventory, group=factor(year(new_month)), colour=factor(year(new_month)))) +
geom_line() +
geom_point() +
labs(x="Month", colour="Year") +
theme_classic()+
labs(title= 'Inventory in Cincinnati by Year' )
grid.arrange(inventory_plot, inventory_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'
list_sale_price
list_and_sale_price_plot = ggplot(cin_combine)+
geom_line(aes(new_month,list_sale_price))+
geom_smooth(aes(new_month,list_sale_price),method='lm',color='red')+
theme_bw()+
xlab("Month")+
ylab("List and sale price")+
labs(
title = 'List and sale price in Cincinnati',
subtitle = 'January 2018 - December 2021')
list_and_sale_price_year_plot<- cin_sub%>%
ggplot(aes(month(new_month, label=TRUE, abbr=TRUE),
list_sale_price, group=factor(year(new_month)), colour=factor(year(new_month)))) +
geom_line() +
geom_point() +
labs(x="Month", colour="Year") +
theme_classic()+
labs(title= 'List and sale price in Cincinnati by Year' )
grid.arrange(list_and_sale_price_plot, list_and_sale_price_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'
`rentals``
rentals_plot = ggplot(cin_combine)+
geom_line(aes(new_month,rentals))+
geom_smooth(aes(new_month,rentals),method='lm',color='red')+
theme_bw()+
xlab("Month")+
ylab("Rentals")+
labs(
title = 'Rentals in Cincinnati',
subtitle = 'January 2018 - December 2021')
rentals_year_plot<- cin_sub%>%
ggplot(aes(month(new_month, label=TRUE, abbr=TRUE),
rentals, group=factor(year(new_month)), colour=factor(year(new_month)))) +
geom_line() +
geom_point() +
labs(x="Month", colour="Year") +
theme_classic()+
labs(title= 'Rentals in Cincinnati by Year' )
grid.arrange(rentals_plot, rentals_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'
`sales_count_price``
sales_count_price_plot = ggplot(cin_combine)+
geom_line(aes(new_month,sales_count_price))+
geom_smooth(aes(new_month,sales_count_price),method='lm',color='red')+
theme_bw()+
xlab("Month")+
ylab("Sales count and price cut")+
labs(
title = 'Sales count and price cut in Cincinnati',
subtitle = 'January 2018 - December 2021')
sales_count_price_year_plot<- cin_sub%>%
ggplot(aes(month(new_month, label=TRUE, abbr=TRUE),
sales_count_price, group=factor(year(new_month)), colour=factor(year(new_month)))) +
geom_line() +
geom_point() +
labs(x="Month", colour="Year") +
theme_classic()+
labs(title= 'Sales count and price cut in Cincinnati by Year' )
grid.arrange(sales_count_price_plot, sales_count_price_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'
As implied in the EDA part, home value , and rentalsseem to increase over time and indicated the same for each year (2018-2021); Similarly, for list and sale price, there’s an increasing trend over the years but a variation within each year (this maybe the seasonal); On the contrary, inventory seems to decreases overtime as well as yearly; Lastly, for sales count and price cut the variation tends to slightly decrease over the years.
Now we want to model the linear trend and see if it’s consistent with our observations from the visualization.
# Test whether there is a true linaer trend over time
mod1 = lm(home_value~new_month,data=cin_combine_full)
summary(mod1)
mod2 = lm(inventory~new_month,data=cin_combine_full)
summary(mod2)
mod3 = lm(list_sale_price~new_month,data=cin_combine_full)
summary(mod3)
mod4 = lm(rentals~new_month,data=cin_combine_full)
summary(mod4)
mod5 = lm(sales_count_price~new_month,data=cin_combine_full)
summary(mod5)
The outcome table:
tab_model(mod1, mod2,mod3,mod4,mod5)
| home value | inventory | list sale price | rentals | sales count price | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 67293.58 | 53729.25 – 80857.90 | <0.001 | 62976.81 | 50644.46 – 75309.16 | <0.001 | -1130447.61 | -1357341.28 – -903553.93 | <0.001 | -1332.23 | -1420.22 – -1244.23 | <0.001 | -4112.02 | -4975.12 – -3248.91 | <0.001 |
| new month | 5.95 | 5.05 – 6.84 | <0.001 | -3.02 | -3.70 – -2.35 | <0.001 | 77.53 | 65.12 – 89.94 | <0.001 | 0.14 | 0.13 – 0.14 | <0.001 | 0.38 | 0.33 – 0.44 | <0.001 |
| Observations | 264 | 48 | 48 | 96 | 167 | ||||||||||
| R2 / R2 adjusted | 0.396 / 0.393 | 0.639 / 0.631 | 0.775 / 0.770 | 0.970 / 0.970 | 0.560 / 0.557 | ||||||||||
We can see from most of the Linear regression model output is consistent with the overtime trends we observed from the visualization plots above: home value , rentals, and list and sale price seem to increase over time and have positive coefficients (positive linear association); On the contrary, inventory seems to decreases overtime and has a negative coefficient (negative linear association); Lastly, for sales count and price cut the variation tends to slightly decrease over the years, however, it has a positive coefficient indicating positive linear association.
From this part, we will start model development focusing on time series of **home value data** in Cincinnati:
Load data:
cin_home_value<- home_value%>%
filter(RegionID==394466)%>%
pivot_longer(6:269,names_to="date",values_to="home_value")%>%
mutate(new_month= as.Date(date, "%Y-%m-%d"))
cin_home_value%>%
head(5)%>%
kbl(caption = "Table: Cincinnati Home Value data")%>%
kable_classic(full_width = F, html_font = "Cambria")
| RegionID | SizeRank | RegionName | RegionType | StateName | date | home_value | new_month |
|---|---|---|---|---|---|---|---|
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2000-01-31 | 129379 | 2000-01-31 |
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2000-02-29 | 129364 | 2000-02-29 |
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2000-03-31 | 129290 | 2000-03-31 |
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2000-04-30 | 129780 | 2000-04-30 |
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2000-05-31 | 130454 | 2000-05-31 |
We want to look at the plot and access Data generating process.
# Plot the Data
cin_home_value%>% ggplot()+
geom_line(aes(new_month,home_value))+
theme_bw()+
xlab("Month")+
ylab("Home Value")+
labs(
title = 'Home Value in Cincinnati',
subtitle = 'January 2000 - December 2021')
As the plot above indicates non-stationarity. We want to look at 6 months rolling sd plot:
homevalue_roll <- cin_home_value %>%
mutate(
home_mean = zoo::rollmean(
home_value,
k = 6,
fill = NA),
close_sd = zoo::rollapply(
home_value,
FUN = sd,
width = 6,
fill = NA)
)
homevalue_rollmean <- homevalue_roll %>%
ggplot() +
geom_line(aes(new_month, home_value)) +
theme_bw() +
ggtitle("homevalue Mean over Time (6 month rolling window)") +
ylab("home value") +
xlab("Month")
homevalue_rollsd <- homevalue_roll %>%
ggplot() +
geom_line(aes(new_month, home_value)) +
theme_bw() +
ggtitle("Differenced homevalue SD over Time (6 month rolling window)") +
ylab("home value") +
xlab("Month")
grid.arrange(homevalue_rollmean,homevalue_rollsd,ncol=1, nrow =2)
The rolling plots above mainly implies the mean non-stationary and slightly/very limited variance non-stationary. It might because of our data is monthly data (monthly average, the variation is less). We want to confirm the data only has mean non-stationary and no variance non-stationary. Thus we start with variance non-stationary transformation (Log/box-cox):
cinhome_var_transform = cin_home_value %>%
mutate(home_log = log1p(home_value),
home_box = BoxCox(home_value,lambda='auto'))
homevalue_log_trans <- cinhome_var_transform %>%
ggplot() +
geom_line(aes(new_month, home_log)) +
theme_bw() +
ggtitle("Homevalue in Cincinnati over Time (log transform)") +
ylab("home value(log)") +
xlab("Month")
homevalue_box_trans <- cinhome_var_transform %>%
ggplot() +
geom_line(aes(new_month, home_box)) +
theme_bw() +
ggtitle("Homevalue in Cincinnati over Time (box-cox transform)") +
ylab("home value(box-cox)") +
xlab("Month")
grid.arrange(homevalue_log_trans,homevalue_box_trans,ncol=1, nrow =2)
We notice from the plots that after variance transformation, they still seem to be non-stationary. Log/ Box-cox transformation do not have much difference. Besides, the non-stationary did not improve much, we want to use ADF/KPSS test to verify:
a1<-adf.test(cinhome_var_transform$home_log) # Non-stationary
k1<-kpss.test(cinhome_var_transform$home_log) # Non-stationary
a2<-adf.test(cinhome_var_transform$home_box) # Non-stationary
k2<-kpss.test(cinhome_var_transform$home_box) # Non-stationary
test<- matrix(c("ADF","Log Transform",a1$p.value,"Non-stationary",
"KPSS","Log Transform",k1$p.value,"Non-stationary",
"ADF","Box-cox Transform",a2$p.value,"Non-stationary",
"KPSS","Box-cox Transform",k2$p.value,"Non-stationary"),
ncol=4,byrow=TRUE)
test %>%
kbl(caption = "Table: ADF/KPSS test on Log vs. Box-cox transform") %>%
kable_classic(full_width = F, html_font = "Cambria")
| ADF | Log Transform | 0.99 | Non-stationary |
| KPSS | Log Transform | 0.01 | Non-stationary |
| ADF | Box-cox Transform | 0.99 | Non-stationary |
| KPSS | Box-cox Transform | 0.01 | Non-stationary |
The plots and tests results both showed that the transformed data is still non-stationary. Now we will focus on mean non-stationary transformation and will start with first order differencing:
cinhome_mean_transform = cin_home_value %>%
mutate(home_log = log1p(home_value),
home_box = BoxCox(home_value,lambda='auto'),
home_diff= home_value-lag(home_value),
home_diff_log = home_log - lag(home_log),
home_diff_box = home_box - lag(home_box))%>%
drop_na()
#first difference
d1<-cinhome_mean_transform %>%
ggplot()+
geom_line(aes(new_month,home_diff))+
theme_bw()+
ggtitle("homevalue First Difference: 2000-2021")+
ylab("Homevalue (Difference))")+
xlab("Month")
#First order difference on log transformed
d2<- cinhome_mean_transform %>%
ggplot()+
geom_line(aes(new_month,home_diff_log))+
theme_bw()+
ggtitle("homevalue First Difference + Log tranformation: 2000-2021")+
ylab("Homevalue (Difference + log))")+
xlab("Month")
#First order difference on box transformed
d3<- cinhome_mean_transform %>%
ggplot()+
geom_line(aes(new_month,home_diff_box))+
theme_bw()+
ggtitle("homevalue First Difference + Box-Cox tranformation: 2000-2021")+
ylab("Homevalue (Difference + Box-cox))")+
xlab("Month")
grid.arrange(d1,d2,d3,ncol=1, nrow =3)
We noticed similar pattern from transformation of mean non-stationary (differencing) alone and differencing on top of variance-transform. They still seem to be non-stationary and we can see a trend. we want to use ADF/KPSS test to verify:
# only difference
ad1<-adf.test(cinhome_mean_transform$home_diff) # Non-stationary
kd1<-kpss.test(cinhome_mean_transform$home_diff) # Non-stationary
#difference+log
adl1<-adf.test(cinhome_mean_transform$home_diff_log) # Non-stationary
kdl1<-kpss.test(cinhome_mean_transform$home_diff_log) # Non-stationary
#difference +box
adb1<-adf.test(cinhome_mean_transform$home_diff_box) # Non-stationary
kdb1<-kpss.test(cinhome_mean_transform$home_diff_box) # Non-stationary
test2<- matrix(c("ADF","Difference",ad1$p.value,"Non-stationary",
"KPSS","Difference",kd1$p.value,"Non-stationary",
"ADF","Difference+log",adl1$p.value,"Non-stationary",
"KPSS","Difference+Log",kdl1$p.value,"Non-stationary",
"ADF","Difference+Box-cox",adb1$p.value,"Non-stationary",
"KPSS","Difference+Box-cox",kdb1$p.value,"Non-stationary"),
ncol=4,byrow=TRUE)
test2 %>%
kbl(caption = " Table: ADF/KPSS test on First Difference & Log/ Box-cox Transform+Difference") %>%
kable_classic(full_width = F, html_font = "Cambria")
| ADF | Difference | 0.788741858628823 | Non-stationary |
| KPSS | Difference | 0.01 | Non-stationary |
| ADF | Difference+log | 0.55797735934197 | Non-stationary |
| KPSS | Difference+Log | 0.01 | Non-stationary |
| ADF | Difference+Box-cox | 0.43768833056367 | Non-stationary |
| KPSS | Difference+Box-cox | 0.01 | Non-stationary |
Results from ADF/KPSS tests for First order difference (alone) and First order difference + Log/ Box-cox transformation all implies non-stationarity. We need to take second-order difference:
#second order difference
cinhome_second_diff<- cinhome_mean_transform%>%
mutate(home_second_diff=home_diff-lag(home_diff),
home_second_diff_log=home_diff_log-lag(home_diff_log),
home_second_diff_box=home_diff_box-lag(home_diff_box))%>%
drop_na()
#other ways to second differernce
# diff(diff(cin_home_value$home_value))
Now we plot the second-order differencing and check:
#second order difference on log transformed
#second difference
sd1<-cinhome_second_diff %>%
ggplot()+
geom_line(aes(new_month,home_second_diff))+
theme_bw()+
ggtitle("Homevalue Second order Difference: 2000-2021")+
ylab("Homevalue (2nd-Difference))")+
xlab("Month")
#second order difference on log transformed
sd2<- cinhome_second_diff %>%
ggplot()+
geom_line(aes(new_month,home_second_diff_log))+
theme_bw()+
ggtitle("Homevalue Second order Difference + Log tranformation: 2000-2021")+
ylab("Homevalue (2nd-Difference + log))")+
xlab("Month")
#second order difference on box transformed
sd3<- cinhome_second_diff %>%
ggplot()+
geom_line(aes(new_month,home_second_diff_box))+
theme_bw()+
ggtitle("Homevalue Second order Difference + Box-Cox tranformation: 2000-2021")+
ylab("Homevalue (2nd-Difference + Box-cox))")+
xlab("Month")
grid.arrange(sd1,sd2,sd3,ncol=1, nrow =3)
The plots above is now more mean-reverting. However, we do notice variation changes over the time. Now we can check if the differenced and transformed data are stationary using ADF/KPSS Test:
# only 2nd difference
ad2<-adf.test(cinhome_second_diff$home_second_diff) # stationary
kd2<-kpss.test(cinhome_second_diff$home_second_diff) # stationary
#second difference+log
adl2<-adf.test(cinhome_second_diff$home_second_diff_log) # stationary
kdl2<-kpss.test(cinhome_second_diff$home_second_diff_log) # stationary
#second difference +box
adb2<-adf.test(cinhome_second_diff$home_second_diff_box) # stationary
kdb2<-kpss.test(cinhome_second_diff$home_second_diff_box) # stationary
test3<- matrix(c("ADF","Difference",ad2$p.value,"Stationary",
"KPSS","Difference",kd2$p.value,"Stationary",
"ADF","Difference+log",adl2$p.value,"Stationary",
"KPSS","Difference+Log",kdl2$p.value,"Stationary",
"ADF","Difference+Box-cox",adb2$p.value,"Stationary",
"KPSS","Difference+Box-cox",kdb2$p.value,"Stationary"),
ncol=4,byrow=TRUE)
test3 %>%
kbl(caption = "Table: ADF/KPSS test on 2nd Difference vs. 2nd Difference + Log/ Box-cox Transform") %>%
kable_classic(full_width = F, html_font = "Cambria")
| ADF | Difference | 0.01 | Stationary |
| KPSS | Difference | 0.1 | Stationary |
| ADF | Difference+log | 0.01 | Stationary |
| KPSS | Difference+Log | 0.1 | Stationary |
| ADF | Difference+Box-cox | 0.01 | Stationary |
| KPSS | Difference+Box-cox | 0.1 | Stationary |
As the tests all suggests that Second Ordered Difference or Second order difference + log/box-cox transformation are stationary. We will plot ACF & PACF of second order difference transformed data and decide on the parameters of ARIMA model:
# Plot ACF/PACF of second differenced
acf(cinhome_second_diff$home_second_diff)
pacf(cinhome_second_diff$home_second_diff)
### Seasonality
Home value is seasonality-adjusted. We can also conclude such findings from the ACF/PACF plot above.
As we observed trend from the previous data generating plot. In this ACF/PACF plot we can only notice some wave pattern, but can not notice very obvious seasonality/ fixed frequency within a year (there might be some cyclicality pattern). For real estate markets, there’s definitely months that sell better than others. However, unless circumstances like housing market crashes, the home value is usually unlike to decrease back to it’s previous seasonal values.
We can also double check and confirm our assumptions of no seasonality for this home_value time series data using decomposition:
#make home value a ts dataframe
home<- ts(cin_home_value$home_value,frequency = 12, start = c(2000, 1))
# home %>% decompose(type="multiplicative") %>%
# autoplot() + xlab("date") +
# ggtitle("Classical multiplicative decomposition
# of electrical equipment index")
#decompose-- additive
home_decompose_ad = decompose(home,type='additive')
plot(home_decompose_ad)
#decompose -- multiplicative
home_decompose_mul = decompose(home,type='multiplicative')
plot(home_decompose_mul)
From the ADF/KPSS tests result and the ACF/PACF plots above, we can confirm that the data is only mean non-stationary, needs second order difference. As for variance transformation, we do not need to worry about. Thus, we will only build model using difference transformation, which dealt with mean non-stationary. Addictionally, from the acf/pacf plots above, we can assume that ARIMA model for:
home_value(Original data): is AR process with 2 order difference. So we have i=2, and p=18 (it’s AR process), q=0. Other significant lags are: 3,4,6,12,15. maybe we can try these in Arima models.#use original parameter: home_value
mod1 = arima(cinhome_second_diff$home_value,order=c(18,2,0))
mod2 = arima(cinhome_second_diff$home_value,order=c(3,2,0))
mod3 = arima(cinhome_second_diff$home_value,order=c(4,2,0))
mod4 = arima(cinhome_second_diff$home_value,order=c(6,2,0))
mod5 = arima(cinhome_second_diff$home_value,order=c(12,2,0))
mod6 = arima(cinhome_second_diff$home_value,order=c(15,2,0))
mod7 = arima(cinhome_second_diff$home_value,order=c(18,2,1))
model_summary<- matrix(
c("Model","ARIMA(p,I,q)","Parameter","sigma^2","log Likelihood","AIC","BIC",
"1","ARIMA(18,2,0)","Home value",mod1$sigma2,mod1$loglik,mod1$aic,BIC(mod1),
"2","ARIMA(3,2,0)","Home value",mod2$sigma2,mod2$loglik,mod2$aic,BIC(mod2),
"3","ARIMA(4,2,0)","Home value",mod3$sigma2,mod3$loglik,mod3$aic,BIC(mod3),
"4","ARIMA(6,2,0)","Home value",mod4$sigma2,mod4$loglik,mod4$aic,BIC(mod4),
"5","ARIMA(12,2,0)","Home value",mod5$sigma2,mod5$loglik,mod5$aic,BIC(mod5),
"6","ARIMA(15,2,0)","Home value",mod6$sigma2,mod6$loglik,mod6$aic,BIC(mod6),
"7","ARIMA(18,2,1)","Home value",mod7$sigma2,mod7$loglik, mod7$aic,BIC(mod7)),
ncol=7,byrow=TRUE)
model_summary %>%
kbl(caption = "Table: ARIMA Model output: AIC vs BIC") %>%
kable_classic(full_width = F, html_font = "Cambria")%>%
row_spec(1, bold = T)
| Model | ARIMA(p,I,q) | Parameter | sigma^2 | log Likelihood | AIC | BIC |
| 1 | ARIMA(18,2,0) | Home value | 35339.0625546411 | -1733.42177827751 | 3504.84355655501 | 3572.49650754431 |
| 2 | ARIMA(3,2,0) | Home value | 52017.2052625915 | -1780.92673285475 | 3569.8534657095 | 3584.09619223356 |
| 3 | ARIMA(4,2,0) | Home value | 49859.9552256206 | -1775.51214578929 | 3561.02429157858 | 3578.82769973366 |
| 4 | ARIMA(6,2,0) | Home value | 46951.6410027904 | -1767.87902494625 | 3549.7580498925 | 3574.68282130961 |
| 5 | ARIMA(12,2,0) | Home value | 42855.4641643852 | -1756.5986295862 | 3539.19725917239 | 3585.4861203756 |
| 6 | ARIMA(15,2,0) | Home value | 39563.6255128961 | -1746.94105520269 | 3525.88211040538 | 3582.85301650163 |
| 7 | ARIMA(18,2,1) | Home value | 35325.8882484904 | -1733.39468462616 | 3506.78936925232 | 3578.00300187263 |
We can tell from the models output that Model 1: ARIMA(18,2,0) on home value performs best as it has the smallest AIC as well as BIC.
#auto.arima for home value
mod1_auto = auto.arima(cinhome_second_diff$home_value,max.p=20,max.q=20,max.order=20,stepwise=FALSE,approximation=FALSE,stationary=FALSE)
From the auto. arima output above, the output suggest the best model for home value is ARIMA(18,2,0).
Now we want to look at residuals on our best model.
# Diagnostics from the Model ---------------------
# Examine Residuals - Box-Ljung Test says residuals have no autocorrelation at lag 1/5
checkresiduals(mod1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(18,2,0)
## Q* = 7.7663, df = 3, p-value = 0.0511
##
## Model df: 18. Total lags used: 21
checkresiduals(mod1_auto)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(18,2,0)
## Q* = 7.7663, df = 3, p-value = 0.0511
##
## Model df: 18. Total lags used: 21
Box.test(mod1$residuals,type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: mod1$residuals
## X-squared = 0.012125, df = 1, p-value = 0.9123
Box.test(mod1$residuals,type = 'Ljung-Box',lag=1)
##
## Box-Ljung test
##
## data: mod1$residuals
## X-squared = 0.012125, df = 1, p-value = 0.9123
Box.test(mod1$residuals,type = 'Ljung-Box',lag=10)
##
## Box-Ljung test
##
## data: mod1$residuals
## X-squared = 5.369, df = 10, p-value = 0.8652
Both model indicated autocorrelation problem, the mod1_autoor Model 1 on home value – ARIMA(18,2,0) did not indicate any autocorrelation.
Box.test(mod1_auto$residuals,type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: mod1_auto$residuals
## X-squared = 0.012125, df = 1, p-value = 0.9123
Box.test(mod1_auto$residuals,type = 'Ljung-Box',lag=1)
##
## Box-Ljung test
##
## data: mod1_auto$residuals
## X-squared = 0.012125, df = 1, p-value = 0.9123
Box.test(mod1_auto$residuals,type = 'Ljung-Box',lag=10)
##
## Box-Ljung test
##
## data: mod1_auto$residuals
## X-squared = 5.369, df = 10, p-value = 0.8652
in_sample_pred = cinhome_second_diff$home_value-mod1_auto$residuals
#transfer for box data
#in_sample_pred = InvBoxCox(in_sample_pred,lambda=best_lambda)
# RMSE Calculation
# Average Error of 13, which is pretty good
RMSE = sqrt(mean((in_sample_pred - cinhome_second_diff$home_value)^2,na.rm=T))
RMSE
## [1] 187.6045
Now we want to predict using our best model and get predictions for 12 steps (a year) ahead:
# Get Predictions for 5 steps ahead
prediction = predict(mod1_auto,n.ahead=12)
# Create Prediction Dataset for next 5 month
pred_data = data.frame(
pred = prediction$pred,
se = prediction$se,
new_month = seq(as.Date(max(cinhome_second_diff$new_month)),by="month",length.out=12)
)
pred_merge = cinhome_second_diff %>%
mutate(in_sample_pred = in_sample_pred) %>%
full_join(pred_data) %>%
mutate(
pred_high = pred+2*se, # Transform back into regular values after creating CIs
pred_low = pred - 2*se,
pred = pred,
error = pred - home_value
)
# Forecast Actually Looks Pretty good! (If we didn't have covid...)
pred_merge %>%
mutate(pred = if_else(is.na(pred) & !is.na(lead(pred)),in_sample_pred,pred)) %>% # This code connects the forecast lines in GGPLOT
ggplot()+
geom_line(aes(ymd(new_month),home_value))+
geom_point(aes(ymd(new_month),in_sample_pred),color='blue')+
geom_line(aes(ymd(new_month),pred),color='blue')+
geom_ribbon(aes(x=ymd(new_month),ymin=pred_low,ymax = pred_high),fill='blue',alpha=0.2)
We are still focusing on home value monthly data in Cincinnati. To perform Prophet Model, we need to change name to our dataset: name our date variable –ds; name our time series variable (home_value) –y.
cin_home_value<- home_value%>%
filter(RegionID==394466)%>%
pivot_longer(6:269,names_to="date",values_to="home_value")%>%
mutate(new_month= as.Date(date, "%Y-%m-%d"))
#change to prophet format data
prophet_data = cin_home_value %>%
rename(ds = new_month, # Have to name our date variable "ds"
y = home_value) # Have to name our time series "y"
prophet_data%>%
head(5)%>%
kbl(caption = "Table: Prophet Format Data: Cincinnati Home Value")%>%
kable_classic(full_width = F, html_font = "Cambria")
| RegionID | SizeRank | RegionName | RegionType | StateName | date | y | ds |
|---|---|---|---|---|---|---|---|
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2000-01-31 | 129379 | 2000-01-31 |
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2000-02-29 | 129364 | 2000-02-29 |
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2000-03-31 | 129290 | 2000-03-31 |
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2000-04-30 | 129780 | 2000-04-30 |
| 394466 | 28 | Cincinnati, OH | Msa | OH | 2000-05-31 | 130454 | 2000-05-31 |
Prophet model is usually used for daily data, fitting monthly data in this model is also applicable but we are asking for daily forecasts in this case. We will start with a basic forecasting using Prophet with frequency as daily.
We took months before “2017-08-31” as train dataset (approximately 80%) and time at “2017-08-31” and after as test dataset (approximately 20%). We can plot an interactive plot for forecasting after modeling.
Additionally, we will plot lines of forecast home value with actual Home value for comparison.
train1 = prophet_data %>%
filter(ds<ymd("2017-08-31"))
test1 = prophet_data %>%
filter(ds>=ymd("2017-08-31"))
model1 = prophet(train1)
future1 = make_future_dataframe(model1,periods = 1583)#freq=daily; 53 month ahead
forecast1 = predict(model1,future1)
#plot interactive forecast plot
dyplot.prophet(model1,forecast1)
#plot actual vs forecast line for comparision
ggplot()+
geom_line(data=forecast1,aes(x=as.Date(ds),y=yhat,col="blue"))+
geom_line(data=cin_home_value,aes(x=as.Date(new_month),y=home_value,col="red"))+
scale_color_identity(name = "Legend", #for manual legend
breaks = c("red","blue"),
labels = c("Actual","Forecast"),
guide = "legend")+
labs(title = "Cincy Home Value: Actual value vs Prophet basic forecast",
x = "Date",
y = "Cincinnati Home Value")
As we noticed from above that the output is strange as the forecast is not smooth and it generates daily instead of monthly forecast. Now we will change the frequency to month.
train = prophet_data %>%
filter(ds<ymd("2017-08-31"))
test = prophet_data %>%
filter(ds>=ymd("2017-08-31"))
model = prophet(train)
future = make_future_dataframe(model,periods = 53,freq = 'month')
forecast = predict(model,future)
#plot interactive forecast plot
dyplot.prophet(model,forecast)
#plot actual vs forecast line for comparision
ggplot()+
geom_line(data=forecast,aes(x=as.Date(ds),y=yhat,col="blue"))+
geom_line(data=cin_home_value,aes(x=as.Date(new_month),y=home_value,col="red"))+
scale_color_identity(name = "Legend", #for manual legend
breaks = c("red","blue"),
labels = c("Actual","Forecast"),
guide = "legend")+
labs(title = "Cincy Home Value: Actual value vs Prophet basic forecast",
x = "Date",
y = "Cincinnati Home Value")
The plots looks more smooth. However, we can notice from the plots that predicted home value is clearly lower than actual home value after the time of “2017-08-31”. Thus, we will look into trend and seasonality impact on home_value:
prophet_plot_components(model,forecast)
We can see from the decompose plot above that there’s a clear trend but seasonality is less clear. It might because this home value dataset is seasonally adjusted and only includes the middle price tier of homes. However, there might be cyclical component that follows the economic patterns (eg: 2008 financial crisis; 2019 COVID, etc). Now let’s check the change points:
The input of changepoints allowed us to increase the fit and its flexibility. We can examine the changepoints identified for the “trend” part of the time series from home value plot above. There is at least 5 change point (1 around 2005, 2 around 2010, 1 around 2015, 1 around 2019):
model_changepoint = prophet(train,n.changepoints=5)
future_changepoint = make_future_dataframe(model,periods = 53,freq = 'month')
forecast_changepoint = predict(model_changepoint,future_changepoint)
plot(model_changepoint,forecast_changepoint)+
add_changepoints_to_plot(model_changepoint)+
theme_bw()+xlab("Date")+ylab("Home value")
prophet_plot_components(model_changepoint,forecast_changepoint)
We noticed from the trend plot that setting changepoint 5 roghtly captured trend changes. If we set higher changepoint number, there’s chance of overfitting.
From the forecast output above, the predicted values are all above 0. In our model, we do not need to take into account a saturating maximum point as the home value. However, we must also specify a maximum capacity if using logistic growth with a minimum. Thus, we could set a reasonable minimun and maximum value: usually home value ( Cincinnati median home value) is unlikely to go below 0 dollar and go above $0.3 million.
train_saturation = prophet_data %>%
filter(ds<ymd("2017-08-31"))
test_saturation = prophet_data %>%
filter(ds>=ymd("2017-08-31"))
model_saturation = prophet(train_saturation)
future_saturation = make_future_dataframe(model_saturation,periods = 53,freq = 'month')#freq=monthly; 53 month ahead
# Set "floor" in training data
train_saturation$floor = 0
train_saturation$cap=300000
future_saturation$floor = 0
future_saturation$cap=300000
# Set floor in forecast data
future_saturation$floor = 0
future_saturation$cap=300000
sat_model = prophet(train_saturation,growth='logistic')
sat_forecast = predict(sat_model,future_saturation)
plot(sat_model,sat_forecast)+ylim(0,300000)+
theme_bw()+xlab("Date")+ylab("Home Value")
From the plot, we noticed that all the home values are within the range of (0.1,0.3) million. Therefore, we do not need to specify the saturating points.
From above decomposition, we noticed that we do not have yearly seasonality. It might because this value is a smoothed, seasonally adjusted measure of the typical home value and market changes across a given region and housing type. Now we will try to identity daily, weekly, as well as additibe/ multiplicative seasonality:
model_season with daily.seasonality and weekly.seasonality option set to TURE to check daily and weekly seasonality:model_season = prophet(train, daily.seasonality = TRUE, weekly.seasonality = TRUE)
future_season = make_future_dataframe(model_season,periods = 1583)#freq=daily; 53 month ahead
forecast_season = predict(model_season,future_season)
#plot interactive forecast plot
dyplot.prophet(model_season,forecast_season)
#decompose
prophet_plot_components(model_season,forecast_season)
Daily or weekly seasonality also not very obvious. Now we will dive into additive/ multiplicative seasonality:
additive = prophet(train)
add_fcst = predict(additive,future)
plot(additive,add_fcst)
prophet_plot_components(additive,add_fcst)
multi = prophet(train,seasonality.mode = 'multiplicative')
multi_fcst = predict(multi,future)
plot(multi,multi_fcst)
prophet_plot_components(multi,multi_fcst)
Decompose plots of additive or multiplicative seasonality appear to be similar: seasonality is not obvious. Thus, we might only have trend and no seasonality.
We want to try with inclusion of holidays. The result might not be very accurate with important holidays as we are examining monthly rather than daily data:
#Can use built-in holidays, or specify our own
model_holiday = prophet(train,fit=FALSE,daily.seasonality = TRUE, weekly.seasonality = TRUE)
model_holiday = add_country_holidays(model_holiday,country_name = 'US')
model_holiday = fit.prophet(model_holiday,train)
forecast_holiday = predict(model_holiday,future1)
dyplot.prophet(model_holiday,forecast_holiday)
prophet_plot_components(model_holiday,forecast_holiday)
The fitted line looks very well. Additionally, the plots indicated a few import days over years. These days maybe corresponding to our changing points in trend. Now we can look into those days:
forecast_holiday %>%
filter(holidays != 0) %>%
select_if(~ !is.numeric(.) || sum(.) != 0)%>%
select(ds,holidays,`Memorial Day`,`New Year's Day (Observed)`)%>%
kbl(caption = "Table: Important Holidays")%>%
kable_classic(full_width = F, html_font = "Cambria")
| ds | holidays | Memorial Day | New Year’s Day (Observed) |
|---|---|---|---|
| 2004-05-31 | 1267.9140 | 1267.914 | 0.0000 |
| 2004-12-31 | -399.0302 | 0.000 | -399.0302 |
| 2010-05-31 | 1267.9140 | 1267.914 | 0.0000 |
| 2010-12-31 | -399.0302 | 0.000 | -399.0302 |
| 2018-05-28 | 1267.9140 | 1267.914 | 0.0000 |
| 2019-05-27 | 1267.9140 | 1267.914 | 0.0000 |
| 2020-05-25 | 1267.9140 | 1267.914 | 0.0000 |
| 2021-05-31 | 1267.9140 | 1267.914 | 0.0000 |
We may notice that the returning result is reasonable:
Important positive influential holiday is Memorial day which is in May when the market is in peak season and buyers are active. On the contrary, negative correlated holiday is New Year’s day, which is when the housing market is in slow season. Moreover, we can see decrease in 2004 and 2010. As for 2010-2011, we had the housing market crash after the 2008 financial crisis.
forecast_holiday %>%
filter(holidays != 0) %>%
dplyr::select(-ds:-additive_terms_upper, -holidays:-holidays_upper, -weekly:-yhat, -contains("upper"), -contains("lower")) %>%
mutate_all(~ if_else(. == 0, as.numeric(NA), .)) %>%
summarize_if(is.numeric, ~ max(., na.rm = T)) %>%
pivot_longer(
cols = `Christmas Day`:`Washington's Birthday`,
names_to = 'holiday',
values_to = 'effect'
) %>%
ggplot() +
geom_col(aes(effect,holiday))+
theme_bw()
anom_train = prophet_data %>%
filter(ds<ymd("2017-08-31"))
anom_test = prophet_data %>%
filter(ds>=ymd("2017-08-31"))
anom_model = prophet(anom_train)
anom_future = make_future_dataframe(anom_model,periods = 53,freq = 'month')
forecast = predict(anom_model,anom_future)
forecast_plot_data = forecast %>%
as_tibble() %>%
mutate(ds = as.Date(ds)) %>%
filter(ds>=ymd("2017-08-31")) %>%
mutate(y = anom_test$y) %>%
mutate(anomaly = if_else(y>yhat_upper | y<yhat_lower,TRUE,as.logical(NA)))
forecast_plot_data %>%
ggplot()+
geom_line(aes(ds,y))+
geom_point(aes(ds,y,color=anomaly))+
geom_line(aes(ds,yhat),color='red')+
geom_ribbon(aes(ds,ymin=yhat_lower,ymax=yhat_upper),alpha=0.4,fill='blue') +
scale_colour_discrete(na.translate = F)
We can see that our actual values in test dataset are much higher than the predicted values. Therefore, all points are detected anomaly. However, we can not delete these data, as they are generated under COVID situation which is an anomaly period of time and business trend.
arima_train = cinhome_second_diff %>%
filter(new_month<ymd("2017-08-31"))
arima_test = cinhome_second_diff %>%
filter(new_month>=ymd("2017-08-31"))
train_mod1 = arima(arima_train$home_value,order=c(18,2,0))
test_pred1 = predict(train_mod1,53)$pred
error1 = arima_test$home_value - test_pred1
# Get Predictions for 5 steps ahead
prediction = predict(train_mod1,n.ahead=53)
# Create Prediction Dataset for next 5 month
pred_data = data.frame(
pred = prediction$pred,
se = prediction$se,
new_month = seq(as.Date(max(cinhome_second_diff$new_month)),by="month",length.out=53)
)
pred_merge = cinhome_second_diff %>%
mutate(in_sample_pred = in_sample_pred) %>%
full_join(pred_data) %>%
mutate(
pred_high = pred+2*se, # Transform back into regular values after creating CIs
pred_low = pred - 2*se,
pred = pred,
error = pred - home_value
)
## Joining, by = "new_month"
# Forecast Actually Looks Pretty good! (If we didn't have covid...)
pred_merge %>%
mutate(pred = if_else(is.na(pred) & !is.na(lead(pred)),in_sample_pred,pred)) %>% # This code connects the forecast lines in GGPLOT
ggplot()+
geom_line(aes(ymd(new_month),home_value))+
geom_point(aes(ymd(new_month),in_sample_pred),color='blue')+
geom_line(aes(ymd(new_month),pred),color='blue')+
geom_ribbon(aes(x=ymd(new_month),ymin=pred_low,ymax = pred_high),fill='blue',alpha=0.2)+
geom_line(data=cin_home_value,aes(x=as.Date(new_month),y=home_value,col="red"))+
scale_color_identity(name = "Legend", #for manual legend
breaks = c("red","blue"),
labels = c("Actual","Forecast"),
guide = "legend")+
labs(title = "Cincy Home Value: Actual value vs Prophet basic forecast",
x = "Date",
y = "Cincinnati Home Value")
## Warning: Removed 52 row(s) containing missing values (geom_path).
## Warning: Removed 52 rows containing missing values (geom_point).
## Warning: Removed 260 row(s) containing missing values (geom_path).
Out of sample RMSE:
out_RMSE = sqrt(mean((arima_test$home_value - test_pred1)^2))
out_RMSE
## [1] 16330.02
Out of sample RMSE
We want to use rolling window cross-validation to assess performance of the model at meaningful thresholds depending on the data (eg: 1 year out for monthly data) and then use various metrics (eg: RMSE,MAE,MAPE).
df.cv_holiday <- cross_validation(model_holiday, initial = 4*365,horizon = 365, units = 'days')
unique(df.cv_holiday$cutoff)%>%
kbl(caption = "Table: Holidays model CV cutoff")%>%
kable_classic(full_width = F, html_font = "Cambria")
| x |
|---|
| 2004-02-02 12:00:00 |
| 2004-08-03 00:00:00 |
| 2005-02-01 12:00:00 |
| 2005-08-03 00:00:00 |
| 2006-02-01 12:00:00 |
| 2006-08-03 00:00:00 |
| 2007-02-01 12:00:00 |
| 2007-08-03 00:00:00 |
| 2008-02-01 12:00:00 |
| 2008-08-02 00:00:00 |
| 2009-01-31 12:00:00 |
| 2009-08-02 00:00:00 |
| 2010-01-31 12:00:00 |
| 2010-08-02 00:00:00 |
| 2011-01-31 12:00:00 |
| 2011-08-02 00:00:00 |
| 2012-01-31 12:00:00 |
| 2012-08-01 00:00:00 |
| 2013-01-30 12:00:00 |
| 2013-08-01 00:00:00 |
| 2014-01-30 12:00:00 |
| 2014-08-01 00:00:00 |
| 2015-01-30 12:00:00 |
| 2015-08-01 00:00:00 |
| 2016-01-30 12:00:00 |
| 2016-07-31 00:00:00 |
df.cv_holiday %>%
ggplot()+
geom_point(aes(ds,y)) +
geom_point(aes(ds,yhat,color=factor(cutoff)))+
theme_bw()+
xlab("Date")+
ylab("Home value")+
scale_color_discrete(name = 'Cutoff')
performance_metrics(df.cv_holiday)%>%
kbl(caption = "Table: Holidays model CV performance metrics")%>%
kable_classic(full_width = F, html_font = "Cambria")
| horizon | mse | rmse | mae | mape | mdape | smape | coverage |
|---|---|---|---|---|---|---|---|
| 1380 hours | 7085812 | 2661.919 | 1847.714 | 0.0127661 | 0.0088902 | 0.0128621 | 0.2016129 |
| 1392 hours | 6830350 | 2613.494 | 1828.086 | 0.0125604 | 0.0088902 | 0.0126436 | 0.2258065 |
| 1404 hours | 7929313 | 2815.904 | 2004.096 | 0.0137685 | 0.0089556 | 0.0138474 | 0.2258065 |
| 1416 hours | 8326729 | 2885.607 | 2134.955 | 0.0147409 | 0.0099573 | 0.0148185 | 0.2580645 |
| 1428 hours | 8925242 | 2987.514 | 2227.183 | 0.0154213 | 0.0099573 | 0.0155174 | 0.2661290 |
| 1440 hours | 9782313 | 3127.669 | 2321.821 | 0.0161352 | 0.0109473 | 0.0162752 | 0.2661290 |
| 1452 hours | 9628944 | 3103.054 | 2311.548 | 0.0160471 | 0.0109473 | 0.0161895 | 0.2580645 |
| 1464 hours | 9701966 | 3114.798 | 2331.786 | 0.0161688 | 0.0121212 | 0.0163127 | 0.2580645 |
| 2100 hours | 8151873 | 2855.149 | 2072.723 | 0.0142187 | 0.0109473 | 0.0143113 | 0.3225806 |
| 2124 hours | 10653239 | 3263.930 | 2488.605 | 0.0170862 | 0.0125826 | 0.0171767 | 0.2500000 |
| 2136 hours | 10717861 | 3273.815 | 2522.003 | 0.0172941 | 0.0137316 | 0.0173834 | 0.2661290 |
| 2148 hours | 11607770 | 3407.018 | 2606.955 | 0.0180729 | 0.0128247 | 0.0182391 | 0.2903226 |
| 2160 hours | 11352351 | 3369.325 | 2607.621 | 0.0180847 | 0.0137316 | 0.0182502 | 0.2661290 |
| 2172 hours | 11216270 | 3349.070 | 2586.079 | 0.0179203 | 0.0128247 | 0.0180888 | 0.2580645 |
| 2184 hours | 11792074 | 3433.959 | 2619.637 | 0.0181315 | 0.0128247 | 0.0183149 | 0.2580645 |
| 2208 hours | 11505619 | 3391.993 | 2605.975 | 0.0179717 | 0.0128247 | 0.0181463 | 0.2500000 |
| 2844 hours | 10081178 | 3175.087 | 2346.764 | 0.0160457 | 0.0128247 | 0.0161706 | 0.3225806 |
| 2856 hours | 10113834 | 3180.225 | 2369.322 | 0.0161690 | 0.0128247 | 0.0162876 | 0.3145161 |
| 2868 hours | 10650577 | 3263.522 | 2476.660 | 0.0169122 | 0.0128247 | 0.0170360 | 0.2903226 |
| 2880 hours | 10977307 | 3313.202 | 2621.293 | 0.0180028 | 0.0151552 | 0.0181192 | 0.2419355 |
| 2892 hours | 12358409 | 3515.453 | 2777.737 | 0.0191477 | 0.0151552 | 0.0193032 | 0.2338710 |
| 2904 hours | 14357527 | 3789.133 | 2931.750 | 0.0202899 | 0.0172276 | 0.0205209 | 0.2258065 |
| 2916 hours | 14253307 | 3775.355 | 2913.183 | 0.0201432 | 0.0150671 | 0.0203787 | 0.2258065 |
| 2928 hours | 14524051 | 3811.043 | 2968.882 | 0.0204882 | 0.0172276 | 0.0207292 | 0.2258065 |
| 3564 hours | 12279925 | 3504.272 | 2647.313 | 0.0180714 | 0.0150671 | 0.0182375 | 0.2903226 |
| 3588 hours | 15215440 | 3900.697 | 3084.814 | 0.0211192 | 0.0181413 | 0.0212932 | 0.2177419 |
| 3600 hours | 15354601 | 3918.495 | 3130.387 | 0.0214060 | 0.0182110 | 0.0215772 | 0.2096774 |
| 3612 hours | 17190227 | 4146.110 | 3253.759 | 0.0224584 | 0.0182110 | 0.0227317 | 0.2258065 |
| 3624 hours | 16918182 | 4113.172 | 3218.015 | 0.0222109 | 0.0187418 | 0.0224809 | 0.2580645 |
| 3636 hours | 16781197 | 4096.486 | 3194.597 | 0.0220248 | 0.0152658 | 0.0223010 | 0.2580645 |
| 3648 hours | 17269927 | 4155.710 | 3217.739 | 0.0221652 | 0.0156254 | 0.0224531 | 0.2258065 |
| 3672 hours | 17037972 | 4127.708 | 3227.050 | 0.0221259 | 0.0156254 | 0.0224043 | 0.2258065 |
| 4308 hours | 14889876 | 3858.740 | 2929.773 | 0.0199010 | 0.0156254 | 0.0201010 | 0.3225806 |
| 4332 hours | 17594367 | 4194.564 | 3322.505 | 0.0226795 | 0.0203274 | 0.0228883 | 0.2500000 |
| 4344 hours | 15710286 | 3963.620 | 3077.467 | 0.0209027 | 0.0172238 | 0.0210921 | 0.2903226 |
| 4356 hours | 19101619 | 4370.540 | 3391.115 | 0.0233096 | 0.0187418 | 0.0236241 | 0.2741935 |
| 4368 hours | 17969227 | 4239.013 | 3358.235 | 0.0230029 | 0.0222755 | 0.0232267 | 0.2661290 |
| 4380 hours | 17227279 | 4150.576 | 3303.528 | 0.0225634 | 0.0222755 | 0.0227665 | 0.2580645 |
| 4392 hours | 19851147 | 4455.463 | 3518.704 | 0.0241879 | 0.0175517 | 0.0245044 | 0.2258065 |
| 4416 hours | 20588467 | 4537.452 | 3617.151 | 0.0247918 | 0.0234997 | 0.0251231 | 0.2258065 |
| 5016 hours | 18019475 | 4244.935 | 3322.038 | 0.0225760 | 0.0175517 | 0.0228247 | 0.2903226 |
| 5040 hours | 19584840 | 4425.476 | 3542.102 | 0.0240430 | 0.0234997 | 0.0242296 | 0.2741935 |
| 5052 hours | 18301392 | 4278.013 | 3338.617 | 0.0225972 | 0.0197556 | 0.0227711 | 0.3225806 |
| 5064 hours | 21050377 | 4588.069 | 3610.164 | 0.0247866 | 0.0234997 | 0.0250675 | 0.2903226 |
| 5076 hours | 21696987 | 4658.003 | 3789.650 | 0.0259998 | 0.0259075 | 0.0262356 | 0.2419355 |
| 5088 hours | 20711486 | 4550.987 | 3754.801 | 0.0255614 | 0.0259075 | 0.0257638 | 0.2258065 |
| 5100 hours | 24205695 | 4919.928 | 3995.778 | 0.0273792 | 0.0259075 | 0.0277396 | 0.2258065 |
| 5124 hours | 24254011 | 4924.836 | 4004.360 | 0.0274321 | 0.0259075 | 0.0277935 | 0.2258065 |
| 5760 hours | 21769420 | 4665.771 | 3698.630 | 0.0251265 | 0.0197556 | 0.0254030 | 0.2903226 |
| 5772 hours | 20421540 | 4519.020 | 3499.653 | 0.0237404 | 0.0197556 | 0.0239631 | 0.3440860 |
| 5784 hours | 21911436 | 4680.965 | 3639.825 | 0.0247012 | 0.0191914 | 0.0249144 | 0.3306452 |
| 5796 hours | 23601327 | 4858.120 | 3958.128 | 0.0269945 | 0.0259075 | 0.0272247 | 0.2500000 |
| 5808 hours | 25180261 | 5017.994 | 4119.196 | 0.0281903 | 0.0223888 | 0.0284723 | 0.2258065 |
| 5820 hours | 27535914 | 5247.467 | 4224.078 | 0.0290345 | 0.0223888 | 0.0294426 | 0.2258065 |
| 5832 hours | 27644336 | 5257.788 | 4235.336 | 0.0290310 | 0.0223888 | 0.0294559 | 0.2258065 |
| 5844 hours | 27326546 | 5227.480 | 4203.571 | 0.0288354 | 0.0209749 | 0.0292542 | 0.2258065 |
| 6480 hours | 23858964 | 4884.564 | 3851.119 | 0.0261895 | 0.0205783 | 0.0264908 | 0.2661290 |
| 6504 hours | 26782552 | 5175.186 | 4169.212 | 0.0282877 | 0.0248724 | 0.0285277 | 0.2473118 |
| 6516 hours | 26994411 | 5195.615 | 4210.219 | 0.0285457 | 0.0260384 | 0.0287818 | 0.2500000 |
| 6528 hours | 29529574 | 5434.112 | 4456.319 | 0.0305953 | 0.0224780 | 0.0310035 | 0.2016129 |
| 6540 hours | 29242776 | 5407.659 | 4454.195 | 0.0306165 | 0.0260384 | 0.0310294 | 0.1935484 |
| 6552 hours | 29074404 | 5392.069 | 4442.977 | 0.0303074 | 0.0260384 | 0.0307263 | 0.1935484 |
| 6564 hours | 30811671 | 5550.826 | 4527.305 | 0.0308395 | 0.0260384 | 0.0313125 | 0.1935484 |
| 6588 hours | 29796426 | 5458.610 | 4469.971 | 0.0303514 | 0.0252500 | 0.0307951 | 0.1935484 |
| 7224 hours | 26723551 | 5169.483 | 4124.258 | 0.0278534 | 0.0252500 | 0.0281878 | 0.2580645 |
| 7236 hours | 26469106 | 5144.814 | 4052.574 | 0.0273652 | 0.0252500 | 0.0276596 | 0.2795699 |
| 7248 hours | 27369799 | 5231.615 | 4161.627 | 0.0281096 | 0.0224780 | 0.0284085 | 0.2661290 |
| 7260 hours | 28774453 | 5364.182 | 4437.877 | 0.0301684 | 0.0295566 | 0.0304749 | 0.1854839 |
| 7272 hours | 31675690 | 5628.116 | 4639.708 | 0.0316485 | 0.0273066 | 0.0320463 | 0.1612903 |
| 7284 hours | 36247338 | 6020.576 | 4872.729 | 0.0333665 | 0.0295566 | 0.0339473 | 0.1612903 |
| 7296 hours | 36729861 | 6060.517 | 4903.854 | 0.0334686 | 0.0295566 | 0.0340743 | 0.1612903 |
| 7308 hours | 36430203 | 6035.744 | 4890.075 | 0.0333955 | 0.0289497 | 0.0339958 | 0.1612903 |
| 7944 hours | 31867621 | 5645.141 | 4469.368 | 0.0302531 | 0.0252500 | 0.0306949 | 0.2258065 |
| 7968 hours | 34731632 | 5893.355 | 4790.511 | 0.0323881 | 0.0301323 | 0.0327692 | 0.2150538 |
| 7980 hours | 35127034 | 5926.806 | 4856.017 | 0.0328047 | 0.0319223 | 0.0331780 | 0.2096774 |
| 7992 hours | 39436339 | 6279.836 | 5174.038 | 0.0353709 | 0.0289497 | 0.0359697 | 0.1693548 |
| 8004 hours | 39089330 | 6252.146 | 5143.341 | 0.0352034 | 0.0319223 | 0.0357977 | 0.1612903 |
| 8016 hours | 38828669 | 6231.265 | 5147.085 | 0.0349462 | 0.0319223 | 0.0355429 | 0.1612903 |
| 8028 hours | 40680158 | 6378.100 | 5251.867 | 0.0356294 | 0.0319223 | 0.0362751 | 0.1612903 |
| 8052 hours | 39581249 | 6291.363 | 5210.728 | 0.0352145 | 0.0319223 | 0.0358262 | 0.1612903 |
| 8688 hours | 35425308 | 5951.916 | 4810.196 | 0.0323424 | 0.0319223 | 0.0328037 | 0.2258065 |
| 8712 hours | 38865763 | 6234.241 | 5151.744 | 0.0347855 | 0.0349362 | 0.0352124 | 0.2150538 |
| 8724 hours | 36237637 | 6019.770 | 4868.439 | 0.0327822 | 0.0264664 | 0.0331871 | 0.2338710 |
| 8736 hours | 42574661 | 6524.926 | 5351.314 | 0.0365124 | 0.0349362 | 0.0371695 | 0.1774194 |
| 8748 hours | 39604089 | 6293.178 | 5186.128 | 0.0352838 | 0.0332404 | 0.0357510 | 0.1935484 |
| 8760 hours | 39307861 | 6269.598 | 5196.775 | 0.0350580 | 0.0332404 | 0.0355287 | 0.1935484 |
plot_cross_validation_metric(df.cv_holiday, metric = 'rmse')
It does not look very well. Now Let’s compare other models:
Model Comparision
Model 1: changepoint in trend model with freq=‘monthly’, changepoint=5;
Model 2: seasonality model with freq=‘daily’ and week/daily seasonality=TRUE;
Model 3: Additive seasonality model;
Model 4: Multiplicative seasonality model.
#mod1
mod1 = prophet(train,n.changepoints=5)
mod1_future = make_future_dataframe(mod1,periods = 53,freq = 'month')
mod1_forecast = predict(mod1,mod1_future)
df_cv_mod1 <- cross_validation(mod1, initial = 4*365,horizon = 365, units = 'days')
metrics1 = performance_metrics(df_cv_mod1) %>%
mutate(model = 'mod1')
#mod2
mod2 = prophet(train, daily.seasonality = TRUE, weekly.seasonality = TRUE)
mod2_future = make_future_dataframe(mod2,periods = 1583)#freq=daily; 53 month ahead
mod2_forecast = predict(mod2,mod2_future)
df_cv_mod2 <- cross_validation(mod2, initial = 4*365,horizon = 365, units = 'days')
metrics2 = performance_metrics(df_cv_mod2) %>%
mutate(model = "mod2")
metrics1 %>%
bind_rows(metrics2) %>%
ggplot()+
geom_line(aes(horizon,rmse,color=model))
#mod3
mod3 = prophet(train,seasonality.mode='additive')
mod3_forecast = predict(mod3,future)
df_cv_mod3 <- cross_validation(mod3, initial = 4*365,horizon = 365, units = 'days')
metrics3 = performance_metrics(df_cv_mod3) %>%
mutate(model = 'mod3')
#mod4
mod4 = prophet(train,seasonality.mode='multiplicative')
forecast4 = predict(mod4,future)
df_cv_mod4 <- cross_validation(mod4, initial = 4*365,horizon = 365, units = 'days')
metrics4 = performance_metrics(df_cv_mod4) %>%
mutate(model = "mod4")
metrics3 %>%
bind_rows(metrics4) %>%
ggplot()+
geom_line(aes(horizon,rmse,color=model))
performance_metrics(df.cv_holiday)%>%
kbl(caption = "Table: Model 1 CV performance metrics")%>%
kable_classic(full_width = F, html_font = "Cambria")
| horizon | mse | rmse | mae | mape | mdape | smape | coverage |
|---|---|---|---|---|---|---|---|
| 1380 hours | 7085812 | 2661.919 | 1847.714 | 0.0127661 | 0.0088902 | 0.0128621 | 0.2016129 |
| 1392 hours | 6830350 | 2613.494 | 1828.086 | 0.0125604 | 0.0088902 | 0.0126436 | 0.2258065 |
| 1404 hours | 7929313 | 2815.904 | 2004.096 | 0.0137685 | 0.0089556 | 0.0138474 | 0.2258065 |
| 1416 hours | 8326729 | 2885.607 | 2134.955 | 0.0147409 | 0.0099573 | 0.0148185 | 0.2580645 |
| 1428 hours | 8925242 | 2987.514 | 2227.183 | 0.0154213 | 0.0099573 | 0.0155174 | 0.2661290 |
| 1440 hours | 9782313 | 3127.669 | 2321.821 | 0.0161352 | 0.0109473 | 0.0162752 | 0.2661290 |
| 1452 hours | 9628944 | 3103.054 | 2311.548 | 0.0160471 | 0.0109473 | 0.0161895 | 0.2580645 |
| 1464 hours | 9701966 | 3114.798 | 2331.786 | 0.0161688 | 0.0121212 | 0.0163127 | 0.2580645 |
| 2100 hours | 8151873 | 2855.149 | 2072.723 | 0.0142187 | 0.0109473 | 0.0143113 | 0.3225806 |
| 2124 hours | 10653239 | 3263.930 | 2488.605 | 0.0170862 | 0.0125826 | 0.0171767 | 0.2500000 |
| 2136 hours | 10717861 | 3273.815 | 2522.003 | 0.0172941 | 0.0137316 | 0.0173834 | 0.2661290 |
| 2148 hours | 11607770 | 3407.018 | 2606.955 | 0.0180729 | 0.0128247 | 0.0182391 | 0.2903226 |
| 2160 hours | 11352351 | 3369.325 | 2607.621 | 0.0180847 | 0.0137316 | 0.0182502 | 0.2661290 |
| 2172 hours | 11216270 | 3349.070 | 2586.079 | 0.0179203 | 0.0128247 | 0.0180888 | 0.2580645 |
| 2184 hours | 11792074 | 3433.959 | 2619.637 | 0.0181315 | 0.0128247 | 0.0183149 | 0.2580645 |
| 2208 hours | 11505619 | 3391.993 | 2605.975 | 0.0179717 | 0.0128247 | 0.0181463 | 0.2500000 |
| 2844 hours | 10081178 | 3175.087 | 2346.764 | 0.0160457 | 0.0128247 | 0.0161706 | 0.3225806 |
| 2856 hours | 10113834 | 3180.225 | 2369.322 | 0.0161690 | 0.0128247 | 0.0162876 | 0.3145161 |
| 2868 hours | 10650577 | 3263.522 | 2476.660 | 0.0169122 | 0.0128247 | 0.0170360 | 0.2903226 |
| 2880 hours | 10977307 | 3313.202 | 2621.293 | 0.0180028 | 0.0151552 | 0.0181192 | 0.2419355 |
| 2892 hours | 12358409 | 3515.453 | 2777.737 | 0.0191477 | 0.0151552 | 0.0193032 | 0.2338710 |
| 2904 hours | 14357527 | 3789.133 | 2931.750 | 0.0202899 | 0.0172276 | 0.0205209 | 0.2258065 |
| 2916 hours | 14253307 | 3775.355 | 2913.183 | 0.0201432 | 0.0150671 | 0.0203787 | 0.2258065 |
| 2928 hours | 14524051 | 3811.043 | 2968.882 | 0.0204882 | 0.0172276 | 0.0207292 | 0.2258065 |
| 3564 hours | 12279925 | 3504.272 | 2647.313 | 0.0180714 | 0.0150671 | 0.0182375 | 0.2903226 |
| 3588 hours | 15215440 | 3900.697 | 3084.814 | 0.0211192 | 0.0181413 | 0.0212932 | 0.2177419 |
| 3600 hours | 15354601 | 3918.495 | 3130.387 | 0.0214060 | 0.0182110 | 0.0215772 | 0.2096774 |
| 3612 hours | 17190227 | 4146.110 | 3253.759 | 0.0224584 | 0.0182110 | 0.0227317 | 0.2258065 |
| 3624 hours | 16918182 | 4113.172 | 3218.015 | 0.0222109 | 0.0187418 | 0.0224809 | 0.2580645 |
| 3636 hours | 16781197 | 4096.486 | 3194.597 | 0.0220248 | 0.0152658 | 0.0223010 | 0.2580645 |
| 3648 hours | 17269927 | 4155.710 | 3217.739 | 0.0221652 | 0.0156254 | 0.0224531 | 0.2258065 |
| 3672 hours | 17037972 | 4127.708 | 3227.050 | 0.0221259 | 0.0156254 | 0.0224043 | 0.2258065 |
| 4308 hours | 14889876 | 3858.740 | 2929.773 | 0.0199010 | 0.0156254 | 0.0201010 | 0.3225806 |
| 4332 hours | 17594367 | 4194.564 | 3322.505 | 0.0226795 | 0.0203274 | 0.0228883 | 0.2500000 |
| 4344 hours | 15710286 | 3963.620 | 3077.467 | 0.0209027 | 0.0172238 | 0.0210921 | 0.2903226 |
| 4356 hours | 19101619 | 4370.540 | 3391.115 | 0.0233096 | 0.0187418 | 0.0236241 | 0.2741935 |
| 4368 hours | 17969227 | 4239.013 | 3358.235 | 0.0230029 | 0.0222755 | 0.0232267 | 0.2661290 |
| 4380 hours | 17227279 | 4150.576 | 3303.528 | 0.0225634 | 0.0222755 | 0.0227665 | 0.2580645 |
| 4392 hours | 19851147 | 4455.463 | 3518.704 | 0.0241879 | 0.0175517 | 0.0245044 | 0.2258065 |
| 4416 hours | 20588467 | 4537.452 | 3617.151 | 0.0247918 | 0.0234997 | 0.0251231 | 0.2258065 |
| 5016 hours | 18019475 | 4244.935 | 3322.038 | 0.0225760 | 0.0175517 | 0.0228247 | 0.2903226 |
| 5040 hours | 19584840 | 4425.476 | 3542.102 | 0.0240430 | 0.0234997 | 0.0242296 | 0.2741935 |
| 5052 hours | 18301392 | 4278.013 | 3338.617 | 0.0225972 | 0.0197556 | 0.0227711 | 0.3225806 |
| 5064 hours | 21050377 | 4588.069 | 3610.164 | 0.0247866 | 0.0234997 | 0.0250675 | 0.2903226 |
| 5076 hours | 21696987 | 4658.003 | 3789.650 | 0.0259998 | 0.0259075 | 0.0262356 | 0.2419355 |
| 5088 hours | 20711486 | 4550.987 | 3754.801 | 0.0255614 | 0.0259075 | 0.0257638 | 0.2258065 |
| 5100 hours | 24205695 | 4919.928 | 3995.778 | 0.0273792 | 0.0259075 | 0.0277396 | 0.2258065 |
| 5124 hours | 24254011 | 4924.836 | 4004.360 | 0.0274321 | 0.0259075 | 0.0277935 | 0.2258065 |
| 5760 hours | 21769420 | 4665.771 | 3698.630 | 0.0251265 | 0.0197556 | 0.0254030 | 0.2903226 |
| 5772 hours | 20421540 | 4519.020 | 3499.653 | 0.0237404 | 0.0197556 | 0.0239631 | 0.3440860 |
| 5784 hours | 21911436 | 4680.965 | 3639.825 | 0.0247012 | 0.0191914 | 0.0249144 | 0.3306452 |
| 5796 hours | 23601327 | 4858.120 | 3958.128 | 0.0269945 | 0.0259075 | 0.0272247 | 0.2500000 |
| 5808 hours | 25180261 | 5017.994 | 4119.196 | 0.0281903 | 0.0223888 | 0.0284723 | 0.2258065 |
| 5820 hours | 27535914 | 5247.467 | 4224.078 | 0.0290345 | 0.0223888 | 0.0294426 | 0.2258065 |
| 5832 hours | 27644336 | 5257.788 | 4235.336 | 0.0290310 | 0.0223888 | 0.0294559 | 0.2258065 |
| 5844 hours | 27326546 | 5227.480 | 4203.571 | 0.0288354 | 0.0209749 | 0.0292542 | 0.2258065 |
| 6480 hours | 23858964 | 4884.564 | 3851.119 | 0.0261895 | 0.0205783 | 0.0264908 | 0.2661290 |
| 6504 hours | 26782552 | 5175.186 | 4169.212 | 0.0282877 | 0.0248724 | 0.0285277 | 0.2473118 |
| 6516 hours | 26994411 | 5195.615 | 4210.219 | 0.0285457 | 0.0260384 | 0.0287818 | 0.2500000 |
| 6528 hours | 29529574 | 5434.112 | 4456.319 | 0.0305953 | 0.0224780 | 0.0310035 | 0.2016129 |
| 6540 hours | 29242776 | 5407.659 | 4454.195 | 0.0306165 | 0.0260384 | 0.0310294 | 0.1935484 |
| 6552 hours | 29074404 | 5392.069 | 4442.977 | 0.0303074 | 0.0260384 | 0.0307263 | 0.1935484 |
| 6564 hours | 30811671 | 5550.826 | 4527.305 | 0.0308395 | 0.0260384 | 0.0313125 | 0.1935484 |
| 6588 hours | 29796426 | 5458.610 | 4469.971 | 0.0303514 | 0.0252500 | 0.0307951 | 0.1935484 |
| 7224 hours | 26723551 | 5169.483 | 4124.258 | 0.0278534 | 0.0252500 | 0.0281878 | 0.2580645 |
| 7236 hours | 26469106 | 5144.814 | 4052.574 | 0.0273652 | 0.0252500 | 0.0276596 | 0.2795699 |
| 7248 hours | 27369799 | 5231.615 | 4161.627 | 0.0281096 | 0.0224780 | 0.0284085 | 0.2661290 |
| 7260 hours | 28774453 | 5364.182 | 4437.877 | 0.0301684 | 0.0295566 | 0.0304749 | 0.1854839 |
| 7272 hours | 31675690 | 5628.116 | 4639.708 | 0.0316485 | 0.0273066 | 0.0320463 | 0.1612903 |
| 7284 hours | 36247338 | 6020.576 | 4872.729 | 0.0333665 | 0.0295566 | 0.0339473 | 0.1612903 |
| 7296 hours | 36729861 | 6060.517 | 4903.854 | 0.0334686 | 0.0295566 | 0.0340743 | 0.1612903 |
| 7308 hours | 36430203 | 6035.744 | 4890.075 | 0.0333955 | 0.0289497 | 0.0339958 | 0.1612903 |
| 7944 hours | 31867621 | 5645.141 | 4469.368 | 0.0302531 | 0.0252500 | 0.0306949 | 0.2258065 |
| 7968 hours | 34731632 | 5893.355 | 4790.511 | 0.0323881 | 0.0301323 | 0.0327692 | 0.2150538 |
| 7980 hours | 35127034 | 5926.806 | 4856.017 | 0.0328047 | 0.0319223 | 0.0331780 | 0.2096774 |
| 7992 hours | 39436339 | 6279.836 | 5174.038 | 0.0353709 | 0.0289497 | 0.0359697 | 0.1693548 |
| 8004 hours | 39089330 | 6252.146 | 5143.341 | 0.0352034 | 0.0319223 | 0.0357977 | 0.1612903 |
| 8016 hours | 38828669 | 6231.265 | 5147.085 | 0.0349462 | 0.0319223 | 0.0355429 | 0.1612903 |
| 8028 hours | 40680158 | 6378.100 | 5251.867 | 0.0356294 | 0.0319223 | 0.0362751 | 0.1612903 |
| 8052 hours | 39581249 | 6291.363 | 5210.728 | 0.0352145 | 0.0319223 | 0.0358262 | 0.1612903 |
| 8688 hours | 35425308 | 5951.916 | 4810.196 | 0.0323424 | 0.0319223 | 0.0328037 | 0.2258065 |
| 8712 hours | 38865763 | 6234.241 | 5151.744 | 0.0347855 | 0.0349362 | 0.0352124 | 0.2150538 |
| 8724 hours | 36237637 | 6019.770 | 4868.439 | 0.0327822 | 0.0264664 | 0.0331871 | 0.2338710 |
| 8736 hours | 42574661 | 6524.926 | 5351.314 | 0.0365124 | 0.0349362 | 0.0371695 | 0.1774194 |
| 8748 hours | 39604089 | 6293.178 | 5186.128 | 0.0352838 | 0.0332404 | 0.0357510 | 0.1935484 |
| 8760 hours | 39307861 | 6269.598 | 5196.775 | 0.0350580 | 0.0332404 | 0.0355287 | 0.1935484 |
plot_cross_validation_metric(df_cv_mod1, metric = 'rmse')
Cross-validation results for monthly data is not very reasonable. Maybe because we set the CV freq to ‘Daily’. All models seems to fit pretty well and have similar RMSE as well as other performing metrics.
Out of sample RMSE:
RMSE = sqrt(mean((test$y - mod1_forecast$yhat)^2))
## Warning in test$y - mod1_forecast$yhat: longer object length is not a multiple
## of shorter object length
RMSE
## [1] 44973.13
From our project, ARIMA model performs better than Prophet model with 5 change point. Arima model has much less out of sample RMSE (16330.02 vs. 44973.13). The reason for that is: firstly, our data is monthly data and prophet model naturally work better for daily data. Secondly, we did not add an external regressor in prohet model, which could change the situation.
Reference: reference