1 Data Info and Exploratory Data Analysis (EDA)

1.1 Datasets Description

The Zillow Home Price Data are time series datasets available online via this link. The datasets selected are updated monthly:

  • Home Values – Zillow Home Values Index (ZHVI) that tells us the typical home value in a given geography (metro area, city, ZIP code, etc.), now and over time.).

  • Home Values Forescasts: is the one-year forecast of ZHVI and is created using the all homes, mid-tier cut of ZHVI.

  • Rentals: a smoothed measure of the typical observed market rate rent acreoss a given region.

  • Inventory: The count of unique listings that were active at any time in a given month.

  • LIST AND SALE PRICES:The median price at which homes across various geographies were listed and sold.

  • SALES COUNT AND PRICE CUTS: is the estimated number of unique properties that sold during the month after accounting for the latency between when sales occur and when they are reported.

1.2 Project Proposal

These monthly time series datasets from Zillow were selected as they are consistent with my interest of working on a project of “Forecasting Home Value in Cincinnati”.

Experts in real estates had been concerned about the cycle of housing market as below:

image

On top of that in 2020, as COVID-19 pandemic spreads the world and policies such as lock-down and social distancing coming out, the predictions on housing market had been very pessimistic. Experts and media made forecasts and implied a shrink in US proprietary prices and housing market as the GDP decreases and economic struggled to recover.

However, the housing market had unprecedented increase since late 2020, rising by nearly 20% over last year (2021). In 2022, how will the housing market perform? Is the demand still strong or should we concern about home values decrease or even crash of the market? The abnormal pattern in 2021 may make it difficult to forecast as it is not a normal trend and will increase the expectation of sellers. We want to forecast these time series to see if the increase trend is a one-time thing or it will continue in 2022 as well.

1.3 Data preparation

Import datasets

We loaded five separate datasets. These are all wide dataset that need to be transform to long dataset later:

  • home_value has 908 obs(regions) and 269 variables (monthly; Jan/2000- Dec/2021).

  • Rentals: has 102 obs(regions) and 99 variables (monthly; Jan/2014- Dec/2021).

  • Inventory: has 917 obs(regions) and 53 variables (monthly; Jan/2018- Dec/2021).

  • LIST AND SALE PRICES:has 95 obs(regions) and 53 variables (monthly; Jan/2018- Dec/2021).

  • SALES COUNT AND PRICE CUTS: has 95 obs(regions) and 172 variables (monthly; Feb/2008- Dec/2021).

home_value<- read_csv("data/home_values.csv")
inventory<- read_csv("data/inventory.csv")
list_and_sale_price<- read_csv("data/list_and_sale_price.csv")
rentals<- read_csv("data/rentals.csv")
sales_count_and_price_cuts<- read_csv("data/sales_count_and_price_cuts.csv")

1.3.1 Transformed datasets

Data for Cincinnati, OH

  • Separate datasets

We can learn from the above datasets that the city Cincinnati, OH was given a unique region id “394466”. The size of the city ranked 28 among 900 and more cities (regions). Additionally, the region type of the city is Metropolitan statistical areas (MSA).

cin_home_value<- home_value%>%
  filter(RegionID==394466)%>%
  pivot_longer(6:269,names_to="date",values_to="home_value")%>%
  mutate(new_month= as.Date(date, "%Y-%m-%d"))

cin_inventory<- inventory%>%
  filter(RegionID==394466)%>%
  pivot_longer(6:53,names_to="date",values_to="inventory")%>%
  mutate(new_month= as.Date(date, "%Y-%m-%d"))

cin_list_and_sale_price<- list_and_sale_price%>%
  filter(RegionID==394466)%>%
  pivot_longer(6:53,names_to="date",values_to="list_sale_price")%>%
  mutate(new_month= as.Date(date, "%Y-%m-%d"))

cin_rentals<- rentals%>%
  filter(RegionID==394466)%>%
  pivot_longer(4:99,names_to="date",values_to="rentals")%>%
  mutate(month= as.Date(as.yearmon(date)))%>%
  mutate(new_month=as.Date(month) + months(1) - days(1))

cin_sales_count_and_price_cuts<- sales_count_and_price_cuts%>%
  filter(RegionID==394466)%>%
  pivot_longer(6:172,names_to="date",values_to="sales_count_price")%>%
  mutate(new_month= as.Date(date, "%Y-%m-%d"))
  • Combined dataset

This is the Cincinnati full combined data of the five separate datasets with different time length and the dimension is 264 observations and 12 columns. Note: for inventory, list price, rentals,sales have shorter time period than home value data.

cin_combine_full<- cin_home_value%>%
  left_join(cin_sales_count_and_price_cuts[,7:8], by="new_month")%>%
  left_join(cin_rentals[,c(5,7)], by="new_month")%>%
  left_join(cin_inventory[,7:8], by="new_month")%>%
  left_join(cin_list_and_sale_price[,7:8], by="new_month")
# good alternative library for descriptive statistics
cin_combine_full%>%
  tail(5)%>%
  kbl(caption = "Table: Combined dataset of home information in Cincinnati--(2000-2021)")%>%
  kable_classic(full_width = F, html_font = "Cambria")
Table: Combined dataset of home information in Cincinnati–(2000-2021)
RegionID SizeRank RegionName RegionType StateName date home_value new_month sales_count_price rentals inventory list_sale_price
394466 28 Cincinnati, OH Msa OH 2021-08-31 231514 2021-08-31 3071 1338 6254 336233
394466 28 Cincinnati, OH Msa OH 2021-09-30 234720 2021-09-30 3010 1348 6479 324600
394466 28 Cincinnati, OH Msa OH 2021-10-31 236772 2021-10-31 3143 1359 6452 316633
394466 28 Cincinnati, OH Msa OH 2021-11-30 238710 2021-11-30 3156 1369 6006 310300
394466 28 Cincinnati, OH Msa OH 2021-12-31 240833 2021-12-31 2941 1380 5302 306967

This is the Cincinnati combined data of the five separate datasets above with common time length and it has 48 obs and 12 variables.

cin_combine<- cin_inventory%>%
  left_join(cin_list_and_sale_price[,7:8], by="new_month")%>%
  left_join(cin_rentals[,c(5,7)], by="new_month")%>%
  left_join(cin_sales_count_and_price_cuts[,7:8], by="new_month")%>%
  left_join(cin_home_value[,7:8], by="new_month")
# good alternative library for descriptive statistics
cin_combine%>%
  head(5)%>%
  kbl(caption = "Table: Combined dataset of home information in Cincinnati--(2018-2021)")%>%
  kable_classic(full_width = F, html_font = "Cambria")
Table: Combined dataset of home information in Cincinnati–(2018-2021)
RegionID SizeRank RegionName RegionType StateName date inventory new_month list_sale_price rentals sales_count_price home_value
394466 28 Cincinnati, OH Msa OH 2018-01-31 8017 2018-01-31 218300 1087 2100 167537
394466 28 Cincinnati, OH Msa OH 2018-02-28 7636 2018-02-28 219967 1091 1797 168013
394466 28 Cincinnati, OH Msa OH 2018-03-31 7846 2018-03-31 231600 1095 2765 168827
394466 28 Cincinnati, OH Msa OH 2018-04-30 8252 2018-04-30 248267 1099 3206 170175
394466 28 Cincinnati, OH Msa OH 2018-05-31 8957 2018-05-31 261600 1104 3278 172092

This is a subset of combined dataset which omited info about the city.

cin_sub<- cin_combine[, 7:12]
# good alternative library for descriptive statistics

cin_sub%>%
  tail(5)%>%
  kbl(caption = "Table: Subset dataset of home information in Cincinnati")%>%
  kable_classic(full_width = F, html_font = "Cambria")
Table: Subset dataset of home information in Cincinnati
inventory new_month list_sale_price rentals sales_count_price home_value
6254 2021-08-31 336233 1338 3071 231514
6479 2021-09-30 324600 1348 3010 234720
6452 2021-10-31 316633 1359 3143 236772
6006 2021-11-30 310300 1369 3156 238710
5302 2021-12-31 306967 1380 2941 240833

1.4 EDA -Full combined dataset

home_value_full_plot = ggplot(cin_combine_full)+
  geom_line(aes(new_month,home_value))+
  theme_bw()+
  xlab("Month")+
  ylab("Home Value")+
  labs(
    title = 'Home Value in Cincinnati',
    subtitle = 'January 2000 - December 2021')

inventory_full_plot = ggplot(cin_combine_full)+
  geom_line(aes(new_month,inventory))+
  theme_bw()+
  xlab("Month")+
  ylab("Inventory")+
   labs(
    title  = 'Inventory in Cincinnati',
    subtitle = 'January 2018 - December 2021')

list_and_sale_price_full_plot = ggplot(cin_combine_full)+
  geom_line(aes(new_month,list_sale_price))+
  theme_bw()+
  xlab("Month")+
  ylab("List and sale price")+
   labs(
    title  = 'List and sale price in Cincinnati',
    subtitle = 'January 2018 - December 2021')

rentals_full_plot = ggplot(cin_combine_full)+
  geom_line(aes(new_month,rentals))+
  theme_bw()+
  xlab("Month")+
  ylab("Rentals")+
  labs(
    title  = 'Rentals in Cincinnati',
    subtitle = 'January 2014 - December 2021')

sales_count_price_full_plot = ggplot(cin_combine_full)+
  geom_line(aes(new_month,sales_count_price))+
  theme_bw()+
  xlab("Month")+
  ylab("Sales count and price cut")+
   labs(
    title  = 'Sales count and price cut in Cincinnati',
    subtitle = 'Feburary 2008 - December 2021')

grid.arrange(home_value_full_plot,
          inventory_full_plot,
          list_and_sale_price_full_plot,
          rentals_full_plot,
          sales_count_price_full_plot, 
          ncol=2, nrow =3)

1.4.1 EDA -Time Trend -combined Data (2018-2021)

home_value

home_value_plot = ggplot(cin_combine)+
  geom_line(aes(new_month,home_value))+
  geom_smooth(aes(new_month,home_value),method='lm',color='red')+
  theme_bw()+
  xlab("Month")+
  ylab("Home Value")+
  labs(
    title = 'Home Value in Cincinnati',
    subtitle = 'January 2018 - December 2021')

home_value_year_plot<- cin_sub%>%
  ggplot(aes(month(new_month, label=TRUE, abbr=TRUE), 
                home_value, group=factor(year(new_month)), colour=factor(year(new_month)))) +
  geom_line() +
  geom_point() +
  labs(x="Month", colour="Year") +
  theme_classic()+
  labs(title= 'Home Value in Cincinnati by Year' )

grid.arrange(home_value_plot, home_value_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'

#boxplot of home value
cin_home_value%>%ggplot(aes(year(new_month), 
                           home_value, group=factor(year(new_month)),
                           colour=factor(year(new_month)))) +
  geom_boxplot()+
  labs(title= 'Boxplot of Home Value in Cincinnati by Year' )

inventory

inventory_plot = ggplot(cin_combine)+
  geom_line(aes(new_month,inventory))+
  geom_smooth(aes(new_month,inventory),method='lm',color='red')+
  theme_bw()+
  xlab("Month")+
  ylab("Inventory")+
  labs(
    title = 'Inventory in Cincinnati',
    subtitle = 'January 2018 - December 2021')

inventory_year_plot<- cin_sub%>%
  ggplot(aes(month(new_month, label=TRUE, abbr=TRUE), 
                inventory, group=factor(year(new_month)), colour=factor(year(new_month)))) +
  geom_line() +
  geom_point() +
  labs(x="Month", colour="Year") +
  theme_classic()+
  labs(title= 'Inventory in Cincinnati by Year' )

grid.arrange(inventory_plot, inventory_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'

list_sale_price

list_and_sale_price_plot = ggplot(cin_combine)+
  geom_line(aes(new_month,list_sale_price))+
  geom_smooth(aes(new_month,list_sale_price),method='lm',color='red')+
  theme_bw()+
  xlab("Month")+
  ylab("List and sale price")+
  labs(
    title = 'List and sale price in Cincinnati',
    subtitle = 'January 2018 - December 2021')

list_and_sale_price_year_plot<- cin_sub%>%
  ggplot(aes(month(new_month, label=TRUE, abbr=TRUE), 
                list_sale_price, group=factor(year(new_month)), colour=factor(year(new_month)))) +
  geom_line() +
  geom_point() +
  labs(x="Month", colour="Year") +
  theme_classic()+
  labs(title= 'List and sale price in Cincinnati by Year' )

grid.arrange(list_and_sale_price_plot, list_and_sale_price_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'

`rentals``

rentals_plot = ggplot(cin_combine)+
  geom_line(aes(new_month,rentals))+
  geom_smooth(aes(new_month,rentals),method='lm',color='red')+
  theme_bw()+
  xlab("Month")+
  ylab("Rentals")+
  labs(
    title = 'Rentals in Cincinnati',
    subtitle = 'January 2018 - December 2021')

rentals_year_plot<- cin_sub%>%
  ggplot(aes(month(new_month, label=TRUE, abbr=TRUE), 
                rentals, group=factor(year(new_month)), colour=factor(year(new_month)))) +
  geom_line() +
  geom_point() +
  labs(x="Month", colour="Year") +
  theme_classic()+
  labs(title= 'Rentals in Cincinnati by Year' )

grid.arrange(rentals_plot, rentals_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'

`sales_count_price``

sales_count_price_plot = ggplot(cin_combine)+
  geom_line(aes(new_month,sales_count_price))+
  geom_smooth(aes(new_month,sales_count_price),method='lm',color='red')+
  theme_bw()+
  xlab("Month")+
  ylab("Sales count and price cut")+
  labs(
    title = 'Sales count and price cut in Cincinnati',
    subtitle = 'January 2018 - December 2021')

sales_count_price_year_plot<- cin_sub%>%
  ggplot(aes(month(new_month, label=TRUE, abbr=TRUE), 
                sales_count_price, group=factor(year(new_month)), colour=factor(year(new_month)))) +
  geom_line() +
  geom_point() +
  labs(x="Month", colour="Year") +
  theme_classic()+
  labs(title= 'Sales count and price cut in Cincinnati by Year' )

grid.arrange(sales_count_price_plot, sales_count_price_year_plot, ncol=1, nrow =2)
## `geom_smooth()` using formula 'y ~ x'

1.5 Initial data analysis

As implied in the EDA part, home value , and rentalsseem to increase over time and indicated the same for each year (2018-2021); Similarly, for list and sale price, there’s an increasing trend over the years but a variation within each year (this maybe the seasonal); On the contrary, inventory seems to decreases overtime as well as yearly; Lastly, for sales count and price cut the variation tends to slightly decrease over the years.

Now we want to model the linear trend and see if it’s consistent with our observations from the visualization.

# Test whether there is a true linaer trend over time
mod1 = lm(home_value~new_month,data=cin_combine_full)
summary(mod1)

mod2 = lm(inventory~new_month,data=cin_combine_full)
summary(mod2)

mod3 = lm(list_sale_price~new_month,data=cin_combine_full)
summary(mod3)


mod4 = lm(rentals~new_month,data=cin_combine_full)
summary(mod4)

mod5 = lm(sales_count_price~new_month,data=cin_combine_full)
summary(mod5)

The outcome table:

tab_model(mod1, mod2,mod3,mod4,mod5)
  home value inventory list sale price rentals sales count price
Predictors Estimates CI p Estimates CI p Estimates CI p Estimates CI p Estimates CI p
(Intercept) 67293.58 53729.25 – 80857.90 <0.001 62976.81 50644.46 – 75309.16 <0.001 -1130447.61 -1357341.28 – -903553.93 <0.001 -1332.23 -1420.22 – -1244.23 <0.001 -4112.02 -4975.12 – -3248.91 <0.001
new month 5.95 5.05 – 6.84 <0.001 -3.02 -3.70 – -2.35 <0.001 77.53 65.12 – 89.94 <0.001 0.14 0.13 – 0.14 <0.001 0.38 0.33 – 0.44 <0.001
Observations 264 48 48 96 167
R2 / R2 adjusted 0.396 / 0.393 0.639 / 0.631 0.775 / 0.770 0.970 / 0.970 0.560 / 0.557

We can see from most of the Linear regression model output is consistent with the overtime trends we observed from the visualization plots above: home value , rentals, and list and sale price seem to increase over time and have positive coefficients (positive linear association); On the contrary, inventory seems to decreases overtime and has a negative coefficient (negative linear association); Lastly, for sales count and price cut the variation tends to slightly decrease over the years, however, it has a positive coefficient indicating positive linear association.

2 ARIMA

From this part, we will start model development focusing on time series of **home value data** in Cincinnati:

Load data:

cin_home_value<- home_value%>%
  filter(RegionID==394466)%>%
  pivot_longer(6:269,names_to="date",values_to="home_value")%>%
  mutate(new_month= as.Date(date, "%Y-%m-%d"))

cin_home_value%>%
  head(5)%>%
  kbl(caption = "Table: Cincinnati Home Value data")%>%
  kable_classic(full_width = F, html_font = "Cambria")
Table: Cincinnati Home Value data
RegionID SizeRank RegionName RegionType StateName date home_value new_month
394466 28 Cincinnati, OH Msa OH 2000-01-31 129379 2000-01-31
394466 28 Cincinnati, OH Msa OH 2000-02-29 129364 2000-02-29
394466 28 Cincinnati, OH Msa OH 2000-03-31 129290 2000-03-31
394466 28 Cincinnati, OH Msa OH 2000-04-30 129780 2000-04-30
394466 28 Cincinnati, OH Msa OH 2000-05-31 130454 2000-05-31

2.1 Assess data generating process

We want to look at the plot and access Data generating process.

# Plot the Data
cin_home_value%>% ggplot()+
  geom_line(aes(new_month,home_value))+
  theme_bw()+
  xlab("Month")+
  ylab("Home Value")+
  labs(
    title = 'Home Value in Cincinnati',
    subtitle = 'January 2000 - December 2021')

As the plot above indicates non-stationarity. We want to look at 6 months rolling sd plot:

homevalue_roll <- cin_home_value %>%
  mutate(
    home_mean = zoo::rollmean(
      home_value, 
      k = 6, 
      fill = NA),
    close_sd = zoo::rollapply(
      home_value, 
      FUN = sd, 
      width = 6, 
      fill = NA)
  )
homevalue_rollmean <- homevalue_roll %>%
  ggplot() +
  geom_line(aes(new_month, home_value)) +
  theme_bw() +
  ggtitle("homevalue Mean over Time (6 month rolling window)") +
  ylab("home value") +
  xlab("Month")
homevalue_rollsd <- homevalue_roll %>%
  ggplot() +
  geom_line(aes(new_month, home_value)) +
  theme_bw() +
  ggtitle("Differenced homevalue SD over Time (6 month rolling window)") +
  ylab("home value") +
  xlab("Month")

grid.arrange(homevalue_rollmean,homevalue_rollsd,ncol=1, nrow =2)

The rolling plots above mainly implies the mean non-stationary and slightly/very limited variance non-stationary. It might because of our data is monthly data (monthly average, the variation is less). We want to confirm the data only has mean non-stationary and no variance non-stationary. Thus we start with variance non-stationary transformation (Log/box-cox):

2.2 Transformation

2.2.1 - Variance non-stationary transformation

  • Log transformation
  • Box-cox transfomation
cinhome_var_transform = cin_home_value %>%
  mutate(home_log = log1p(home_value),
         home_box = BoxCox(home_value,lambda='auto'))
homevalue_log_trans <- cinhome_var_transform %>%
  ggplot() +
  geom_line(aes(new_month, home_log)) +
  theme_bw() +
  ggtitle("Homevalue in Cincinnati over Time (log transform)") +
  ylab("home value(log)") +
  xlab("Month")
homevalue_box_trans <- cinhome_var_transform %>%
  ggplot() +
  geom_line(aes(new_month, home_box)) +
  theme_bw() +
  ggtitle("Homevalue in Cincinnati over Time (box-cox transform)") +
  ylab("home value(box-cox)") +
  xlab("Month")

grid.arrange(homevalue_log_trans,homevalue_box_trans,ncol=1, nrow =2)

We notice from the plots that after variance transformation, they still seem to be non-stationary. Log/ Box-cox transformation do not have much difference. Besides, the non-stationary did not improve much, we want to use ADF/KPSS test to verify:

  • ADF test: (null: non-stationary) p-value <0.05 indicates stationary.
  • KPSS test: (Null: stationary) p-value <0.05 indicates non- stationary.
a1<-adf.test(cinhome_var_transform$home_log) # Non-stationary
k1<-kpss.test(cinhome_var_transform$home_log) # Non-stationary

a2<-adf.test(cinhome_var_transform$home_box) # Non-stationary
k2<-kpss.test(cinhome_var_transform$home_box) # Non-stationary

test<- matrix(c("ADF","Log Transform",a1$p.value,"Non-stationary",
                "KPSS","Log Transform",k1$p.value,"Non-stationary",
                "ADF","Box-cox Transform",a2$p.value,"Non-stationary",
                "KPSS","Box-cox Transform",k2$p.value,"Non-stationary"),
              ncol=4,byrow=TRUE)

test %>%
  kbl(caption = "Table: ADF/KPSS test on Log vs. Box-cox transform") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Table: ADF/KPSS test on Log vs. Box-cox transform
ADF Log Transform 0.99 Non-stationary
KPSS Log Transform 0.01 Non-stationary
ADF Box-cox Transform 0.99 Non-stationary
KPSS Box-cox Transform 0.01 Non-stationary

The plots and tests results both showed that the transformed data is still non-stationary. Now we will focus on mean non-stationary transformation and will start with first order differencing:

2.2.2 - Mean non-stationary transformation

  • First Difference
  • First Difference, Log transform
  • First Difference, Box transform
cinhome_mean_transform = cin_home_value %>%
  mutate(home_log = log1p(home_value),
         home_box = BoxCox(home_value,lambda='auto'),
         home_diff= home_value-lag(home_value),
         home_diff_log = home_log - lag(home_log),
         home_diff_box = home_box - lag(home_box))%>%
  drop_na()

#first difference
d1<-cinhome_mean_transform %>%
  ggplot()+
      geom_line(aes(new_month,home_diff))+
      theme_bw()+
      ggtitle("homevalue First Difference: 2000-2021")+
      ylab("Homevalue (Difference))")+
      xlab("Month")

#First order difference on log transformed 
d2<- cinhome_mean_transform %>%
  ggplot()+
      geom_line(aes(new_month,home_diff_log))+
      theme_bw()+
      ggtitle("homevalue First Difference + Log tranformation: 2000-2021")+
      ylab("Homevalue (Difference + log))")+
      xlab("Month")

#First order difference on box transformed
d3<- cinhome_mean_transform %>%
  ggplot()+
      geom_line(aes(new_month,home_diff_box))+
      theme_bw()+
      ggtitle("homevalue First Difference + Box-Cox tranformation: 2000-2021")+
      ylab("Homevalue (Difference + Box-cox))")+
      xlab("Month")

grid.arrange(d1,d2,d3,ncol=1, nrow =3)

We noticed similar pattern from transformation of mean non-stationary (differencing) alone and differencing on top of variance-transform. They still seem to be non-stationary and we can see a trend. we want to use ADF/KPSS test to verify:

# only difference
ad1<-adf.test(cinhome_mean_transform$home_diff) # Non-stationary
kd1<-kpss.test(cinhome_mean_transform$home_diff) # Non-stationary

#difference+log
adl1<-adf.test(cinhome_mean_transform$home_diff_log) # Non-stationary
kdl1<-kpss.test(cinhome_mean_transform$home_diff_log) # Non-stationary

#difference +box
adb1<-adf.test(cinhome_mean_transform$home_diff_box) # Non-stationary
kdb1<-kpss.test(cinhome_mean_transform$home_diff_box) # Non-stationary

test2<- matrix(c("ADF","Difference",ad1$p.value,"Non-stationary",
                 "KPSS","Difference",kd1$p.value,"Non-stationary",
                 "ADF","Difference+log",adl1$p.value,"Non-stationary",
                 "KPSS","Difference+Log",kdl1$p.value,"Non-stationary",
                 "ADF","Difference+Box-cox",adb1$p.value,"Non-stationary",
                 "KPSS","Difference+Box-cox",kdb1$p.value,"Non-stationary"),
               ncol=4,byrow=TRUE)
test2 %>%
  kbl(caption = " Table: ADF/KPSS test on First Difference & Log/ Box-cox Transform+Difference") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Table: ADF/KPSS test on First Difference & Log/ Box-cox Transform+Difference
ADF Difference 0.788741858628823 Non-stationary
KPSS Difference 0.01 Non-stationary
ADF Difference+log 0.55797735934197 Non-stationary
KPSS Difference+Log 0.01 Non-stationary
ADF Difference+Box-cox 0.43768833056367 Non-stationary
KPSS Difference+Box-cox 0.01 Non-stationary

Results from ADF/KPSS tests for First order difference (alone) and First order difference + Log/ Box-cox transformation all implies non-stationarity. We need to take second-order difference:

  • Second-order Difference
  • Second-order Difference, Log transform
  • Second-order Difference, Box transform
#second order difference
cinhome_second_diff<- cinhome_mean_transform%>%
  mutate(home_second_diff=home_diff-lag(home_diff),
         home_second_diff_log=home_diff_log-lag(home_diff_log),
         home_second_diff_box=home_diff_box-lag(home_diff_box))%>%
  drop_na()
#other ways to second differernce
# diff(diff(cin_home_value$home_value))

Now we plot the second-order differencing and check:

#second order difference on log transformed 
#second difference
sd1<-cinhome_second_diff %>%
  ggplot()+
      geom_line(aes(new_month,home_second_diff))+
      theme_bw()+
      ggtitle("Homevalue Second order Difference: 2000-2021")+
      ylab("Homevalue (2nd-Difference))")+
      xlab("Month")

#second order difference on log transformed 
sd2<- cinhome_second_diff %>%
  ggplot()+
      geom_line(aes(new_month,home_second_diff_log))+
      theme_bw()+
      ggtitle("Homevalue Second order Difference + Log tranformation: 2000-2021")+
      ylab("Homevalue (2nd-Difference + log))")+
      xlab("Month")

#second order difference on box transformed
sd3<- cinhome_second_diff %>%
  ggplot()+
      geom_line(aes(new_month,home_second_diff_box))+
      theme_bw()+
      ggtitle("Homevalue Second order Difference + Box-Cox tranformation: 2000-2021")+
      ylab("Homevalue (2nd-Difference + Box-cox))")+
      xlab("Month")

grid.arrange(sd1,sd2,sd3,ncol=1, nrow =3)

The plots above is now more mean-reverting. However, we do notice variation changes over the time. Now we can check if the differenced and transformed data are stationary using ADF/KPSS Test:

# only 2nd difference
ad2<-adf.test(cinhome_second_diff$home_second_diff) # stationary
kd2<-kpss.test(cinhome_second_diff$home_second_diff) # stationary

#second difference+log
adl2<-adf.test(cinhome_second_diff$home_second_diff_log) # stationary
kdl2<-kpss.test(cinhome_second_diff$home_second_diff_log) # stationary

#second difference +box
adb2<-adf.test(cinhome_second_diff$home_second_diff_box) # stationary
kdb2<-kpss.test(cinhome_second_diff$home_second_diff_box) # stationary

test3<- matrix(c("ADF","Difference",ad2$p.value,"Stationary",
                 "KPSS","Difference",kd2$p.value,"Stationary",
                 "ADF","Difference+log",adl2$p.value,"Stationary",
                 "KPSS","Difference+Log",kdl2$p.value,"Stationary",
                 "ADF","Difference+Box-cox",adb2$p.value,"Stationary",
                 "KPSS","Difference+Box-cox",kdb2$p.value,"Stationary"),
               ncol=4,byrow=TRUE)
test3 %>%
  kbl(caption = "Table: ADF/KPSS test on 2nd Difference vs. 2nd Difference + Log/ Box-cox Transform") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Table: ADF/KPSS test on 2nd Difference vs. 2nd Difference + Log/ Box-cox Transform
ADF Difference 0.01 Stationary
KPSS Difference 0.1 Stationary
ADF Difference+log 0.01 Stationary
KPSS Difference+Log 0.1 Stationary
ADF Difference+Box-cox 0.01 Stationary
KPSS Difference+Box-cox 0.1 Stationary

2.2.3 ACF/PACF plot

As the tests all suggests that Second Ordered Difference or Second order difference + log/box-cox transformation are stationary. We will plot ACF & PACF of second order difference transformed data and decide on the parameters of ARIMA model:

# Plot ACF/PACF of second differenced
acf(cinhome_second_diff$home_second_diff)

pacf(cinhome_second_diff$home_second_diff)

### Seasonality

Home value is seasonality-adjusted. We can also conclude such findings from the ACF/PACF plot above.

As we observed trend from the previous data generating plot. In this ACF/PACF plot we can only notice some wave pattern, but can not notice very obvious seasonality/ fixed frequency within a year (there might be some cyclicality pattern). For real estate markets, there’s definitely months that sell better than others. However, unless circumstances like housing market crashes, the home value is usually unlike to decrease back to it’s previous seasonal values.

We can also double check and confirm our assumptions of no seasonality for this home_value time series data using decomposition:

#make home value a ts dataframe
home<- ts(cin_home_value$home_value,frequency = 12, start = c(2000, 1))

# home %>% decompose(type="multiplicative") %>%
#   autoplot() + xlab("date") +
#   ggtitle("Classical multiplicative decomposition
#     of electrical equipment index")


#decompose-- additive
home_decompose_ad = decompose(home,type='additive')
plot(home_decompose_ad)

#decompose -- multiplicative
home_decompose_mul = decompose(home,type='multiplicative')
plot(home_decompose_mul)

2.3 ARIMA Modeling

From the ADF/KPSS tests result and the ACF/PACF plots above, we can confirm that the data is only mean non-stationary, needs second order difference. As for variance transformation, we do not need to worry about. Thus, we will only build model using difference transformation, which dealt with mean non-stationary. Addictionally, from the acf/pacf plots above, we can assume that ARIMA model for:

  • home_value(Original data): is AR process with 2 order difference. So we have i=2, and p=18 (it’s AR process), q=0. Other significant lags are: 3,4,6,12,15. maybe we can try these in Arima models.
#use original parameter: home_value
mod1 = arima(cinhome_second_diff$home_value,order=c(18,2,0))
mod2 = arima(cinhome_second_diff$home_value,order=c(3,2,0))
mod3 = arima(cinhome_second_diff$home_value,order=c(4,2,0))
mod4 = arima(cinhome_second_diff$home_value,order=c(6,2,0))
mod5 = arima(cinhome_second_diff$home_value,order=c(12,2,0))
mod6 = arima(cinhome_second_diff$home_value,order=c(15,2,0))
mod7 = arima(cinhome_second_diff$home_value,order=c(18,2,1))

model_summary<- matrix(
  c("Model","ARIMA(p,I,q)","Parameter","sigma^2","log Likelihood","AIC","BIC",
    "1","ARIMA(18,2,0)","Home value",mod1$sigma2,mod1$loglik,mod1$aic,BIC(mod1),
    "2","ARIMA(3,2,0)","Home value",mod2$sigma2,mod2$loglik,mod2$aic,BIC(mod2),
    "3","ARIMA(4,2,0)","Home value",mod3$sigma2,mod3$loglik,mod3$aic,BIC(mod3),
    "4","ARIMA(6,2,0)","Home value",mod4$sigma2,mod4$loglik,mod4$aic,BIC(mod4),
    "5","ARIMA(12,2,0)","Home value",mod5$sigma2,mod5$loglik,mod5$aic,BIC(mod5),
    "6","ARIMA(15,2,0)","Home value",mod6$sigma2,mod6$loglik,mod6$aic,BIC(mod6),
    "7","ARIMA(18,2,1)","Home value",mod7$sigma2,mod7$loglik, mod7$aic,BIC(mod7)),
  ncol=7,byrow=TRUE)

model_summary %>%
  kbl(caption = "Table: ARIMA Model output: AIC vs BIC") %>%
  kable_classic(full_width = F, html_font = "Cambria")%>%
  row_spec(1, bold = T)
Table: ARIMA Model output: AIC vs BIC
Model ARIMA(p,I,q) Parameter sigma^2 log Likelihood AIC BIC
1 ARIMA(18,2,0) Home value 35339.0625546411 -1733.42177827751 3504.84355655501 3572.49650754431
2 ARIMA(3,2,0) Home value 52017.2052625915 -1780.92673285475 3569.8534657095 3584.09619223356
3 ARIMA(4,2,0) Home value 49859.9552256206 -1775.51214578929 3561.02429157858 3578.82769973366
4 ARIMA(6,2,0) Home value 46951.6410027904 -1767.87902494625 3549.7580498925 3574.68282130961
5 ARIMA(12,2,0) Home value 42855.4641643852 -1756.5986295862 3539.19725917239 3585.4861203756
6 ARIMA(15,2,0) Home value 39563.6255128961 -1746.94105520269 3525.88211040538 3582.85301650163
7 ARIMA(18,2,1) Home value 35325.8882484904 -1733.39468462616 3506.78936925232 3578.00300187263

We can tell from the models output that Model 1: ARIMA(18,2,0) on home value performs best as it has the smallest AIC as well as BIC.

2.3.1 Auto.arima

#auto.arima for home value
mod1_auto = auto.arima(cinhome_second_diff$home_value,max.p=20,max.q=20,max.order=20,stepwise=FALSE,approximation=FALSE,stationary=FALSE)

From the auto. arima output above, the output suggest the best model for home value is ARIMA(18,2,0).

2.4 Diagnosits

  • Check residuals
  • Box-Ljung Test

Now we want to look at residuals on our best model.

# Diagnostics from the Model ---------------------
# Examine Residuals - Box-Ljung Test says residuals have no autocorrelation at lag 1/5
checkresiduals(mod1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(18,2,0)
## Q* = 7.7663, df = 3, p-value = 0.0511
## 
## Model df: 18.   Total lags used: 21
checkresiduals(mod1_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(18,2,0)
## Q* = 7.7663, df = 3, p-value = 0.0511
## 
## Model df: 18.   Total lags used: 21
Box.test(mod1$residuals,type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  mod1$residuals
## X-squared = 0.012125, df = 1, p-value = 0.9123
Box.test(mod1$residuals,type = 'Ljung-Box',lag=1)
## 
##  Box-Ljung test
## 
## data:  mod1$residuals
## X-squared = 0.012125, df = 1, p-value = 0.9123
Box.test(mod1$residuals,type = 'Ljung-Box',lag=10)
## 
##  Box-Ljung test
## 
## data:  mod1$residuals
## X-squared = 5.369, df = 10, p-value = 0.8652

Both model indicated autocorrelation problem, the mod1_autoor Model 1 on home value – ARIMA(18,2,0) did not indicate any autocorrelation.

Box.test(mod1_auto$residuals,type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  mod1_auto$residuals
## X-squared = 0.012125, df = 1, p-value = 0.9123
Box.test(mod1_auto$residuals,type = 'Ljung-Box',lag=1)
## 
##  Box-Ljung test
## 
## data:  mod1_auto$residuals
## X-squared = 0.012125, df = 1, p-value = 0.9123
Box.test(mod1_auto$residuals,type = 'Ljung-Box',lag=10)
## 
##  Box-Ljung test
## 
## data:  mod1_auto$residuals
## X-squared = 5.369, df = 10, p-value = 0.8652
in_sample_pred = cinhome_second_diff$home_value-mod1_auto$residuals
#transfer for box data 
#in_sample_pred = InvBoxCox(in_sample_pred,lambda=best_lambda)

# RMSE Calculation
# Average Error of 13, which is pretty good
RMSE = sqrt(mean((in_sample_pred - cinhome_second_diff$home_value)^2,na.rm=T))
RMSE
## [1] 187.6045

2.5 Forecasting

Now we want to predict using our best model and get predictions for 12 steps (a year) ahead:

# Get Predictions for 5 steps ahead
prediction = predict(mod1_auto,n.ahead=12)

# Create Prediction Dataset for next 5 month
pred_data = data.frame(
  pred = prediction$pred,
  se = prediction$se,
  new_month =  seq(as.Date(max(cinhome_second_diff$new_month)),by="month",length.out=12)
  )


pred_merge = cinhome_second_diff %>% 
  mutate(in_sample_pred = in_sample_pred) %>%
  full_join(pred_data) %>%
  mutate(
    pred_high = pred+2*se, # Transform back into regular values after creating CIs
    pred_low = pred - 2*se,
    pred = pred,
    error = pred - home_value
  )

# Forecast Actually Looks Pretty good! (If we didn't have covid...)
pred_merge %>% 
  mutate(pred = if_else(is.na(pred) & !is.na(lead(pred)),in_sample_pred,pred)) %>% # This code connects the forecast lines in GGPLOT
  ggplot()+
  geom_line(aes(ymd(new_month),home_value))+
  geom_point(aes(ymd(new_month),in_sample_pred),color='blue')+
  geom_line(aes(ymd(new_month),pred),color='blue')+
  geom_ribbon(aes(x=ymd(new_month),ymin=pred_low,ymax = pred_high),fill='blue',alpha=0.2)

3 Facebook Prophet model

3.1 Basic model

We are still focusing on home value monthly data in Cincinnati. To perform Prophet Model, we need to change name to our dataset: name our date variable –ds; name our time series variable (home_value) –y.

cin_home_value<- home_value%>%
  filter(RegionID==394466)%>%
  pivot_longer(6:269,names_to="date",values_to="home_value")%>%
  mutate(new_month= as.Date(date, "%Y-%m-%d"))

#change to prophet format data
prophet_data = cin_home_value %>% 
    rename(ds = new_month, # Have to name our date variable "ds"
    y = home_value)  # Have to name our time series "y"

prophet_data%>%
  head(5)%>%
  kbl(caption = "Table: Prophet Format Data: Cincinnati Home Value")%>%
  kable_classic(full_width = F, html_font = "Cambria")
Table: Prophet Format Data: Cincinnati Home Value
RegionID SizeRank RegionName RegionType StateName date y ds
394466 28 Cincinnati, OH Msa OH 2000-01-31 129379 2000-01-31
394466 28 Cincinnati, OH Msa OH 2000-02-29 129364 2000-02-29
394466 28 Cincinnati, OH Msa OH 2000-03-31 129290 2000-03-31
394466 28 Cincinnati, OH Msa OH 2000-04-30 129780 2000-04-30
394466 28 Cincinnati, OH Msa OH 2000-05-31 130454 2000-05-31

Prophet model is usually used for daily data, fitting monthly data in this model is also applicable but we are asking for daily forecasts in this case. We will start with a basic forecasting using Prophet with frequency as daily.

We took months before “2017-08-31” as train dataset (approximately 80%) and time at “2017-08-31” and after as test dataset (approximately 20%). We can plot an interactive plot for forecasting after modeling.

Additionally, we will plot lines of forecast home value with actual Home value for comparison.

train1 = prophet_data %>% 
  filter(ds<ymd("2017-08-31"))
test1 = prophet_data %>%
  filter(ds>=ymd("2017-08-31"))
model1 = prophet(train1)
future1 = make_future_dataframe(model1,periods = 1583)#freq=daily; 53 month ahead
forecast1 = predict(model1,future1)

#plot interactive forecast plot
dyplot.prophet(model1,forecast1)
#plot actual vs forecast line for comparision
ggplot()+
  geom_line(data=forecast1,aes(x=as.Date(ds),y=yhat,col="blue"))+
  geom_line(data=cin_home_value,aes(x=as.Date(new_month),y=home_value,col="red"))+
  scale_color_identity(name = "Legend",      #for manual legend
                       breaks = c("red","blue"),
                       labels = c("Actual","Forecast"),
                       guide = "legend")+
  labs(title = "Cincy Home Value: Actual value vs Prophet basic forecast",
       x = "Date",
       y = "Cincinnati Home Value")

As we noticed from above that the output is strange as the forecast is not smooth and it generates daily instead of monthly forecast. Now we will change the frequency to month.

train = prophet_data %>% 
  filter(ds<ymd("2017-08-31"))
test = prophet_data %>%
  filter(ds>=ymd("2017-08-31"))
model = prophet(train)
future = make_future_dataframe(model,periods = 53,freq = 'month')
forecast = predict(model,future)

#plot interactive forecast plot
dyplot.prophet(model,forecast)
#plot actual vs forecast line for comparision
ggplot()+
  geom_line(data=forecast,aes(x=as.Date(ds),y=yhat,col="blue"))+
  geom_line(data=cin_home_value,aes(x=as.Date(new_month),y=home_value,col="red"))+
  scale_color_identity(name = "Legend",      #for manual legend
                       breaks = c("red","blue"),
                       labels = c("Actual","Forecast"),
                       guide = "legend")+
  labs(title = "Cincy Home Value: Actual value vs Prophet basic forecast",
       x = "Date",
       y = "Cincinnati Home Value")

The plots looks more smooth. However, we can notice from the plots that predicted home value is clearly lower than actual home value after the time of “2017-08-31”. Thus, we will look into trend and seasonality impact on home_value:

3.1.1 Decomposition of the elements

prophet_plot_components(model,forecast)

We can see from the decompose plot above that there’s a clear trend but seasonality is less clear. It might because this home value dataset is seasonally adjusted and only includes the middle price tier of homes. However, there might be cyclical component that follows the economic patterns (eg: 2008 financial crisis; 2019 COVID, etc). Now let’s check the change points:

3.1.2 Examination of the changepoints

The input of changepoints allowed us to increase the fit and its flexibility. We can examine the changepoints identified for the “trend” part of the time series from home value plot above. There is at least 5 change point (1 around 2005, 2 around 2010, 1 around 2015, 1 around 2019):

model_changepoint = prophet(train,n.changepoints=5)
future_changepoint = make_future_dataframe(model,periods = 53,freq = 'month')
forecast_changepoint = predict(model_changepoint,future_changepoint)
plot(model_changepoint,forecast_changepoint)+
  add_changepoints_to_plot(model_changepoint)+
  theme_bw()+xlab("Date")+ylab("Home value")

prophet_plot_components(model_changepoint,forecast_changepoint)

We noticed from the trend plot that setting changepoint 5 roghtly captured trend changes. If we set higher changepoint number, there’s chance of overfitting.

3.1.3 Saturation Points

From the forecast output above, the predicted values are all above 0. In our model, we do not need to take into account a saturating maximum point as the home value. However, we must also specify a maximum capacity if using logistic growth with a minimum. Thus, we could set a reasonable minimun and maximum value: usually home value ( Cincinnati median home value) is unlikely to go below 0 dollar and go above $0.3 million.

train_saturation = prophet_data %>% 
  filter(ds<ymd("2017-08-31"))
test_saturation = prophet_data %>%
  filter(ds>=ymd("2017-08-31"))
model_saturation = prophet(train_saturation)
future_saturation = make_future_dataframe(model_saturation,periods = 53,freq = 'month')#freq=monthly; 53 month ahead

# Set "floor" in training data
train_saturation$floor = 0
train_saturation$cap=300000
future_saturation$floor = 0
future_saturation$cap=300000

# Set floor in forecast data
future_saturation$floor = 0
future_saturation$cap=300000
sat_model = prophet(train_saturation,growth='logistic')
sat_forecast = predict(sat_model,future_saturation)
plot(sat_model,sat_forecast)+ylim(0,300000)+
theme_bw()+xlab("Date")+ylab("Home Value")

From the plot, we noticed that all the home values are within the range of (0.1,0.3) million. Therefore, we do not need to specify the saturating points.

3.1.4 Identification and assessment of seasonality

From above decomposition, we noticed that we do not have yearly seasonality. It might because this value is a smoothed, seasonally adjusted measure of the typical home value and market changes across a given region and housing type. Now we will try to identity daily, weekly, as well as additibe/ multiplicative seasonality:

  • Daily/weekly seasonality We have monthly data but we can examine different models with different parameters or setting: We start with model_season with daily.seasonality and weekly.seasonality option set to TURE to check daily and weekly seasonality:
model_season = prophet(train, daily.seasonality = TRUE, weekly.seasonality = TRUE)
future_season = make_future_dataframe(model_season,periods = 1583)#freq=daily; 53 month ahead
forecast_season = predict(model_season,future_season)

#plot interactive forecast plot
dyplot.prophet(model_season,forecast_season)
#decompose
prophet_plot_components(model_season,forecast_season)

Daily or weekly seasonality also not very obvious. Now we will dive into additive/ multiplicative seasonality:

  • additive seasonality
additive = prophet(train)
add_fcst = predict(additive,future)
plot(additive,add_fcst)

prophet_plot_components(additive,add_fcst)

  • multiplicative seasonality
multi = prophet(train,seasonality.mode = 'multiplicative')
multi_fcst = predict(multi,future)
plot(multi,multi_fcst)

prophet_plot_components(multi,multi_fcst)

Decompose plots of additive or multiplicative seasonality appear to be similar: seasonality is not obvious. Thus, we might only have trend and no seasonality.

3.1.5 Assessment of holidays

We want to try with inclusion of holidays. The result might not be very accurate with important holidays as we are examining monthly rather than daily data:

#Can use built-in holidays, or specify our own
model_holiday = prophet(train,fit=FALSE,daily.seasonality = TRUE, weekly.seasonality = TRUE)
model_holiday = add_country_holidays(model_holiday,country_name = 'US')
model_holiday = fit.prophet(model_holiday,train)
forecast_holiday = predict(model_holiday,future1)

dyplot.prophet(model_holiday,forecast_holiday)
prophet_plot_components(model_holiday,forecast_holiday)

The fitted line looks very well. Additionally, the plots indicated a few import days over years. These days maybe corresponding to our changing points in trend. Now we can look into those days:

forecast_holiday %>%
  filter(holidays != 0) %>%
  select_if(~ !is.numeric(.) || sum(.) != 0)%>%
  select(ds,holidays,`Memorial Day`,`New Year's Day (Observed)`)%>%
  kbl(caption = "Table: Important Holidays")%>%
  kable_classic(full_width = F, html_font = "Cambria")
Table: Important Holidays
ds holidays Memorial Day New Year’s Day (Observed)
2004-05-31 1267.9140 1267.914 0.0000
2004-12-31 -399.0302 0.000 -399.0302
2010-05-31 1267.9140 1267.914 0.0000
2010-12-31 -399.0302 0.000 -399.0302
2018-05-28 1267.9140 1267.914 0.0000
2019-05-27 1267.9140 1267.914 0.0000
2020-05-25 1267.9140 1267.914 0.0000
2021-05-31 1267.9140 1267.914 0.0000

We may notice that the returning result is reasonable:

Important positive influential holiday is Memorial day which is in May when the market is in peak season and buyers are active. On the contrary, negative correlated holiday is New Year’s day, which is when the housing market is in slow season. Moreover, we can see decrease in 2004 and 2010. As for 2010-2011, we had the housing market crash after the 2008 financial crisis.

forecast_holiday %>%
  filter(holidays != 0) %>%
  dplyr::select(-ds:-additive_terms_upper, -holidays:-holidays_upper, -weekly:-yhat, -contains("upper"), -contains("lower")) %>%
  mutate_all(~ if_else(. == 0, as.numeric(NA), .)) %>%
  summarize_if(is.numeric, ~ max(., na.rm = T)) %>%
  pivot_longer(
    cols = `Christmas Day`:`Washington's Birthday`,
    names_to = 'holiday', 
    values_to = 'effect'
  ) %>%
ggplot() +
  geom_col(aes(effect,holiday))+
  theme_bw()

3.2 Outlier detection

anom_train = prophet_data %>% 
  filter(ds<ymd("2017-08-31"))
anom_test = prophet_data %>% 
  filter(ds>=ymd("2017-08-31"))
anom_model = prophet(anom_train)
anom_future = make_future_dataframe(anom_model,periods = 53,freq = 'month')
forecast = predict(anom_model,anom_future)
forecast_plot_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2017-08-31")) %>% 
  mutate(y = anom_test$y) %>% 
  mutate(anomaly = if_else(y>yhat_upper | y<yhat_lower,TRUE,as.logical(NA)))
forecast_plot_data %>% 
  ggplot()+
    geom_line(aes(ds,y))+
    geom_point(aes(ds,y,color=anomaly))+
    geom_line(aes(ds,yhat),color='red')+
    geom_ribbon(aes(ds,ymin=yhat_lower,ymax=yhat_upper),alpha=0.4,fill='blue') + 
  scale_colour_discrete(na.translate = F)

We can see that our actual values in test dataset are much higher than the predicted values. Therefore, all points are detected anomaly. However, we can not delete these data, as they are generated under COVID situation which is an anomaly period of time and business trend.

4 Model Comparision and Validation

4.1 ARIMA

arima_train = cinhome_second_diff %>%
  filter(new_month<ymd("2017-08-31"))
arima_test = cinhome_second_diff %>%
  filter(new_month>=ymd("2017-08-31"))

train_mod1 = arima(arima_train$home_value,order=c(18,2,0))

test_pred1 = predict(train_mod1,53)$pred

error1 = arima_test$home_value - test_pred1
# Get Predictions for 5 steps ahead
prediction = predict(train_mod1,n.ahead=53)

# Create Prediction Dataset for next 5 month
pred_data = data.frame(
  pred = prediction$pred,
  se = prediction$se,
  new_month =  seq(as.Date(max(cinhome_second_diff$new_month)),by="month",length.out=53)
  )


pred_merge = cinhome_second_diff %>% 
  mutate(in_sample_pred = in_sample_pred) %>%
  full_join(pred_data) %>%
  mutate(
    pred_high = pred+2*se, # Transform back into regular values after creating CIs
    pred_low = pred - 2*se,
    pred = pred,
    error = pred - home_value
  )
## Joining, by = "new_month"
# Forecast Actually Looks Pretty good! (If we didn't have covid...)
pred_merge %>% 
  mutate(pred = if_else(is.na(pred) & !is.na(lead(pred)),in_sample_pred,pred)) %>% # This code connects the forecast lines in GGPLOT
  ggplot()+
  geom_line(aes(ymd(new_month),home_value))+
  geom_point(aes(ymd(new_month),in_sample_pred),color='blue')+
  geom_line(aes(ymd(new_month),pred),color='blue')+
  geom_ribbon(aes(x=ymd(new_month),ymin=pred_low,ymax = pred_high),fill='blue',alpha=0.2)+
  geom_line(data=cin_home_value,aes(x=as.Date(new_month),y=home_value,col="red"))+
  scale_color_identity(name = "Legend",      #for manual legend
                       breaks = c("red","blue"),
                       labels = c("Actual","Forecast"),
                       guide = "legend")+
  labs(title = "Cincy Home Value: Actual value vs Prophet basic forecast",
       x = "Date",
       y = "Cincinnati Home Value")
## Warning: Removed 52 row(s) containing missing values (geom_path).
## Warning: Removed 52 rows containing missing values (geom_point).
## Warning: Removed 260 row(s) containing missing values (geom_path).

Out of sample RMSE:

out_RMSE = sqrt(mean((arima_test$home_value - test_pred1)^2))
out_RMSE
## [1] 16330.02

Out of sample RMSE

4.2 Prophet Rolling window Cross-Validation

We want to use rolling window cross-validation to assess performance of the model at meaningful thresholds depending on the data (eg: 1 year out for monthly data) and then use various metrics (eg: RMSE,MAE,MAPE).

  • Cross Validation on holiday model
df.cv_holiday <- cross_validation(model_holiday, initial = 4*365,horizon = 365, units = 'days')

unique(df.cv_holiday$cutoff)%>%
  kbl(caption = "Table: Holidays model CV cutoff")%>%
  kable_classic(full_width = F, html_font = "Cambria")
Table: Holidays model CV cutoff
x
2004-02-02 12:00:00
2004-08-03 00:00:00
2005-02-01 12:00:00
2005-08-03 00:00:00
2006-02-01 12:00:00
2006-08-03 00:00:00
2007-02-01 12:00:00
2007-08-03 00:00:00
2008-02-01 12:00:00
2008-08-02 00:00:00
2009-01-31 12:00:00
2009-08-02 00:00:00
2010-01-31 12:00:00
2010-08-02 00:00:00
2011-01-31 12:00:00
2011-08-02 00:00:00
2012-01-31 12:00:00
2012-08-01 00:00:00
2013-01-30 12:00:00
2013-08-01 00:00:00
2014-01-30 12:00:00
2014-08-01 00:00:00
2015-01-30 12:00:00
2015-08-01 00:00:00
2016-01-30 12:00:00
2016-07-31 00:00:00
df.cv_holiday %>% 
  ggplot()+
  geom_point(aes(ds,y)) +
  geom_point(aes(ds,yhat,color=factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("Home value")+
  scale_color_discrete(name = 'Cutoff')

performance_metrics(df.cv_holiday)%>%
  kbl(caption = "Table: Holidays model CV performance metrics")%>%
  kable_classic(full_width = F, html_font = "Cambria")
Table: Holidays model CV performance metrics
horizon mse rmse mae mape mdape smape coverage
1380 hours 7085812 2661.919 1847.714 0.0127661 0.0088902 0.0128621 0.2016129
1392 hours 6830350 2613.494 1828.086 0.0125604 0.0088902 0.0126436 0.2258065
1404 hours 7929313 2815.904 2004.096 0.0137685 0.0089556 0.0138474 0.2258065
1416 hours 8326729 2885.607 2134.955 0.0147409 0.0099573 0.0148185 0.2580645
1428 hours 8925242 2987.514 2227.183 0.0154213 0.0099573 0.0155174 0.2661290
1440 hours 9782313 3127.669 2321.821 0.0161352 0.0109473 0.0162752 0.2661290
1452 hours 9628944 3103.054 2311.548 0.0160471 0.0109473 0.0161895 0.2580645
1464 hours 9701966 3114.798 2331.786 0.0161688 0.0121212 0.0163127 0.2580645
2100 hours 8151873 2855.149 2072.723 0.0142187 0.0109473 0.0143113 0.3225806
2124 hours 10653239 3263.930 2488.605 0.0170862 0.0125826 0.0171767 0.2500000
2136 hours 10717861 3273.815 2522.003 0.0172941 0.0137316 0.0173834 0.2661290
2148 hours 11607770 3407.018 2606.955 0.0180729 0.0128247 0.0182391 0.2903226
2160 hours 11352351 3369.325 2607.621 0.0180847 0.0137316 0.0182502 0.2661290
2172 hours 11216270 3349.070 2586.079 0.0179203 0.0128247 0.0180888 0.2580645
2184 hours 11792074 3433.959 2619.637 0.0181315 0.0128247 0.0183149 0.2580645
2208 hours 11505619 3391.993 2605.975 0.0179717 0.0128247 0.0181463 0.2500000
2844 hours 10081178 3175.087 2346.764 0.0160457 0.0128247 0.0161706 0.3225806
2856 hours 10113834 3180.225 2369.322 0.0161690 0.0128247 0.0162876 0.3145161
2868 hours 10650577 3263.522 2476.660 0.0169122 0.0128247 0.0170360 0.2903226
2880 hours 10977307 3313.202 2621.293 0.0180028 0.0151552 0.0181192 0.2419355
2892 hours 12358409 3515.453 2777.737 0.0191477 0.0151552 0.0193032 0.2338710
2904 hours 14357527 3789.133 2931.750 0.0202899 0.0172276 0.0205209 0.2258065
2916 hours 14253307 3775.355 2913.183 0.0201432 0.0150671 0.0203787 0.2258065
2928 hours 14524051 3811.043 2968.882 0.0204882 0.0172276 0.0207292 0.2258065
3564 hours 12279925 3504.272 2647.313 0.0180714 0.0150671 0.0182375 0.2903226
3588 hours 15215440 3900.697 3084.814 0.0211192 0.0181413 0.0212932 0.2177419
3600 hours 15354601 3918.495 3130.387 0.0214060 0.0182110 0.0215772 0.2096774
3612 hours 17190227 4146.110 3253.759 0.0224584 0.0182110 0.0227317 0.2258065
3624 hours 16918182 4113.172 3218.015 0.0222109 0.0187418 0.0224809 0.2580645
3636 hours 16781197 4096.486 3194.597 0.0220248 0.0152658 0.0223010 0.2580645
3648 hours 17269927 4155.710 3217.739 0.0221652 0.0156254 0.0224531 0.2258065
3672 hours 17037972 4127.708 3227.050 0.0221259 0.0156254 0.0224043 0.2258065
4308 hours 14889876 3858.740 2929.773 0.0199010 0.0156254 0.0201010 0.3225806
4332 hours 17594367 4194.564 3322.505 0.0226795 0.0203274 0.0228883 0.2500000
4344 hours 15710286 3963.620 3077.467 0.0209027 0.0172238 0.0210921 0.2903226
4356 hours 19101619 4370.540 3391.115 0.0233096 0.0187418 0.0236241 0.2741935
4368 hours 17969227 4239.013 3358.235 0.0230029 0.0222755 0.0232267 0.2661290
4380 hours 17227279 4150.576 3303.528 0.0225634 0.0222755 0.0227665 0.2580645
4392 hours 19851147 4455.463 3518.704 0.0241879 0.0175517 0.0245044 0.2258065
4416 hours 20588467 4537.452 3617.151 0.0247918 0.0234997 0.0251231 0.2258065
5016 hours 18019475 4244.935 3322.038 0.0225760 0.0175517 0.0228247 0.2903226
5040 hours 19584840 4425.476 3542.102 0.0240430 0.0234997 0.0242296 0.2741935
5052 hours 18301392 4278.013 3338.617 0.0225972 0.0197556 0.0227711 0.3225806
5064 hours 21050377 4588.069 3610.164 0.0247866 0.0234997 0.0250675 0.2903226
5076 hours 21696987 4658.003 3789.650 0.0259998 0.0259075 0.0262356 0.2419355
5088 hours 20711486 4550.987 3754.801 0.0255614 0.0259075 0.0257638 0.2258065
5100 hours 24205695 4919.928 3995.778 0.0273792 0.0259075 0.0277396 0.2258065
5124 hours 24254011 4924.836 4004.360 0.0274321 0.0259075 0.0277935 0.2258065
5760 hours 21769420 4665.771 3698.630 0.0251265 0.0197556 0.0254030 0.2903226
5772 hours 20421540 4519.020 3499.653 0.0237404 0.0197556 0.0239631 0.3440860
5784 hours 21911436 4680.965 3639.825 0.0247012 0.0191914 0.0249144 0.3306452
5796 hours 23601327 4858.120 3958.128 0.0269945 0.0259075 0.0272247 0.2500000
5808 hours 25180261 5017.994 4119.196 0.0281903 0.0223888 0.0284723 0.2258065
5820 hours 27535914 5247.467 4224.078 0.0290345 0.0223888 0.0294426 0.2258065
5832 hours 27644336 5257.788 4235.336 0.0290310 0.0223888 0.0294559 0.2258065
5844 hours 27326546 5227.480 4203.571 0.0288354 0.0209749 0.0292542 0.2258065
6480 hours 23858964 4884.564 3851.119 0.0261895 0.0205783 0.0264908 0.2661290
6504 hours 26782552 5175.186 4169.212 0.0282877 0.0248724 0.0285277 0.2473118
6516 hours 26994411 5195.615 4210.219 0.0285457 0.0260384 0.0287818 0.2500000
6528 hours 29529574 5434.112 4456.319 0.0305953 0.0224780 0.0310035 0.2016129
6540 hours 29242776 5407.659 4454.195 0.0306165 0.0260384 0.0310294 0.1935484
6552 hours 29074404 5392.069 4442.977 0.0303074 0.0260384 0.0307263 0.1935484
6564 hours 30811671 5550.826 4527.305 0.0308395 0.0260384 0.0313125 0.1935484
6588 hours 29796426 5458.610 4469.971 0.0303514 0.0252500 0.0307951 0.1935484
7224 hours 26723551 5169.483 4124.258 0.0278534 0.0252500 0.0281878 0.2580645
7236 hours 26469106 5144.814 4052.574 0.0273652 0.0252500 0.0276596 0.2795699
7248 hours 27369799 5231.615 4161.627 0.0281096 0.0224780 0.0284085 0.2661290
7260 hours 28774453 5364.182 4437.877 0.0301684 0.0295566 0.0304749 0.1854839
7272 hours 31675690 5628.116 4639.708 0.0316485 0.0273066 0.0320463 0.1612903
7284 hours 36247338 6020.576 4872.729 0.0333665 0.0295566 0.0339473 0.1612903
7296 hours 36729861 6060.517 4903.854 0.0334686 0.0295566 0.0340743 0.1612903
7308 hours 36430203 6035.744 4890.075 0.0333955 0.0289497 0.0339958 0.1612903
7944 hours 31867621 5645.141 4469.368 0.0302531 0.0252500 0.0306949 0.2258065
7968 hours 34731632 5893.355 4790.511 0.0323881 0.0301323 0.0327692 0.2150538
7980 hours 35127034 5926.806 4856.017 0.0328047 0.0319223 0.0331780 0.2096774
7992 hours 39436339 6279.836 5174.038 0.0353709 0.0289497 0.0359697 0.1693548
8004 hours 39089330 6252.146 5143.341 0.0352034 0.0319223 0.0357977 0.1612903
8016 hours 38828669 6231.265 5147.085 0.0349462 0.0319223 0.0355429 0.1612903
8028 hours 40680158 6378.100 5251.867 0.0356294 0.0319223 0.0362751 0.1612903
8052 hours 39581249 6291.363 5210.728 0.0352145 0.0319223 0.0358262 0.1612903
8688 hours 35425308 5951.916 4810.196 0.0323424 0.0319223 0.0328037 0.2258065
8712 hours 38865763 6234.241 5151.744 0.0347855 0.0349362 0.0352124 0.2150538
8724 hours 36237637 6019.770 4868.439 0.0327822 0.0264664 0.0331871 0.2338710
8736 hours 42574661 6524.926 5351.314 0.0365124 0.0349362 0.0371695 0.1774194
8748 hours 39604089 6293.178 5186.128 0.0352838 0.0332404 0.0357510 0.1935484
8760 hours 39307861 6269.598 5196.775 0.0350580 0.0332404 0.0355287 0.1935484
plot_cross_validation_metric(df.cv_holiday, metric = 'rmse')

It does not look very well. Now Let’s compare other models:

  • Model Comparision

  • Model 1: changepoint in trend model with freq=‘monthly’, changepoint=5;

  • Model 2: seasonality model with freq=‘daily’ and week/daily seasonality=TRUE;

  • Model 3: Additive seasonality model;

  • Model 4: Multiplicative seasonality model.

#mod1
mod1 = prophet(train,n.changepoints=5)
mod1_future = make_future_dataframe(mod1,periods = 53,freq = 'month')
mod1_forecast = predict(mod1,mod1_future)

df_cv_mod1 <- cross_validation(mod1, initial = 4*365,horizon = 365, units = 'days')
metrics1 = performance_metrics(df_cv_mod1) %>% 
  mutate(model = 'mod1')

#mod2  
mod2 = prophet(train, daily.seasonality = TRUE, weekly.seasonality = TRUE)
mod2_future = make_future_dataframe(mod2,periods = 1583)#freq=daily; 53 month ahead
mod2_forecast = predict(mod2,mod2_future)

df_cv_mod2 <- cross_validation(mod2, initial = 4*365,horizon = 365, units = 'days')
metrics2 = performance_metrics(df_cv_mod2) %>% 
  mutate(model = "mod2")
metrics1 %>% 
bind_rows(metrics2) %>% 
ggplot()+
geom_line(aes(horizon,rmse,color=model))

#mod3
mod3 = prophet(train,seasonality.mode='additive')
mod3_forecast = predict(mod3,future)
df_cv_mod3 <- cross_validation(mod3, initial = 4*365,horizon = 365, units = 'days')
metrics3 = performance_metrics(df_cv_mod3) %>% 
  mutate(model = 'mod3')

#mod4
mod4 = prophet(train,seasonality.mode='multiplicative')
forecast4 = predict(mod4,future)
df_cv_mod4 <- cross_validation(mod4, initial = 4*365,horizon = 365, units = 'days')
metrics4 = performance_metrics(df_cv_mod4) %>% 
  mutate(model = "mod4")
metrics3 %>% 
bind_rows(metrics4) %>% 
ggplot()+
geom_line(aes(horizon,rmse,color=model))

performance_metrics(df.cv_holiday)%>%
  kbl(caption = "Table: Model 1 CV performance metrics")%>%
  kable_classic(full_width = F, html_font = "Cambria")
Table: Model 1 CV performance metrics
horizon mse rmse mae mape mdape smape coverage
1380 hours 7085812 2661.919 1847.714 0.0127661 0.0088902 0.0128621 0.2016129
1392 hours 6830350 2613.494 1828.086 0.0125604 0.0088902 0.0126436 0.2258065
1404 hours 7929313 2815.904 2004.096 0.0137685 0.0089556 0.0138474 0.2258065
1416 hours 8326729 2885.607 2134.955 0.0147409 0.0099573 0.0148185 0.2580645
1428 hours 8925242 2987.514 2227.183 0.0154213 0.0099573 0.0155174 0.2661290
1440 hours 9782313 3127.669 2321.821 0.0161352 0.0109473 0.0162752 0.2661290
1452 hours 9628944 3103.054 2311.548 0.0160471 0.0109473 0.0161895 0.2580645
1464 hours 9701966 3114.798 2331.786 0.0161688 0.0121212 0.0163127 0.2580645
2100 hours 8151873 2855.149 2072.723 0.0142187 0.0109473 0.0143113 0.3225806
2124 hours 10653239 3263.930 2488.605 0.0170862 0.0125826 0.0171767 0.2500000
2136 hours 10717861 3273.815 2522.003 0.0172941 0.0137316 0.0173834 0.2661290
2148 hours 11607770 3407.018 2606.955 0.0180729 0.0128247 0.0182391 0.2903226
2160 hours 11352351 3369.325 2607.621 0.0180847 0.0137316 0.0182502 0.2661290
2172 hours 11216270 3349.070 2586.079 0.0179203 0.0128247 0.0180888 0.2580645
2184 hours 11792074 3433.959 2619.637 0.0181315 0.0128247 0.0183149 0.2580645
2208 hours 11505619 3391.993 2605.975 0.0179717 0.0128247 0.0181463 0.2500000
2844 hours 10081178 3175.087 2346.764 0.0160457 0.0128247 0.0161706 0.3225806
2856 hours 10113834 3180.225 2369.322 0.0161690 0.0128247 0.0162876 0.3145161
2868 hours 10650577 3263.522 2476.660 0.0169122 0.0128247 0.0170360 0.2903226
2880 hours 10977307 3313.202 2621.293 0.0180028 0.0151552 0.0181192 0.2419355
2892 hours 12358409 3515.453 2777.737 0.0191477 0.0151552 0.0193032 0.2338710
2904 hours 14357527 3789.133 2931.750 0.0202899 0.0172276 0.0205209 0.2258065
2916 hours 14253307 3775.355 2913.183 0.0201432 0.0150671 0.0203787 0.2258065
2928 hours 14524051 3811.043 2968.882 0.0204882 0.0172276 0.0207292 0.2258065
3564 hours 12279925 3504.272 2647.313 0.0180714 0.0150671 0.0182375 0.2903226
3588 hours 15215440 3900.697 3084.814 0.0211192 0.0181413 0.0212932 0.2177419
3600 hours 15354601 3918.495 3130.387 0.0214060 0.0182110 0.0215772 0.2096774
3612 hours 17190227 4146.110 3253.759 0.0224584 0.0182110 0.0227317 0.2258065
3624 hours 16918182 4113.172 3218.015 0.0222109 0.0187418 0.0224809 0.2580645
3636 hours 16781197 4096.486 3194.597 0.0220248 0.0152658 0.0223010 0.2580645
3648 hours 17269927 4155.710 3217.739 0.0221652 0.0156254 0.0224531 0.2258065
3672 hours 17037972 4127.708 3227.050 0.0221259 0.0156254 0.0224043 0.2258065
4308 hours 14889876 3858.740 2929.773 0.0199010 0.0156254 0.0201010 0.3225806
4332 hours 17594367 4194.564 3322.505 0.0226795 0.0203274 0.0228883 0.2500000
4344 hours 15710286 3963.620 3077.467 0.0209027 0.0172238 0.0210921 0.2903226
4356 hours 19101619 4370.540 3391.115 0.0233096 0.0187418 0.0236241 0.2741935
4368 hours 17969227 4239.013 3358.235 0.0230029 0.0222755 0.0232267 0.2661290
4380 hours 17227279 4150.576 3303.528 0.0225634 0.0222755 0.0227665 0.2580645
4392 hours 19851147 4455.463 3518.704 0.0241879 0.0175517 0.0245044 0.2258065
4416 hours 20588467 4537.452 3617.151 0.0247918 0.0234997 0.0251231 0.2258065
5016 hours 18019475 4244.935 3322.038 0.0225760 0.0175517 0.0228247 0.2903226
5040 hours 19584840 4425.476 3542.102 0.0240430 0.0234997 0.0242296 0.2741935
5052 hours 18301392 4278.013 3338.617 0.0225972 0.0197556 0.0227711 0.3225806
5064 hours 21050377 4588.069 3610.164 0.0247866 0.0234997 0.0250675 0.2903226
5076 hours 21696987 4658.003 3789.650 0.0259998 0.0259075 0.0262356 0.2419355
5088 hours 20711486 4550.987 3754.801 0.0255614 0.0259075 0.0257638 0.2258065
5100 hours 24205695 4919.928 3995.778 0.0273792 0.0259075 0.0277396 0.2258065
5124 hours 24254011 4924.836 4004.360 0.0274321 0.0259075 0.0277935 0.2258065
5760 hours 21769420 4665.771 3698.630 0.0251265 0.0197556 0.0254030 0.2903226
5772 hours 20421540 4519.020 3499.653 0.0237404 0.0197556 0.0239631 0.3440860
5784 hours 21911436 4680.965 3639.825 0.0247012 0.0191914 0.0249144 0.3306452
5796 hours 23601327 4858.120 3958.128 0.0269945 0.0259075 0.0272247 0.2500000
5808 hours 25180261 5017.994 4119.196 0.0281903 0.0223888 0.0284723 0.2258065
5820 hours 27535914 5247.467 4224.078 0.0290345 0.0223888 0.0294426 0.2258065
5832 hours 27644336 5257.788 4235.336 0.0290310 0.0223888 0.0294559 0.2258065
5844 hours 27326546 5227.480 4203.571 0.0288354 0.0209749 0.0292542 0.2258065
6480 hours 23858964 4884.564 3851.119 0.0261895 0.0205783 0.0264908 0.2661290
6504 hours 26782552 5175.186 4169.212 0.0282877 0.0248724 0.0285277 0.2473118
6516 hours 26994411 5195.615 4210.219 0.0285457 0.0260384 0.0287818 0.2500000
6528 hours 29529574 5434.112 4456.319 0.0305953 0.0224780 0.0310035 0.2016129
6540 hours 29242776 5407.659 4454.195 0.0306165 0.0260384 0.0310294 0.1935484
6552 hours 29074404 5392.069 4442.977 0.0303074 0.0260384 0.0307263 0.1935484
6564 hours 30811671 5550.826 4527.305 0.0308395 0.0260384 0.0313125 0.1935484
6588 hours 29796426 5458.610 4469.971 0.0303514 0.0252500 0.0307951 0.1935484
7224 hours 26723551 5169.483 4124.258 0.0278534 0.0252500 0.0281878 0.2580645
7236 hours 26469106 5144.814 4052.574 0.0273652 0.0252500 0.0276596 0.2795699
7248 hours 27369799 5231.615 4161.627 0.0281096 0.0224780 0.0284085 0.2661290
7260 hours 28774453 5364.182 4437.877 0.0301684 0.0295566 0.0304749 0.1854839
7272 hours 31675690 5628.116 4639.708 0.0316485 0.0273066 0.0320463 0.1612903
7284 hours 36247338 6020.576 4872.729 0.0333665 0.0295566 0.0339473 0.1612903
7296 hours 36729861 6060.517 4903.854 0.0334686 0.0295566 0.0340743 0.1612903
7308 hours 36430203 6035.744 4890.075 0.0333955 0.0289497 0.0339958 0.1612903
7944 hours 31867621 5645.141 4469.368 0.0302531 0.0252500 0.0306949 0.2258065
7968 hours 34731632 5893.355 4790.511 0.0323881 0.0301323 0.0327692 0.2150538
7980 hours 35127034 5926.806 4856.017 0.0328047 0.0319223 0.0331780 0.2096774
7992 hours 39436339 6279.836 5174.038 0.0353709 0.0289497 0.0359697 0.1693548
8004 hours 39089330 6252.146 5143.341 0.0352034 0.0319223 0.0357977 0.1612903
8016 hours 38828669 6231.265 5147.085 0.0349462 0.0319223 0.0355429 0.1612903
8028 hours 40680158 6378.100 5251.867 0.0356294 0.0319223 0.0362751 0.1612903
8052 hours 39581249 6291.363 5210.728 0.0352145 0.0319223 0.0358262 0.1612903
8688 hours 35425308 5951.916 4810.196 0.0323424 0.0319223 0.0328037 0.2258065
8712 hours 38865763 6234.241 5151.744 0.0347855 0.0349362 0.0352124 0.2150538
8724 hours 36237637 6019.770 4868.439 0.0327822 0.0264664 0.0331871 0.2338710
8736 hours 42574661 6524.926 5351.314 0.0365124 0.0349362 0.0371695 0.1774194
8748 hours 39604089 6293.178 5186.128 0.0352838 0.0332404 0.0357510 0.1935484
8760 hours 39307861 6269.598 5196.775 0.0350580 0.0332404 0.0355287 0.1935484
plot_cross_validation_metric(df_cv_mod1, metric = 'rmse')

Cross-validation results for monthly data is not very reasonable. Maybe because we set the CV freq to ‘Daily’. All models seems to fit pretty well and have similar RMSE as well as other performing metrics.

Out of sample RMSE:

RMSE = sqrt(mean((test$y - mod1_forecast$yhat)^2))
## Warning in test$y - mod1_forecast$yhat: longer object length is not a multiple
## of shorter object length
RMSE
## [1] 44973.13

4.2.1 Conclusion:

From our project, ARIMA model performs better than Prophet model with 5 change point. Arima model has much less out of sample RMSE (16330.02 vs. 44973.13). The reason for that is: firstly, our data is monthly data and prophet model naturally work better for daily data. Secondly, we did not add an external regressor in prohet model, which could change the situation.

Reference: reference