1 Data Info and Exploratory Data Analysis (EDA)

1.1 Datasets Description

The Zillow Home Price Data are time series datasets available online via this link. The datasets selected are updated monthly:

Home Values – Zillow Home Values Index (ZHVI) that tells us the typical home value in a given geography (metro area, city, ZIP code, etc.), now and over time.).
Home Values Forescasts: is the one-year forecast of ZHVI and is created using the all homes, mid-tier cut of ZHVI.
Rentals: a smoothed measure of the typical observed market rate rent acreoss a given region.
Inventory: The count of unique listings that were active at any time in a given month.
LIST AND SALE PRICES:The median price at which homes across various geographies were listed and sold.
SALES COUNT AND PRICE CUTS: is the estimated number of unique properties that sold during the month after accounting for the latency between when sales occur and when they are reported.

1.2 Project Proposal

These monthly time series datasets from Zillow were selected as they are consistent with my interest of working on a project of “Forecasting Home Value in Cincinnati”.

Experts in real estates had been concerned about the cycle of housing market as below:

image

On top of that in 2020, as COVID-19 pandemic spreads the world and policies such as lock-down and social distancing coming out, the predictions on housing market had been very pessimistic. Experts and media made forecasts and implied a shrink in US proprietary prices and housing market as the GDP decreases and economic struggled to recover.

However, the housing market had unprecedented increase since late 2020, rising by nearly 20% over last year (2021). In 2022, how will the housing market perform? Is the demand still strong or should we concern about home values decrease or even crash of the market? The abnormal pattern in 2021 may make it difficult to forecast as it is not a normal trend and will increase the expectation of sellers. We want to forecast these time series to see if the increase trend is a one-time thing or it will continue in 2022 as well.

1.3 Data preparation

Import datasets

We loaded five separate datasets. These are all wide dataset that need to be transform to long dataset later:

home_value has 908 obs(regions) and 269 variables (monthly; Jan/2000- Dec/2021).
Rentals: has 102 obs(regions) and 99 variables (monthly; Jan/2014- Dec/2021).
Inventory: has 917 obs(regions) and 53 variables (monthly; Jan/2018- Dec/2021).
LIST AND SALE PRICES:has 95 obs(regions) and 53 variables (monthly; Jan/2018- Dec/2021).
SALES COUNT AND PRICE CUTS: has 95 obs(regions) and 172 variables (monthly; Feb/2008- Dec/2021).

home_value<- read_csv("data/home_values.csv")
inventory<- read_csv("data/inventory.csv")
list_and_sale_price<- read_csv("data/list_and_sale_price.csv")
rentals<- read_csv("data/rentals.csv")
sales_count_and_price_cuts<- read_csv("data/sales_count_and_price_cuts.csv")

1.3.1 Transformed datasets

Data for Cincinnati, OH

Separate datasets

We can learn from the above datasets that the city Cincinnati, OH was given a unique region id “394466”. The size of the city ranked 28 among 900 and more cities (regions). Additionally, the region type of the city is Metropolitan statistical areas (MSA).

cin_home_value<- home_value%>%
  filter(RegionID==394466)%>%
  pivot_longer(6:269,names_to="date",values_to="home_value")%>%
  mutate(new_month= as.Date(date, "%Y-%m-%d"))

cin_inventory<- inventory%>%
  filter(RegionID==394466)%>%
  pivot_longer(6:53,names_to="date",values_to="inventory")%>%
  mutate(new_month= as.Date(date, "%Y-%m-%d"))

cin_list_and_sale_price<- list_and_sale_price%>%
  filter(RegionID==394466)%>%
  pivot_longer(6:53,names_to="date",values_to="list_sale_price")%>%
  mutate(new_month= as.Date(date, "%Y-%m-%d"))

cin_rentals<- rentals%>%
  filter(RegionID==394466)%>%
  pivot_longer(4:99,names_to="date",values_to="rentals")%>%
  mutate(month= as.Date(as.yearmon(date)))%>%
  mutate(new_month=as.Date(month) + months(1) - days(1))

cin_sales_count_and_price_cuts<- sales_count_and_price_cuts%>%
  filter(RegionID==394466)%>%
  pivot_longer(6:172,names_to="date",values_to="sales_count_price")%>%
  mutate(new_month= as.Date(date, "%Y-%m-%d"))

Combined dataset

This is the Cincinnati full combined data of the five separate datasets with different time length and the dimension is 264 observations and 12 columns. Note: for inventory, list price, rentals,sales have shorter time period than home value data.

cin_combine_full<- cin_home_value%>%
  left_join(cin_sales_count_and_price_cuts[,7:8], by="new_month")%>%
  left_join(cin_rentals[,c(5,7)], by="new_month")%>%
  left_join(cin_inventory[,7:8], by="new_month")%>%
  left_join(cin_list_and_sale_price[,7:8], by="new_month")

# good alternative library for descriptive statistics
cin_combine_full%>%
  tail(5)%>%
  kbl(caption = "Table: Combined dataset of home information in Cincinnati--(2000-2021)")%>%
  kable_classic(full_width = F, html_font = "Cambria")

Table: Combined dataset of home information in Cincinnati–(2000-2021)
RegionID	SizeRank	RegionName	RegionType	StateName	date	home_value	new_month	sales_count_price	rentals	inventory	list_sale_price
394466	28	Cincinnati, OH	Msa	OH	2021-08-31	231514	2021-08-31	3071	1338	6254	336233
394466	28	Cincinnati, OH	Msa	OH	2021-09-30	234720	2021-09-30	3010	1348	6479	324600
394466	28	Cincinnati, OH	Msa	OH	2021-10-31	236772	2021-10-31	3143	1359	6452	316633
394466	28	Cincinnati, OH	Msa	OH	2021-11-30	238710	2021-11-30	3156	1369	6006	310300
394466	28	Cincinnati, OH	Msa	OH	2021-12-31	240833	2021-12-31	2941	1380	5302	306967

This is the Cincinnati combined data of the five separate datasets above with common time length and it has 48 obs and 12 variables.

cin_combine<- cin_inventory%>%
  left_join(cin_list_and_sale_price[,7:8], by="new_month")%>%
  left_join(cin_rentals[,c(5,7)], by="new_month")%>%
  left_join(cin_sales_count_and_price_cuts[,7:8], by="new_month")%>%
  left_join(cin_home_value[,7:8], by="new_month")

# good alternative library for descriptive statistics
cin_combine%>%
  head(5)%>%
  kbl(caption = "Table: Combined dataset of home information in Cincinnati--(2018-2021)")%>%
  kable_classic(full_width = F, html_font = "Cambria")

Table: Combined dataset of home information in Cincinnati–(2018-2021)
RegionID	SizeRank	RegionName	RegionType	StateName	date	inventory	new_month	list_sale_price	rentals	sales_count_price	home_value
394466	28	Cincinnati, OH	Msa	OH	2018-01-31	8017	2018-01-31	218300	1087	2100	167537
394466	28	Cincinnati, OH	Msa	OH	2018-02-28	7636	2018-02-28	219967	1091	1797	168013
394466	28	Cincinnati, OH	Msa	OH	2018-03-31	7846	2018-03-31	231600	1095	2765	168827
394466	28	Cincinnati, OH	Msa	OH	2018-04-30	8252	2018-04-30	248267	1099	3206	170175
394466	28	Cincinnati, OH	Msa	OH	2018-05-31	8957	2018-05-31	261600	1104	3278	172092

This is a subset of combined dataset which omited info about the city.

cin_sub<- cin_combine[, 7:12]

# good alternative library for descriptive statistics

cin_sub%>%
  tail(5)%>%
  kbl(caption = "Table: Subset dataset of home information in Cincinnati")%>%
  kable_classic(full_width = F, html_font = "Cambria")

Table: Subset dataset of home information in Cincinnati
inventory	new_month	list_sale_price	rentals	sales_count_price	home_value
6254	2021-08-31	336233	1338	3071	231514
6479	2021-09-30	324600	1348	3010	234720
6452	2021-10-31	316633	1359	3143	236772
6006	2021-11-30	310300	1369	3156	238710
5302	2021-12-31	306967	1380	2941	240833

1.4 EDA -Full combined dataset

home_value_full_plot = ggplot(cin_combine_full)+
  geom_line(aes(new_month,home_value))+
  theme_bw()+
  xlab("Month")+
  ylab("Home Value")+
  labs(
    title = 'Home Value in Cincinnati',
    subtitle = 'January 2000 - December 2021')

inventory_full_plot = ggplot(cin_combine_full)+
  geom_line(aes(new_month,inventory))+
  theme_bw()+
  xlab("Month")+
  ylab("Inventory")+
   labs(
    title  = 'Inventory in Cincinnati',
    subtitle = 'January 2018 - December 2021')

list_and_sale_price_full_plot = ggplot(cin_combine_full)+
  geom_line(aes(new_month,list_sale_price))+
  theme_bw()+
  xlab("Month")+
  ylab("List and sale price")+
   labs(
    title  = 'List and sale price in Cincinnati',
    subtitle = 'January 2018 - December 2021')

rentals_full_plot = ggplot(cin_combine_full)+
  geom_line(aes(new_month,rentals))+
  theme_bw()+
  xlab("Month")+
  ylab("Rentals")+
  labs(
    title  = 'Rentals in Cincinnati',
    subtitle = 'January 2014 - December 2021')

sales_count_price_full_plot = ggplot(cin_combine_full)+
  geom_line(aes(new_month,sales_count_price))+
  theme_bw()+
  xlab("Month")+
  ylab("Sales count and price cut")+
   labs(
    title  = 'Sales count and price cut in Cincinnati',
    subtitle = 'Feburary 2008 - December 2021')

grid.arrange(home_value_full_plot,
          inventory_full_plot,
          list_and_sale_price_full_plot,
          rentals_full_plot,
          sales_count_price_full_plot, 
          ncol=2, nrow =3)

1.4.1 EDA -Time Trend -combined Data (2018-2021)

home_value

home_value_plot = ggplot(cin_combine)+
  geom_line(aes(new_month,home_value))+
  geom_smooth(aes(new_month,home_value),method='lm',color='red')+
  theme_bw()+
  xlab("Month")+
  ylab("Home Value")+
  labs(
    title = 'Home Value in Cincinnati',
    subtitle = 'January 2018 - December 2021')

home_value_year_plot<- cin_sub%>%
  ggplot(aes(month(new_month, label=TRUE, abbr=TRUE), 
                home_value, group=factor(year(new_month)), colour=factor(year(new_month)))) +
  geom_line() +
  geom_point() +
  labs(x="Month", colour="Year") +
  theme_classic()+
  labs(title= 'Home Value in Cincinnati by Year' )

grid.arrange(home_value_plot, home_value_year_plot, ncol=1, nrow =2)

## `geom_smooth()` using formula 'y ~ x'

#boxplot of home value
cin_home_value%>%ggplot(aes(year(new_month), 
                           home_value, group=factor(year(new_month)),
                           colour=factor(year(new_month)))) +
  geom_boxplot()+
  labs(title= 'Boxplot of Home Value in Cincinnati by Year' )

inventory

inventory_plot = ggplot(cin_combine)+
  geom_line(aes(new_month,inventory))+
  geom_smooth(aes(new_month,inventory),method='lm',color='red')+
  theme_bw()+
  xlab("Month")+
  ylab("Inventory")+
  labs(
    title = 'Inventory in Cincinnati',
    subtitle = 'January 2018 - December 2021')

inventory_year_plot<- cin_sub%>%
  ggplot(aes(month(new_month, label=TRUE, abbr=TRUE), 
                inventory, group=factor(year(new_month)), colour=factor(year(new_month)))) +
  geom_line() +
  geom_point() +
  labs(x="Month", colour="Year") +
  theme_classic()+
  labs(title= 'Inventory in Cincinnati by Year' )

grid.arrange(inventory_plot, inventory_year_plot, ncol=1, nrow =2)

## `geom_smooth()` using formula 'y ~ x'

list_sale_price

list_and_sale_price_plot = ggplot(cin_combine)+
  geom_line(aes(new_month,list_sale_price))+
  geom_smooth(aes(new_month,list_sale_price),method='lm',color='red')+
  theme_bw()+
  xlab("Month")+
  ylab("List and sale price")+
  labs(
    title = 'List and sale price in Cincinnati',
    subtitle = 'January 2018 - December 2021')

list_and_sale_price_year_plot<- cin_sub%>%
  ggplot(aes(month(new_month, label=TRUE, abbr=TRUE), 
                list_sale_price, group=factor(year(new_month)), colour=factor(year(new_month)))) +
  geom_line() +
  geom_point() +
  labs(x="Month", colour="Year") +
  theme_classic()+
  labs(title= 'List and sale price in Cincinnati by Year' )

grid.arrange(list_and_sale_price_plot, list_and_sale_price_year_plot, ncol=1, nrow =2)

## `geom_smooth()` using formula 'y ~ x'

`rentals``

rentals_plot = ggplot(cin_combine)+
  geom_line(aes(new_month,rentals))+
  geom_smooth(aes(new_month,rentals),method='lm',color='red')+
  theme_bw()+
  xlab("Month")+
  ylab("Rentals")+
  labs(
    title = 'Rentals in Cincinnati',
    subtitle = 'January 2018 - December 2021')

rentals_year_plot<- cin_sub%>%
  ggplot(aes(month(new_month, label=TRUE, abbr=TRUE), 
                rentals, group=factor(year(new_month)), colour=factor(year(new_month)))) +
  geom_line() +
  geom_point() +
  labs(x="Month", colour="Year") +
  theme_classic()+
  labs(title= 'Rentals in Cincinnati by Year' )

grid.arrange(rentals_plot, rentals_year_plot, ncol=1, nrow =2)

## `geom_smooth()` using formula 'y ~ x'

`sales_count_price``

sales_count_price_plot = ggplot(cin_combine)+
  geom_line(aes(new_month,sales_count_price))+
  geom_smooth(aes(new_month,sales_count_price),method='lm',color='red')+
  theme_bw()+
  xlab("Month")+
  ylab("Sales count and price cut")+
  labs(
    title = 'Sales count and price cut in Cincinnati',
    subtitle = 'January 2018 - December 2021')

sales_count_price_year_plot<- cin_sub%>%
  ggplot(aes(month(new_month, label=TRUE, abbr=TRUE), 
                sales_count_price, group=factor(year(new_month)), colour=factor(year(new_month)))) +
  geom_line() +
  geom_point() +
  labs(x="Month", colour="Year") +
  theme_classic()+
  labs(title= 'Sales count and price cut in Cincinnati by Year' )

grid.arrange(sales_count_price_plot, sales_count_price_year_plot, ncol=1, nrow =2)

## `geom_smooth()` using formula 'y ~ x'

1.5 Initial data analysis

As implied in the EDA part, home value , and rentalsseem to increase over time and indicated the same for each year (2018-2021); Similarly, for list and sale price, there’s an increasing trend over the years but a variation within each year (this maybe the seasonal); On the contrary, inventory seems to decreases overtime as well as yearly; Lastly, for sales count and price cut the variation tends to slightly decrease over the years.

Now we want to model the linear trend and see if it’s consistent with our observations from the visualization.

# Test whether there is a true linaer trend over time
mod1 = lm(home_value~new_month,data=cin_combine_full)
summary(mod1)

mod2 = lm(inventory~new_month,data=cin_combine_full)
summary(mod2)

mod3 = lm(list_sale_price~new_month,data=cin_combine_full)
summary(mod3)


mod4 = lm(rentals~new_month,data=cin_combine_full)
summary(mod4)

mod5 = lm(sales_count_price~new_month,data=cin_combine_full)
summary(mod5)

The outcome table:

tab_model(mod1, mod2,mod3,mod4,mod5)

	home value			inventory			list sale price			rentals			sales count price
Predictors	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p
(Intercept)	67293.58	53729.25 – 80857.90	<0.001	62976.81	50644.46 – 75309.16	<0.001	-1130447.61	-1357341.28 – -903553.93	<0.001	-1332.23	-1420.22 – -1244.23	<0.001	-4112.02	-4975.12 – -3248.91	<0.001
new month	5.95	5.05 – 6.84	<0.001	-3.02	-3.70 – -2.35	<0.001	77.53	65.12 – 89.94	<0.001	0.14	0.13 – 0.14	<0.001	0.38	0.33 – 0.44	<0.001
Observations	264			48			48			96			167
R² / R² adjusted	0.396 / 0.393			0.639 / 0.631			0.775 / 0.770			0.970 / 0.970			0.560 / 0.557

We can see from most of the Linear regression model output is consistent with the overtime trends we observed from the visualization plots above: home value , rentals, and list and sale price seem to increase over time and have positive coefficients (positive linear association); On the contrary, inventory seems to decreases overtime and has a negative coefficient (negative linear association); Lastly, for sales count and price cut the variation tends to slightly decrease over the years, however, it has a positive coefficient indicating positive linear association.

2 ARIMA

From this part, we will start model development focusing on time series of **home value data** in Cincinnati:

Load data:

cin_home_value<- home_value%>%
  filter(RegionID==394466)%>%
  pivot_longer(6:269,names_to="date",values_to="home_value")%>%
  mutate(new_month= as.Date(date, "%Y-%m-%d"))

cin_home_value%>%
  head(5)%>%
  kbl(caption = "Table: Cincinnati Home Value data")%>%
  kable_classic(full_width = F, html_font = "Cambria")

Table: Cincinnati Home Value data
RegionID	SizeRank	RegionName	RegionType	StateName	date	home_value	new_month
394466	28	Cincinnati, OH	Msa	OH	2000-01-31	129379	2000-01-31
394466	28	Cincinnati, OH	Msa	OH	2000-02-29	129364	2000-02-29
394466	28	Cincinnati, OH	Msa	OH	2000-03-31	129290	2000-03-31
394466	28	Cincinnati, OH	Msa	OH	2000-04-30	129780	2000-04-30
394466	28	Cincinnati, OH	Msa	OH	2000-05-31	130454	2000-05-31

2.1 Assess data generating process

We want to look at the plot and access Data generating process.

# Plot the Data
cin_home_value%>% ggplot()+
  geom_line(aes(new_month,home_value))+
  theme_bw()+
  xlab("Month")+
  ylab("Home Value")+
  labs(
    title = 'Home Value in Cincinnati',
    subtitle = 'January 2000 - December 2021')

As the plot above indicates non-stationarity. We want to look at 6 months rolling sd plot:

homevalue_roll <- cin_home_value %>%
  mutate(
    home_mean = zoo::rollmean(
      home_value, 
      k = 6, 
      fill = NA),
    close_sd = zoo::rollapply(
      home_value, 
      FUN = sd, 
      width = 6, 
      fill = NA)
  )
homevalue_rollmean <- homevalue_roll %>%
  ggplot() +
  geom_line(aes(new_month, home_value)) +
  theme_bw() +
  ggtitle("homevalue Mean over Time (6 month rolling window)") +
  ylab("home value") +
  xlab("Month")
homevalue_rollsd <- homevalue_roll %>%
  ggplot() +
  geom_line(aes(new_month, home_value)) +
  theme_bw() +
  ggtitle("Differenced homevalue SD over Time (6 month rolling window)") +
  ylab("home value") +
  xlab("Month")

grid.arrange(homevalue_rollmean,homevalue_rollsd,ncol=1, nrow =2)

The rolling plots above mainly implies the mean non-stationary and slightly/very limited variance non-stationary. It might because of our data is monthly data (monthly average, the variation is less). We want to confirm the data only has mean non-stationary and no variance non-stationary. Thus we start with variance non-stationary transformation (Log/box-cox):

2.2 Transformation

2.2.1 - Variance non-stationary transformation

Log transformation
Box-cox transfomation

cinhome_var_transform = cin_home_value %>%
  mutate(home_log = log1p(home_value),
         home_box = BoxCox(home_value,lambda='auto'))

homevalue_log_trans <- cinhome_var_transform %>%
  ggplot() +
  geom_line(aes(new_month, home_log)) +
  theme_bw() +
  ggtitle("Homevalue in Cincinnati over Time (log transform)") +
  ylab("home value(log)") +
  xlab("Month")
homevalue_box_trans <- cinhome_var_transform %>%
  ggplot() +
  geom_line(aes(new_month, home_box)) +
  theme_bw() +
  ggtitle("Homevalue in Cincinnati over Time (box-cox transform)") +
  ylab("home value(box-cox)") +
  xlab("Month")

grid.arrange(homevalue_log_trans,homevalue_box_trans,ncol=1, nrow =2)

We notice from the plots that after variance transformation, they still seem to be non-stationary. Log/ Box-cox transformation do not have much difference. Besides, the non-stationary did not improve much, we want to use ADF/KPSS test to verify:

ADF test: (null: non-stationary) p-value <0.05 indicates stationary.
KPSS test: (Null: stationary) p-value <0.05 indicates non- stationary.

a1<-adf.test(cinhome_var_transform$home_log) # Non-stationary
k1<-kpss.test(cinhome_var_transform$home_log) # Non-stationary

a2<-adf.test(cinhome_var_transform$home_box) # Non-stationary
k2<-kpss.test(cinhome_var_transform$home_box) # Non-stationary

test<- matrix(c("ADF","Log Transform",a1$p.value,"Non-stationary",
                "KPSS","Log Transform",k1$p.value,"Non-stationary",
                "ADF","Box-cox Transform",a2$p.value,"Non-stationary",
                "KPSS","Box-cox Transform",k2$p.value,"Non-stationary"),
              ncol=4,byrow=TRUE)

test %>%
  kbl(caption = "Table: ADF/KPSS test on Log vs. Box-cox transform") %>%
  kable_classic(full_width = F, html_font = "Cambria")

Table: ADF/KPSS test on Log vs. Box-cox transform
ADF	Log Transform	0.99	Non-stationary
KPSS	Log Transform	0.01	Non-stationary
ADF	Box-cox Transform	0.99	Non-stationary
KPSS	Box-cox Transform	0.01	Non-stationary

The plots and tests results both showed that the transformed data is still non-stationary. Now we will focus on mean non-stationary transformation and will start with first order differencing:

2.2.2 - Mean non-stationary transformation

First Difference
First Difference, Log transform
First Difference, Box transform

cinhome_mean_transform = cin_home_value %>%
  mutate(home_log = log1p(home_value),
         home_box = BoxCox(home_value,lambda='auto'),
         home_diff= home_value-lag(home_value),
         home_diff_log = home_log - lag(home_log),
         home_diff_box = home_box - lag(home_box))%>%
  drop_na()

#first difference
d1<-cinhome_mean_transform %>%
  ggplot()+
      geom_line(aes(new_month,home_diff))+
      theme_bw()+
      ggtitle("homevalue First Difference: 2000-2021")+
      ylab("Homevalue (Difference))")+
      xlab("Month")

#First order difference on log transformed 
d2<- cinhome_mean_transform %>%
  ggplot()+
      geom_line(aes(new_month,home_diff_log))+
      theme_bw()+
      ggtitle("homevalue First Difference + Log tranformation: 2000-2021")+
      ylab("Homevalue (Difference + log))")+
      xlab("Month")

#First order difference on box transformed
d3<- cinhome_mean_transform %>%
  ggplot()+
      geom_line(aes(new_month,home_diff_box))+
      theme_bw()+
      ggtitle("homevalue First Difference + Box-Cox tranformation: 2000-2021")+
      ylab("Homevalue (Difference + Box-cox))")+
      xlab("Month")

grid.arrange(d1,d2,d3,ncol=1, nrow =3)

We noticed similar pattern from transformation of mean non-stationary (differencing) alone and differencing on top of variance-transform. They still seem to be non-stationary and we can see a trend. we want to use ADF/KPSS test to verify:

# only difference
ad1<-adf.test(cinhome_mean_transform$home_diff) # Non-stationary
kd1<-kpss.test(cinhome_mean_transform$home_diff) # Non-stationary

#difference+log
adl1<-adf.test(cinhome_mean_transform$home_diff_log) # Non-stationary
kdl1<-kpss.test(cinhome_mean_transform$home_diff_log) # Non-stationary

#difference +box
adb1<-adf.test(cinhome_mean_transform$home_diff_box) # Non-stationary
kdb1<-kpss.test(cinhome_mean_transform$home_diff_box) # Non-stationary

test2<- matrix(c("ADF","Difference",ad1$p.value,"Non-stationary",
                 "KPSS","Difference",kd1$p.value,"Non-stationary",
                 "ADF","Difference+log",adl1$p.value,"Non-stationary",
                 "KPSS","Difference+Log",kdl1$p.value,"Non-stationary",
                 "ADF","Difference+Box-cox",adb1$p.value,"Non-stationary",
                 "KPSS","Difference+Box-cox",kdb1$p.value,"Non-stationary"),
               ncol=4,byrow=TRUE)
test2 %>%
  kbl(caption = " Table: ADF/KPSS test on First Difference & Log/ Box-cox Transform+Difference") %>%
  kable_classic(full_width = F, html_font = "Cambria")

Table: ADF/KPSS test on First Difference & Log/ Box-cox Transform+Difference
ADF	Difference	0.788741858628823	Non-stationary
KPSS	Difference	0.01	Non-stationary
ADF	Difference+log	0.55797735934197	Non-stationary
KPSS	Difference+Log	0.01	Non-stationary
ADF	Difference+Box-cox	0.43768833056367	Non-stationary
KPSS	Difference+Box-cox	0.01	Non-stationary

Results from ADF/KPSS tests for First order difference (alone) and First order difference + Log/ Box-cox transformation all implies non-stationarity. We need to take second-order difference:

Second-order Difference
Second-order Difference, Log transform
Second-order Difference, Box transform

#second order difference
cinhome_second_diff<- cinhome_mean_transform%>%
  mutate(home_second_diff=home_diff-lag(home_diff),
         home_second_diff_log=home_diff_log-lag(home_diff_log),
         home_second_diff_box=home_diff_box-lag(home_diff_box))%>%
  drop_na()
#other ways to second differernce
# diff(diff(cin_home_value$home_value))

Now we plot the second-order differencing and check:

#second order difference on log transformed 
#second difference
sd1<-cinhome_second_diff %>%
  ggplot()+
      geom_line(aes(new_month,home_second_diff))+
      theme_bw()+
      ggtitle("Homevalue Second order Difference: 2000-2021")+
      ylab("Homevalue (2nd-Difference))")+
      xlab("Month")

#second order difference on log transformed 
sd2<- cinhome_second_diff %>%
  ggplot()+
      geom_line(aes(new_month,home_second_diff_log))+
      theme_bw()+
      ggtitle("Homevalue Second order Difference + Log tranformation: 2000-2021")+
      ylab("Homevalue (2nd-Difference + log))")+
      xlab("Month")

#second order difference on box transformed
sd3<- cinhome_second_diff %>%
  ggplot()+
      geom_line(aes(new_month,home_second_diff_box))+
      theme_bw()+
      ggtitle("Homevalue Second order Difference + Box-Cox tranformation: 2000-2021")+
      ylab("Homevalue (2nd-Difference + Box-cox))")+
      xlab("Month")

grid.arrange(sd1,sd2,sd3,ncol=1, nrow =3)

The plots above is now more mean-reverting. However, we do notice variation changes over the time. Now we can check if the differenced and transformed data are stationary using ADF/KPSS Test:

# only 2nd difference
ad2<-adf.test(cinhome_second_diff$home_second_diff) # stationary
kd2<-kpss.test(cinhome_second_diff$home_second_diff) # stationary

#second difference+log
adl2<-adf.test(cinhome_second_diff$home_second_diff_log) # stationary
kdl2<-kpss.test(cinhome_second_diff$home_second_diff_log) # stationary

#second difference +box
adb2<-adf.test(cinhome_second_diff$home_second_diff_box) # stationary
kdb2<-kpss.test(cinhome_second_diff$home_second_diff_box) # stationary

test3<- matrix(c("ADF","Difference",ad2$p.value,"Stationary",
                 "KPSS","Difference",kd2$p.value,"Stationary",
                 "ADF","Difference+log",adl2$p.value,"Stationary",
                 "KPSS","Difference+Log",kdl2$p.value,"Stationary",
                 "ADF","Difference+Box-cox",adb2$p.value,"Stationary",
                 "KPSS","Difference+Box-cox",kdb2$p.value,"Stationary"),
               ncol=4,byrow=TRUE)
test3 %>%
  kbl(caption = "Table: ADF/KPSS test on 2nd Difference vs. 2nd Difference + Log/ Box-cox Transform") %>%
  kable_classic(full_width = F, html_font = "Cambria")

Table: ADF/KPSS test on 2nd Difference vs. 2nd Difference + Log/ Box-cox Transform
ADF	Difference	0.01	Stationary
KPSS	Difference	0.1	Stationary
ADF	Difference+log	0.01	Stationary
KPSS	Difference+Log	0.1	Stationary
ADF	Difference+Box-cox	0.01	Stationary
KPSS	Difference+Box-cox	0.1	Stationary

2.2.3 ACF/PACF plot

As the tests all suggests that Second Ordered Difference or Second order difference + log/box-cox transformation are stationary. We will plot ACF & PACF of second order difference transformed data and decide on the parameters of ARIMA model:

# Plot ACF/PACF of second differenced
acf(cinhome_second_diff$home_second_diff)

pacf(cinhome_second_diff$home_second_diff)

### Seasonality

Home value is seasonality-adjusted. We can also conclude such findings from the ACF/PACF plot above.

As we observed trend from the previous data generating plot. In this ACF/PACF plot we can only notice some wave pattern, but can not notice very obvious seasonality/ fixed frequency within a year (there might be some cyclicality pattern). For real estate markets, there’s definitely months that sell better than others. However, unless circumstances like housing market crashes, the home value is usually unlike to decrease back to it’s previous seasonal values.

We can also double check and confirm our assumptions of no seasonality for this home_value time series data using decomposition:

#make home value a ts dataframe
home<- ts(cin_home_value$home_value,frequency = 12, start = c(2000, 1))

# home %>% decompose(type="multiplicative") %>%
#   autoplot() + xlab("date") +
#   ggtitle("Classical multiplicative decomposition
#     of electrical equipment index")


#decompose-- additive
home_decompose_ad = decompose(home,type='additive')
plot(home_decompose_ad)

#decompose -- multiplicative
home_decompose_mul = decompose(home,type='multiplicative')
plot(home_decompose_mul)

2.3 ARIMA Modeling

From the ADF/KPSS tests result and the ACF/PACF plots above, we can confirm that the data is only mean non-stationary, needs second order difference. As for variance transformation, we do not need to worry about. Thus, we will only build model using difference transformation, which dealt with mean non-stationary. Addictionally, from the acf/pacf plots above, we can assume that ARIMA model for:

home_value(Original data): is AR process with 2 order difference. So we have i=2, and p=18 (it’s AR process), q=0. Other significant lags are: 3,4,6,12,15. maybe we can try these in Arima models.

#use original parameter: home_value
mod1 = arima(cinhome_second_diff$home_value,order=c(18,2,0))
mod2 = arima(cinhome_second_diff$home_value,order=c(3,2,0))
mod3 = arima(cinhome_second_diff$home_value,order=c(4,2,0))
mod4 = arima(cinhome_second_diff$home_value,order=c(6,2,0))
mod5 = arima(cinhome_second_diff$home_value,order=c(12,2,0))
mod6 = arima(cinhome_second_diff$home_value,order=c(15,2,0))
mod7 = arima(cinhome_second_diff$home_value,order=c(18,2,1))

model_summary<- matrix(
  c("Model","ARIMA(p,I,q)","Parameter","sigma^2","log Likelihood","AIC","BIC",
    "1","ARIMA(18,2,0)","Home value",mod1$sigma2,mod1$loglik,mod1$aic,BIC(mod1),
    "2","ARIMA(3,2,0)","Home value",mod2$sigma2,mod2$loglik,mod2$aic,BIC(mod2),
    "3","ARIMA(4,2,0)","Home value",mod3$sigma2,mod3$loglik,mod3$aic,BIC(mod3),
    "4","ARIMA(6,2,0)","Home value",mod4$sigma2,mod4$loglik,mod4$aic,BIC(mod4),
    "5","ARIMA(12,2,0)","Home value",mod5$sigma2,mod5$loglik,mod5$aic,BIC(mod5),
    "6","ARIMA(15,2,0)","Home value",mod6$sigma2,mod6$loglik,mod6$aic,BIC(mod6),
    "7","ARIMA(18,2,1)","Home value",mod7$sigma2,mod7$loglik, mod7$aic,BIC(mod7)),
  ncol=7,byrow=TRUE)

model_summary %>%
  kbl(caption = "Table: ARIMA Model output: AIC vs BIC") %>%
  kable_classic(full_width = F, html_font = "Cambria")%>%
  row_spec(1, bold = T)

Table: ARIMA Model output: AIC vs BIC
Model	ARIMA(p,I,q)	Parameter	sigma^2	log Likelihood	AIC	BIC
1	ARIMA(18,2,0)	Home value	35339.0625546411	-1733.42177827751	3504.84355655501	3572.49650754431
2	ARIMA(3,2,0)	Home value	52017.2052625915	-1780.92673285475	3569.8534657095	3584.09619223356
3	ARIMA(4,2,0)	Home value	49859.9552256206	-1775.51214578929	3561.02429157858	3578.82769973366
4	ARIMA(6,2,0)	Home value	46951.6410027904	-1767.87902494625	3549.7580498925	3574.68282130961
5	ARIMA(12,2,0)	Home value	42855.4641643852	-1756.5986295862	3539.19725917239	3585.4861203756
6	ARIMA(15,2,0)	Home value	39563.6255128961	-1746.94105520269	3525.88211040538	3582.85301650163
7	ARIMA(18,2,1)	Home value	35325.8882484904	-1733.39468462616	3506.78936925232	3578.00300187263

We can tell from the models output that Model 1: ARIMA(18,2,0) on home value performs best as it has the smallest AIC as well as BIC.

2.3.1 Auto.arima

#auto.arima for home value
mod1_auto = auto.arima(cinhome_second_diff$home_value,max.p=20,max.q=20,max.order=20,stepwise=FALSE,approximation=FALSE,stationary=FALSE)

From the auto. arima output above, the output suggest the best model for home value is ARIMA(18,2,0).

2.4 Diagnosits

Check residuals
Box-Ljung Test

Now we want to look at residuals on our best model.

# Diagnostics from the Model ---------------------
# Examine Residuals - Box-Ljung Test says residuals have no autocorrelation at lag 1/5
checkresiduals(mod1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(18,2,0)
## Q* = 7.7663, df = 3, p-value = 0.0511
## 
## Model df: 18.   Total lags used: 21

checkresiduals(mod1_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(18,2,0)
## Q* = 7.7663, df = 3, p-value = 0.0511
## 
## Model df: 18.   Total lags used: 21

Box.test(mod1$residuals,type = 'Ljung-Box')

## 
##  Box-Ljung test
## 
## data:  mod1$residuals
## X-squared = 0.012125, df = 1, p-value = 0.9123

Box.test(mod1$residuals,type = 'Ljung-Box',lag=1)

## 
##  Box-Ljung test
## 
## data:  mod1$residuals
## X-squared = 0.012125, df = 1, p-value = 0.9123

Box.test(mod1$residuals,type = 'Ljung-Box',lag=10)

## 
##  Box-Ljung test
## 
## data:  mod1$residuals
## X-squared = 5.369, df = 10, p-value = 0.8652

Both model indicated autocorrelation problem, the mod1_autoor Model 1 on home value – ARIMA(18,2,0) did not indicate any autocorrelation.

Box.test(mod1_auto$residuals,type = 'Ljung-Box')

## 
##  Box-Ljung test
## 
## data:  mod1_auto$residuals
## X-squared = 0.012125, df = 1, p-value = 0.9123

Box.test(mod1_auto$residuals,type = 'Ljung-Box',lag=1)

## 
##  Box-Ljung test
## 
## data:  mod1_auto$residuals
## X-squared = 0.012125, df = 1, p-value = 0.9123

Box.test(mod1_auto$residuals,type = 'Ljung-Box',lag=10)

## 
##  Box-Ljung test
## 
## data:  mod1_auto$residuals
## X-squared = 5.369, df = 10, p-value = 0.8652

in_sample_pred = cinhome_second_diff$home_value-mod1_auto$residuals
#transfer for box data 
#in_sample_pred = InvBoxCox(in_sample_pred,lambda=best_lambda)

# RMSE Calculation
# Average Error of 13, which is pretty good
RMSE = sqrt(mean((in_sample_pred - cinhome_second_diff$home_value)^2,na.rm=T))
RMSE

## [1] 187.6045

2.5 Forecasting

Now we want to predict using our best model and get predictions for 12 steps (a year) ahead:

# Get Predictions for 5 steps ahead
prediction = predict(mod1_auto,n.ahead=12)

# Create Prediction Dataset for next 5 month
pred_data = data.frame(
  pred = prediction$pred,
  se = prediction$se,
  new_month =  seq(as.Date(max(cinhome_second_diff$new_month)),by="month",length.out=12)
  )


pred_merge = cinhome_second_diff %>% 
  mutate(in_sample_pred = in_sample_pred) %>%
  full_join(pred_data) %>%
  mutate(
    pred_high = pred+2*se, # Transform back into regular values after creating CIs
    pred_low = pred - 2*se,
    pred = pred,
    error = pred - home_value
  )

# Forecast Actually Looks Pretty good! (If we didn't have covid...)
pred_merge %>% 
  mutate(pred = if_else(is.na(pred) & !is.na(lead(pred)),in_sample_pred,pred)) %>% # This code connects the forecast lines in GGPLOT
  ggplot()+
  geom_line(aes(ymd(new_month),home_value))+
  geom_point(aes(ymd(new_month),in_sample_pred),color='blue')+
  geom_line(aes(ymd(new_month),pred),color='blue')+
  geom_ribbon(aes(x=ymd(new_month),ymin=pred_low,ymax = pred_high),fill='blue',alpha=0.2)

3 Facebook Prophet model

3.1 Basic model

We are still focusing on home value monthly data in Cincinnati. To perform Prophet Model, we need to change name to our dataset: name our date variable –ds; name our time series variable (home_value) –y.

cin_home_value<- home_value%>%
  filter(RegionID==394466)%>%
  pivot_longer(6:269,names_to="date",values_to="home_value")%>%
  mutate(new_month= as.Date(date, "%Y-%m-%d"))

#change to prophet format data
prophet_data = cin_home_value %>% 
    rename(ds = new_month, # Have to name our date variable "ds"
    y = home_value)  # Have to name our time series "y"

prophet_data%>%
  head(5)%>%
  kbl(caption = "Table: Prophet Format Data: Cincinnati Home Value")%>%
  kable_classic(full_width = F, html_font = "Cambria")

Table: Prophet Format Data: Cincinnati Home Value
RegionID	SizeRank	RegionName	RegionType	StateName	date	y	ds
394466	28	Cincinnati, OH	Msa	OH	2000-01-31	129379	2000-01-31
394466	28	Cincinnati, OH	Msa	OH	2000-02-29	129364	2000-02-29
394466	28	Cincinnati, OH	Msa	OH	2000-03-31	129290	2000-03-31
394466	28	Cincinnati, OH	Msa	OH	2000-04-30	129780	2000-04-30
394466	28	Cincinnati, OH	Msa	OH	2000-05-31	130454	2000-05-31

Prophet model is usually used for daily data, fitting monthly data in this model is also applicable but we are asking for daily forecasts in this case. We will start with a basic forecasting using Prophet with frequency as daily.

We took months before “2017-08-31” as train dataset (approximately 80%) and time at “2017-08-31” and after as test dataset (approximately 20%). We can plot an interactive plot for forecasting after modeling.

Additionally, we will plot lines of forecast home value with actual Home value for comparison.

train1 = prophet_data %>% 
  filter(ds<ymd("2017-08-31"))
test1 = prophet_data %>%
  filter(ds>=ymd("2017-08-31"))
model1 = prophet(train1)
future1 = make_future_dataframe(model1,periods = 1583)#freq=daily; 53 month ahead
forecast1 = predict(model1,future1)

#plot interactive forecast plot
dyplot.prophet(model1,forecast1)

#plot actual vs forecast line for comparision
ggplot()+
  geom_line(data=forecast1,aes(x=as.Date(ds),y=yhat,col="blue"))+
  geom_line(data=cin_home_value,aes(x=as.Date(new_month),y=home_value,col="red"))+
  scale_color_identity(name = "Legend",      #for manual legend
                       breaks = c("red","blue"),
                       labels = c("Actual","Forecast"),
                       guide = "legend")+
  labs(title = "Cincy Home Value: Actual value vs Prophet basic forecast",
       x = "Date",
       y = "Cincinnati Home Value")

As we noticed from above that the output is strange as the forecast is not smooth and it generates daily instead of monthly forecast. Now we will change the frequency to month.

train = prophet_data %>% 
  filter(ds<ymd("2017-08-31"))
test = prophet_data %>%
  filter(ds>=ymd("2017-08-31"))
model = prophet(train)
future = make_future_dataframe(model,periods = 53,freq = 'month')
forecast = predict(model,future)

#plot interactive forecast plot
dyplot.prophet(model,forecast)

#plot actual vs forecast line for comparision
ggplot()+
  geom_line(data=forecast,aes(x=as.Date(ds),y=yhat,col="blue"))+
  geom_line(data=cin_home_value,aes(x=as.Date(new_month),y=home_value,col="red"))+
  scale_color_identity(name = "Legend",      #for manual legend
                       breaks = c("red","blue"),
                       labels = c("Actual","Forecast"),
                       guide = "legend")+
  labs(title = "Cincy Home Value: Actual value vs Prophet basic forecast",
       x = "Date",
       y = "Cincinnati Home Value")

The plots looks more smooth. However, we can notice from the plots that predicted home value is clearly lower than actual home value after the time of “2017-08-31”. Thus, we will look into trend and seasonality impact on home_value:

3.1.1 Decomposition of the elements

prophet_plot_components(model,forecast)

We can see from the decompose plot above that there’s a clear trend but seasonality is less clear. It might because this home value dataset is seasonally adjusted and only includes the middle price tier of homes. However, there might be cyclical component that follows the economic patterns (eg: 2008 financial crisis; 2019 COVID, etc). Now let’s check the change points:

3.1.2 Examination of the changepoints

The input of changepoints allowed us to increase the fit and its flexibility. We can examine the changepoints identified for the “trend” part of the time series from home value plot above. There is at least 5 change point (1 around 2005, 2 around 2010, 1 around 2015, 1 around 2019):

model_changepoint = prophet(train,n.changepoints=5)
future_changepoint = make_future_dataframe(model,periods = 53,freq = 'month')
forecast_changepoint = predict(model_changepoint,future_changepoint)
plot(model_changepoint,forecast_changepoint)+
  add_changepoints_to_plot(model_changepoint)+
  theme_bw()+xlab("Date")+ylab("Home value")

prophet_plot_components(model_changepoint,forecast_changepoint)

We noticed from the trend plot that setting changepoint 5 roghtly captured trend changes. If we set higher changepoint number, there’s chance of overfitting.

3.1.3 Saturation Points

From the forecast output above, the predicted values are all above 0. In our model, we do not need to take into account a saturating maximum point as the home value. However, we must also specify a maximum capacity if using logistic growth with a minimum. Thus, we could set a reasonable minimun and maximum value: usually home value ( Cincinnati median home value) is unlikely to go below 0 dollar and go above $0.3 million.

train_saturation = prophet_data %>% 
  filter(ds<ymd("2017-08-31"))
test_saturation = prophet_data %>%
  filter(ds>=ymd("2017-08-31"))
model_saturation = prophet(train_saturation)
future_saturation = make_future_dataframe(model_saturation,periods = 53,freq = 'month')#freq=monthly; 53 month ahead

# Set "floor" in training data
train_saturation$floor = 0
train_saturation$cap=300000
future_saturation$floor = 0
future_saturation$cap=300000

# Set floor in forecast data
future_saturation$floor = 0
future_saturation$cap=300000
sat_model = prophet(train_saturation,growth='logistic')
sat_forecast = predict(sat_model,future_saturation)
plot(sat_model,sat_forecast)+ylim(0,300000)+
theme_bw()+xlab("Date")+ylab("Home Value")

From the plot, we noticed that all the home values are within the range of (0.1,0.3) million. Therefore, we do not need to specify the saturating points.

3.1.4 Identification and assessment of seasonality

From above decomposition, we noticed that we do not have yearly seasonality. It might because this value is a smoothed, seasonally adjusted measure of the typical home value and market changes across a given region and housing type. Now we will try to identity daily, weekly, as well as additibe/ multiplicative seasonality:

Daily/weekly seasonality We have monthly data but we can examine different models with different parameters or setting: We start with model_season with daily.seasonality and weekly.seasonality option set to TURE to check daily and weekly seasonality:

model_season = prophet(train, daily.seasonality = TRUE, weekly.seasonality = TRUE)
future_season = make_future_dataframe(model_season,periods = 1583)#freq=daily; 53 month ahead
forecast_season = predict(model_season,future_season)

#plot interactive forecast plot
dyplot.prophet(model_season,forecast_season)

#decompose
prophet_plot_components(model_season,forecast_season)

Daily or weekly seasonality also not very obvious. Now we will dive into additive/ multiplicative seasonality:

additive seasonality

additive = prophet(train)
add_fcst = predict(additive,future)
plot(additive,add_fcst)

prophet_plot_components(additive,add_fcst)

multiplicative seasonality

multi = prophet(train,seasonality.mode = 'multiplicative')
multi_fcst = predict(multi,future)
plot(multi,multi_fcst)

prophet_plot_components(multi,multi_fcst)

Decompose plots of additive or multiplicative seasonality appear to be similar: seasonality is not obvious. Thus, we might only have trend and no seasonality.

3.1.5 Assessment of holidays

We want to try with inclusion of holidays. The result might not be very accurate with important holidays as we are examining monthly rather than daily data:

#Can use built-in holidays, or specify our own
model_holiday = prophet(train,fit=FALSE,daily.seasonality = TRUE, weekly.seasonality = TRUE)
model_holiday = add_country_holidays(model_holiday,country_name = 'US')
model_holiday = fit.prophet(model_holiday,train)
forecast_holiday = predict(model_holiday,future1)

dyplot.prophet(model_holiday,forecast_holiday)

prophet_plot_components(model_holiday,forecast_holiday)

The fitted line looks very well. Additionally, the plots indicated a few import days over years. These days maybe corresponding to our changing points in trend. Now we can look into those days:

forecast_holiday %>%
  filter(holidays != 0) %>%
  select_if(~ !is.numeric(.) || sum(.) != 0)%>%
  select(ds,holidays,`Memorial Day`,`New Year's Day (Observed)`)%>%
  kbl(caption = "Table: Important Holidays")%>%
  kable_classic(full_width = F, html_font = "Cambria")

Table: Important Holidays
ds	holidays	Memorial Day	New Year’s Day (Observed)
2004-05-31	1267.9140	1267.914	0.0000
2004-12-31	-399.0302	0.000	-399.0302
2010-05-31	1267.9140	1267.914	0.0000
2010-12-31	-399.0302	0.000	-399.0302
2018-05-28	1267.9140	1267.914	0.0000
2019-05-27	1267.9140	1267.914	0.0000
2020-05-25	1267.9140	1267.914	0.0000
2021-05-31	1267.9140	1267.914	0.0000

We may notice that the returning result is reasonable:

Important positive influential holiday is Memorial day which is in May when the market is in peak season and buyers are active. On the contrary, negative correlated holiday is New Year’s day, which is when the housing market is in slow season. Moreover, we can see decrease in 2004 and 2010. As for 2010-2011, we had the housing market crash after the 2008 financial crisis.

forecast_holiday %>%
  filter(holidays != 0) %>%
  dplyr::select(-ds:-additive_terms_upper, -holidays:-holidays_upper, -weekly:-yhat, -contains("upper"), -contains("lower")) %>%
  mutate_all(~ if_else(. == 0, as.numeric(NA), .)) %>%
  summarize_if(is.numeric, ~ max(., na.rm = T)) %>%
  pivot_longer(
    cols = `Christmas Day`:`Washington's Birthday`,
    names_to = 'holiday', 
    values_to = 'effect'
  ) %>%
ggplot() +
  geom_col(aes(effect,holiday))+
  theme_bw()

3.2 Outlier detection

anom_train = prophet_data %>% 
  filter(ds<ymd("2017-08-31"))
anom_test = prophet_data %>% 
  filter(ds>=ymd("2017-08-31"))
anom_model = prophet(anom_train)
anom_future = make_future_dataframe(anom_model,periods = 53,freq = 'month')
forecast = predict(anom_model,anom_future)

forecast_plot_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2017-08-31")) %>% 
  mutate(y = anom_test$y) %>% 
  mutate(anomaly = if_else(y>yhat_upper | y<yhat_lower,TRUE,as.logical(NA)))
forecast_plot_data %>% 
  ggplot()+
    geom_line(aes(ds,y))+
    geom_point(aes(ds,y,color=anomaly))+
    geom_line(aes(ds,yhat),color='red')+
    geom_ribbon(aes(ds,ymin=yhat_lower,ymax=yhat_upper),alpha=0.4,fill='blue') + 
  scale_colour_discrete(na.translate = F)

We can see that our actual values in test dataset are much higher than the predicted values. Therefore, all points are detected anomaly. However, we can not delete these data, as they are generated under COVID situation which is an anomaly period of time and business trend.

4 Model Comparision and Validation

4.1 ARIMA

arima_train = cinhome_second_diff %>%
  filter(new_month<ymd("2017-08-31"))
arima_test = cinhome_second_diff %>%
  filter(new_month>=ymd("2017-08-31"))

train_mod1 = arima(arima_train$home_value,order=c(18,2,0))

test_pred1 = predict(train_mod1,53)$pred

error1 = arima_test$home_value - test_pred1

# Get Predictions for 5 steps ahead
prediction = predict(train_mod1,n.ahead=53)

# Create Prediction Dataset for next 5 month
pred_data = data.frame(
  pred = prediction$pred,
  se = prediction$se,
  new_month =  seq(as.Date(max(cinhome_second_diff$new_month)),by="month",length.out=53)
  )


pred_merge = cinhome_second_diff %>% 
  mutate(in_sample_pred = in_sample_pred) %>%
  full_join(pred_data) %>%
  mutate(
    pred_high = pred+2*se, # Transform back into regular values after creating CIs
    pred_low = pred - 2*se,
    pred = pred,
    error = pred - home_value
  )

## Joining, by = "new_month"

# Forecast Actually Looks Pretty good! (If we didn't have covid...)
pred_merge %>% 
  mutate(pred = if_else(is.na(pred) & !is.na(lead(pred)),in_sample_pred,pred)) %>% # This code connects the forecast lines in GGPLOT
  ggplot()+
  geom_line(aes(ymd(new_month),home_value))+
  geom_point(aes(ymd(new_month),in_sample_pred),color='blue')+
  geom_line(aes(ymd(new_month),pred),color='blue')+
  geom_ribbon(aes(x=ymd(new_month),ymin=pred_low,ymax = pred_high),fill='blue',alpha=0.2)+
  geom_line(data=cin_home_value,aes(x=as.Date(new_month),y=home_value,col="red"))+
  scale_color_identity(name = "Legend",      #for manual legend
                       breaks = c("red","blue"),
                       labels = c("Actual","Forecast"),
                       guide = "legend")+
  labs(title = "Cincy Home Value: Actual value vs Prophet basic forecast",
       x = "Date",
       y = "Cincinnati Home Value")

## Warning: Removed 52 row(s) containing missing values (geom_path).

## Warning: Removed 52 rows containing missing values (geom_point).

## Warning: Removed 260 row(s) containing missing values (geom_path).

Out of sample RMSE:

out_RMSE = sqrt(mean((arima_test$home_value - test_pred1)^2))
out_RMSE

## [1] 16330.02

Out of sample RMSE

4.2 Prophet Rolling window Cross-Validation

We want to use rolling window cross-validation to assess performance of the model at meaningful thresholds depending on the data (eg: 1 year out for monthly data) and then use various metrics (eg: RMSE,MAE,MAPE).

Cross Validation on holiday model

df.cv_holiday <- cross_validation(model_holiday, initial = 4*365,horizon = 365, units = 'days')

unique(df.cv_holiday$cutoff)%>%
  kbl(caption = "Table: Holidays model CV cutoff")%>%
  kable_classic(full_width = F, html_font = "Cambria")

Table: Holidays model CV cutoff
x
2004-02-02 12:00:00
2004-08-03 00:00:00
2005-02-01 12:00:00
2005-08-03 00:00:00
2006-02-01 12:00:00
2006-08-03 00:00:00
2007-02-01 12:00:00
2007-08-03 00:00:00
2008-02-01 12:00:00
2008-08-02 00:00:00
2009-01-31 12:00:00
2009-08-02 00:00:00
2010-01-31 12:00:00
2010-08-02 00:00:00
2011-01-31 12:00:00
2011-08-02 00:00:00
2012-01-31 12:00:00
2012-08-01 00:00:00
2013-01-30 12:00:00
2013-08-01 00:00:00
2014-01-30 12:00:00
2014-08-01 00:00:00
2015-01-30 12:00:00
2015-08-01 00:00:00
2016-01-30 12:00:00
2016-07-31 00:00:00

df.cv_holiday %>% 
  ggplot()+
  geom_point(aes(ds,y)) +
  geom_point(aes(ds,yhat,color=factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("Home value")+
  scale_color_discrete(name = 'Cutoff')

performance_metrics(df.cv_holiday)%>%
  kbl(caption = "Table: Holidays model CV performance metrics")%>%
  kable_classic(full_width = F, html_font = "Cambria")

Table: Holidays model CV performance metrics
horizon	mse	rmse	mae	mape	mdape	smape	coverage
1380 hours	7085812	2661.919	1847.714	0.0127661	0.0088902	0.0128621	0.2016129
1392 hours	6830350	2613.494	1828.086	0.0125604	0.0088902	0.0126436	0.2258065
1404 hours	7929313	2815.904	2004.096	0.0137685	0.0089556	0.0138474	0.2258065
1416 hours	8326729	2885.607	2134.955	0.0147409	0.0099573	0.0148185	0.2580645
1428 hours	8925242	2987.514	2227.183	0.0154213	0.0099573	0.0155174	0.2661290
1440 hours	9782313	3127.669	2321.821	0.0161352	0.0109473	0.0162752	0.2661290
1452 hours	9628944	3103.054	2311.548	0.0160471	0.0109473	0.0161895	0.2580645
1464 hours	9701966	3114.798	2331.786	0.0161688	0.0121212	0.0163127	0.2580645
2100 hours	8151873	2855.149	2072.723	0.0142187	0.0109473	0.0143113	0.3225806
2124 hours	10653239	3263.930	2488.605	0.0170862	0.0125826	0.0171767	0.2500000
2136 hours	10717861	3273.815	2522.003	0.0172941	0.0137316	0.0173834	0.2661290
2148 hours	11607770	3407.018	2606.955	0.0180729	0.0128247	0.0182391	0.2903226
2160 hours	11352351	3369.325	2607.621	0.0180847	0.0137316	0.0182502	0.2661290
2172 hours	11216270	3349.070	2586.079	0.0179203	0.0128247	0.0180888	0.2580645
2184 hours	11792074	3433.959	2619.637	0.0181315	0.0128247	0.0183149	0.2580645
2208 hours	11505619	3391.993	2605.975	0.0179717	0.0128247	0.0181463	0.2500000
2844 hours	10081178	3175.087	2346.764	0.0160457	0.0128247	0.0161706	0.3225806
2856 hours	10113834	3180.225	2369.322	0.0161690	0.0128247	0.0162876	0.3145161
2868 hours	10650577	3263.522	2476.660	0.0169122	0.0128247	0.0170360	0.2903226
2880 hours	10977307	3313.202	2621.293	0.0180028	0.0151552	0.0181192	0.2419355
2892 hours	12358409	3515.453	2777.737	0.0191477	0.0151552	0.0193032	0.2338710
2904 hours	14357527	3789.133	2931.750	0.0202899	0.0172276	0.0205209	0.2258065
2916 hours	14253307	3775.355	2913.183	0.0201432	0.0150671	0.0203787	0.2258065
2928 hours	14524051	3811.043	2968.882	0.0204882	0.0172276	0.0207292	0.2258065
3564 hours	12279925	3504.272	2647.313	0.0180714	0.0150671	0.0182375	0.2903226
3588 hours	15215440	3900.697	3084.814	0.0211192	0.0181413	0.0212932	0.2177419
3600 hours	15354601	3918.495	3130.387	0.0214060	0.0182110	0.0215772	0.2096774
3612 hours	17190227	4146.110	3253.759	0.0224584	0.0182110	0.0227317	0.2258065
3624 hours	16918182	4113.172	3218.015	0.0222109	0.0187418	0.0224809	0.2580645
3636 hours	16781197	4096.486	3194.597	0.0220248	0.0152658	0.0223010	0.2580645
3648 hours	17269927	4155.710	3217.739	0.0221652	0.0156254	0.0224531	0.2258065
3672 hours	17037972	4127.708	3227.050	0.0221259	0.0156254	0.0224043	0.2258065
4308 hours	14889876	3858.740	2929.773	0.0199010	0.0156254	0.0201010	0.3225806
4332 hours	17594367	4194.564	3322.505	0.0226795	0.0203274	0.0228883	0.2500000
4344 hours	15710286	3963.620	3077.467	0.0209027	0.0172238	0.0210921	0.2903226
4356 hours	19101619	4370.540	3391.115	0.0233096	0.0187418	0.0236241	0.2741935
4368 hours	17969227	4239.013	3358.235	0.0230029	0.0222755	0.0232267	0.2661290
4380 hours	17227279	4150.576	3303.528	0.0225634	0.0222755	0.0227665	0.2580645
4392 hours	19851147	4455.463	3518.704	0.0241879	0.0175517	0.0245044	0.2258065
4416 hours	20588467	4537.452	3617.151	0.0247918	0.0234997	0.0251231	0.2258065
5016 hours	18019475	4244.935	3322.038	0.0225760	0.0175517	0.0228247	0.2903226
5040 hours	19584840	4425.476	3542.102	0.0240430	0.0234997	0.0242296	0.2741935
5052 hours	18301392	4278.013	3338.617	0.0225972	0.0197556	0.0227711	0.3225806
5064 hours	21050377	4588.069	3610.164	0.0247866	0.0234997	0.0250675	0.2903226
5076 hours	21696987	4658.003	3789.650	0.0259998	0.0259075	0.0262356	0.2419355
5088 hours	20711486	4550.987	3754.801	0.0255614	0.0259075	0.0257638	0.2258065
5100 hours	24205695	4919.928	3995.778	0.0273792	0.0259075	0.0277396	0.2258065
5124 hours	24254011	4924.836	4004.360	0.0274321	0.0259075	0.0277935	0.2258065
5760 hours	21769420	4665.771	3698.630	0.0251265	0.0197556	0.0254030	0.2903226
5772 hours	20421540	4519.020	3499.653	0.0237404	0.0197556	0.0239631	0.3440860
5784 hours	21911436	4680.965	3639.825	0.0247012	0.0191914	0.0249144	0.3306452
5796 hours	23601327	4858.120	3958.128	0.0269945	0.0259075	0.0272247	0.2500000
5808 hours	25180261	5017.994	4119.196	0.0281903	0.0223888	0.0284723	0.2258065
5820 hours	27535914	5247.467	4224.078	0.0290345	0.0223888	0.0294426	0.2258065
5832 hours	27644336	5257.788	4235.336	0.0290310	0.0223888	0.0294559	0.2258065
5844 hours	27326546	5227.480	4203.571	0.0288354	0.0209749	0.0292542	0.2258065
6480 hours	23858964	4884.564	3851.119	0.0261895	0.0205783	0.0264908	0.2661290
6504 hours	26782552	5175.186	4169.212	0.0282877	0.0248724	0.0285277	0.2473118
6516 hours	26994411	5195.615	4210.219	0.0285457	0.0260384	0.0287818	0.2500000
6528 hours	29529574	5434.112	4456.319	0.0305953	0.0224780	0.0310035	0.2016129
6540 hours	29242776	5407.659	4454.195	0.0306165	0.0260384	0.0310294	0.1935484
6552 hours	29074404	5392.069	4442.977	0.0303074	0.0260384	0.0307263	0.1935484
6564 hours	30811671	5550.826	4527.305	0.0308395	0.0260384	0.0313125	0.1935484
6588 hours	29796426	5458.610	4469.971	0.0303514	0.0252500	0.0307951	0.1935484
7224 hours	26723551	5169.483	4124.258	0.0278534	0.0252500	0.0281878	0.2580645
7236 hours	26469106	5144.814	4052.574	0.0273652	0.0252500	0.0276596	0.2795699
7248 hours	27369799	5231.615	4161.627	0.0281096	0.0224780	0.0284085	0.2661290
7260 hours	28774453	5364.182	4437.877	0.0301684	0.0295566	0.0304749	0.1854839
7272 hours	31675690	5628.116	4639.708	0.0316485	0.0273066	0.0320463	0.1612903
7284 hours	36247338	6020.576	4872.729	0.0333665	0.0295566	0.0339473	0.1612903
7296 hours	36729861	6060.517	4903.854	0.0334686	0.0295566	0.0340743	0.1612903
7308 hours	36430203	6035.744	4890.075	0.0333955	0.0289497	0.0339958	0.1612903
7944 hours	31867621	5645.141	4469.368	0.0302531	0.0252500	0.0306949	0.2258065
7968 hours	34731632	5893.355	4790.511	0.0323881	0.0301323	0.0327692	0.2150538
7980 hours	35127034	5926.806	4856.017	0.0328047	0.0319223	0.0331780	0.2096774
7992 hours	39436339	6279.836	5174.038	0.0353709	0.0289497	0.0359697	0.1693548
8004 hours	39089330	6252.146	5143.341	0.0352034	0.0319223	0.0357977	0.1612903
8016 hours	38828669	6231.265	5147.085	0.0349462	0.0319223	0.0355429	0.1612903
8028 hours	40680158	6378.100	5251.867	0.0356294	0.0319223	0.0362751	0.1612903
8052 hours	39581249	6291.363	5210.728	0.0352145	0.0319223	0.0358262	0.1612903
8688 hours	35425308	5951.916	4810.196	0.0323424	0.0319223	0.0328037	0.2258065
8712 hours	38865763	6234.241	5151.744	0.0347855	0.0349362	0.0352124	0.2150538
8724 hours	36237637	6019.770	4868.439	0.0327822	0.0264664	0.0331871	0.2338710
8736 hours	42574661	6524.926	5351.314	0.0365124	0.0349362	0.0371695	0.1774194
8748 hours	39604089	6293.178	5186.128	0.0352838	0.0332404	0.0357510	0.1935484
8760 hours	39307861	6269.598	5196.775	0.0350580	0.0332404	0.0355287	0.1935484

plot_cross_validation_metric(df.cv_holiday, metric = 'rmse')

It does not look very well. Now Let’s compare other models:

Model Comparision
Model 1: changepoint in trend model with freq=‘monthly’, changepoint=5;
Model 2: seasonality model with freq=‘daily’ and week/daily seasonality=TRUE;
Model 3: Additive seasonality model;
Model 4: Multiplicative seasonality model.

#mod1
mod1 = prophet(train,n.changepoints=5)
mod1_future = make_future_dataframe(mod1,periods = 53,freq = 'month')
mod1_forecast = predict(mod1,mod1_future)

df_cv_mod1 <- cross_validation(mod1, initial = 4*365,horizon = 365, units = 'days')
metrics1 = performance_metrics(df_cv_mod1) %>% 
  mutate(model = 'mod1')

#mod2  
mod2 = prophet(train, daily.seasonality = TRUE, weekly.seasonality = TRUE)
mod2_future = make_future_dataframe(mod2,periods = 1583)#freq=daily; 53 month ahead
mod2_forecast = predict(mod2,mod2_future)

df_cv_mod2 <- cross_validation(mod2, initial = 4*365,horizon = 365, units = 'days')
metrics2 = performance_metrics(df_cv_mod2) %>% 
  mutate(model = "mod2")
metrics1 %>% 
bind_rows(metrics2) %>% 
ggplot()+
geom_line(aes(horizon,rmse,color=model))

#mod3
mod3 = prophet(train,seasonality.mode='additive')
mod3_forecast = predict(mod3,future)
df_cv_mod3 <- cross_validation(mod3, initial = 4*365,horizon = 365, units = 'days')
metrics3 = performance_metrics(df_cv_mod3) %>% 
  mutate(model = 'mod3')

#mod4
mod4 = prophet(train,seasonality.mode='multiplicative')
forecast4 = predict(mod4,future)
df_cv_mod4 <- cross_validation(mod4, initial = 4*365,horizon = 365, units = 'days')
metrics4 = performance_metrics(df_cv_mod4) %>% 
  mutate(model = "mod4")
metrics3 %>% 
bind_rows(metrics4) %>% 
ggplot()+
geom_line(aes(horizon,rmse,color=model))

performance_metrics(df.cv_holiday)%>%
  kbl(caption = "Table: Model 1 CV performance metrics")%>%
  kable_classic(full_width = F, html_font = "Cambria")

Table: Model 1 CV performance metrics
horizon	mse	rmse	mae	mape	mdape	smape	coverage
1380 hours	7085812	2661.919	1847.714	0.0127661	0.0088902	0.0128621	0.2016129
1392 hours	6830350	2613.494	1828.086	0.0125604	0.0088902	0.0126436	0.2258065
1404 hours	7929313	2815.904	2004.096	0.0137685	0.0089556	0.0138474	0.2258065
1416 hours	8326729	2885.607	2134.955	0.0147409	0.0099573	0.0148185	0.2580645
1428 hours	8925242	2987.514	2227.183	0.0154213	0.0099573	0.0155174	0.2661290
1440 hours	9782313	3127.669	2321.821	0.0161352	0.0109473	0.0162752	0.2661290
1452 hours	9628944	3103.054	2311.548	0.0160471	0.0109473	0.0161895	0.2580645
1464 hours	9701966	3114.798	2331.786	0.0161688	0.0121212	0.0163127	0.2580645
2100 hours	8151873	2855.149	2072.723	0.0142187	0.0109473	0.0143113	0.3225806
2124 hours	10653239	3263.930	2488.605	0.0170862	0.0125826	0.0171767	0.2500000
2136 hours	10717861	3273.815	2522.003	0.0172941	0.0137316	0.0173834	0.2661290
2148 hours	11607770	3407.018	2606.955	0.0180729	0.0128247	0.0182391	0.2903226
2160 hours	11352351	3369.325	2607.621	0.0180847	0.0137316	0.0182502	0.2661290
2172 hours	11216270	3349.070	2586.079	0.0179203	0.0128247	0.0180888	0.2580645
2184 hours	11792074	3433.959	2619.637	0.0181315	0.0128247	0.0183149	0.2580645
2208 hours	11505619	3391.993	2605.975	0.0179717	0.0128247	0.0181463	0.2500000
2844 hours	10081178	3175.087	2346.764	0.0160457	0.0128247	0.0161706	0.3225806
2856 hours	10113834	3180.225	2369.322	0.0161690	0.0128247	0.0162876	0.3145161
2868 hours	10650577	3263.522	2476.660	0.0169122	0.0128247	0.0170360	0.2903226
2880 hours	10977307	3313.202	2621.293	0.0180028	0.0151552	0.0181192	0.2419355
2892 hours	12358409	3515.453	2777.737	0.0191477	0.0151552	0.0193032	0.2338710
2904 hours	14357527	3789.133	2931.750	0.0202899	0.0172276	0.0205209	0.2258065
2916 hours	14253307	3775.355	2913.183	0.0201432	0.0150671	0.0203787	0.2258065
2928 hours	14524051	3811.043	2968.882	0.0204882	0.0172276	0.0207292	0.2258065
3564 hours	12279925	3504.272	2647.313	0.0180714	0.0150671	0.0182375	0.2903226
3588 hours	15215440	3900.697	3084.814	0.0211192	0.0181413	0.0212932	0.2177419
3600 hours	15354601	3918.495	3130.387	0.0214060	0.0182110	0.0215772	0.2096774
3612 hours	17190227	4146.110	3253.759	0.0224584	0.0182110	0.0227317	0.2258065
3624 hours	16918182	4113.172	3218.015	0.0222109	0.0187418	0.0224809	0.2580645
3636 hours	16781197	4096.486	3194.597	0.0220248	0.0152658	0.0223010	0.2580645
3648 hours	17269927	4155.710	3217.739	0.0221652	0.0156254	0.0224531	0.2258065
3672 hours	17037972	4127.708	3227.050	0.0221259	0.0156254	0.0224043	0.2258065
4308 hours	14889876	3858.740	2929.773	0.0199010	0.0156254	0.0201010	0.3225806
4332 hours	17594367	4194.564	3322.505	0.0226795	0.0203274	0.0228883	0.2500000
4344 hours	15710286	3963.620	3077.467	0.0209027	0.0172238	0.0210921	0.2903226
4356 hours	19101619	4370.540	3391.115	0.0233096	0.0187418	0.0236241	0.2741935
4368 hours	17969227	4239.013	3358.235	0.0230029	0.0222755	0.0232267	0.2661290
4380 hours	17227279	4150.576	3303.528	0.0225634	0.0222755	0.0227665	0.2580645
4392 hours	19851147	4455.463	3518.704	0.0241879	0.0175517	0.0245044	0.2258065
4416 hours	20588467	4537.452	3617.151	0.0247918	0.0234997	0.0251231	0.2258065
5016 hours	18019475	4244.935	3322.038	0.0225760	0.0175517	0.0228247	0.2903226
5040 hours	19584840	4425.476	3542.102	0.0240430	0.0234997	0.0242296	0.2741935
5052 hours	18301392	4278.013	3338.617	0.0225972	0.0197556	0.0227711	0.3225806
5064 hours	21050377	4588.069	3610.164	0.0247866	0.0234997	0.0250675	0.2903226
5076 hours	21696987	4658.003	3789.650	0.0259998	0.0259075	0.0262356	0.2419355
5088 hours	20711486	4550.987	3754.801	0.0255614	0.0259075	0.0257638	0.2258065
5100 hours	24205695	4919.928	3995.778	0.0273792	0.0259075	0.0277396	0.2258065
5124 hours	24254011	4924.836	4004.360	0.0274321	0.0259075	0.0277935	0.2258065
5760 hours	21769420	4665.771	3698.630	0.0251265	0.0197556	0.0254030	0.2903226
5772 hours	20421540	4519.020	3499.653	0.0237404	0.0197556	0.0239631	0.3440860
5784 hours	21911436	4680.965	3639.825	0.0247012	0.0191914	0.0249144	0.3306452
5796 hours	23601327	4858.120	3958.128	0.0269945	0.0259075	0.0272247	0.2500000
5808 hours	25180261	5017.994	4119.196	0.0281903	0.0223888	0.0284723	0.2258065
5820 hours	27535914	5247.467	4224.078	0.0290345	0.0223888	0.0294426	0.2258065
5832 hours	27644336	5257.788	4235.336	0.0290310	0.0223888	0.0294559	0.2258065
5844 hours	27326546	5227.480	4203.571	0.0288354	0.0209749	0.0292542	0.2258065
6480 hours	23858964	4884.564	3851.119	0.0261895	0.0205783	0.0264908	0.2661290
6504 hours	26782552	5175.186	4169.212	0.0282877	0.0248724	0.0285277	0.2473118
6516 hours	26994411	5195.615	4210.219	0.0285457	0.0260384	0.0287818	0.2500000
6528 hours	29529574	5434.112	4456.319	0.0305953	0.0224780	0.0310035	0.2016129
6540 hours	29242776	5407.659	4454.195	0.0306165	0.0260384	0.0310294	0.1935484
6552 hours	29074404	5392.069	4442.977	0.0303074	0.0260384	0.0307263	0.1935484
6564 hours	30811671	5550.826	4527.305	0.0308395	0.0260384	0.0313125	0.1935484
6588 hours	29796426	5458.610	4469.971	0.0303514	0.0252500	0.0307951	0.1935484
7224 hours	26723551	5169.483	4124.258	0.0278534	0.0252500	0.0281878	0.2580645
7236 hours	26469106	5144.814	4052.574	0.0273652	0.0252500	0.0276596	0.2795699
7248 hours	27369799	5231.615	4161.627	0.0281096	0.0224780	0.0284085	0.2661290
7260 hours	28774453	5364.182	4437.877	0.0301684	0.0295566	0.0304749	0.1854839
7272 hours	31675690	5628.116	4639.708	0.0316485	0.0273066	0.0320463	0.1612903
7284 hours	36247338	6020.576	4872.729	0.0333665	0.0295566	0.0339473	0.1612903
7296 hours	36729861	6060.517	4903.854	0.0334686	0.0295566	0.0340743	0.1612903
7308 hours	36430203	6035.744	4890.075	0.0333955	0.0289497	0.0339958	0.1612903
7944 hours	31867621	5645.141	4469.368	0.0302531	0.0252500	0.0306949	0.2258065
7968 hours	34731632	5893.355	4790.511	0.0323881	0.0301323	0.0327692	0.2150538
7980 hours	35127034	5926.806	4856.017	0.0328047	0.0319223	0.0331780	0.2096774
7992 hours	39436339	6279.836	5174.038	0.0353709	0.0289497	0.0359697	0.1693548
8004 hours	39089330	6252.146	5143.341	0.0352034	0.0319223	0.0357977	0.1612903
8016 hours	38828669	6231.265	5147.085	0.0349462	0.0319223	0.0355429	0.1612903
8028 hours	40680158	6378.100	5251.867	0.0356294	0.0319223	0.0362751	0.1612903
8052 hours	39581249	6291.363	5210.728	0.0352145	0.0319223	0.0358262	0.1612903
8688 hours	35425308	5951.916	4810.196	0.0323424	0.0319223	0.0328037	0.2258065
8712 hours	38865763	6234.241	5151.744	0.0347855	0.0349362	0.0352124	0.2150538
8724 hours	36237637	6019.770	4868.439	0.0327822	0.0264664	0.0331871	0.2338710
8736 hours	42574661	6524.926	5351.314	0.0365124	0.0349362	0.0371695	0.1774194
8748 hours	39604089	6293.178	5186.128	0.0352838	0.0332404	0.0357510	0.1935484
8760 hours	39307861	6269.598	5196.775	0.0350580	0.0332404	0.0355287	0.1935484

plot_cross_validation_metric(df_cv_mod1, metric = 'rmse')

Cross-validation results for monthly data is not very reasonable. Maybe because we set the CV freq to ‘Daily’. All models seems to fit pretty well and have similar RMSE as well as other performing metrics.