Property Price Forecasting

1.Introduction

There are huge finances that are involved while buying a property. In order to make that investment in the right property, it is very important to understand the factors that drives the price and more important what would be the forecast price in future for that property. There are many agencies that individual agents which are into the business of bnuying and selling property. One of biggest player in the real state industry is Zillow who deals in buying, selling and renting a home.

1.1 Zillow’s Business Model

Zillow Group, Inc., or simply Zillow, is an American online real estate marketplace company that was founded in 2006 and was created by Rich Barton and Lloyd Frink.Its service includes advertising for agents and real estate professionals, a home-buying business for flip over, which is anounced that company would shut down this business and sell off the inventory on-hand due to incur of huge losses and third is marketplace to advertise homes for sales and rent.

Zillow also list of properties that are not on market and provide the estimates on those properties which they commonly referes to as Zestimate.

1.2 Methodology/Assumptions Zestimate

The ZHVI will now be calculated using a new set of assumptions:

  1. The average Zestimate within some range of home values determines the index level
  2. Monthly changes in the index are now calculated using a weighted mean of the appreciation of individual homes.
  3. For any geography or cut, index appreciation can now be interpreted as the market’s total appreciation.

The above assumptions are being used to estimate the property prices and providing it to the users of Zillow website/app. It is also important for the inverstors and buyers to aware about the past trends as well as the future projections to get a better feeling about the property prices. With this project we are going to address this question of past trends and furtre projections of the property prices.The information that we have currently on our hand will allow us to get the past trends and our further analysis will provide the information about the future projections.

1.3 Data Available and Future Methodology

Currently we have the data available at the Zip Code level for the Condos all across the United States from Jan 2000 to Nov 2021. Using this data, I will be trying to forecast the prices atleast 2 years in future. For this project purpose, I will be taking 4 bedroom time series data for the region around Tampa Bay.

For the purpose of restricting the analysis around specific area, I will be randomly choosing 2 zip codes for the area around Tampa,Fl. As a part initial steps, I will be focusing on data cleaning and data manipulation steps.

2. Data Source

The data set has been sourced from ZVHI website research data https://www.zillow.com/research/data/.

3.Packages Required

library(fpp3)
library(tidyverse)
library(knitr)
library(lubridate)
library(patchwork)
library(maps)
library(forecast)
library(tseries)
library(sarima)
library(kableExtra)
library(reshape2)
library(dplyr)
library(prophet)

Table 1: Package Summaries

Package Version Summary
tidyverse 1.3.1 A compilation of R tools for data manipulation and visualization
knitr 1.3.6 Dynamic report generation in R, esp. for displaying tables
lubridate 1.8.0 A package that makes handling dates easier
patchwork 1.1.1 A package for combining multiple ggplot2 plots into a single figure
maps 3.4.0 Used to prepare and display geographical maps
forecast 8.15 Forecasting functions for time series modeling
tseries 0.10-49 Time series analysis
sarima 0.8.5 Generation of ARIMA models for time series modeling
kableExtra 1.3.4 Construct complex table with ‘kable’ and pipe symbol
reshape2 1.4.4 Flexibly reshape data

4.Data Preparation

In this section, our data preparation process is described. This includes importing the data, handling missing values, and refactoring several of the variables.Each variable will also be briefly described, including the data gathering steps.

4.1 Variables

Table 2: Variable Summary

Column Name Description
RegionID This is unique Id for the Regions
SizeRank This is the ranking done based on the size of the region
RegionName This field contains the zip code of the region.
RegionType Type of region is Zip.
StateName State
City This column provide the specific City Name of Housing Data
Metro This provide the name of the metro city around that region
County Name This is the county name for that region
Months Column These columns contains the prices of region for every month since 2000

4.2 Data Import

First, we need to import the data.

# Read in the Zip_zvhi_condo.csv
Four_Bedroom<-read.csv("Zip_zhvi_4bdrm.csv",header = TRUE, stringsAsFactors = FALSE)

# Check the structure of the data
str(Four_Bedroom, list.len=12)
## 'data.frame':    26619 obs. of  273 variables:
##  $ RegionID   : int  61639 84654 61637 91982 84616 91940 61616 91733 93144 84640 ...
##  $ SizeRank   : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ RegionName : int  10025 60657 10023 77494 60614 77449 10002 77084 79936 60640 ...
##  $ RegionType : chr  "Zip" "Zip" "Zip" "Zip" ...
##  $ StateName  : chr  "NY" "IL" "NY" "TX" ...
##  $ State      : chr  "NY" "IL" "NY" "TX" ...
##  $ City       : chr  "New York" "Chicago" "New York" "Katy" ...
##  $ Metro      : chr  "New York-Newark-Jersey City" "Chicago-Naperville-Elgin" "New York-Newark-Jersey City" "Houston-The Woodlands-Sugar Land" ...
##  $ CountyName : chr  "New York County" "Cook County" "New York County" "Harris County" ...
##  $ X2000.01.31: num  1616896 520117 NA 220137 630771 ...
##  $ X2000.02.29: num  1561368 521933 1564476 220152 633028 ...
##  $ X2000.03.31: num  1531088 525219 1564714 221038 635739 ...
##   [list output truncated]

It looks like the ZVHI dataset got imported correctly. However, there are clear issues that require correction during data preparation.For example the dates for the value taken are shown as an individual varibale in wider format. This needs to be tidy up into the long format as well there is a character X added to these dates which needs to be removed from every column.

5. Data Cleaning and Transformation

The Data imported has been prefixes by X for every Date, we need to remove that X to bring the dates back to their standard format.

names(Four_Bedroom) <- sub("^X", "", names(Four_Bedroom))
head(Four_Bedroom[,c(1:11)],10)
##    RegionID SizeRank RegionName RegionType StateName State     City
## 1     61639        0      10025        Zip        NY    NY New York
## 2     84654        1      60657        Zip        IL    IL  Chicago
## 3     61637        2      10023        Zip        NY    NY New York
## 4     91982        3      77494        Zip        TX    TX     Katy
## 5     84616        4      60614        Zip        IL    IL  Chicago
## 6     91940        5      77449        Zip        TX    TX     Katy
## 7     61616        6      10002        Zip        NY    NY New York
## 8     91733        7      77084        Zip        TX    TX  Houston
## 9     93144        8      79936        Zip        TX    TX  El Paso
## 10    84640        9      60640        Zip        IL    IL  Chicago
##                               Metro      CountyName 2000.01.31 2000.02.29
## 1       New York-Newark-Jersey City New York County    1616896    1561368
## 2          Chicago-Naperville-Elgin     Cook County     520117     521933
## 3       New York-Newark-Jersey City New York County         NA    1564476
## 4  Houston-The Woodlands-Sugar Land   Harris County     220137     220152
## 5          Chicago-Naperville-Elgin     Cook County     630771     633028
## 6  Houston-The Woodlands-Sugar Land   Harris County     123383     123367
## 7       New York-Newark-Jersey City New York County         NA         NA
## 8  Houston-The Woodlands-Sugar Land   Harris County     128259     128127
## 9                           El Paso  El Paso County     115560     115606
## 10         Chicago-Naperville-Elgin     Cook County     427818     431305

The data exists in wider format, we first need to convert it into long format, so that further analysis on the price can be performed conviniently and more appropriately

Four_Bedroom_long<-Four_Bedroom %>% 
  pivot_longer(-c(RegionID,SizeRank,RegionName,RegionType,StateName,State,City,Metro,CountyName),
    names_to = "Monthly",
    values_to = "Price"
  ) 
str(Four_Bedroom_long)
## tibble [7,027,416 × 11] (S3: tbl_df/tbl/data.frame)
##  $ RegionID  : int [1:7027416] 61639 61639 61639 61639 61639 61639 61639 61639 61639 61639 ...
##  $ SizeRank  : int [1:7027416] 0 0 0 0 0 0 0 0 0 0 ...
##  $ RegionName: int [1:7027416] 10025 10025 10025 10025 10025 10025 10025 10025 10025 10025 ...
##  $ RegionType: chr [1:7027416] "Zip" "Zip" "Zip" "Zip" ...
##  $ StateName : chr [1:7027416] "NY" "NY" "NY" "NY" ...
##  $ State     : chr [1:7027416] "NY" "NY" "NY" "NY" ...
##  $ City      : chr [1:7027416] "New York" "New York" "New York" "New York" ...
##  $ Metro     : chr [1:7027416] "New York-Newark-Jersey City" "New York-Newark-Jersey City" "New York-Newark-Jersey City" "New York-Newark-Jersey City" ...
##  $ CountyName: chr [1:7027416] "New York County" "New York County" "New York County" "New York County" ...
##  $ Monthly   : chr [1:7027416] "2000.01.31" "2000.02.29" "2000.03.31" "2000.04.30" ...
##  $ Price     : num [1:7027416] 1616896 1561368 1531088 1508281 1533773 ...

5.1 Date Conversion

Four_Bedroom_Clean1 <- Four_Bedroom_long %>%
            mutate(Monthly_parsed = as.Date(Monthly,"%Y.%m.%d"))
#Converting the Date from factor to character
Four_Bedroom_long[["Monthly"]]<- as.character(Four_Bedroom_long$Monthly)
summary(Four_Bedroom_long)
##     RegionID         SizeRank       RegionName     RegionType       
##  Min.   : 58196   Min.   :    0   Min.   : 1001   Length:7027416    
##  1st Qu.: 68437   1st Qu.: 6734   1st Qu.:25039   Class :character  
##  Median : 78622   Median :13555   Median :47382   Mode  :character  
##  Mean   : 80140   Mean   :14184   Mean   :47959                     
##  3rd Qu.: 88570   3rd Qu.:21109   3rd Qu.:70085                     
##  Max.   :753844   Max.   :34430   Max.   :99901                     
##                                                                     
##   StateName            State               City              Metro          
##  Length:7027416     Length:7027416     Length:7027416     Length:7027416    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   CountyName          Monthly              Price         
##  Length:7027416     Length:7027416     Min.   :    9043  
##  Class :character   Class :character   1st Qu.:  136178  
##  Mode  :character   Mode  :character   Median :  202624  
##                                        Mean   :  279140  
##                                        3rd Qu.:  317109  
##                                        Max.   :10006952  
##                                        NA's   :1580283

Above summary shows that there to be some issue with Monthly column where Other has 7831304 values and Price has 1726648 as NA’s

There are around 1580283 data in the Price column that does not have any value.So ,before taking any action on these NA’s we first filter on the Zip Code that we need to analyze and then make further data cleaning. As a part of this analysis we are filtering the zipcode of Brandon

6. Time Series Data Selection

For time series modelling, I will be working on the specific region in US to understand the dynamics of how the prices of housing has varied over time specific to that region. This will also help up to analyse the factors that are contributing the the price change for that particular region.

For this case study, I have chosen the Brandon City near Tampa region and the main reason for choosing this city is that I have visited this city multiple time over the last many years and have observed some drastic changes around this region in terms of development and population growth.

So, let’s begin with our analysis.

6.1 Filtering of Targeted Zipcode

Four_Bedroom_Tampa <- Four_Bedroom_Clean1 %>%
  filter(Four_Bedroom_Clean1$RegionName %in% c(33511))
str(Four_Bedroom_Tampa)
## tibble [264 × 12] (S3: tbl_df/tbl/data.frame)
##  $ RegionID      : int [1:264] 72653 72653 72653 72653 72653 72653 72653 72653 72653 72653 ...
##  $ SizeRank      : int [1:264] 476 476 476 476 476 476 476 476 476 476 ...
##  $ RegionName    : int [1:264] 33511 33511 33511 33511 33511 33511 33511 33511 33511 33511 ...
##  $ RegionType    : chr [1:264] "Zip" "Zip" "Zip" "Zip" ...
##  $ StateName     : chr [1:264] "FL" "FL" "FL" "FL" ...
##  $ State         : chr [1:264] "FL" "FL" "FL" "FL" ...
##  $ City          : chr [1:264] "Brandon" "Brandon" "Brandon" "Brandon" ...
##  $ Metro         : chr [1:264] "Tampa-St. Petersburg-Clearwater" "Tampa-St. Petersburg-Clearwater" "Tampa-St. Petersburg-Clearwater" "Tampa-St. Petersburg-Clearwater" ...
##  $ CountyName    : chr [1:264] "Hillsborough County" "Hillsborough County" "Hillsborough County" "Hillsborough County" ...
##  $ Monthly       : chr [1:264] "2000.01.31" "2000.02.29" "2000.03.31" "2000.04.30" ...
##  $ Price         : num [1:264] 139153 139852 139941 140537 139994 ...
##  $ Monthly_parsed: Date[1:264], format: "2000-01-31" "2000-02-29" ...

6.2 Data Exploration Check

colSums(is.na(Four_Bedroom_Tampa))  # To find the number of NA's in the data frame.
##       RegionID       SizeRank     RegionName     RegionType      StateName 
##              0              0              0              0              0 
##          State           City          Metro     CountyName        Monthly 
##              0              0              0              0              0 
##          Price Monthly_parsed 
##              0              0
# To check the authenticity of the data.


table(Four_Bedroom_Tampa$RegionID)
## 
## 72653 
##   264
table(Four_Bedroom_Tampa$SizeRank)
## 
## 476 
## 264
table(Four_Bedroom_Tampa$RegionName)
## 
## 33511 
##   264
table(Four_Bedroom_Tampa$StateName)
## 
##  FL 
## 264
table(Four_Bedroom_Tampa$State)
## 
##  FL 
## 264
table(Four_Bedroom_Tampa$City)
## 
## Brandon 
##     264
table(Four_Bedroom_Tampa$CountyName)
## 
## Hillsborough County 
##                 264

6.3 Timeseries dataframe Selection

zill_ts_df <- Four_Bedroom_Tampa[,c('Monthly_parsed','Price')]
sum(is.na(zill_ts_df))
## [1] 0

7. Analysing through Plots

7.1 Line Chart

ggplot(data = Four_Bedroom_Tampa ,aes(x = Monthly_parsed, y=Price/1000)) +
  geom_line(color = "red") +
  xlab("Year") +
  ylab ("Price in K")+
  ggtitle("House Prices over time in Brandon Region of Tampa Bay") +
  scale_x_date(date_labels = "%Y", breaks = "2 year") +
  theme(axis.text.x = element_text(angle = 90,hjust = 1))

7.2 Box Plot

boxplot(Four_Bedroom_Tampa$Price)

From the above boxplot there seems to be some outliers in Price, but not any major price which needs to be removed from the data set.

8. Modelling

8.1 Regression Modelling

model<-lm(RegionName ~ Price, data = Four_Bedroom_Tampa)
summary(model)
## 
## Call:
## lm(formula = RegionName ~ Price, data = Four_Bedroom_Tampa)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.284e-09 -1.150e-12  6.320e-12  1.103e-11  1.608e-11 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 3.351e+04  2.134e-11 1.571e+15   <2e-16 ***
## Price       1.452e-16  9.584e-17 1.515e+00    0.131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.984e-11 on 262 degrees of freedom
## Multiple R-squared:  0.5017, Adjusted R-squared:  0.4998 
## F-statistic: 263.7 on 1 and 262 DF,  p-value: < 2.2e-16

Calculation and plotting the auto correlation factor and partial auto corelation factor for Brandon.It helps us determine the parameter required to fit in the time series data into ARIMA Model.

Assess the behavior/data-generating process of your time-series using the techniques discussed in class. From your understanding of the process, would you expect the time-series to be mean stationary? Variance stationary? Why or why not?

8.2 Auto Correlation Plot with Lags

Analyze the Autocorrelation for the time series: 1. Lag

8.3 ACF and PACF Plots

ts1<-ts(Four_Bedroom_Tampa$Price, Four_Bedroom_Tampa$Monthly_parsed)
par(mfrow=c(1,2))
acf(ts1, lag.max = 50)
pacf(ts1,lag.max = 10)

*Answer: The ACF plot clearly indicates that there is a dampening effect and PACF indicates that it is of AR1 level. As per current understanding the data generated process is non stationary.I would expect the trend to be mean non-stationary and variance stationary.

8.4 Density Plots

ts = Four_Bedroom_Tampa[,c("Monthly_parsed","Price")] %>%
  mutate(lag_price = lag(Price)) %>%
  drop_na()

cor = cor(ts$Price,ts$lag_price)

ts_plot = ggplot(ts)+
    geom_line(aes(Monthly_parsed,Price))+
    geom_hline(aes(yintercept=0))+
    theme_bw()+
    ggtitle(paste0("Correlation = ",cor))

#scatter_plot = ggplot(ts)+
#    geom_point(aes(lag_price,Price))+
#    geom_smooth(aes(lag_price,Price),method='lm',color='red',se=F)+
#    theme_bw()

dens_plot = ggplot(ts)+
    geom_density(aes(Price),alpha=0.4,fill='blue')+ 
    geom_density(aes(lag_price),alpha=0.4,fill='red')+
    theme_bw()

These plots indicates that the time series data is non mean reverting and somewhat additive in nature.

Four_Bedroom_Tampa<- Four_Bedroom_Tampa %>%
  mutate(lag_price = lag(Four_Bedroom_Tampa$Price),
         lag_price2 = lag(Four_Bedroom_Tampa$Price,2),
         ) %>%
  drop_na()

cor(Four_Bedroom_Tampa$Price,Four_Bedroom_Tampa$lag_price)
## [1] 0.9987113
cor(Four_Bedroom_Tampa$Price,Four_Bedroom_Tampa$lag_price2)
## [1] 0.994927

Above correlation values shows that the current price is more closely related to one time period prior, than the 2 time period prior.

9. Timeseries Analysis (Subsetting)

Let’s analyze the data for last 10 years rather than analyzing the data for last 20 years to get a better picture of the trend

Four_Bedroom_Tampa %>%
  filter(Monthly_parsed>=ymd('2012-01-31') & Monthly_parsed<=ymd('2021-12-31')) %>%
  ggplot(aes(x = Monthly_parsed, y=Price/1000)) +
  geom_line(color = "red") +
  xlab("Year") +
  ylab ("Price in K")+
  ggtitle("House Prices over time in Brandon Region of Tampa Bay", subtitle = "(2012-2021)") +
  scale_x_date(date_labels = "%Y", breaks = "1 year") +
  theme(axis.text.x = element_text(angle = 90,hjust = 1))

This is not a mean reverting trend for the house prices and looks mean non-stationary and additive in nature. The variance seems to be stationary through these years from 2012 -2020 and after that there is more variance.

9.1 Stationary Vs Non-Stationary

Four_Bedroom_Tampa %>%
  arrange(Monthly_parsed) %>%
  filter(Monthly_parsed>ymd("2012-01-31")) %>%
  mutate(return = (Price - lag(Price))/Price) %>%
  drop_na() %>%
  ggplot()+
    geom_line(aes(Monthly_parsed,return))+
    theme_bw()+
    ggtitle("Monthly Change in House Price Over Time", subtitle = "(2012-2021)")+
    ylab("Montlhly House Price")+
    xlab("Date")

Looking at the above plot we can say that the House Price over time is non-stationary but the monthly growth in the house price is somewhat stationary , if we can ignore the change during the pandemic time, where the monthly change in house price also appears to be a non-stationary.

10. ARIMA Modelling

model1<- arima(Four_Bedroom_Tampa$Price, order = c(1,0,0), method = "ML")
model2<- arima(Four_Bedroom_Tampa$Price, order = c(2,0,0), method = "ML")
summary(model1)
## 
## Call:
## arima(x = Four_Bedroom_Tampa$Price, order = c(1, 0, 0), method = "ML")
## 
## Coefficients:
##          ar1  intercept
##       0.9997   222568.3
## s.e.  0.0004   136423.5
## 
## sigma^2 estimated as 8217066:  log likelihood = -2461.28,  aic = 4928.55
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE      MAPE      MASE      ACF1
## Training set 926.3237 2866.542 2048.481 0.3726319 0.8847368 0.9996889 0.9369554
summary(model2)
## 
## Call:
## arima(x = Four_Bedroom_Tampa$Price, order = c(2, 0, 0), method = "ML")
## 
## Coefficients:
##          ar1      ar2  intercept
##       1.9634  -0.9641   299628.9
## s.e.  0.0181   0.0183   131921.0
## 
## sigma^2 estimated as 677107:  log likelihood = -2132.35,  aic = 4272.69
## 
## Training set error measures:
##                      ME     RMSE      MAE          MPE      MAPE      MASE
## Training set -0.7191751 825.8193 586.1337 -0.003108578 0.2631556 0.2860418
##                  ACF1
## Training set 0.105912

AIC for Model 2 is better than Model1. Hence Model 2 is better.

10.1 BIC and AIC Caluations

print(BIC(model1,model2))
##        df      BIC
## model1  3 4939.258
## model2  4 4286.964

BIC also suggests that Model 2 is better than Model1.

10.2 Auto Arima

aa = Four_Bedroom_Tampa[,c("Monthly_parsed","Price")] %>%
  arrange(Monthly_parsed) %>%
  filter(Monthly_parsed>ymd("2012-01-31")) %>%
  mutate(return = (Price - lag(Price))/lag(Price)) %>%
  drop_na()

forecast::auto.arima(aa$return)
## Series: aa$return 
## ARIMA(0,1,0) 
## 
## sigma^2 estimated as 1.087e-05:  log likelihood=502.63
## AIC=-1003.26   AICc=-1003.23   BIC=-1000.5

10.3 Rolling Mean Check

zill_data_roll <- Four_Bedroom_Tampa %>%
 filter(Monthly_parsed>=ymd('2012-01-31') & Monthly_parsed<=ymd("2021-12-31")) %>%
  mutate(
    price_mean = zoo::rollmean(
      Price, 
      k = 6, 
      fill = NA),
    price_sd = zoo::rollapply(
      Price, 
      FUN = sd, 
      width = 6, 
      fill = NA)
  )
par(mfrow=c(2,1))
zill_data_rollmean <- zill_data_roll %>%
  ggplot() +
  geom_line(aes(Monthly_parsed, price_mean)) +
  theme_bw() +
  ggtitle("House Price Mean over Time (6 month rolling window)") +
  ylab("Housing Price") +
  xlab("Time")

zill_data_rollsd <- zill_data_roll %>%
  ggplot() +
  geom_line(aes(Monthly_parsed, price_sd)) +
  theme_bw() +
  ggtitle("Differenced Housing Price SD over Time (6 month rolling window)") +
  ylab("Housing Price") +
  xlab("Time")

** If the data appear to be variance non-stationary, transform your time-series using a natural log or Box-Cox transformation.** ## 11. Lag Transformation

Now lets try some transformations on the data and see if that provides some better results.

zill_data_log = Four_Bedroom_Tampa %>%
  filter(Monthly_parsed>=ymd('2012-01-31') & Monthly_parsed<=ymd("2021-12-31")) %>%
  mutate(price_log = log1p(Price))


zill_data_boxcox = Four_Bedroom_Tampa %>%
  filter(Monthly_parsed>=ymd('2012-01-31') & Monthly_parsed<=ymd("2021-12-31")) %>%
  mutate(price_boxcox = forecast::BoxCox(Price,lambda='auto'))

par(mfrow=c(2,1))

zill_data_log<-zill_data_log %>%
  ggplot()+
      geom_line(aes(Monthly_parsed,price_log))+
      theme_bw()+
      ggtitle("House Prices over Time (Log)")+
      ylab("Housing Price (Log)")+
      xlab("Time")

zill_data_boxcox<-zill_data_boxcox %>%
  ggplot()+
      geom_line(aes(Monthly_parsed,price_boxcox))+
      theme_bw()+
      ggtitle("Housing Price over Time (Box-Cox Transformation)")+
      ylab("Housing Price (Box-Cox Tranformation)")+
      xlab("Time")

The above transformation does not provide any conclusive results since the actual trends have the variance stationary.

11.1 Mean Stationary Plot

zill_data_diff = Four_Bedroom_Tampa %>%
  mutate(price_diff = Price - lag(Price))

zill_data_diff %>% 
  select(Monthly_parsed,Price,price_diff) %>% 
  head()
## # A tibble: 6 × 3
##   Monthly_parsed  Price price_diff
##   <date>          <dbl>      <dbl>
## 1 2000-03-31     139941         NA
## 2 2000-04-30     140537        596
## 3 2000-05-31     139994       -543
## 4 2000-06-30     139775       -219
## 5 2000-07-31     139821         46
## 6 2000-08-31     140500        679
zill_data_diff %>%
  ggplot()+
      geom_line(aes(Monthly_parsed,price_diff))+
      theme_bw()+
      ggtitle("Housing Price (First Difference)")+
      ylab("Housing Price (Difference))")+
      xlab("Time")
## Warning: Removed 1 row(s) containing missing values (geom_path).

11.2 First difference for log values

zill_data_diff = Four_Bedroom_Tampa %>%
  mutate(price_log = log1p(Price)) %>% 
  mutate(price_diff = price_log - lag(price_log))

zill_data_diff %>%
  ggplot()+
      geom_line(aes(Monthly_parsed,price_diff))+
      theme_bw()+
      ggtitle("Housing Price (Log; First Difference)")+
      ylab("Log Housing Price (Difference))")+
      xlab("Time")
## Warning: Removed 1 row(s) containing missing values (geom_path).

11.3 First Difference and First Log between 2012-2021

# First Price Difference
zill_data_diff = Four_Bedroom_Tampa %>%
  filter(Monthly_parsed>=ymd("2012-01-31") & Monthly_parsed <= ymd("2021-12-31")) %>%
  mutate(price_diff = Price - lag(Price)) %>%
drop_na()

zill_data_diff %>%
  ggplot()+
      geom_line(aes(Monthly_parsed,price_diff))+
      theme_bw()+
      ggtitle("Housing Price (First Difference)")+
      ylab("Housing Price (Difference))")+
      xlab("Time")

# First Price Difference in Log

zill_data_log_diff = Four_Bedroom_Tampa %>%
  filter(Monthly_parsed>=ymd("2012-01-31") & Monthly_parsed <= ymd("2021-12-31")) %>%
  mutate(price_log = log1p(Price)) %>% 
  mutate(price_diff = price_log - lag(price_log)) %>%
  drop_na()

zill_data_log_diff %>%
  ggplot()+
      geom_line(aes(Monthly_parsed,price_diff))+
      theme_bw()+
      ggtitle("Housing Price (Log; First Difference)")+
      ylab("Log Housing Price (Difference))")+
      xlab("Time")

Conduct and interpret an Augmented Dickey Fuller and KPSS test for stationarity. Do the results suggest that the process is mean stationary? If the process is non-stationary, calculate the first difference of the data.

11.4 Augment Dickey Fuller Test

library(tseries) # Use tseries package for adf.test function

adf.test(zill_data_diff$Price) # Raw Price Value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  zill_data_diff$Price
## Dickey-Fuller = -2.8206, Lag order = 4, p-value = 0.2358
## alternative hypothesis: stationary
adf.test(zill_data_diff$price_diff) # First Difference
## 
##  Augmented Dickey-Fuller Test
## 
## data:  zill_data_diff$price_diff
## Dickey-Fuller = -1.3887, Lag order = 4, p-value = 0.8304
## alternative hypothesis: stationary
adf.test(zill_data_log_diff$price_log) # Log Price Value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  zill_data_log_diff$price_log
## Dickey-Fuller = -3.3303, Lag order = 4, p-value = 0.06955
## alternative hypothesis: stationary
adf.test(zill_data_log_diff$price_diff) # Differenced Log Price Value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  zill_data_log_diff$price_diff
## Dickey-Fuller = -1.9341, Lag order = 4, p-value = 0.6039
## alternative hypothesis: stationary
  • ADF Raw Price Value : p-value>0.05, so we cannot reject the null hypothesis. Thus the data is non stationary.
  • ADF First Difference : p-value>0.05, so we cannot reject the null hypothesis. Thus the data is non stationary.
  • ADF Log Price Value : p-value>0.05, so we cannot reject the null hypothesis. Thus the data is non stationary.
  • ADF Differenced Log Price Value : p-value>0.05, so we cannot reject the null hypothesis. Thus the data is non stationary.

11.5 KPSS Test

kpss.test(zill_data_diff$Price) # Raw Price Value
## Warning in kpss.test(zill_data_diff$Price): p-value smaller than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  zill_data_diff$Price
## KPSS Level = 2.299, Truncation lag parameter = 4, p-value = 0.01
kpss.test(zill_data_diff$price_diff) # First Difference
## 
##  KPSS Test for Level Stationarity
## 
## data:  zill_data_diff$price_diff
## KPSS Level = 0.70956, Truncation lag parameter = 4, p-value = 0.01268
kpss.test(zill_data_log_diff$price_log) # Log Price Value
## Warning in kpss.test(zill_data_log_diff$price_log): p-value smaller than printed
## p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  zill_data_log_diff$price_log
## KPSS Level = 2.3643, Truncation lag parameter = 4, p-value = 0.01
kpss.test(zill_data_log_diff$price_diff) # Differenced Log Price Value
## 
##  KPSS Test for Level Stationarity
## 
## data:  zill_data_log_diff$price_diff
## KPSS Level = 0.42909, Truncation lag parameter = 4, p-value = 0.06462
  • KPSS Raw Price Value : p-value<0.05, so we can reject the null hypothesis. Thus the data is non stationary.
  • KPSS First Difference : p-value<0.05, so we can reject the null hypothesis. Thus the data is non stationary.
  • KPSS Log Price Value : p-value<0.05, so we can reject the null hypothesis. Thus the data is non stationary.
  • KPSS Differenced Log Price Value : p-value>0.05, so we cannot reject the null hypothesis. Thus the data is stationary.

Looking visually the plots that are generated through various methods, there does not seems to be any seasonality factor affecting the house prices. Major factor that is affecting the house prices is sudden and uncertain financial crisis in the market For example, global financial crisis during 2007-2008 resulted into the drop of housing prices from 2009-20012. After 2012 it was pretty steady until 2020 where the global pandemic triggers the surge in the housing prices since there was minimal to low impact on white collar jobs. Another reason could be the remote working concept has resulted in the investment in the remote and preferred place around US, which might have resulted in the increase in the prices.

11.6 Arima Modeling with Lag Values (ACF/PACF)

ts1 <- ts(zill_data_diff$Price)
ts2 <- ts(zill_data_diff$price_diff)
ts3 <- ts(zill_data_log_diff$price_log)
ts4 <- ts(zill_data_log_diff$price_diff)

par(mfrow=c(2,2))
plot.ts(ts1)
plot.ts(ts2)
plot.ts(ts3)
plot.ts(ts4)

acf(ts1)
pacf(ts1)

acf(ts2)
pacf(ts2)

acf(ts3)
pacf(ts3)

acf(ts4)
pacf(ts4)

#ts_arima<- arima(Four_Bedroom_Tampa$Price)
# plot.ts(ts_arima)

We can observe some dampening effect on the acf plots almost on every transformation. Looking at the pacf plots it looks like the first order lag.

BIC Values for all the Order Lag

BIC(
arima(zill_data_log_diff$price_log, order=c(0,0,1)),
arima(zill_data_log_diff$price_log, order=c(0,1,1)),
arima(zill_data_log_diff$price_log, order=c(0,1,2)),
arima(zill_data_log_diff$price_log, order=c(0,1,3)),
arima(zill_data_log_diff$price_log, order=c(1,1,0)),
arima(zill_data_log_diff$price_log, order=c(1,1,1)),
arima(zill_data_log_diff$price_log, order=c(2,1,0)),
arima(zill_data_log_diff$price_log, order=c(3,1,0))
)
## Warning in BIC.default(arima(zill_data_log_diff$price_log, order = c(0, : models
## are not all fitted to the same number of observations
##                                                         df        BIC
## arima(zill_data_log_diff$price_log, order = c(0, 0, 1))  3  -181.8744
## arima(zill_data_log_diff$price_log, order = c(0, 1, 1))  2  -848.9640
## arima(zill_data_log_diff$price_log, order = c(0, 1, 2))  3  -949.3890
## arima(zill_data_log_diff$price_log, order = c(0, 1, 3))  4  -981.8349
## arima(zill_data_log_diff$price_log, order = c(1, 1, 0))  2 -1006.6656
## arima(zill_data_log_diff$price_log, order = c(1, 1, 1))  3 -1003.2269
## arima(zill_data_log_diff$price_log, order = c(2, 1, 0))  3 -1003.2888
## arima(zill_data_log_diff$price_log, order = c(3, 1, 0))  4  -998.5988

The above BIC values indicates that AR1 process is the best among all the model.

11.7 Auto ARIMA Model for Lag Values

auto.arima(zill_data_log_diff$price_log,stationary=FALSE,allowdrift=FALSE,
seasonal=FALSE,stepwise=FALSE,approximation=FALSE)
## Series: zill_data_log_diff$price_log 
## ARIMA(1,2,2) 
## 
## Coefficients:
##           ar1     ma1     ma2
##       -0.5158  1.0415  0.9664
## s.e.   0.0871  0.0595  0.0998
## 
## sigma^2 estimated as 1.1e-05:  log likelihood=515.96
## AIC=-1023.92   AICc=-1023.57   BIC=-1012.87

The auto arima process indicates that it is ARIMA(1,1,2) order of time series is best to describe this data.

Fit several ARIMA models to your time-series variable. Which model is the “best” according to the AIC? According to the BIC?

best_model <- arima(Four_Bedroom_Tampa$Price, order = c(1,1,2) )
forecast::checkresiduals(best_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,2)
## Q* = 45.678, df = 7, p-value = 1.01e-07
## 
## Model df: 3.   Total lags used: 10

According to the above residual plots, we can observe that the residuals has a white noise plot.

12. Model Selection

12.1 RMSE Value for ARIMA

best_mod = arima(zill_data_log_diff$price_log,order=c(1,1,2))

resid = best_mod$residuals

pred = resid+zill_data_log_diff$price_log

ggplot()+
  geom_line(aes(zill_data_log_diff$Monthly_parsed,zill_data_log_diff$price_log))+
  geom_line(aes(zill_data_log_diff$Monthly_parsed,pred),color='blue',alpha=0.4)+
  theme_bw()+
  xlab("Date")+
  ylab("Log Price Diff Housing")

RMSE = sqrt(mean((expm1(pred) - expm1(zill_data_log_diff$price_log))^2,na.rm=T))
RMSE
## [1] 747.3862

In-sample predicted value is 747.38 units which seems to follow the trend in the data.

12.2 Ljung-Box Test Results

par(mfrow=c(1,2))
resid = best_mod$resid
acf(resid,lag.max=20)
pacf(resid,lag.max=20)

*ACF and PACF values of the residual shows that the residuals are white noise.

12.3 Box-Test Ljung

Box.test(resid,type='Ljung-Box',lag=1)
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 11.114, df = 1, p-value = 0.000857
Box.test(resid,type='Ljung-Box',lag=5)
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 12.199, df = 5, p-value = 0.03215
Box.test(resid,type='Ljung-Box',lag=10)
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 13.597, df = 10, p-value = 0.1922
Box.test(resid,type='Ljung-Box',lag=20)
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 21.677, df = 20, p-value = 0.3583
  • Lag 5 and 10 we can reject the Null hypothesis means there is still some remaining auto-correlation.*
RMSE = sqrt(mean((expm1(pred) - expm1(zill_data_log_diff$price_log))^2,na.rm=T))
RMSE
## [1] 747.3862

13. Forecasting

best_mod %>%
  forecast(h=6) %>% 
  autoplot(ylab = "Price Log")

13.1 Forecasting Model after reverting the transformation

prediction = predict(best_mod,n.ahead=6)

future_dates = c(
  max(zill_data_log_diff$Monthly_parsed) %m+% months(1),
  max(zill_data_log_diff$Monthly_parsed) %m+% months(2),
  max(zill_data_log_diff$Monthly_parsed) %m+% months(3),
  max(zill_data_log_diff$Monthly_parsed) %m+% months(4),
  max(zill_data_log_diff$Monthly_parsed) %m+% months(5),
  max(zill_data_log_diff$Monthly_parsed) %m+% months(6)
)

pred_data = data.frame(
  pred = prediction,
  Monthly_parsed = future_dates
)

pred_merge = Four_Bedroom_Tampa %>% 
  full_join(pred_data) %>%
  filter(Monthly_parsed>ymd("2012-01-31") & Monthly_parsed<=ymd("2022-06-30")) %>%
  mutate(
    pred_high = expm1(pred.pred+2*pred.se),
    pred_low = expm1(pred.pred - 2*pred.se),
    pred.pred = expm1(pred.pred),
    error = pred.pred - Price
    )
## Joining, by = "Monthly_parsed"
pred_merge %>% 
  ggplot()+
  geom_line(aes(Monthly_parsed,Price))+
  geom_line(aes(Monthly_parsed,pred.pred),color='blue')+
  geom_ribbon(aes(x=Monthly_parsed,ymin=pred_low,ymax = pred_high),fill='blue',alpha=0.2)

Note: You can use auto.arima() to find the “best” model, but you should also compare normal ARIMA fits as well.

14. Facebook Prophet Modelling

14.1 Fitting a Basic Model

library(prophet)
prophet_data = Four_Bedroom_Tampa %>% 
    rename(ds = Monthly_parsed, # Have to name our date variable "ds"
    y = Price)  # Have to name our time series "y"

train = prophet_data %>% 
  filter(ds<ymd("2020-01-01"))

test = prophet_data %>%
  filter(ds>=ymd("2020-01-01"))

model = prophet(train)
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
future = make_future_dataframe(model,periods = 730)

forecast = predict(model,future)

14.2 Plotting Forecast

plot(model,forecast)+
ylab("Price")+xlab("Date")+theme_bw()

### 14.3 Interactive Charts

dyplot.prophet(model,forecast)
## Warning: `select_()` was deprecated in dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

14.4 Visualizing the Components

Very similar to general time series decomposition

prophet_plot_components(model,forecast)

- Overall trend shows that there will be a gradual increase in price from 2018-2022. - Yearly trend shows there is some sort of seasonality where the Prices goes up in Summer during June timeframe and drops down during the Fall and then spikes up again during End of the Year.

14.5 Plotting Changepoints

plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Price")

### 14.6 Changepoint Hyperparameters A greater number of changepoints and more flexibility may improve performance, but may also increase risk of overfitting

# Number of Changepoints
model = prophet(train,n.changepoints=50)
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
forecast = predict(model,future)

plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Price")

The change points identified does not show much significance since there is a monthly data available which shows a gradual increase or decrease in housing price.

prophet_plot_components(model,forecast)

### 14.7 Non-Scaled Forecast

forecast_plot_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2020-01-01"))

forecast_not_scaled = ggplot()+
geom_line(aes(test$ds,test$y))+
geom_line(aes(forecast_plot_data$ds,forecast_plot_data$yhat),color='red')

#forecast_scaled = forecast_not_scaled + 
#ylim(0,1000)

forecast_not_scaled

The above plot shows that prediction is off.

15. Prediction

three_yr_future = make_future_dataframe(model,periods = 1100)
three_yr_forecast = predict(model,three_yr_future)
plot(model,three_yr_forecast)+theme_bw()+xlab("Date")+ylab("Price")

We cannot conclude much from a three year forecast, it is increasing steadly in comparison the the reality during the COVID where there is a huge rise in the housing prices.

Looking at the above 3 year trend, it should not account for any saturation point maximum or minimum.

15.1 Component Model Prediction

prophet_plot_components(model,forecast)

Above plot shows the seasonality with a sharp decline in prices during March and then increase during the summer month.

15.2 Seasonality in Prophet

Additive and multiplicative seasonality

additive = prophet(train)
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
add_fcst = predict(additive,future)

plot(additive,add_fcst)

### 15.3 Additive Seasonality

prophet_plot_components(additive,add_fcst)

15.4 Multiplicative Seasonality

multi = prophet(train,seasonality.mode = 'multiplicative')
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
multi_fcst = predict(multi,future)

plot(multi,multi_fcst)

prophet_plot_components(multi, multi_fcst)

15.5 Specifying Holidays for Daily Data

There will be no impact of holidays on Housing Data as the data is available for every month and not day wise.

16. Assessment of Prophet Models

forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2020-01-01"))

RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
## Warning in test$y - forecast_metric_data$yhat: longer object length is not a
## multiple of shorter object length
MAE = mean(abs(test$y - forecast_metric_data$yhat))
## Warning in test$y - forecast_metric_data$yhat: longer object length is not a
## multiple of shorter object length
MAPE = mean(abs((test$y - forecast_metric_data$yhat)/test$y))
## Warning in test$y - forecast_metric_data$yhat: longer object length is not a
## multiple of shorter object length
## Warning in (test$y - forecast_metric_data$yhat)/test$y: longer object length is
## not a multiple of shorter object length
print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 42188.7"
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 30525.93"
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 0.09"
  • RMSE is higher than MAE
  • MAPE is very low which is 9% that means with our prediction we are only off by 9%.

16.1 Cross-Validation with Prophet

df.cv <- cross_validation(model, initial = 2920, period = 365, horizon = 365, units = "days")
## Making 11 forecasts with cutoffs between 2009-01-02 and 2018-12-31
head(df.cv)
## # A tibble: 6 × 6
##        y ds                     yhat yhat_lower yhat_upper cutoff             
##    <dbl> <dttm>                <dbl>      <dbl>      <dbl> <dttm>             
## 1 206999 2009-01-31 00:00:00 209510.    207065.    211890. 2009-01-02 00:00:00
## 2 204291 2009-02-28 00:00:00 206775.    204376.    209036. 2009-01-02 00:00:00
## 3 201840 2009-03-31 00:00:00 203442.    200783.    206162. 2009-01-02 00:00:00
## 4 198242 2009-04-30 00:00:00 200184.    197353.    202907. 2009-01-02 00:00:00
## 5 195884 2009-05-31 00:00:00 196736.    193641.    199828. 2009-01-02 00:00:00
## 6 192684 2009-06-30 00:00:00 193250.    189825.    196810. 2009-01-02 00:00:00
unique(df.cv$cutoff)
##  [1] "2009-01-02 GMT" "2010-01-02 GMT" "2011-01-02 GMT" "2012-01-02 GMT"
##  [5] "2013-01-01 GMT" "2014-01-01 GMT" "2015-01-01 GMT" "2016-01-01 GMT"
##  [9] "2016-12-31 GMT" "2017-12-31 GMT" "2018-12-31 GMT"

16.2 Cross-Val Actual vs Pred

df.cv %>% 
  ggplot()+
  geom_point(aes(ds,y)) +
  geom_point(aes(ds,yhat,color=factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("Price")+
  scale_color_discrete(name = 'Cutoff')

performance_metrics(df.cv)
##     horizon       mse      rmse       mae       mape      mdape      smape
## 1   57 days 105825408 10287.148  7194.786 0.03823971 0.01865136 0.03981570
## 2   58 days 124209984 11144.953  7907.724 0.04242600 0.01451994 0.04433405
## 3   59 days  93099180  9648.792  6182.004 0.03257739 0.00943273 0.03399875
## 4   88 days 129665891 11387.093  7602.062 0.04061261 0.01939580 0.04260262
## 5   89 days 139050169 11791.954  8016.831 0.04285878 0.01557659 0.04500396
## 6   90 days 114628506 10706.470  6899.224 0.03622710 0.01421974 0.03797148
## 7  118 days 156552866 12512.109  8343.723 0.04440717 0.01639912 0.04680881
## 8  119 days 167146696 12928.523  8944.088 0.04759875 0.02099539 0.05015176
## 9  120 days 138380635 11763.530  7741.450 0.04034309 0.01830325 0.04242291
## 10 149 days 190773923 13812.093  9193.089 0.04858329 0.02099539 0.05149475
## 11 150 days 203931897 14280.473  9988.467 0.05270238 0.02099539 0.05575262
## 12 151 days 167858618 12956.026  8409.832 0.04361577 0.01706310 0.04609178
## 13 179 days 231412077 15212.234  9982.783 0.05268921 0.01782473 0.05619120
## 14 180 days 246346622 15695.433 10865.039 0.05721455 0.03389750 0.06088714
## 15 181 days 203321416 14259.082  9097.765 0.04709150 0.01759668 0.05009397
## 16 210 days 276242686 16620.550 10954.344 0.05779244 0.02340230 0.06199901
## 17 211 days 290264013 17037.136 11889.490 0.06253203 0.03566445 0.06686004
## 18 212 days 237510576 15411.378  9848.494 0.05106021 0.02131809 0.05457992
## 19 241 days 319683278 17879.689 11932.112 0.06317913 0.03044333 0.06809612
## 20 242 days 331490983 18206.894 12745.518 0.06717952 0.03980312 0.07217087
## 21 243 days 269618146 16420.053 10495.483 0.05465120 0.02416739 0.05869990
## 22 271 days 355762724 18861.673 12817.504 0.06826580 0.03589756 0.07382762
## 23 272 days 371115120 19264.348 13575.832 0.07169865 0.03922339 0.07734059
## 24 273 days 301032603 17350.291 11203.925 0.05846624 0.03027716 0.06303050
## 25 302 days 399699656 19992.490 13926.812 0.07443759 0.04464477 0.08077389
## 26 303 days 430241322 20742.259 14824.596 0.07824909 0.04464477 0.08483365
## 27 304 days 350221970 18714.218 12368.340 0.06454354 0.03770889 0.06988204
## 28 332 days 458612270 21415.235 15403.817 0.08245176 0.05077511 0.08978582
## 29 333 days 498794063 22333.698 16299.472 0.08580852 0.05077511 0.09343648
## 30 334 days 403613576 20090.136 13598.821 0.07085551 0.03877666 0.07700207
## 31 363 days 529904666 23019.658 16966.091 0.09082214 0.05742924 0.09933773
## 32 364 days 566437309 23799.943 17724.813 0.09329064 0.05742924 0.10201734
## 33 365 days 453030016 21284.502 14704.084 0.07680833 0.03858939 0.08381069
##     coverage
## 1  0.3653846
## 2  0.4230769
## 3  0.4871795
## 4  0.4615385
## 5  0.4615385
## 6  0.5000000
## 7  0.5128205
## 8  0.4230769
## 9  0.4230769
## 10 0.4102564
## 11 0.3076923
## 12 0.3846154
## 13 0.4102564
## 14 0.3076923
## 15 0.3846154
## 16 0.4102564
## 17 0.3076923
## 18 0.4615385
## 19 0.4871795
## 20 0.3846154
## 21 0.5000000
## 22 0.4102564
## 23 0.3076923
## 24 0.4230769
## 25 0.2820513
## 26 0.2307692
## 27 0.3461538
## 28 0.2307692
## 29 0.2307692
## 30 0.3461538
## 31 0.2307692
## 32 0.3076923
## 33 0.4230769
plot_cross_validation_metric(df.cv, metric = 'rmse')