1.0 Introduction.

This document creates a forecast of median real estate prices in the United states. Median housing prices are a vital indicator of the real estate market’s health and economic activity within a region. By accurately forecasting these prices, stakeholders such as homebuyers, sellers, and policymakers can make informed decisions.

The data set used in this analysis contains historical median real estate prices from 2001 to 2022. We will leverage the tidyverse, particularly the fable package, to explore the data, identify patterns, and build robust time series models for forecasting.

Our objectives in this analysis include:

  1. Exploratory Data Analysis (EDA): We will begin by loading and visualizing the data set to understand the underlying trends, seasonality, and any potential outliers or anomalies.

  2. Model Selection: We will experiment with various specifications of ARIMA models to find the best model that captures the patterns in the data effectively.

  3. Model Evaluation: We will assess the performance of our chosen model using appropriate metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), or others.

  4. Forecasting: Finally, we will generate forecasts for future median housing prices in the Boston MSA based on our selected model, providing valuable insights.

1.1 Prepare Workspace.

# clean workspace
rm(list = ls())

# adjust notation
options(scipen = 999, digits = 10)

# Set the seed for reproducibility
set.seed(123)

# Prepare needed libraries
packages <- c("ggplot2",     # Visualizations 
              "dplyr",       # Data manipulation
              "tidyr",       # Data formatting 
              "knitr",       # markdown output formatting
              "tsibble",
              "lubridate",
              "fpp3",
              "fabletools",
              "fable",
              "foreign"
              )

  for (i in 1:length(packages)) {
    if (!packages[i] %in% rownames(installed.packages())) {
      install.packages(packages[i]
                       , repos = "http://cran.rstudio.com/"
                       , dependencies = TRUE
                       )
    }
    library(packages[i], character.only = TRUE)
  }
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Attaching package: 'tsibble'
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:tsibble':
## 
##     interval
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
## ✔ tibble      3.2.1     ✔ fable       0.3.4
## ✔ tsibbledata 0.4.1     ✔ fabletools  0.4.2
## ✔ feasts      0.3.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()     masks base::date()
## ✖ dplyr::filter()       masks stats::filter()
## ✖ tsibble::intersect()  masks base::intersect()
## ✖ lubridate::interval() masks tsibble::interval()
## ✖ dplyr::lag()          masks stats::lag()
## ✖ tsibble::setdiff()    masks base::setdiff()
## ✖ tsibble::union()      masks base::union()

1.2 Upload Data.

# set working directory for chunk
setwd("~/Documents/Rpubs_Projects")

# Read Data
df <- read.csv("Real_Estate_Sales_2001-2021_GL.csv")

# view the structure of the data set
str(df)
## 'data.frame':    1054159 obs. of  14 variables:
##  $ Serial.Number   : int  2020348 20002 210317 200212 200243 200377 200109 2020180 2020313 200097 ...
##  $ List.Year       : int  2020 2020 2021 2020 2020 2020 2020 2020 2020 2020 ...
##  $ Date.Recorded   : chr  "09/13/2021" "10/02/2020" "07/05/2022" "03/09/2021" ...
##  $ Town            : chr  "Ansonia" "Ashford" "Avon" "Avon" ...
##  $ Address         : chr  "230 WAKELEE AVE" "390 TURNPIKE RD" "53 COTSWOLD WAY" "5 CHESTNUT DRIVE" ...
##  $ Assessed.Value  : num  150500 253000 329730 130400 619290 ...
##  $ Sale.Amount     : num  325000 430000 805000 179900 890000 ...
##  $ Sales.Ratio     : num  0.463 0.588 0.41 0.725 0.696 ...
##  $ Property.Type   : chr  "Commercial" "Residential" "Residential" "Residential" ...
##  $ Residential.Type: chr  "" "Single Family" "Single Family" "Condo" ...
##  $ Non.Use.Code    : chr  "" "" "" "" ...
##  $ Assessor.Remarks: chr  "" "" "" "" ...
##  $ OPM.remarks     : chr  "" "" "" "" ...
##  $ Location        : chr  "" "" "POINT (-72.846365959 41.781677018)" "" ...

2.0 Data Cleaning.

# check missing values with na 
 colSums(is.na(df))
##    Serial.Number        List.Year    Date.Recorded             Town 
##                0                0                0                0 
##          Address   Assessed.Value      Sale.Amount      Sales.Ratio 
##                0                0                0                0 
##    Property.Type Residential.Type     Non.Use.Code Assessor.Remarks 
##                0                0                0                0 
##      OPM.remarks         Location 
##                0                0

The is.na() function is indicating that there are no missing values but we can see that there is missing data in the form of blank spaces by using the str(). This means we will have to check for blank values repestnted as "" in the data frame.

# check for blank spaces 
missing_values <- apply(df == "", 2, sum)
print(missing_values)
##    Serial.Number        List.Year    Date.Recorded             Town 
##                0                0                2                0 
##          Address   Assessed.Value      Sale.Amount      Sales.Ratio 
##               51                0                0                0 
##    Property.Type Residential.Type     Non.Use.Code Assessor.Remarks 
##           382446           393884           751917           892681 
##      OPM.remarks         Location 
##          1042595           799516

Cool, now we will aggregate our data up to the monthly level. We will use the median sale amount as the variable of interest. This is because mean sales price is too volitile, with tons of noise and outliers.

# summarize to day and drop columns
df |>
  select(Date.Recorded, Sale.Amount)  |>
  group_by(Date.Recorded) |>
  summarise(Sale.Amount = median(Sale.Amount))-> df

# create proper date format 
df$Date.Recorded <- as.Date(df$Date.Recorded, format = "%m/%d/%Y")

2.1 Create tsibble.

# create month column 
df |>
  mutate(month = yearmonth(Date.Recorded)) -> monthlysales

# drop missing data 
monthlysales <- na.omit(monthlysales)

# create tsibble & aggreagate to month using median 
monthlysales |>
  group_by(month) |>
  summarise(sales = median(Sale.Amount, na.rm = TRUE)) |>
  as_tsibble(index = month) -> rst

# Drop the first two rows because of missing 
rst <- rst[4:255,]

In this step, we create a tsibble (time series tibble) for easier analysis. We ensure there are no missing values and use the median sale amount for each month. We then drop the first two rows due to missing data to maintain a clean dataset.

3.0 Explore Data.

# plot data 
rst |>
  ggplot()+
  geom_line(aes(x = month, y = sales))+
  labs(title = "Median Realestate Sales Price")

The data appears to be non stationary with an upward trend. There also appears to be constant variance, non of the sections of the time series jump out as more variance to me so we will not need to do any transformations. Lastly there is some seasonality in the data and we should investigate further.

3.1 Stationarity.

# Kpss test for stationary 
rst |>
  features(sales, unitroot_kpss)
## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1      1.11        0.01
# Auto correlation function 
rst |>
  ACF(sales) |>
  autoplot()+
  labs(title = "ACF of Median Price ")

# Partial autocorrelation function 
rst |>
  PACF(sales) |>
  autoplot()+
  labs(title = "PACF of Median Price ")

This time series is definitely non-stationary. The autocorrelation plot shows slowly decaying lags, indicating a trend in the data. Combined with the results of the KPSS test, this conclusively shows that the data is non-stationary and will require differencing.

3.2 Decomposition.

# Fit STL decomposition 
rst |>
  model(
    STL(sales)) -> dcmp_stl

# Plot STL decomposition
  components(dcmp_stl) |>
  autoplot() +
  labs(title = "STL Decomposition of Monthly Median Sale Price")

  dcmp_comp <- components(dcmp_stl)
rst |>
gg_season(sales)+
  labs(title= "Seasonal Plot of Median Price")

There is a seasonal pattern in the data, with a noticeable peak in median price in the middle of each year. It would be beneficial to work with seasonally adjusted data moving forward to account for this pattern. Additionally, most economic time series used for analysis are seasonally adjusted, providing further justification for using seasonally adjusted data in our analysis.

3.3 Plot Seasonally Ajusted.

dcmp_comp |>
  select(-.model) -> dcmp_comp

dcmp_comp |>
  ggplot()+
  geom_line(aes(x = month, y = season_adjust), color = "green3")+
  labs(title = "Seasonally adjusted Median Sale Price")

dcmp_comp|>
  features(season_adjust, unitroot_kpss)
## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1      1.14        0.01

3.4 Plot Diffrenced data.

# Plot diffrenced data
dcmp_comp |>
  ggplot()+
  geom_line(aes(x = month, y = difference(season_adjust)), color = "green3")+
  labs(title = "1st Diffrenced Median Sale Price")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

# Kpss test for stationary 
dcmp_comp |>
  features(difference(season_adjust), unitroot_kpss)
## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.265         0.1

3.4 Auto-Correlation.

# ACF 
dcmp_comp |>
  ACF(difference(season_adjust)) |>
  autoplot()+
  labs(title = "ACF of Diffrenced Median Price")

# PACF
dcmp_comp |>
  PACF(difference(season_adjust)) |>
  autoplot()+
  labs(title = "PACF of Diffrenced Median Price")

We plot the ACF and PACF of the differenced seasonally adjusted data to identify the appropriate parameters for the ARIMA model by examining the correlation structure after differencing. We then perform the KPSS test again on the differenced data to ensure it is now stationary, which is crucial to confirm that the data is suitable for modeling. It is likely that we will only need to apply first differencing, as the data appears stationary after one round of differencing.

4.0 Modeling.

To build and validate our time series model, we split the dataset into training and test sets. The training set includes data from first 228 observations of the series and the last 24 months as the test set. This approach is crude but allows us to train the model on a substantial portion of the data while keeping a separate test set for unbiased evaluation.

4.1 Fit to Training Data.

# create training set

# non- seasonally ajusted
train_rst <- rst[1:228,]

# Seasonally Ajusted
train_sa <- dcmp_comp[1:228,] 
# fit training data 
fit <- train_sa |>
  model(
        AA = ARIMA(season_adjust, stepwise = FALSE),
        Arima = ARIMA(season_adjust~ pdq(2,1,1)+ PDQ(1,1,1)),
        Ar = ARIMA(season_adjust~ pdq(2,1,0)+ PDQ(2,1,0))
        )
fit |>
  forecast(h = 24) -> fit_t


fit_t |>
  autoplot(dcmp_comp)+
  labs(title = "Forecast with Training Data")

# inspect oder of Auto Arima 
print(fit[[1]][[1]]$fit$spec)
##     p d q P D Q constant period
## 581 3 1 1 2 0 0    FALSE     12

4.2 Model Evaluation.

# find most accurate model 
fit_accuracy <- accuracy(fit_t, dcmp_comp) |>
  arrange(.model) |>
  select(.model, .type, RMSE, MAPE,MASE, RMSSE)

fit_accuracy
## # A tibble: 3 × 6
##   .model .type   RMSE  MAPE  MASE RMSSE
##   <chr>  <chr>  <dbl> <dbl> <dbl> <dbl>
## 1 AA     Test  13309.  3.45 0.647 0.668
## 2 Ar     Test  10214.  2.42 0.447 0.512
## 3 Arima  Test  26122.  7.13 1.33  1.31

The best preforming model is the (Ar) auto-regressive model and the (AA) Auto Arima model was second best and lastly the ARIMA model that was manually specifed preformed the worst. Our Ar model dose not have any moving average portion and is just using auto-regressive lags and first difference. Importantly the Ar and AA models are doing better than a one step naive forecast as indicated by RMSSE is lower than 1.

# Extract residuals
fit |>
  select(Ar) |>
  gg_tsresiduals()+
  labs(title = "Residuals of Ar Model")

# check the location of the roots
gg_arma(fit |> select(Ar))

Moving forward with the AR model, we inspect the residuals and examine the roots of our model. Starting with the residuals, they appear to be close to normally distributed and resemble white noise. This suggests that our model is capturing the underlying structure of the data well and is unbiased. However, there is a concern regarding a change in the amplitude of the residuals, with the first few observations being very small. This could indicate inconsistent variance in the data or potential overfitting of the model.

Next, we inspect the inverted roots of the AR model. All roots lie within the unit circle, ensuring the stability of the model. However, many of the roots are close to 1, which is problematic as it suggests the model is near the boundary of non-stationarity. Fortunately, none of the roots lie on the unit circle, which would indicate an unstable forecast and non-stationary model. Overall, while the model is stable, the proximity of the roots to 1 warrants caution and further scrutiny.

5.0 Forecast with Best Model.

fc_fit <- dcmp_comp |>
  model(
        final = ARIMA(season_adjust~ pdq(2,1,0)+ PDQ(2,1,0))
        )
fc_fit |>
  forecast(h = 24) -> fc

fc |>
  autoplot(dcmp_comp)+
  labs(title = "Seasonal Adjusted Median Real Estate Price")

In the end, I believe this forecast is quite good. However, since I obtained this data set from Kaggle, I’m not sure if it is truly representative, so this forecast should not be used for inference.

Time series forecasting is a blend of art and science, and many decisions come down to judgment calls. If you’ve made it to this point in the project and noticed any errors, please leave a comment. I would greatly appreciate constructive feedback to help improve my skills.