This document creates a forecast of median real estate prices in the United states. Median housing prices are a vital indicator of the real estate market’s health and economic activity within a region. By accurately forecasting these prices, stakeholders such as homebuyers, sellers, and policymakers can make informed decisions.
The data set used in this analysis contains historical median real
estate prices from 2001 to 2022. We will leverage the
tidyverse
, particularly the fable
package, to
explore the data, identify patterns, and build robust time series models
for forecasting.
Our objectives in this analysis include:
Exploratory Data Analysis (EDA): We will begin by loading and visualizing the data set to understand the underlying trends, seasonality, and any potential outliers or anomalies.
Model Selection: We will experiment with various specifications of ARIMA models to find the best model that captures the patterns in the data effectively.
Model Evaluation: We will assess the performance of our chosen model using appropriate metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), or others.
Forecasting: Finally, we will generate forecasts for future median housing prices in the Boston MSA based on our selected model, providing valuable insights.
# clean workspace
rm(list = ls())
# adjust notation
options(scipen = 999, digits = 10)
# Set the seed for reproducibility
set.seed(123)
# Prepare needed libraries
packages <- c("ggplot2", # Visualizations
"dplyr", # Data manipulation
"tidyr", # Data formatting
"knitr", # markdown output formatting
"tsibble",
"lubridate",
"fpp3",
"fabletools",
"fable",
"foreign"
)
for (i in 1:length(packages)) {
if (!packages[i] %in% rownames(installed.packages())) {
install.packages(packages[i]
, repos = "http://cran.rstudio.com/"
, dependencies = TRUE
)
}
library(packages[i], character.only = TRUE)
}
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'tsibble'
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:tsibble':
##
## interval
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
## ✔ tibble 3.2.1 ✔ fable 0.3.4
## ✔ tsibbledata 0.4.1 ✔ fabletools 0.4.2
## ✔ feasts 0.3.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ lubridate::interval() masks tsibble::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
# set working directory for chunk
setwd("~/Documents/Rpubs_Projects")
# Read Data
df <- read.csv("Real_Estate_Sales_2001-2021_GL.csv")
# view the structure of the data set
str(df)
## 'data.frame': 1054159 obs. of 14 variables:
## $ Serial.Number : int 2020348 20002 210317 200212 200243 200377 200109 2020180 2020313 200097 ...
## $ List.Year : int 2020 2020 2021 2020 2020 2020 2020 2020 2020 2020 ...
## $ Date.Recorded : chr "09/13/2021" "10/02/2020" "07/05/2022" "03/09/2021" ...
## $ Town : chr "Ansonia" "Ashford" "Avon" "Avon" ...
## $ Address : chr "230 WAKELEE AVE" "390 TURNPIKE RD" "53 COTSWOLD WAY" "5 CHESTNUT DRIVE" ...
## $ Assessed.Value : num 150500 253000 329730 130400 619290 ...
## $ Sale.Amount : num 325000 430000 805000 179900 890000 ...
## $ Sales.Ratio : num 0.463 0.588 0.41 0.725 0.696 ...
## $ Property.Type : chr "Commercial" "Residential" "Residential" "Residential" ...
## $ Residential.Type: chr "" "Single Family" "Single Family" "Condo" ...
## $ Non.Use.Code : chr "" "" "" "" ...
## $ Assessor.Remarks: chr "" "" "" "" ...
## $ OPM.remarks : chr "" "" "" "" ...
## $ Location : chr "" "" "POINT (-72.846365959 41.781677018)" "" ...
# check missing values with na
colSums(is.na(df))
## Serial.Number List.Year Date.Recorded Town
## 0 0 0 0
## Address Assessed.Value Sale.Amount Sales.Ratio
## 0 0 0 0
## Property.Type Residential.Type Non.Use.Code Assessor.Remarks
## 0 0 0 0
## OPM.remarks Location
## 0 0
The is.na()
function is indicating that there are no
missing values but we can see that there is missing data in the form of
blank spaces by using the str()
. This means we will have to
check for blank values repestnted as ""
in the data
frame.
# check for blank spaces
missing_values <- apply(df == "", 2, sum)
print(missing_values)
## Serial.Number List.Year Date.Recorded Town
## 0 0 2 0
## Address Assessed.Value Sale.Amount Sales.Ratio
## 51 0 0 0
## Property.Type Residential.Type Non.Use.Code Assessor.Remarks
## 382446 393884 751917 892681
## OPM.remarks Location
## 1042595 799516
Cool, now we will aggregate our data up to the monthly level. We will use the median sale amount as the variable of interest. This is because mean sales price is too volitile, with tons of noise and outliers.
# summarize to day and drop columns
df |>
select(Date.Recorded, Sale.Amount) |>
group_by(Date.Recorded) |>
summarise(Sale.Amount = median(Sale.Amount))-> df
# create proper date format
df$Date.Recorded <- as.Date(df$Date.Recorded, format = "%m/%d/%Y")
# create month column
df |>
mutate(month = yearmonth(Date.Recorded)) -> monthlysales
# drop missing data
monthlysales <- na.omit(monthlysales)
# create tsibble & aggreagate to month using median
monthlysales |>
group_by(month) |>
summarise(sales = median(Sale.Amount, na.rm = TRUE)) |>
as_tsibble(index = month) -> rst
# Drop the first two rows because of missing
rst <- rst[4:255,]
In this step, we create a tsibble (time series tibble) for easier analysis. We ensure there are no missing values and use the median sale amount for each month. We then drop the first two rows due to missing data to maintain a clean dataset.
# plot data
rst |>
ggplot()+
geom_line(aes(x = month, y = sales))+
labs(title = "Median Realestate Sales Price")
The data appears to be non stationary with an upward trend. There also appears to be constant variance, non of the sections of the time series jump out as more variance to me so we will not need to do any transformations. Lastly there is some seasonality in the data and we should investigate further.
# Kpss test for stationary
rst |>
features(sales, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 1.11 0.01
# Auto correlation function
rst |>
ACF(sales) |>
autoplot()+
labs(title = "ACF of Median Price ")
# Partial autocorrelation function
rst |>
PACF(sales) |>
autoplot()+
labs(title = "PACF of Median Price ")
This time series is definitely non-stationary. The autocorrelation plot shows slowly decaying lags, indicating a trend in the data. Combined with the results of the KPSS test, this conclusively shows that the data is non-stationary and will require differencing.
# Fit STL decomposition
rst |>
model(
STL(sales)) -> dcmp_stl
# Plot STL decomposition
components(dcmp_stl) |>
autoplot() +
labs(title = "STL Decomposition of Monthly Median Sale Price")
dcmp_comp <- components(dcmp_stl)
rst |>
gg_season(sales)+
labs(title= "Seasonal Plot of Median Price")
There is a seasonal pattern in the data, with a noticeable peak in median price in the middle of each year. It would be beneficial to work with seasonally adjusted data moving forward to account for this pattern. Additionally, most economic time series used for analysis are seasonally adjusted, providing further justification for using seasonally adjusted data in our analysis.
dcmp_comp |>
select(-.model) -> dcmp_comp
dcmp_comp |>
ggplot()+
geom_line(aes(x = month, y = season_adjust), color = "green3")+
labs(title = "Seasonally adjusted Median Sale Price")
dcmp_comp|>
features(season_adjust, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 1.14 0.01
# Plot diffrenced data
dcmp_comp |>
ggplot()+
geom_line(aes(x = month, y = difference(season_adjust)), color = "green3")+
labs(title = "1st Diffrenced Median Sale Price")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
# Kpss test for stationary
dcmp_comp |>
features(difference(season_adjust), unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.265 0.1
# ACF
dcmp_comp |>
ACF(difference(season_adjust)) |>
autoplot()+
labs(title = "ACF of Diffrenced Median Price")
# PACF
dcmp_comp |>
PACF(difference(season_adjust)) |>
autoplot()+
labs(title = "PACF of Diffrenced Median Price")
We plot the ACF and PACF of the differenced seasonally adjusted data to identify the appropriate parameters for the ARIMA model by examining the correlation structure after differencing. We then perform the KPSS test again on the differenced data to ensure it is now stationary, which is crucial to confirm that the data is suitable for modeling. It is likely that we will only need to apply first differencing, as the data appears stationary after one round of differencing.
To build and validate our time series model, we split the dataset into training and test sets. The training set includes data from first 228 observations of the series and the last 24 months as the test set. This approach is crude but allows us to train the model on a substantial portion of the data while keeping a separate test set for unbiased evaluation.
# create training set
# non- seasonally ajusted
train_rst <- rst[1:228,]
# Seasonally Ajusted
train_sa <- dcmp_comp[1:228,]
# fit training data
fit <- train_sa |>
model(
AA = ARIMA(season_adjust, stepwise = FALSE),
Arima = ARIMA(season_adjust~ pdq(2,1,1)+ PDQ(1,1,1)),
Ar = ARIMA(season_adjust~ pdq(2,1,0)+ PDQ(2,1,0))
)
fit |>
forecast(h = 24) -> fit_t
fit_t |>
autoplot(dcmp_comp)+
labs(title = "Forecast with Training Data")
# inspect oder of Auto Arima
print(fit[[1]][[1]]$fit$spec)
## p d q P D Q constant period
## 581 3 1 1 2 0 0 FALSE 12
# find most accurate model
fit_accuracy <- accuracy(fit_t, dcmp_comp) |>
arrange(.model) |>
select(.model, .type, RMSE, MAPE,MASE, RMSSE)
fit_accuracy
## # A tibble: 3 × 6
## .model .type RMSE MAPE MASE RMSSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AA Test 13309. 3.45 0.647 0.668
## 2 Ar Test 10214. 2.42 0.447 0.512
## 3 Arima Test 26122. 7.13 1.33 1.31
The best preforming model is the (Ar) auto-regressive model and the (AA) Auto Arima model was second best and lastly the ARIMA model that was manually specifed preformed the worst. Our Ar model dose not have any moving average portion and is just using auto-regressive lags and first difference. Importantly the Ar and AA models are doing better than a one step naive forecast as indicated by RMSSE is lower than 1.
# Extract residuals
fit |>
select(Ar) |>
gg_tsresiduals()+
labs(title = "Residuals of Ar Model")
# check the location of the roots
gg_arma(fit |> select(Ar))
Moving forward with the AR model, we inspect the residuals and examine the roots of our model. Starting with the residuals, they appear to be close to normally distributed and resemble white noise. This suggests that our model is capturing the underlying structure of the data well and is unbiased. However, there is a concern regarding a change in the amplitude of the residuals, with the first few observations being very small. This could indicate inconsistent variance in the data or potential overfitting of the model.
Next, we inspect the inverted roots of the AR model. All roots lie within the unit circle, ensuring the stability of the model. However, many of the roots are close to 1, which is problematic as it suggests the model is near the boundary of non-stationarity. Fortunately, none of the roots lie on the unit circle, which would indicate an unstable forecast and non-stationary model. Overall, while the model is stable, the proximity of the roots to 1 warrants caution and further scrutiny.
fc_fit <- dcmp_comp |>
model(
final = ARIMA(season_adjust~ pdq(2,1,0)+ PDQ(2,1,0))
)
fc_fit |>
forecast(h = 24) -> fc
fc |>
autoplot(dcmp_comp)+
labs(title = "Seasonal Adjusted Median Real Estate Price")
In the end, I believe this forecast is quite good. However, since I obtained this data set from Kaggle, I’m not sure if it is truly representative, so this forecast should not be used for inference.
Time series forecasting is a blend of art and science, and many decisions come down to judgment calls. If you’ve made it to this point in the project and noticed any errors, please leave a comment. I would greatly appreciate constructive feedback to help improve my skills.