Introduction
For this project, we will be creating a Time Series. We will be
working on decomposing this time series and forecasting with
decomposing. The data set I found for this project looks at the sale
prices for homes that have been sold from 2007 to 2019 in the United
States. This data set includes the sale prices of these various homes
over this course of 12 years.
We will use this data set to create a monthly time series to analyze
these home sale prices along with any sort of trends that there may be.
Using this monthly time series, we will look at the trends in home sale
prices by month, meaning we will find the average home sale price for
each month over the course of 2007 to 2019. This will allow us to
analyze the time series and look for trends occurring across the many
months which the data was collected over the course of. We will work on
decomposing this monthly time series of the average home sale prices and
will use forecasting with decomposing for this analysis of the time
series data.
Data Description
I found this data set on kaggle.com on the following webpage: https://www.kaggle.com/datasets/htagholdings/property-sales?select=raw_sales.csv
This data set looks at the sale prices for homes that have been
recorded from 2007 to 2019. The first observation was collected on
February 6, 2007, and the final observation in the data set was
collected on July 26, 2019. The sale prices are given in US dollars, and
were collected to look at the trends amongst the sale prices of homes in
this particular region.
We will read in the data set from Github and we will call it
“home”.
home <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/raw_sales.csv")
The original data set contains multiple variables looking at things
such as the postcode of the house, the type of property it is, and the
number of bedrooms in the house. However, for this project, we are just
looking at creating a time series with the variable of house sale
prices. So, we will only be interested in the home sale price and the
date of which the house was sold. For this time series project, we will
look at the monthly averages of the home sale prices in order to create
a monthly time series which we will use for analysis to see if there are
any notable patterns we can see from the monthly time series.
Variables
The two variables which we are interested in looking at for this time
series project include:
datesold: The date on which the house was sold. This is given in
the format of year-month-day time”, for example, “2007-02-07 00:00:00”.
This is a variable used for identifying each observation of house sales.
We will adjust this variable to represent the year and the month of
which the home sales were collected in order to create a monthly time
series report.
price: The price for which a house was sold for, given in US
dollars. This is a quantitative, numerical variable which we are
interested in observing the trends of over time. We will adjust this
variable to represent the average home sale price for a specific month
of a specific year in order to create a monthly time series
report.
We will create a monthly time series which observed and analyzes the
trends and patterns seen in the average home sale prices by month over
the course of 2007 to 2019.
Exploratory Data
Analysis
Before we create our time series object, let’s ensure that there are
no missing values in the original data set, “home”.
colSums(is.na(home))
datesold postcode price propertyType bedrooms
0 0 0 0 0
As we can see, there are zero missing values in our data set, so that
means we do not have to worry about filling in any missing observations.
We can proceed with creating our time series object.
Time Series Data
Prepartion
We will adjust the data to create our monthly time series. This will
allow us to analyze the monthly trends in home sale prices over the
years of which this data was collected.
We will start by creating a new data set called “home1” with just the
variables of datesold and price.
home1 <- data.frame(home$datesold, home$price)
home1$datesold <- home1$home.datesold
home1$price <- home1$home.price
home1 <- select(home1, -1,-2)
For this time series project, we will create a monthly time series.
This will take a look at the average home sale price from each month of
which the data set was collected from. This will allow us to look for
trends and patterns across the monthly average sale prices that have
been observed from the span of 2007 to 2019.
The datesold variable includes dates that are given in the format of
“Year-Month-Day hour:minute:second”. While scrolling through the data
set entries, I noticed that the time portion of every data entry was
given as 00:00:00. This means that all of the observations were recorded
exactly at midnight. All of the observations have this exact same time
as their recorded date, so we will drop the time portion from the
datesold variable to make it easier to interpret, as well as to make it
easier to create our monthly average time series.
We will start by dropping the time component from the observations
and the date of which the home sale occurred. This will leave us with
just a “Year-Month-Day” format for the entries of the datesold
variable.
home1$datesold <- as.Date(ymd_hms(home1$datesold))
Next, we will calculate the average home sale price per month in
order to create a data set which we can use to create a montly time
series. The variable datesold will now be given in the format
“Year-Month” and the price variable will represent the average home sale
price for that specific month. We will create a new data set called
“home2” to store this updated version of the data which we will use for
our monthly time series.
home2 <- home1 %>%
mutate(datesold = format(datesold, "%Y-%m")) %>%
group_by(datesold) %>%
summarise(price = mean(price, na.rm = TRUE))
There are exactly 150 observations in this revised data set which
looks at the average home sale prices by month, so we do not need to
drop any observations from the data set since this is an ideal amount to
use for a time series.
Time Series Plot
Now, let’s create a time series object from our data and plot it to
help visualize any potential trends that may be occurring within this
time series object. We will call this time series object “home.ts”.
Since we are creating a monthly time series looking at the average home
sale prices by month, our frequency will be 12.
home.ts <- ts(home2$price, start = c(2007, 1), frequency = 12)
plot(home.ts, main="Monthly Average Home Sale Prices from 2007 to 2019", xlab = "Year", ylab = "Price")

Looking at the time series plot, it appears that there are several
patterns and trends going on within this monthly time series of the home
sale prices from 2007 to 2019. There appears to be some sort of
seasonality which can be seen by the various peaks and drops which
appears to be repeating after some period of time. We will look further
into any potential seasonality in this time series with the
decomposition methods.
Additionally, there appears to be a major peak and drop around the
start of the time series at around 2007 to 2008. This reaches much
higher than any other portion of the time series and is evidence of
something unusual which could be going on at this point. It is possible
that there was some major event occurring within the housing market at
this point which could have led to a massive rise and then a sudden fall
in the average sale prices of homes at the time.
Also, it appears that home sale prices have gradually increased from
the start of the time series in 2007 to the end of the time series in
2019. We can see that the average home sale price has continuously risen
over the course of this time series. This could suggest an additive or
multiplicative trend. In this case, it appears to be more of a
multiplicative trend as the seasonal variation does appear to have grown
and gotten more dramatic over time rather than staying identical
regardless of the year.
Forecasting with
Classical Decomposing
First, let’s begin with looking at a classical decomposition of our
time series data.
# Classical Decomposition
cls.decomp = decompose(home.ts)
par(mar = c(2,2,2,2))
plot(cls.decomp, xlab = "Year")

As we can see in our plot, most of the house prices range between
$300,000 and $800,000.
There appears to be something interesting going on at the beginning
of the time series plot. There is a very sharp peak around 2007 to 2008,
with a major spike much higher than anything around it. This is in fact
the observation with the highest price in the entire time series plot,
reaching up to $800,000, so this is something of interest from the data.
After doing some research, there was a major financial crisis which
occurred from 2007 to 2008 which marked a huge economic recession. It
would make sense that a big financial crisis like this would also have a
significant effect on the housing market, so this is likely why there is
such a sharp and sudden peak in home sale prices around 2007 to
2008.
There does appear to be some sort of seasonal trend going on within
the data as seen by the various peaks and drops that can be seen within
the plot. This shows evidence of some sort of seasonality occurring.
This seasonal pattern appears to repeat over each year, with the highest
and lowest peaks occurring within the span of each year period.
A seasonal trend would make sense for a data set relating to house
sale prices as the time of year could affect the number of individuals
buying a new home, and thus sale prices rising to meet this larger
demand. For instance, times of year such as the summer may see an
increase of buyers looking for a new house due to the ideal weather and
the timing of summer break making it ideal to buy a house before the
next school year begins if the buyers have children. This greater demand
could lead to a rise in the sales prices of the homes, and could be a
potential explanation for this seasonal trend we see in the time series
plot.
Additionally, there appears to be an overall increase in home sale
prices from 2007 to 2019. Looking at the trend curve, we can see that
there has been an overall, gradual increase in the home sale prices from
the start of this data set’s collection to the end of its collection.
So, we can conclude that the home sale prices have gradually increased
from what they were in 2007 to what they became in 2019.
Forecasting with STL
Decomposing
Next, let’s use STL decomposition on our time series data.
# STL Decomposition
stl.decomp = stl(home.ts, s.window = "periodic")
par(mar = c(2,2,2,2))
plot(stl.decomp)

Using this STL decomposition method provides several advantages to
the classical decomposition method we used previously. The STL
decomposition method uses locally estimated scatterplot smoothing, also
known as LOESS, to estimate non-linear trends with greater accuracy in
its forecasting than using the classical decomposition would
provide.
These plots created with the STL decomposition method allow us to
observe some of the trends and patterns of our time series data.
Overall, these plots generated by the STL decomposition provide similar
findings to those created in the previous classical decomposition
method. We see a similar overall trend, with various peaks and falls
over the course of the data being collected from 2007 to 2019 by monthly
averages of the home sale prices.
We also see a similar seasonal pattern that repeats multiple times
over the course of the time which the data was collected. This once
again matches with the idea of an annual seasonality, as it appears like
this seasonal pattern appears to repeat for each year that occurs over
the course of the time series data collection. Additionally, similar to
what was found in the classical decomposition method, is that there is a
significant, notable peak around the start of the time series in 2007 to
2008 which shows the highest home sales prices out of the entire data
set. This major peak around 2007 to 2008 is the highest average home
sales prices which was observed throughout the time series data
collection. Furthermore, we can see the overall trend of a gradual
increase in the home sale prices from 2007 to 2019. This can be seen in
the trend curve by how the average home sale prices have gradually risen
and increased from the start of the time series data collection in 2007
to the end of the time series in 2019. These patterns found by the STL
decomposition method match up with what was observed previously, and
this STL decomposition method provides greater accuracy in its
forecasting.
Training and Testing
Data Sets
Next, let’s look into creating a training and a testing data set to
forecast the time series model.
We will use the last seven observations to make up our testing data
set, and we will try out four different potential sizes for the training
data set. The four different sizes I will try out for the training data
set are n = 100, n = 70, n= 50, and n = 35. This will be the number of
observations which are used in the training data set.
data = home.ts
n0 = length(data)
train.data1 = home.ts[43:(n0-7)]
train.data2 = home.ts[73:(n0-7)]
train.data3 = home.ts[93:(n0-7)]
train.data4 = home.ts[108:(n0-7)]
# The testing data set will be the last 7 observations.
test.data = home.ts[(n0-6):n0]
# Creating the four potential training data set sizes.
train1.ts = ts(train.data1, frequency = 12, start = c(2007-02, 1))
train2.ts = ts(train.data2, frequency = 12, start = c(2010-01, 1))
train3.ts = ts(train.data3, frequency = 12, start = c(2013-01, 1))
train4.ts = ts(train.data4, frequency = 12, start = c(2015-01, 1))
stl1 = stl(train1.ts, s.window = 12)
stl2 = stl(train2.ts, s.window = 12)
stl3 = stl(train3.ts, s.window = 12)
stl4 = stl(train4.ts, s.window = 12)
Next, let’s create the objects for forecasting.
# Forecasting
fcst1 = forecast(stl1,h = 7, method = "naive")
fcst2 = forecast(stl2,h = 7, method = "naive")
fcst3 = forecast(stl3,h = 7, method = "naive")
fcst4 = forecast(stl4,h = 7, method = "naive")
We will perform error analysis to help us determine which size is
best for the training data set out of the four potential sizes we are
going to be choosing between. We will look at both the mean absolute
prediction error (MAPE) and the mean squared error (MSE) of the
potential training data sets.
PE1 = (test.data-fcst1$mean)/fcst1$mean
PE2 = (test.data-fcst2$mean)/fcst2$mean
PE3 = (test.data-fcst3$mean)/fcst3$mean
PE4 = (test.data-fcst4$mean)/fcst4$mean
# Mean Absolute Prediction Errors.
MAPE1 = mean(abs(PE1))
MAPE2 = mean(abs(PE2))
MAPE3 = mean(abs(PE3))
MAPE4 = mean(abs(PE4))
E1 = test.data-fcst1$mean
E2 = test.data-fcst2$mean
E3 = test.data-fcst3$mean
E4 = test.data-fcst4$mean
# Mean Squared Errors.
MSE1 = mean(E1^2)
MSE2 = mean(E2^2)
MSE3 = mean(E3^2)
MSE4 = mean(E4^2)
MSE = c(MSE1, MSE2, MSE3, MSE4)
MAPE = c(MAPE1, MAPE2, MAPE3, MAPE4)
accuracy = cbind(MSE = MSE, MAPE = MAPE)
row.names(accuracy) = c("n = 100", "n = 70", "n = 50", "n = 35")
kable(accuracy, caption = "Error Comparison of the Forecast Results with Different Training Data Set Sizes")
Error Comparison of the Forecast Results with Different
Training Data Set Sizes
n = 100 |
1206113411 |
0.0421218 |
n = 70 |
1246521262 |
0.0443203 |
n = 50 |
1365242537 |
0.0486566 |
n = 35 |
1292631907 |
0.0484141 |
As we can see, the training data set with a size of n = 100 has the
lowest of both the MSE and the MAPE out of the four potential training
data set sizes that were looked at. This suggests that this size is the
most ideal out of the potential options as it reduces the errors the
most out of the four options that were considered.
We will create a visualization to compare these forecast errors and
confirm the finding that a sample size of 100 observations reduces the
errors most out of the potential sample sizes that were considered for
the training data set of the time series.
First, let’s look at a plot of the mean squared errors (MSE) for each
of the four potential sample sizes.
# Plot of the MSE.
plot(1:4, MSE, type = "b", col="darkorchid", ylab = "Errors", xlab = "Sample Size",
main = "MSE", axes = FALSE)
labs = c("n = 100", "n = 70", "n = 50", "n = 35")
axis(1, at = 1:4, label = labs)
axis(2)

As we can see, n = 100 observations has the lowest MSE out of the
four potential sample sizes, indicating that it is the ideal choice,
because it reduces the errors.
We will also look at the plot of the mean absolute prediction errors
(MAPE) for each of the four potential training data set sizes.
# Plot of the MAPE.
plot(1:4, MAPE, type = "b", col="darkorchid", ylab = "Errors", xlab = "Sample Size",
main = "MAPE", axes = FALSE)
labs = c("n = 100", "n = 70", "n = 50", "n = 35")
axis(1, at = 1:4, label = labs)
axis(2)

Once again, the sample size of n = 100 observations shows the lowest
value of the MAPE, showing that is the ideal choice out of the four
sample sizes, because it reduces the errors.
As we can see in the plots for both the MSE and the MAPE, n = 100
observations shows the lowest errors, and as the sample size decreases
in the number of observations, the errors for both the MSE and the MAPE
increase, resulting in errors much greater than those of the sample size
with 100 observations. This strengthens that the ideal training data set
size out of the four options that were considered is a size of 100
observations. A training data set with 100 observations yields the best
performance, because it reduces the errors.
So, we can conclude that from the choices of a training data set with
100, 70, 50, or 35 observations, the best choice is the training data
set with 100 observations. This choice of 100 observations results in
the lowest errors, and therefore, provides the best performance for
forecasting our time series.
Forecasting the
Upcoming Months
We will use forecasting to provide an estimate for the upcoming
months after the end of the time series data collection. This time
series collected data up until July 2019, so we will forecast the
estimated home sale prices for the months which follow the end of this
data collection.
We will use an exponential smoothing method for this forecasting.
This will allow us to forecast the next twelve months of average home
sale prices by the month. Since our time series stopped at July 2019,
this forecasting will provide estimates of monthly average home sale
prices up through June 2020.
We stated previosuly that there appeared to be a multiplicative trend
occurring within this time series. This was seen by how the seasonal
variation appeared to increase over time along with the overall gradual
increase in monthly average home sales. We will reflect this
multiplicative trend with the exponential smoothing forecast.
ets <- ets(home.ts, model = "MAM")
forecast <- forecast(ets, h = 12)
plot(forecast, main = "Forecast of the Average Home Sale Prices \n for the Next 12 Months")

As we can see, this plot illustrates the forecast for the next twelve
months after the end of the time series. This forecast provides an
estimate for what the average home sale prices will look like for the
next twelve months.
We can see what the specific estimated home sale prices are for each
of the next twelve months after July 2019.
forecast$mean
Jan Feb Mar Apr May Jun Jul Aug
2019 633256.4 669327.5
2020 648920.1 671477.6 728994.7 658684.8 665938.5 660994.9
Sep Oct Nov Dec
2019 658924.8 691939.7 671194.7 624725.1
2020
We can see that the month with the highest forecasted price is March
2020 with a forecasted average home sale price of $728,994.70. The month
with the lowest forecasted price is December 2019 with a forecasted
average home sale price of $624,725.10.
Conclusion
In this project, we created a time series of the monthly average home
sales prices that were collected from the span of 2007 to 2019. We
analyzed the trends and patterns of this time series to make note of any
findings which stood out. One notable aspect of this time series was
that it showed a seasonal pattern, with what looked like an annual
seasonality. It was evident that there was some sort of a seasonal
pattern within the home sales prices which repeated after a certain
period of time, which appeared to be annually. The time series plot
showed that this pattern appeared to repeat on approximately an annual
basis, with the home sales prices going through different phases of
highs and lows over the course of each year. Both the classical
decomposition method and the STL decomposition method confirmed the
presence of this seasonal trend within the data for this time series of
the home sales prices.
Another notable finding in this time series project was that there
appeared to be an overall increase in the average home sale prices from
the beginning of the time series data collection in 2007 to the end of
the data collection in 2019. It appeared that home sale prices gradually
increased over the course of this 12 year period of the data collection.
The time series plots as well as the decomposition plots for the time
series showed evidence of this increase in the average home sale prices
from 2007 to 2019.
Additionally, we looked at four potential sample sizes to use for a
training data set in order to determine which number of observations
would be ideal to use. The four potential sizes for the training data
set which we looked at were n = 100, 70, 50, and 35 observations. It
turned out that the sample size of n = 100 observations was the ideal
choice out of these four options, because it resulted in the lowest
errors, as seen by how it had the lowest values for both its MSE and its
MAPE. This indicates that the sample size of 100 observations provides
the best utility for the training set, because it reduces the
errors.
We also used exponential smoothing methods to forecast the following
twelve months after the end of the time series data collection. This
provided forecasted predictions for the average home sale prices up
through June of 2020.
Recommendations
Some recommendations I would make for future projects include:
Overall, it appeared like there was an annual seasonality in this
particular time series done on the observed values of the sales prices
of houses over the span of 2007 to 2019. Some potential reasons were
given to provide an explanation for this seasonal trend, such as more
individuals buying houses in the summer leading the an increase in sales
prices. To further strengthen the evidence of this seasonal trend,
future projects could look further into this to see if this is indeed a
common pattern with records of house sales prices.
Future projects could look into trying out even more different
training data set sizes to see if there is another one which would work
even better than the ones that we tried out in this project. We looked
at four potential sizes for the training data set to see which one
provided the best accuracy, and found that n = 100 observations provided
the best utility by reducing the errors. However, perhaps future
projects could try out even more different sizes to see if there is one
which provides even better accuracy by reducing the errors.
Overall, this time series projects provided insight into some of the
patterns and trends surrounding the monthly average home sale prices
from the span of time from 2007 to 2019. The seasonality and trends
which were observed show that the sales prices of homes is something
which goes through cycles of changes, and various highs and lows
throughout the span of time.
---
title: "House Sale Prices from 2007 to 2019: Time Series Forecasting with Decomposition"
author: "Josie Gallop"
date: "2024-11-18"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 6
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
editor_options: 
  chunk_output_type: console
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```
```{r setup, include=FALSE}
# Detect, install, and load packages if needed.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("leaflet")) {
   install.packages("leaflet")
   library(leaflet)
}
if (!require("EnvStats")) {
   install.packages("EnvStats")
   library(EnvStats)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("phytools")) {
   install.packages("phytools")
   library(phytools)
}
if(!require("dplyr")) {
   install.packages("dplyr")
   library(dplyr)
}
if(!require("tidyverse")) {
   install.packages("tidyverse")
   library(tidyverse)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if (!require("boot")) {
   install.packages("boot")
   library(boot)
}
if(!require("pander")) {
   install.packages("pander")
   library(pander)
}
if(!require("mlbench")) {
   install.packages("mlbench")
   library(mlbench)
}
if(!require("psych")) {
   install.packages("psych")
   library(psych)
}
if(!require("lubridate")) {
   install.packages("lubridate")
   library(lubridate)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if (!require("pROC")) {
   install.packages("pROC")
   library(pROC)
}
if (!require("forecast")) {
   install.packages("forecast")
   library(forecast)
}
knitr::opts_chunk$set(echo = TRUE,  
                   warning = FALSE,   
                   message = FALSE,  
                   results = TRUE,  
                   comment = NA   
                      )   
```


# Introduction

For this project, we will be creating a Time Series. We will be working on decomposing this time series and forecasting with decomposing. The data set I found for this project looks at the sale prices for homes that have been sold from 2007 to 2019 in the United States. This data set includes the sale prices of these various homes over this course of 12 years. 

We will use this data set to create a monthly time series to analyze these home sale prices along with any sort of trends that there may be. Using this monthly time series, we will look at the trends in home sale prices by month, meaning we will find the average home sale price for each month over the course of 2007 to 2019. This will allow us to analyze the time series and look for trends occurring across the many months which the data was collected over the course of. We will work on decomposing this monthly time series of the average home sale prices and will use forecasting with decomposing for this analysis of the time series data.



## Data Description

I found this data set on kaggle.com on the following webpage:
https://www.kaggle.com/datasets/htagholdings/property-sales?select=raw_sales.csv

This data set looks at the sale prices for homes that have been recorded from 2007 to 2019. The first observation was collected on February 6, 2007, and the final observation in the data set was collected on July 26, 2019. The sale prices are given in US dollars, and were collected to look at the trends amongst the sale prices of homes in this particular region. 

We will read in the data set from Github and we will call it "home". 

```{r}
home <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/raw_sales.csv")
```

The original data set contains multiple variables looking at things such as the postcode of the house, the type of property it is, and the number of bedrooms in the house. However, for this project, we are just looking at creating a time series with the variable of house sale prices. So, we will only be interested in the home sale price and the date of which the house was sold. For this time series project, we will look at the monthly averages of the home sale prices in order to create a monthly time series which we will use for analysis to see if there are any notable patterns we can see from the monthly time series.


## Variables

The two variables which we are interested in looking at for this time series project include:

* datesold: The date on which the house was sold. This is given in the format of year-month-day time", for example, "2007-02-07 00:00:00". This is a variable used for identifying each observation of house sales. We will adjust this variable to represent the year and the month of which the home sales were collected in order to create a monthly time series report.

* price: The price for which a house was sold for, given in US dollars. This is a quantitative, numerical variable which we are interested in observing the trends of over time. We will adjust this variable to represent the average home sale price for a specific month of a specific year in order to create a monthly time series report.

We will create a monthly time series which observed and analyzes the trends and patterns seen in the average home sale prices by month over the course of 2007 to 2019. 




# Exploratory Data Analysis

Before we create our time series object, let's ensure that there are no missing values in the original data set, "home". 

```{r}
colSums(is.na(home))
```

As we can see, there are zero missing values in our data set, so that means we do not have to worry about filling in any missing observations. We can proceed with creating our time series object. 



## Time Series Data Prepartion

We will adjust the data to create our monthly time series. This will allow us to analyze the monthly trends in home sale prices over the years of which this data was collected. 

We will start by creating a new data set called "home1" with just the variables of datesold and price. 

```{r}
home1 <- data.frame(home$datesold, home$price)
home1$datesold <- home1$home.datesold
home1$price <- home1$home.price
home1 <- select(home1, -1,-2)
```

For this time series project, we will create a monthly time series. This will take a look at the average home sale price from each month of which the data set was collected from. This will allow us to look for trends and patterns across the monthly average sale prices that have been observed from the span of 2007 to 2019. 

The datesold variable includes dates that are given in the format of "Year-Month-Day hour:minute:second". While scrolling through the data set entries, I noticed that the time portion of every data entry was given as 00:00:00. This means that all of the observations were recorded exactly at midnight. All of the observations have this exact same time as their recorded date, so we will drop the time portion from the datesold variable to make it easier to interpret, as well as to make it easier to create our monthly average time series. 

We will start by dropping the time component from the observations and the date of which the home sale occurred. This will leave us with just a "Year-Month-Day" format for the entries of the datesold variable. 

```{r}
home1$datesold <- as.Date(ymd_hms(home1$datesold))
```

Next, we will calculate the average home sale price per month in order to create a data set which we can use to create a montly time series. The variable datesold will now be given in the format "Year-Month" and the price variable will represent the average home sale price for that specific month. We will create a new data set called "home2" to store this updated version of the data which we will use for our monthly time series. 

```{r}
home2 <- home1 %>%
  mutate(datesold = format(datesold, "%Y-%m")) %>%
  group_by(datesold) %>%
  summarise(price = mean(price, na.rm = TRUE))
```

There are exactly 150 observations in this revised data set which looks at the average home sale prices by month, so we do not need to drop any observations from the data set since this is an ideal amount to use for a time series. 



## Time Series Plot

Now, let's create a time series object from our data and plot it to help visualize any potential trends that may be occurring within this time series object. We will call this time series object "home.ts". Since we are creating a monthly time series looking at the average home sale prices by month, our frequency will be 12. 

```{r}
home.ts <- ts(home2$price, start = c(2007, 1), frequency = 12)
plot(home.ts, main="Monthly Average Home Sale Prices from 2007 to 2019", xlab = "Year", ylab = "Price")
```

Looking at the time series plot, it appears that there are several patterns and trends going on within this monthly time series of the home sale prices from 2007 to 2019. There appears to be some sort of seasonality which can be seen by the various peaks and drops which appears to be repeating after some period of time. We will look further into any potential seasonality in this time series with the decomposition methods. 

Additionally, there appears to be a major peak and drop around the start of the time series at around 2007 to 2008. This reaches much higher than any other portion of the time series and is evidence of something unusual which could be going on at this point. It is possible that there was some major event occurring within the housing market at this point which could have led to a massive rise and then a sudden fall in the average sale prices of homes at the time.

Also, it appears that home sale prices have gradually increased from the start of the time series in 2007 to the end of the time series in 2019. We can see that the average home sale price has continuously risen over the course of this time series. This could suggest an additive or multiplicative trend. In this case, it appears to be more of a multiplicative trend as the seasonal variation does appear to have grown and gotten more dramatic over time rather than staying identical regardless of the year. 



# Forecasting with Classical Decomposing

First, let's begin with looking at a classical decomposition of our time series data. 

```{r}
# Classical Decomposition
cls.decomp = decompose(home.ts)
par(mar = c(2,2,2,2))
plot(cls.decomp, xlab = "Year")
```

As we can see in our plot, most of the house prices range between $300,000 and $800,000. 

There appears to be something interesting going on at the beginning of the time series plot. There is a very sharp peak around 2007 to 2008, with a major spike much higher than anything around it. This is in fact the observation with the highest price in the entire time series plot, reaching up to $800,000, so this is something of interest from the data. After doing some research, there was a major financial crisis which occurred from 2007 to 2008 which marked a huge economic recession. It would make sense that a big financial crisis like this would also have a significant effect on the housing market, so this is likely why there is such a sharp and sudden peak in home sale prices around 2007 to 2008. 

There does appear to be some sort of seasonal trend going on within the data as seen by the various peaks and drops that can be seen within the plot. This shows evidence of some sort of seasonality occurring. This seasonal pattern appears to repeat over each year, with the highest and lowest peaks occurring within the span of each year period. 

A seasonal trend would make sense for a data set relating to house sale prices as the time of year could affect the number of individuals buying a new home, and thus sale prices rising to meet this larger demand. For instance, times of year such as the summer may see an increase of buyers looking for a new house due to the ideal weather and the timing of summer break making it ideal to buy a house before the next school year begins if the buyers have children. This greater demand could lead to a rise in the sales prices of the homes, and could be a potential explanation for this seasonal trend we see in the time series plot. 

Additionally, there appears to be an overall increase in home sale prices from 2007 to 2019. Looking at the trend curve, we can see that there has been an overall, gradual increase in the home sale prices from the start of this data set's collection to the end of its collection. So, we can conclude that the home sale prices have gradually increased from what they were in 2007 to what they became in 2019. 




# Forecasting with STL Decomposing 

Next, let's use STL decomposition on our time series data.

```{r}
# STL Decomposition
stl.decomp = stl(home.ts,  s.window = "periodic")
par(mar = c(2,2,2,2))
plot(stl.decomp)
```

Using this STL decomposition method provides several advantages to the classical decomposition method we used previously. The STL decomposition method uses locally estimated scatterplot smoothing, also known as LOESS, to estimate non-linear trends with greater accuracy in its forecasting than using the classical decomposition would provide. 

These plots created with the STL decomposition method allow us to observe some of the trends and patterns of our time series data. Overall, these plots generated by the STL decomposition provide similar findings to those created in the previous classical decomposition method. We see a similar overall trend, with various peaks and falls over the course of the data being collected from 2007 to 2019 by monthly averages of the home sale prices. 

We also see a similar seasonal pattern that repeats multiple times over the course of the time which the data was collected. This once again matches with the idea of an annual seasonality, as it appears like this seasonal pattern appears to repeat for each year that occurs over the course of the time series data collection. Additionally, similar to what was found in the classical decomposition method, is that there is a significant, notable peak around the start of the time series in 2007 to 2008 which shows the highest home sales prices out of the entire data set. This major peak around 2007 to 2008 is the highest average home sales prices which was observed throughout the time series data collection. Furthermore, we can see the overall trend of a gradual increase in the home sale prices from 2007 to 2019. This can be seen in the trend curve by how the average home sale prices have gradually risen and increased from the start of the time series data collection in 2007 to the end of the time series in 2019. These patterns found by the STL decomposition method match up with what was observed previously, and this STL decomposition method provides greater accuracy in its forecasting. 




# Training and Testing Data Sets

Next, let's look into creating a training and a testing data set to forecast the time series model. 

We will use the last seven observations to make up our testing data set, and we will try out four different potential sizes for the training data set. The four different sizes I will try out for the training data set are n = 100, n = 70, n= 50, and n = 35. This will be the number of observations which are used in the training data set. 


```{r}
data = home.ts
n0 = length(data)
train.data1 = home.ts[43:(n0-7)]
train.data2 = home.ts[73:(n0-7)]
train.data3 = home.ts[93:(n0-7)]
train.data4 = home.ts[108:(n0-7)]

# The testing data set will be the last 7 observations.
test.data = home.ts[(n0-6):n0]

# Creating the four potential training data set sizes.
train1.ts = ts(train.data1, frequency = 12, start = c(2007-02, 1))
train2.ts = ts(train.data2, frequency = 12, start = c(2010-01, 1))
train3.ts = ts(train.data3, frequency = 12, start = c(2013-01, 1))
train4.ts = ts(train.data4, frequency = 12, start = c(2015-01, 1))

stl1 = stl(train1.ts, s.window = 12)
stl2 = stl(train2.ts, s.window = 12)
stl3 = stl(train3.ts, s.window = 12)
stl4 = stl(train4.ts, s.window = 12)
```

Next, let's create the objects for forecasting.

```{r}
# Forecasting
fcst1 = forecast(stl1,h = 7, method = "naive")
fcst2 = forecast(stl2,h = 7, method = "naive")
fcst3 = forecast(stl3,h = 7, method = "naive")
fcst4 = forecast(stl4,h = 7, method = "naive")
```

We will perform error analysis to help us determine which size is best for the training data set out of the four potential sizes we are going to be choosing between. We will look at both the mean absolute prediction error (MAPE) and the mean squared error (MSE) of the potential training data sets. 

```{r}
PE1 = (test.data-fcst1$mean)/fcst1$mean
PE2 = (test.data-fcst2$mean)/fcst2$mean
PE3 = (test.data-fcst3$mean)/fcst3$mean
PE4 = (test.data-fcst4$mean)/fcst4$mean

# Mean Absolute Prediction Errors.
MAPE1 = mean(abs(PE1))
MAPE2 = mean(abs(PE2))
MAPE3 = mean(abs(PE3))
MAPE4 = mean(abs(PE4))

E1 = test.data-fcst1$mean
E2 = test.data-fcst2$mean
E3 = test.data-fcst3$mean
E4 = test.data-fcst4$mean

# Mean Squared Errors.
MSE1 = mean(E1^2)
MSE2 = mean(E2^2)
MSE3 = mean(E3^2)
MSE4 = mean(E4^2)

MSE = c(MSE1, MSE2, MSE3, MSE4)
MAPE = c(MAPE1, MAPE2, MAPE3, MAPE4)
accuracy = cbind(MSE = MSE, MAPE = MAPE)
row.names(accuracy) = c("n = 100", "n = 70", "n = 50", "n = 35")
kable(accuracy, caption = "Error Comparison of the Forecast Results with Different Training Data Set Sizes")
```

As we can see, the training data set with a size of n = 100 has the lowest of both the MSE and the MAPE out of the four potential training data set sizes that were looked at. This suggests that this size is the most ideal out of the potential options as it reduces the errors the most out of the four options that were considered. 

We will create a visualization to compare these forecast errors and confirm the finding that a sample size of 100 observations reduces the errors most out of the potential sample sizes that were considered for the training data set of the time series.

First, let's look at a plot of the mean squared errors (MSE) for each of the four potential sample sizes. 

```{r}
# Plot of the MSE.
plot(1:4, MSE, type = "b", col="darkorchid", ylab = "Errors", xlab = "Sample Size",
     main = "MSE", axes = FALSE)
     labs = c("n = 100", "n = 70", "n = 50", "n = 35")
     axis(1, at = 1:4, label = labs)
     axis(2)
```

As we can see, n = 100 observations has the lowest MSE out of the four potential sample sizes, indicating that it is the ideal choice, because it reduces the errors. 

We will also look at the plot of the mean absolute prediction errors (MAPE) for each of the four potential training data set sizes. 

```{r}
# Plot of the MAPE.
plot(1:4, MAPE, type = "b", col="darkorchid", ylab = "Errors", xlab = "Sample Size",
     main = "MAPE", axes = FALSE)
     labs = c("n = 100", "n = 70", "n = 50", "n = 35")
     axis(1, at = 1:4, label = labs)
     axis(2)
```

Once again, the sample size of n = 100 observations shows the lowest value of the MAPE, showing that is the ideal choice out of the four sample sizes, because it reduces the errors. 

As we can see in the plots for both the MSE and the MAPE, n = 100 observations shows the lowest errors, and  as the sample size decreases in the number of observations, the errors for both the MSE and the MAPE increase, resulting in errors much greater than those of the sample size with 100 observations. This strengthens that the ideal training data set size out of the four options that were considered is a size of 100 observations. A training data set with 100 observations yields the best performance, because it reduces the errors. 

So, we can conclude that from the choices of a training data set with 100, 70, 50, or 35 observations, the best choice is the training data set with 100 observations. This choice of 100 observations results in the lowest errors, and therefore, provides the best performance for forecasting our time series. 



# Forecasting the Upcoming Months

We will use forecasting to provide an estimate for the upcoming months after the end of the time series data collection. This time series collected data up until July 2019, so we will forecast the estimated home sale prices for the months which follow the end of this data collection.

We will use an exponential smoothing method for this forecasting. This will allow us to forecast the next twelve months of average home sale prices by the month. Since our time series stopped at July 2019, this forecasting will provide estimates of monthly average home sale prices up through June 2020.

We stated previosuly that there appeared to be a multiplicative trend occurring within this time series. This was seen by how the seasonal variation appeared to increase over time along with the overall gradual increase in monthly average home sales. We will reflect this multiplicative trend with the exponential smoothing forecast. 

```{r}
ets <- ets(home.ts, model = "MAM")
forecast <- forecast(ets, h = 12)
plot(forecast, main = "Forecast of the Average Home Sale Prices \n for the Next 12 Months")
```

As we can see, this plot illustrates the forecast for the next twelve months after the end of the time series. This forecast provides an estimate for what the average home sale prices will look like for the next twelve months. 

We can see what the specific estimated home sale prices are for each of the next twelve months after July 2019. 

```{r}
forecast$mean 
```

We can see that the month with the highest forecasted price is March 2020 with a forecasted average home sale price of $728,994.70. The month with the lowest forecasted price is December 2019 with a forecasted average home sale price of $624,725.10. 



# Conclusion

In this project, we created a time series of the monthly average home sales prices that were collected from the span of 2007 to 2019. We analyzed the trends and patterns of this time series to make note of any findings which stood out. One notable aspect of this time series was that it showed a seasonal pattern, with what looked like an annual seasonality. It was evident that there was some sort of a seasonal pattern within the home sales prices which repeated after a certain period of time, which appeared to be annually. The time series plot showed that this pattern appeared to repeat on approximately an annual basis, with the home sales prices going through different phases of highs and lows over the course of each year. Both the classical decomposition method and the STL decomposition method confirmed the presence of this seasonal trend within the data for this time series of the home sales prices. 

Another notable finding in this time series project was that there appeared to be an overall increase in the average home sale prices from the beginning of the time series data collection in 2007 to the end of the data collection in 2019. It appeared that home sale prices gradually increased over the course of this 12 year period of the data collection. The time series plots as well as the decomposition plots for the time series showed evidence of this increase in the average home sale prices from 2007 to 2019.

Additionally, we looked at four potential sample sizes to use for a training data set in order to determine which number of observations would be ideal to use. The four potential sizes for the training data set which we looked at were n = 100, 70, 50, and 35 observations. It turned out that the sample size of n = 100 observations was the ideal choice out of these four options, because it resulted in the lowest errors, as seen by how it had the lowest values for both its MSE and its MAPE. This indicates that the sample size of 100 observations provides the best utility for the training set, because it reduces the errors. 

We also used exponential smoothing methods to forecast the following twelve months after the end of the time series data collection. This provided forecasted predictions for the average home sale prices up through June of 2020. 


## Recommendations

Some recommendations I would make for future projects include:

* Overall, it appeared like there was an annual seasonality in this particular time series done on the observed values of the sales prices of houses over the span of 2007 to 2019. Some potential reasons were given to provide an explanation for this seasonal trend, such as more individuals buying houses in the summer leading the an increase in sales prices. To further strengthen the evidence of this seasonal trend, future projects could look further into this to see if this is indeed a common pattern with records of house sales prices. 

* Future projects could look into trying out even more different training data set sizes to see if there is another one which would work even better than the ones that we tried out in this project. We looked at four potential sizes for the training data set to see which one provided the best accuracy, and found that n = 100 observations provided the best utility by reducing the errors. However, perhaps future projects could try out even more different sizes to see if there is one which provides even better accuracy by reducing the errors.  


Overall, this time series projects provided insight into some of the patterns and trends surrounding the monthly average home sale prices from the span of time from 2007 to 2019. The seasonality and trends which were observed show that the sales prices of homes is something which goes through cycles of changes, and various highs and lows throughout the span of time. 



# References

The data set I used for my time series was found on kaggle.com. Included below is the citation of the web page where I found this data set. 

Holdings, H., & James, T. (2019, August 12). House Property Sales Time Series. Kaggle. https://www.kaggle.com/datasets/htagholdings/property-sales/data 
