1 Introduction

For this project, we will be creating a Time Series. We will be working on decomposing this time series and forecasting with decomposing. The data set I found for this project looks at the sale prices for homes that have been sold from 2007 to 2019 in the United States. This data set includes the sale prices of these various homes over this course of 12 years.

We will use this data set to create a monthly time series to analyze these home sale prices along with any sort of trends that there may be. Using this monthly time series, we will look at the trends in home sale prices by month, meaning we will find the average home sale price for each month over the course of 2007 to 2019. This will allow us to analyze the time series and look for trends occurring across the many months which the data was collected over the course of. We will work on decomposing this monthly time series of the average home sale prices and will use forecasting with decomposing for this analysis of the time series data.

1.1 Data Description

I found this data set on kaggle.com on the following webpage: https://www.kaggle.com/datasets/htagholdings/property-sales?select=raw_sales.csv

This data set looks at the sale prices for homes that have been recorded from 2007 to 2019. The first observation was collected on February 6, 2007, and the final observation in the data set was collected on July 26, 2019. The sale prices are given in US dollars, and were collected to look at the trends amongst the sale prices of homes in this particular region.

We will read in the data set from Github and we will call it “home”.

home <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/raw_sales.csv")

The original data set contains multiple variables looking at things such as the postcode of the house, the type of property it is, and the number of bedrooms in the house. However, for this project, we are just looking at creating a time series with the variable of house sale prices. So, we will only be interested in the home sale price and the date of which the house was sold. For this time series project, we will look at the monthly averages of the home sale prices in order to create a monthly time series which we will use for analysis to see if there are any notable patterns we can see from the monthly time series.

1.2 Variables

The two variables which we are interested in looking at for this time series project include:

  • datesold: The date on which the house was sold. This is given in the format of year-month-day time”, for example, “2007-02-07 00:00:00”. This is a variable used for identifying each observation of house sales. We will adjust this variable to represent the year and the month of which the home sales were collected in order to create a monthly time series report.

  • price: The price for which a house was sold for, given in US dollars. This is a quantitative, numerical variable which we are interested in observing the trends of over time. We will adjust this variable to represent the average home sale price for a specific month of a specific year in order to create a monthly time series report.

We will create a monthly time series which observed and analyzes the trends and patterns seen in the average home sale prices by month over the course of 2007 to 2019.

2 Exploratory Data Analysis

Before we create our time series object, let’s ensure that there are no missing values in the original data set, “home”.

colSums(is.na(home))
    datesold     postcode        price propertyType     bedrooms 
           0            0            0            0            0 

As we can see, there are zero missing values in our data set, so that means we do not have to worry about filling in any missing observations. We can proceed with creating our time series object.

2.1 Time Series Data Prepartion

We will adjust the data to create our monthly time series. This will allow us to analyze the monthly trends in home sale prices over the years of which this data was collected.

We will start by creating a new data set called “home1” with just the variables of datesold and price.

home1 <- data.frame(home$datesold, home$price)
home1$datesold <- home1$home.datesold
home1$price <- home1$home.price
home1 <- select(home1, -1,-2)

For this time series project, we will create a monthly time series. This will take a look at the average home sale price from each month of which the data set was collected from. This will allow us to look for trends and patterns across the monthly average sale prices that have been observed from the span of 2007 to 2019.

The datesold variable includes dates that are given in the format of “Year-Month-Day hour:minute:second”. While scrolling through the data set entries, I noticed that the time portion of every data entry was given as 00:00:00. This means that all of the observations were recorded exactly at midnight. All of the observations have this exact same time as their recorded date, so we will drop the time portion from the datesold variable to make it easier to interpret, as well as to make it easier to create our monthly average time series.

We will start by dropping the time component from the observations and the date of which the home sale occurred. This will leave us with just a “Year-Month-Day” format for the entries of the datesold variable.

home1$datesold <- as.Date(ymd_hms(home1$datesold))

Next, we will calculate the average home sale price per month in order to create a data set which we can use to create a montly time series. The variable datesold will now be given in the format “Year-Month” and the price variable will represent the average home sale price for that specific month. We will create a new data set called “home2” to store this updated version of the data which we will use for our monthly time series.

home2 <- home1 %>%
  mutate(datesold = format(datesold, "%Y-%m")) %>%
  group_by(datesold) %>%
  summarise(price = mean(price, na.rm = TRUE))

There are exactly 150 observations in this revised data set which looks at the average home sale prices by month, so we do not need to drop any observations from the data set since this is an ideal amount to use for a time series.

2.2 Time Series Plot

Now, let’s create a time series object from our data and plot it to help visualize any potential trends that may be occurring within this time series object. We will call this time series object “home.ts”. Since we are creating a monthly time series looking at the average home sale prices by month, our frequency will be 12.

home.ts <- ts(home2$price, start = c(2007, 1), frequency = 12)
plot(home.ts, main="Monthly Average Home Sale Prices from 2007 to 2019", xlab = "Year", ylab = "Price")

Looking at the time series plot, it appears that there are several patterns and trends going on within this monthly time series of the home sale prices from 2007 to 2019. There appears to be some sort of seasonality which can be seen by the various peaks and drops which appears to be repeating after some period of time. We will look further into any potential seasonality in this time series with the decomposition methods.

Additionally, there appears to be a major peak and drop around the start of the time series at around 2007 to 2008. This reaches much higher than any other portion of the time series and is evidence of something unusual which could be going on at this point. It is possible that there was some major event occurring within the housing market at this point which could have led to a massive rise and then a sudden fall in the average sale prices of homes at the time.

Also, it appears that home sale prices have gradually increased from the start of the time series in 2007 to the end of the time series in 2019. We can see that the average home sale price has continuously risen over the course of this time series. This could suggest an additive or multiplicative trend. In this case, it appears to be more of a multiplicative trend as the seasonal variation does appear to have grown and gotten more dramatic over time rather than staying identical regardless of the year.

3 Forecasting with Classical Decomposing

First, let’s begin with looking at a classical decomposition of our time series data.

# Classical Decomposition
cls.decomp = decompose(home.ts)
par(mar = c(2,2,2,2))
plot(cls.decomp, xlab = "Year")

As we can see in our plot, most of the house prices range between $300,000 and $800,000.

There appears to be something interesting going on at the beginning of the time series plot. There is a very sharp peak around 2007 to 2008, with a major spike much higher than anything around it. This is in fact the observation with the highest price in the entire time series plot, reaching up to $800,000, so this is something of interest from the data. After doing some research, there was a major financial crisis which occurred from 2007 to 2008 which marked a huge economic recession. It would make sense that a big financial crisis like this would also have a significant effect on the housing market, so this is likely why there is such a sharp and sudden peak in home sale prices around 2007 to 2008.

There does appear to be some sort of seasonal trend going on within the data as seen by the various peaks and drops that can be seen within the plot. This shows evidence of some sort of seasonality occurring. This seasonal pattern appears to repeat over each year, with the highest and lowest peaks occurring within the span of each year period.

A seasonal trend would make sense for a data set relating to house sale prices as the time of year could affect the number of individuals buying a new home, and thus sale prices rising to meet this larger demand. For instance, times of year such as the summer may see an increase of buyers looking for a new house due to the ideal weather and the timing of summer break making it ideal to buy a house before the next school year begins if the buyers have children. This greater demand could lead to a rise in the sales prices of the homes, and could be a potential explanation for this seasonal trend we see in the time series plot.

Additionally, there appears to be an overall increase in home sale prices from 2007 to 2019. Looking at the trend curve, we can see that there has been an overall, gradual increase in the home sale prices from the start of this data set’s collection to the end of its collection. So, we can conclude that the home sale prices have gradually increased from what they were in 2007 to what they became in 2019.

4 Forecasting with STL Decomposing

Next, let’s use STL decomposition on our time series data.

# STL Decomposition
stl.decomp = stl(home.ts,  s.window = "periodic")
par(mar = c(2,2,2,2))
plot(stl.decomp)

Using this STL decomposition method provides several advantages to the classical decomposition method we used previously. The STL decomposition method uses locally estimated scatterplot smoothing, also known as LOESS, to estimate non-linear trends with greater accuracy in its forecasting than using the classical decomposition would provide.

These plots created with the STL decomposition method allow us to observe some of the trends and patterns of our time series data. Overall, these plots generated by the STL decomposition provide similar findings to those created in the previous classical decomposition method. We see a similar overall trend, with various peaks and falls over the course of the data being collected from 2007 to 2019 by monthly averages of the home sale prices.

We also see a similar seasonal pattern that repeats multiple times over the course of the time which the data was collected. This once again matches with the idea of an annual seasonality, as it appears like this seasonal pattern appears to repeat for each year that occurs over the course of the time series data collection. Additionally, similar to what was found in the classical decomposition method, is that there is a significant, notable peak around the start of the time series in 2007 to 2008 which shows the highest home sales prices out of the entire data set. This major peak around 2007 to 2008 is the highest average home sales prices which was observed throughout the time series data collection. Furthermore, we can see the overall trend of a gradual increase in the home sale prices from 2007 to 2019. This can be seen in the trend curve by how the average home sale prices have gradually risen and increased from the start of the time series data collection in 2007 to the end of the time series in 2019. These patterns found by the STL decomposition method match up with what was observed previously, and this STL decomposition method provides greater accuracy in its forecasting.

5 Training and Testing Data Sets

Next, let’s look into creating a training and a testing data set to forecast the time series model.

We will use the last seven observations to make up our testing data set, and we will try out four different potential sizes for the training data set. The four different sizes I will try out for the training data set are n = 100, n = 70, n= 50, and n = 35. This will be the number of observations which are used in the training data set.

data = home.ts
n0 = length(data)
train.data1 = home.ts[43:(n0-7)]
train.data2 = home.ts[73:(n0-7)]
train.data3 = home.ts[93:(n0-7)]
train.data4 = home.ts[108:(n0-7)]

# The testing data set will be the last 7 observations.
test.data = home.ts[(n0-6):n0]

# Creating the four potential training data set sizes.
train1.ts = ts(train.data1, frequency = 12, start = c(2007-02, 1))
train2.ts = ts(train.data2, frequency = 12, start = c(2010-01, 1))
train3.ts = ts(train.data3, frequency = 12, start = c(2013-01, 1))
train4.ts = ts(train.data4, frequency = 12, start = c(2015-01, 1))

stl1 = stl(train1.ts, s.window = 12)
stl2 = stl(train2.ts, s.window = 12)
stl3 = stl(train3.ts, s.window = 12)
stl4 = stl(train4.ts, s.window = 12)

Next, let’s create the objects for forecasting.

# Forecasting
fcst1 = forecast(stl1,h = 7, method = "naive")
fcst2 = forecast(stl2,h = 7, method = "naive")
fcst3 = forecast(stl3,h = 7, method = "naive")
fcst4 = forecast(stl4,h = 7, method = "naive")

We will perform error analysis to help us determine which size is best for the training data set out of the four potential sizes we are going to be choosing between. We will look at both the mean absolute prediction error (MAPE) and the mean squared error (MSE) of the potential training data sets.

PE1 = (test.data-fcst1$mean)/fcst1$mean
PE2 = (test.data-fcst2$mean)/fcst2$mean
PE3 = (test.data-fcst3$mean)/fcst3$mean
PE4 = (test.data-fcst4$mean)/fcst4$mean

# Mean Absolute Prediction Errors.
MAPE1 = mean(abs(PE1))
MAPE2 = mean(abs(PE2))
MAPE3 = mean(abs(PE3))
MAPE4 = mean(abs(PE4))

E1 = test.data-fcst1$mean
E2 = test.data-fcst2$mean
E3 = test.data-fcst3$mean
E4 = test.data-fcst4$mean

# Mean Squared Errors.
MSE1 = mean(E1^2)
MSE2 = mean(E2^2)
MSE3 = mean(E3^2)
MSE4 = mean(E4^2)

MSE = c(MSE1, MSE2, MSE3, MSE4)
MAPE = c(MAPE1, MAPE2, MAPE3, MAPE4)
accuracy = cbind(MSE = MSE, MAPE = MAPE)
row.names(accuracy) = c("n = 100", "n = 70", "n = 50", "n = 35")
kable(accuracy, caption = "Error Comparison of the Forecast Results with Different Training Data Set Sizes")
Error Comparison of the Forecast Results with Different Training Data Set Sizes
MSE MAPE
n = 100 1206113411 0.0421218
n = 70 1246521262 0.0443203
n = 50 1365242537 0.0486566
n = 35 1292631907 0.0484141

As we can see, the training data set with a size of n = 100 has the lowest of both the MSE and the MAPE out of the four potential training data set sizes that were looked at. This suggests that this size is the most ideal out of the potential options as it reduces the errors the most out of the four options that were considered.

We will create a visualization to compare these forecast errors and confirm the finding that a sample size of 100 observations reduces the errors most out of the potential sample sizes that were considered for the training data set of the time series.

First, let’s look at a plot of the mean squared errors (MSE) for each of the four potential sample sizes.

# Plot of the MSE.
plot(1:4, MSE, type = "b", col="darkorchid", ylab = "Errors", xlab = "Sample Size",
     main = "MSE", axes = FALSE)
     labs = c("n = 100", "n = 70", "n = 50", "n = 35")
     axis(1, at = 1:4, label = labs)
     axis(2)

As we can see, n = 100 observations has the lowest MSE out of the four potential sample sizes, indicating that it is the ideal choice, because it reduces the errors.

We will also look at the plot of the mean absolute prediction errors (MAPE) for each of the four potential training data set sizes.

# Plot of the MAPE.
plot(1:4, MAPE, type = "b", col="darkorchid", ylab = "Errors", xlab = "Sample Size",
     main = "MAPE", axes = FALSE)
     labs = c("n = 100", "n = 70", "n = 50", "n = 35")
     axis(1, at = 1:4, label = labs)
     axis(2)

Once again, the sample size of n = 100 observations shows the lowest value of the MAPE, showing that is the ideal choice out of the four sample sizes, because it reduces the errors.

As we can see in the plots for both the MSE and the MAPE, n = 100 observations shows the lowest errors, and as the sample size decreases in the number of observations, the errors for both the MSE and the MAPE increase, resulting in errors much greater than those of the sample size with 100 observations. This strengthens that the ideal training data set size out of the four options that were considered is a size of 100 observations. A training data set with 100 observations yields the best performance, because it reduces the errors.

So, we can conclude that from the choices of a training data set with 100, 70, 50, or 35 observations, the best choice is the training data set with 100 observations. This choice of 100 observations results in the lowest errors, and therefore, provides the best performance for forecasting our time series.

6 Forecasting the Upcoming Months

We will use forecasting to provide an estimate for the upcoming months after the end of the time series data collection. This time series collected data up until July 2019, so we will forecast the estimated home sale prices for the months which follow the end of this data collection.

We will use an exponential smoothing method for this forecasting. This will allow us to forecast the next twelve months of average home sale prices by the month. Since our time series stopped at July 2019, this forecasting will provide estimates of monthly average home sale prices up through June 2020.

We stated previosuly that there appeared to be a multiplicative trend occurring within this time series. This was seen by how the seasonal variation appeared to increase over time along with the overall gradual increase in monthly average home sales. We will reflect this multiplicative trend with the exponential smoothing forecast.

ets <- ets(home.ts, model = "MAM")
forecast <- forecast(ets, h = 12)
plot(forecast, main = "Forecast of the Average Home Sale Prices \n for the Next 12 Months")

As we can see, this plot illustrates the forecast for the next twelve months after the end of the time series. This forecast provides an estimate for what the average home sale prices will look like for the next twelve months.

We can see what the specific estimated home sale prices are for each of the next twelve months after July 2019.

forecast$mean 
          Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
2019                                                       633256.4 669327.5
2020 648920.1 671477.6 728994.7 658684.8 665938.5 660994.9                  
          Sep      Oct      Nov      Dec
2019 658924.8 691939.7 671194.7 624725.1
2020                                    

We can see that the month with the highest forecasted price is March 2020 with a forecasted average home sale price of $728,994.70. The month with the lowest forecasted price is December 2019 with a forecasted average home sale price of $624,725.10.

7 Conclusion

In this project, we created a time series of the monthly average home sales prices that were collected from the span of 2007 to 2019. We analyzed the trends and patterns of this time series to make note of any findings which stood out. One notable aspect of this time series was that it showed a seasonal pattern, with what looked like an annual seasonality. It was evident that there was some sort of a seasonal pattern within the home sales prices which repeated after a certain period of time, which appeared to be annually. The time series plot showed that this pattern appeared to repeat on approximately an annual basis, with the home sales prices going through different phases of highs and lows over the course of each year. Both the classical decomposition method and the STL decomposition method confirmed the presence of this seasonal trend within the data for this time series of the home sales prices.

Another notable finding in this time series project was that there appeared to be an overall increase in the average home sale prices from the beginning of the time series data collection in 2007 to the end of the data collection in 2019. It appeared that home sale prices gradually increased over the course of this 12 year period of the data collection. The time series plots as well as the decomposition plots for the time series showed evidence of this increase in the average home sale prices from 2007 to 2019.

Additionally, we looked at four potential sample sizes to use for a training data set in order to determine which number of observations would be ideal to use. The four potential sizes for the training data set which we looked at were n = 100, 70, 50, and 35 observations. It turned out that the sample size of n = 100 observations was the ideal choice out of these four options, because it resulted in the lowest errors, as seen by how it had the lowest values for both its MSE and its MAPE. This indicates that the sample size of 100 observations provides the best utility for the training set, because it reduces the errors.

We also used exponential smoothing methods to forecast the following twelve months after the end of the time series data collection. This provided forecasted predictions for the average home sale prices up through June of 2020.

7.1 Recommendations

Some recommendations I would make for future projects include:

  • Overall, it appeared like there was an annual seasonality in this particular time series done on the observed values of the sales prices of houses over the span of 2007 to 2019. Some potential reasons were given to provide an explanation for this seasonal trend, such as more individuals buying houses in the summer leading the an increase in sales prices. To further strengthen the evidence of this seasonal trend, future projects could look further into this to see if this is indeed a common pattern with records of house sales prices.

  • Future projects could look into trying out even more different training data set sizes to see if there is another one which would work even better than the ones that we tried out in this project. We looked at four potential sizes for the training data set to see which one provided the best accuracy, and found that n = 100 observations provided the best utility by reducing the errors. However, perhaps future projects could try out even more different sizes to see if there is one which provides even better accuracy by reducing the errors.

Overall, this time series projects provided insight into some of the patterns and trends surrounding the monthly average home sale prices from the span of time from 2007 to 2019. The seasonality and trends which were observed show that the sales prices of homes is something which goes through cycles of changes, and various highs and lows throughout the span of time.

8 References

The data set I used for my time series was found on kaggle.com. Included below is the citation of the web page where I found this data set.

Holdings, H., & James, T. (2019, August 12). House Property Sales Time Series. Kaggle. https://www.kaggle.com/datasets/htagholdings/property-sales/data

---
title: "House Sale Prices from 2007 to 2019: Time Series Forecasting with Decomposition"
author: "Josie Gallop"
date: "2024-11-18"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 6
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
editor_options: 
  chunk_output_type: console
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```
```{r setup, include=FALSE}
# Detect, install, and load packages if needed.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("leaflet")) {
   install.packages("leaflet")
   library(leaflet)
}
if (!require("EnvStats")) {
   install.packages("EnvStats")
   library(EnvStats)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("phytools")) {
   install.packages("phytools")
   library(phytools)
}
if(!require("dplyr")) {
   install.packages("dplyr")
   library(dplyr)
}
if(!require("tidyverse")) {
   install.packages("tidyverse")
   library(tidyverse)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if (!require("boot")) {
   install.packages("boot")
   library(boot)
}
if(!require("pander")) {
   install.packages("pander")
   library(pander)
}
if(!require("mlbench")) {
   install.packages("mlbench")
   library(mlbench)
}
if(!require("psych")) {
   install.packages("psych")
   library(psych)
}
if(!require("lubridate")) {
   install.packages("lubridate")
   library(lubridate)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if (!require("pROC")) {
   install.packages("pROC")
   library(pROC)
}
if (!require("forecast")) {
   install.packages("forecast")
   library(forecast)
}
knitr::opts_chunk$set(echo = TRUE,  
                   warning = FALSE,   
                   message = FALSE,  
                   results = TRUE,  
                   comment = NA   
                      )   
```


# Introduction

For this project, we will be creating a Time Series. We will be working on decomposing this time series and forecasting with decomposing. The data set I found for this project looks at the sale prices for homes that have been sold from 2007 to 2019 in the United States. This data set includes the sale prices of these various homes over this course of 12 years. 

We will use this data set to create a monthly time series to analyze these home sale prices along with any sort of trends that there may be. Using this monthly time series, we will look at the trends in home sale prices by month, meaning we will find the average home sale price for each month over the course of 2007 to 2019. This will allow us to analyze the time series and look for trends occurring across the many months which the data was collected over the course of. We will work on decomposing this monthly time series of the average home sale prices and will use forecasting with decomposing for this analysis of the time series data.



## Data Description

I found this data set on kaggle.com on the following webpage:
https://www.kaggle.com/datasets/htagholdings/property-sales?select=raw_sales.csv

This data set looks at the sale prices for homes that have been recorded from 2007 to 2019. The first observation was collected on February 6, 2007, and the final observation in the data set was collected on July 26, 2019. The sale prices are given in US dollars, and were collected to look at the trends amongst the sale prices of homes in this particular region. 

We will read in the data set from Github and we will call it "home". 

```{r}
home <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/raw_sales.csv")
```

The original data set contains multiple variables looking at things such as the postcode of the house, the type of property it is, and the number of bedrooms in the house. However, for this project, we are just looking at creating a time series with the variable of house sale prices. So, we will only be interested in the home sale price and the date of which the house was sold. For this time series project, we will look at the monthly averages of the home sale prices in order to create a monthly time series which we will use for analysis to see if there are any notable patterns we can see from the monthly time series.


## Variables

The two variables which we are interested in looking at for this time series project include:

* datesold: The date on which the house was sold. This is given in the format of year-month-day time", for example, "2007-02-07 00:00:00". This is a variable used for identifying each observation of house sales. We will adjust this variable to represent the year and the month of which the home sales were collected in order to create a monthly time series report.

* price: The price for which a house was sold for, given in US dollars. This is a quantitative, numerical variable which we are interested in observing the trends of over time. We will adjust this variable to represent the average home sale price for a specific month of a specific year in order to create a monthly time series report.

We will create a monthly time series which observed and analyzes the trends and patterns seen in the average home sale prices by month over the course of 2007 to 2019. 




# Exploratory Data Analysis

Before we create our time series object, let's ensure that there are no missing values in the original data set, "home". 

```{r}
colSums(is.na(home))
```

As we can see, there are zero missing values in our data set, so that means we do not have to worry about filling in any missing observations. We can proceed with creating our time series object. 



## Time Series Data Prepartion

We will adjust the data to create our monthly time series. This will allow us to analyze the monthly trends in home sale prices over the years of which this data was collected. 

We will start by creating a new data set called "home1" with just the variables of datesold and price. 

```{r}
home1 <- data.frame(home$datesold, home$price)
home1$datesold <- home1$home.datesold
home1$price <- home1$home.price
home1 <- select(home1, -1,-2)
```

For this time series project, we will create a monthly time series. This will take a look at the average home sale price from each month of which the data set was collected from. This will allow us to look for trends and patterns across the monthly average sale prices that have been observed from the span of 2007 to 2019. 

The datesold variable includes dates that are given in the format of "Year-Month-Day hour:minute:second". While scrolling through the data set entries, I noticed that the time portion of every data entry was given as 00:00:00. This means that all of the observations were recorded exactly at midnight. All of the observations have this exact same time as their recorded date, so we will drop the time portion from the datesold variable to make it easier to interpret, as well as to make it easier to create our monthly average time series. 

We will start by dropping the time component from the observations and the date of which the home sale occurred. This will leave us with just a "Year-Month-Day" format for the entries of the datesold variable. 

```{r}
home1$datesold <- as.Date(ymd_hms(home1$datesold))
```

Next, we will calculate the average home sale price per month in order to create a data set which we can use to create a montly time series. The variable datesold will now be given in the format "Year-Month" and the price variable will represent the average home sale price for that specific month. We will create a new data set called "home2" to store this updated version of the data which we will use for our monthly time series. 

```{r}
home2 <- home1 %>%
  mutate(datesold = format(datesold, "%Y-%m")) %>%
  group_by(datesold) %>%
  summarise(price = mean(price, na.rm = TRUE))
```

There are exactly 150 observations in this revised data set which looks at the average home sale prices by month, so we do not need to drop any observations from the data set since this is an ideal amount to use for a time series. 



## Time Series Plot

Now, let's create a time series object from our data and plot it to help visualize any potential trends that may be occurring within this time series object. We will call this time series object "home.ts". Since we are creating a monthly time series looking at the average home sale prices by month, our frequency will be 12. 

```{r}
home.ts <- ts(home2$price, start = c(2007, 1), frequency = 12)
plot(home.ts, main="Monthly Average Home Sale Prices from 2007 to 2019", xlab = "Year", ylab = "Price")
```

Looking at the time series plot, it appears that there are several patterns and trends going on within this monthly time series of the home sale prices from 2007 to 2019. There appears to be some sort of seasonality which can be seen by the various peaks and drops which appears to be repeating after some period of time. We will look further into any potential seasonality in this time series with the decomposition methods. 

Additionally, there appears to be a major peak and drop around the start of the time series at around 2007 to 2008. This reaches much higher than any other portion of the time series and is evidence of something unusual which could be going on at this point. It is possible that there was some major event occurring within the housing market at this point which could have led to a massive rise and then a sudden fall in the average sale prices of homes at the time.

Also, it appears that home sale prices have gradually increased from the start of the time series in 2007 to the end of the time series in 2019. We can see that the average home sale price has continuously risen over the course of this time series. This could suggest an additive or multiplicative trend. In this case, it appears to be more of a multiplicative trend as the seasonal variation does appear to have grown and gotten more dramatic over time rather than staying identical regardless of the year. 



# Forecasting with Classical Decomposing

First, let's begin with looking at a classical decomposition of our time series data. 

```{r}
# Classical Decomposition
cls.decomp = decompose(home.ts)
par(mar = c(2,2,2,2))
plot(cls.decomp, xlab = "Year")
```

As we can see in our plot, most of the house prices range between $300,000 and $800,000. 

There appears to be something interesting going on at the beginning of the time series plot. There is a very sharp peak around 2007 to 2008, with a major spike much higher than anything around it. This is in fact the observation with the highest price in the entire time series plot, reaching up to $800,000, so this is something of interest from the data. After doing some research, there was a major financial crisis which occurred from 2007 to 2008 which marked a huge economic recession. It would make sense that a big financial crisis like this would also have a significant effect on the housing market, so this is likely why there is such a sharp and sudden peak in home sale prices around 2007 to 2008. 

There does appear to be some sort of seasonal trend going on within the data as seen by the various peaks and drops that can be seen within the plot. This shows evidence of some sort of seasonality occurring. This seasonal pattern appears to repeat over each year, with the highest and lowest peaks occurring within the span of each year period. 

A seasonal trend would make sense for a data set relating to house sale prices as the time of year could affect the number of individuals buying a new home, and thus sale prices rising to meet this larger demand. For instance, times of year such as the summer may see an increase of buyers looking for a new house due to the ideal weather and the timing of summer break making it ideal to buy a house before the next school year begins if the buyers have children. This greater demand could lead to a rise in the sales prices of the homes, and could be a potential explanation for this seasonal trend we see in the time series plot. 

Additionally, there appears to be an overall increase in home sale prices from 2007 to 2019. Looking at the trend curve, we can see that there has been an overall, gradual increase in the home sale prices from the start of this data set's collection to the end of its collection. So, we can conclude that the home sale prices have gradually increased from what they were in 2007 to what they became in 2019. 




# Forecasting with STL Decomposing 

Next, let's use STL decomposition on our time series data.

```{r}
# STL Decomposition
stl.decomp = stl(home.ts,  s.window = "periodic")
par(mar = c(2,2,2,2))
plot(stl.decomp)
```

Using this STL decomposition method provides several advantages to the classical decomposition method we used previously. The STL decomposition method uses locally estimated scatterplot smoothing, also known as LOESS, to estimate non-linear trends with greater accuracy in its forecasting than using the classical decomposition would provide. 

These plots created with the STL decomposition method allow us to observe some of the trends and patterns of our time series data. Overall, these plots generated by the STL decomposition provide similar findings to those created in the previous classical decomposition method. We see a similar overall trend, with various peaks and falls over the course of the data being collected from 2007 to 2019 by monthly averages of the home sale prices. 

We also see a similar seasonal pattern that repeats multiple times over the course of the time which the data was collected. This once again matches with the idea of an annual seasonality, as it appears like this seasonal pattern appears to repeat for each year that occurs over the course of the time series data collection. Additionally, similar to what was found in the classical decomposition method, is that there is a significant, notable peak around the start of the time series in 2007 to 2008 which shows the highest home sales prices out of the entire data set. This major peak around 2007 to 2008 is the highest average home sales prices which was observed throughout the time series data collection. Furthermore, we can see the overall trend of a gradual increase in the home sale prices from 2007 to 2019. This can be seen in the trend curve by how the average home sale prices have gradually risen and increased from the start of the time series data collection in 2007 to the end of the time series in 2019. These patterns found by the STL decomposition method match up with what was observed previously, and this STL decomposition method provides greater accuracy in its forecasting. 




# Training and Testing Data Sets

Next, let's look into creating a training and a testing data set to forecast the time series model. 

We will use the last seven observations to make up our testing data set, and we will try out four different potential sizes for the training data set. The four different sizes I will try out for the training data set are n = 100, n = 70, n= 50, and n = 35. This will be the number of observations which are used in the training data set. 


```{r}
data = home.ts
n0 = length(data)
train.data1 = home.ts[43:(n0-7)]
train.data2 = home.ts[73:(n0-7)]
train.data3 = home.ts[93:(n0-7)]
train.data4 = home.ts[108:(n0-7)]

# The testing data set will be the last 7 observations.
test.data = home.ts[(n0-6):n0]

# Creating the four potential training data set sizes.
train1.ts = ts(train.data1, frequency = 12, start = c(2007-02, 1))
train2.ts = ts(train.data2, frequency = 12, start = c(2010-01, 1))
train3.ts = ts(train.data3, frequency = 12, start = c(2013-01, 1))
train4.ts = ts(train.data4, frequency = 12, start = c(2015-01, 1))

stl1 = stl(train1.ts, s.window = 12)
stl2 = stl(train2.ts, s.window = 12)
stl3 = stl(train3.ts, s.window = 12)
stl4 = stl(train4.ts, s.window = 12)
```

Next, let's create the objects for forecasting.

```{r}
# Forecasting
fcst1 = forecast(stl1,h = 7, method = "naive")
fcst2 = forecast(stl2,h = 7, method = "naive")
fcst3 = forecast(stl3,h = 7, method = "naive")
fcst4 = forecast(stl4,h = 7, method = "naive")
```

We will perform error analysis to help us determine which size is best for the training data set out of the four potential sizes we are going to be choosing between. We will look at both the mean absolute prediction error (MAPE) and the mean squared error (MSE) of the potential training data sets. 

```{r}
PE1 = (test.data-fcst1$mean)/fcst1$mean
PE2 = (test.data-fcst2$mean)/fcst2$mean
PE3 = (test.data-fcst3$mean)/fcst3$mean
PE4 = (test.data-fcst4$mean)/fcst4$mean

# Mean Absolute Prediction Errors.
MAPE1 = mean(abs(PE1))
MAPE2 = mean(abs(PE2))
MAPE3 = mean(abs(PE3))
MAPE4 = mean(abs(PE4))

E1 = test.data-fcst1$mean
E2 = test.data-fcst2$mean
E3 = test.data-fcst3$mean
E4 = test.data-fcst4$mean

# Mean Squared Errors.
MSE1 = mean(E1^2)
MSE2 = mean(E2^2)
MSE3 = mean(E3^2)
MSE4 = mean(E4^2)

MSE = c(MSE1, MSE2, MSE3, MSE4)
MAPE = c(MAPE1, MAPE2, MAPE3, MAPE4)
accuracy = cbind(MSE = MSE, MAPE = MAPE)
row.names(accuracy) = c("n = 100", "n = 70", "n = 50", "n = 35")
kable(accuracy, caption = "Error Comparison of the Forecast Results with Different Training Data Set Sizes")
```

As we can see, the training data set with a size of n = 100 has the lowest of both the MSE and the MAPE out of the four potential training data set sizes that were looked at. This suggests that this size is the most ideal out of the potential options as it reduces the errors the most out of the four options that were considered. 

We will create a visualization to compare these forecast errors and confirm the finding that a sample size of 100 observations reduces the errors most out of the potential sample sizes that were considered for the training data set of the time series.

First, let's look at a plot of the mean squared errors (MSE) for each of the four potential sample sizes. 

```{r}
# Plot of the MSE.
plot(1:4, MSE, type = "b", col="darkorchid", ylab = "Errors", xlab = "Sample Size",
     main = "MSE", axes = FALSE)
     labs = c("n = 100", "n = 70", "n = 50", "n = 35")
     axis(1, at = 1:4, label = labs)
     axis(2)
```

As we can see, n = 100 observations has the lowest MSE out of the four potential sample sizes, indicating that it is the ideal choice, because it reduces the errors. 

We will also look at the plot of the mean absolute prediction errors (MAPE) for each of the four potential training data set sizes. 

```{r}
# Plot of the MAPE.
plot(1:4, MAPE, type = "b", col="darkorchid", ylab = "Errors", xlab = "Sample Size",
     main = "MAPE", axes = FALSE)
     labs = c("n = 100", "n = 70", "n = 50", "n = 35")
     axis(1, at = 1:4, label = labs)
     axis(2)
```

Once again, the sample size of n = 100 observations shows the lowest value of the MAPE, showing that is the ideal choice out of the four sample sizes, because it reduces the errors. 

As we can see in the plots for both the MSE and the MAPE, n = 100 observations shows the lowest errors, and  as the sample size decreases in the number of observations, the errors for both the MSE and the MAPE increase, resulting in errors much greater than those of the sample size with 100 observations. This strengthens that the ideal training data set size out of the four options that were considered is a size of 100 observations. A training data set with 100 observations yields the best performance, because it reduces the errors. 

So, we can conclude that from the choices of a training data set with 100, 70, 50, or 35 observations, the best choice is the training data set with 100 observations. This choice of 100 observations results in the lowest errors, and therefore, provides the best performance for forecasting our time series. 



# Forecasting the Upcoming Months

We will use forecasting to provide an estimate for the upcoming months after the end of the time series data collection. This time series collected data up until July 2019, so we will forecast the estimated home sale prices for the months which follow the end of this data collection.

We will use an exponential smoothing method for this forecasting. This will allow us to forecast the next twelve months of average home sale prices by the month. Since our time series stopped at July 2019, this forecasting will provide estimates of monthly average home sale prices up through June 2020.

We stated previosuly that there appeared to be a multiplicative trend occurring within this time series. This was seen by how the seasonal variation appeared to increase over time along with the overall gradual increase in monthly average home sales. We will reflect this multiplicative trend with the exponential smoothing forecast. 

```{r}
ets <- ets(home.ts, model = "MAM")
forecast <- forecast(ets, h = 12)
plot(forecast, main = "Forecast of the Average Home Sale Prices \n for the Next 12 Months")
```

As we can see, this plot illustrates the forecast for the next twelve months after the end of the time series. This forecast provides an estimate for what the average home sale prices will look like for the next twelve months. 

We can see what the specific estimated home sale prices are for each of the next twelve months after July 2019. 

```{r}
forecast$mean 
```

We can see that the month with the highest forecasted price is March 2020 with a forecasted average home sale price of $728,994.70. The month with the lowest forecasted price is December 2019 with a forecasted average home sale price of $624,725.10. 



# Conclusion

In this project, we created a time series of the monthly average home sales prices that were collected from the span of 2007 to 2019. We analyzed the trends and patterns of this time series to make note of any findings which stood out. One notable aspect of this time series was that it showed a seasonal pattern, with what looked like an annual seasonality. It was evident that there was some sort of a seasonal pattern within the home sales prices which repeated after a certain period of time, which appeared to be annually. The time series plot showed that this pattern appeared to repeat on approximately an annual basis, with the home sales prices going through different phases of highs and lows over the course of each year. Both the classical decomposition method and the STL decomposition method confirmed the presence of this seasonal trend within the data for this time series of the home sales prices. 

Another notable finding in this time series project was that there appeared to be an overall increase in the average home sale prices from the beginning of the time series data collection in 2007 to the end of the data collection in 2019. It appeared that home sale prices gradually increased over the course of this 12 year period of the data collection. The time series plots as well as the decomposition plots for the time series showed evidence of this increase in the average home sale prices from 2007 to 2019.

Additionally, we looked at four potential sample sizes to use for a training data set in order to determine which number of observations would be ideal to use. The four potential sizes for the training data set which we looked at were n = 100, 70, 50, and 35 observations. It turned out that the sample size of n = 100 observations was the ideal choice out of these four options, because it resulted in the lowest errors, as seen by how it had the lowest values for both its MSE and its MAPE. This indicates that the sample size of 100 observations provides the best utility for the training set, because it reduces the errors. 

We also used exponential smoothing methods to forecast the following twelve months after the end of the time series data collection. This provided forecasted predictions for the average home sale prices up through June of 2020. 


## Recommendations

Some recommendations I would make for future projects include:

* Overall, it appeared like there was an annual seasonality in this particular time series done on the observed values of the sales prices of houses over the span of 2007 to 2019. Some potential reasons were given to provide an explanation for this seasonal trend, such as more individuals buying houses in the summer leading the an increase in sales prices. To further strengthen the evidence of this seasonal trend, future projects could look further into this to see if this is indeed a common pattern with records of house sales prices. 

* Future projects could look into trying out even more different training data set sizes to see if there is another one which would work even better than the ones that we tried out in this project. We looked at four potential sizes for the training data set to see which one provided the best accuracy, and found that n = 100 observations provided the best utility by reducing the errors. However, perhaps future projects could try out even more different sizes to see if there is one which provides even better accuracy by reducing the errors.  


Overall, this time series projects provided insight into some of the patterns and trends surrounding the monthly average home sale prices from the span of time from 2007 to 2019. The seasonality and trends which were observed show that the sales prices of homes is something which goes through cycles of changes, and various highs and lows throughout the span of time. 



# References

The data set I used for my time series was found on kaggle.com. Included below is the citation of the web page where I found this data set. 

Holdings, H., & James, T. (2019, August 12). House Property Sales Time Series. Kaggle. https://www.kaggle.com/datasets/htagholdings/property-sales/data 
