1 Introduction and Background

The dataset in use today originally stems from the extremely popular real estate website, Zillow. It is a compilation of Zillow’s home value index rankings broken down in a month-by-month and state-by-state manner, beginning in January of 2000 and being most recently updated in September of 2025.

Technically speaking, the dataset’s structure is fairly straightforward. There are 51 variables, with the first day of each month acting as a date object within R in YYYY-MM-DD format, and every state in the nation’s home value index getting its own variable, stored as a double (units are $USD). There are 228 missing values across the dataset, with the vast majority of them coming from North Dakota, Montana, Wyoming and New Mexico.

For my analysis, I will be specifically focusing on the typical home values in the state of New York, and constraining my study to the 15 year time window of January 2010 to January of 2025. I will not be primarily concerned with associative and external exploration as to why the changes in home prices have occurred in the manner which they have. Rather I will be applying a longitudinal perspective and developing a number of time series models with the hope of creating a model that can reliably, based on the historical data at hand, predict future New York home values.

2 Plot of the Data

Below we see a plot of the standard New York home’s value over the past 15 years. From this plot, there is one obvious takeaway that instantly comes to the viewer’s mind. That being, there is a clear, noticeable and approximately consistent increase in home values throughout the duration between January of 2010 and January of 2025. To quantify; the home value index for New York in January of 2025 was about $496,846 per home, while in January of 2010 that same metric was only $267,153. This is a staggering increase of about 85.98%.

Time_Series = ts(TS_Data$NY, frequency = 12, start = c(2010,1))
  # frequency set to 12 since our dataset is composed of years, which are 12-month periods
  # keyword start ensures that the axes on graph below will be accurate, that ts function recognizes that the first observation is the first of 12 of value 2010

# using xaxt = "n" will override the graph's default axis. I want the x intervals to be more frequent than every 5 years
plot((Time_Series/1000), xaxt="n",
     main = "New York State's Home Value Index, 2010 - 2025",
     xlab = "Year",
     ylab = "Value (1000's $USD)")
axis(1,
     at = seq(2010, 2025, by = 1),
     labels = seq(2010, 2025, by = 1))

3 Decomposition Analysis

Before creating time series models of varying sizes and using them to forecast future values, I performed decomposition on the dataset. Decomposition is basically the process of breaking down the makeup of a time series into its different key components. These components being the seasonal patterns, overall trend, and random error (sometimes called remainder) that exist within the data. Below are visuals from two decomposition processes; classical and STL.

3.1 Classical Decomposition

Here is a display of the New York home value index data from 2010 to 2025, post classical decomposition. There are a few noticeable takeaways. Looking at the trend component, we see a clear rise in home values over the 15-year period. This is not surprising, as it was also clearly visible in our original plot above. Regarding seasonality, we see consistent oscillation at approximately the same interval length throughout the dataset. On average throughout the years, New York homes had their highest values in the Fall (~ $340,676) and their lowest in the Spring (~ 332,837). Finally, when it comes to the random error component of the decomposed time series, we see a spike in the remainder’s magnitude after 2020. This can most likely be attributed to the economic uncertainty related to the COVID pandemic.

Classic_Decomp = decompose(Time_Series)
plot(Classic_Decomp)

  Summer = filter(Data, Season == "Summer")
  Fall = filter(Data, Season == "Fall")
  Winter = filter(Data, Season == "Winter")
  Spring = filter(Data, Season == "Spring")
    
  invisible(round(mean(Summer$NY)))
  invisible(round(mean(Fall$NY)))
  invisible(round(mean(Winter$NY)))
  invisible(round(mean(Spring$NY)))

3.2 Seasonal and Trend LOESS (STL) Decomposition

The STL decomposition of the data tells practically the same story of the classical decomposition. The only large difference being that this decomposition shows a much larger magnitude of seasonal oscillation from about 2020 to 2025, while the classical decomposition shows equivalent seasonal oscillation throughout.

Generally speaking, STL decomposition is considered a more robust procedure than its classical counterpart, as STL is more effective in dealing with outliers and non-linearity in data. The STL method uses LOESS (modern locally estimated scatterplot smoothing), while the classical method uses a more basic moving average. That is why we see variation in the seasonality of the data in our STL visual.

Going forward, I used the STL method when formulating the time series models.

STL_Decomp = stl(Time_Series, s.window = 12)
  # the s.window argument essentially tells the stl function how to estimate the seasonal component. Since I put the number 12, that is basically telling the function that every 12 observations (months) the seasonal pattern (year) should be reset to re-estimate for the next series of observations. If I was trying to estimate the seasonal component for summer/fall/winter/spring, I would set s.window = to 3

plot(STL_Decomp,
     main = "STL Decomposition of Time Series")

4 Creating Training and Testing Sets

Since it is bad statistical practice to test a model on the same data which it is trained on, I devoted the data’s ten most recent observations (April of 2024 to January of 2025) for post-creation testing purposes. For the training data, I used R’s runif function to randomly select four sample sizes (n) to assign to each of the four training subsets of the data. All four of the training datasets ended with the observation on March of 2024, and respectively began (n - 1) months after January of 2010. I created the training datasets (1 through 4) in sequential order of descending sample size (set 1 has the most observations, set 2 has the second most, etc…).

After subsetting the data into these four separate training sets, I performed STL decomposition on each one and then forecasted each model’s next ten observations (the same ten which make up the known testing dataset). For each model I used the drift method for forecasting, as the drift accounts for historical increases and decreases in a response variable over time, something that is very prevalent in this dataset.

# Training dataset of 171 observations, most we want to take out is about 30.
# use runif function for random selection, to select four numbers between 1 and 30, our training subsets of the data will go from that randomly selected observation point up to point 171 (March of 2024)
   # sort(round(runif(4, min = 1, max = 30)))
    # Results are 9, 12, 16, 25

Testing_Data = TS_Data[172:181,]

# Training set 1: 
Training_Data1 = TS_Data[9:171, ]

# Training set 2: 
Training_Data2 = TS_Data[12:171, ]

# Training set 3: 
Training_Data3 = TS_Data[16:171, ]

# Training set 4: 
Training_Data4 = TS_Data[25:171, ]


### Now that training and testing subsets of data have been created, make time series objects for each

TS_Train1 = ts(Training_Data1$NY, frequency = 12, start = c(2010, 9))
TS_Train2 = ts(Training_Data2$NY, frequency = 12, start = c(2010, 12))
TS_Train3 = ts(Training_Data3$NY, frequency = 12, start = c(2011, 4))
TS_Train4 = ts(Training_Data4$NY, frequency = 12, start = c(2012, 1))

### going to use the stl() function to decompose each of the training dataset time series objects, then use those decomposed objects for forecasting

STL_Training1 = stl(TS_Train1, s.window = 12)
STL_Training2 = stl(TS_Train2, s.window = 12)
STL_Training3 = stl(TS_Train3, s.window = 12)
STL_Training4 = stl(TS_Train4, s.window = 12)

### Forecast calculations with decomposed training set time series objects

  # Set h = 10 as there is 10 observations in the testing dataset, so we can do a one-to-one comparison between our training forecasts for each time series model and our observed values in the testing dataset (April 2024 to January 2025)
Train_Set1_Forecasts = forecast(STL_Training1, h = 10, method = "rwdrift")
Train_Set2_Forecasts = forecast(STL_Training2, h = 10, method = "rwdrift")
Train_Set3_Forecasts = forecast(STL_Training3, h = 10, method = "rwdrift")
Train_Set4_Forecasts = forecast(STL_Training4, h = 10, method = "rwdrift")

4.1 Examining each Training Set’s Forecasted Values

Below I created both a table to quantify each training set’s forecasts as well as a graph to visualize their performances relative to both one another and the true New York home values provided in the original dataset. As we can see, all of the models underestimated the rise in home values that would take place from August of 2024 to January of 2025.

  # Use the $mean selection to see the forecasted predicted values at particular points
TSMean1 = data.frame(Train_Set1_Forecasts$mean)
  colnames(TSMean1) = "Errors"
TSMean2 = data.frame(Train_Set2_Forecasts$mean)
  colnames(TSMean2) = "Errors"
TSMean3 = data.frame(Train_Set3_Forecasts$mean)
  colnames(TSMean3) = "Errors"
TSMean4 = data.frame(Train_Set4_Forecasts$mean)
  colnames(TSMean4) = "Errors"
Years = data.frame(TS_Data$Date[172:181])

Forecasted_Values = cbind(Years, TSMean1, TSMean2, TSMean3, TSMean4, Testing_Data[,2])
colnames(Forecasted_Values) = c("Date", "TS 1", "TS 2", "TS 3", "TS 4", "True Values")
kable(Forecasted_Values, caption = "Forecasted Values Per Training Set")
Forecasted Values Per Training Set
Date TS 1 TS 2 TS 3 TS 4 True Values
2024-04-01 471873.8 471906.1 471970.9 472416.2 474674.4
2024-05-01 473687.4 473749.5 473884.2 474484.1 478114.6
2024-06-01 475432.6 475524.3 475729.0 476477.5 480880.8
2024-07-01 476944.1 477081.9 477258.2 478134.8 483467.0
2024-08-01 478031.1 478215.0 478421.9 479375.2 486631.5
2024-09-01 478733.7 478963.6 479201.1 480193.7 489556.9
2024-10-01 479461.4 479739.6 480047.0 481079.5 492028.3
2024-11-01 480458.1 480784.5 481130.9 482225.5 493899.7
2024-12-01 481615.7 481990.3 482375.8 483545.4 495692.6
2025-01-01 482804.7 483219.2 483708.2 485002.2 496846.3
List_Of_Dates = as.Date(c("2024-04-01", "2024-05-01", "2024-06-01", "2024-07-01", "2024-08-01", "2024-09-01", "2024-10-01", "2024-11-01", "2024-12-01", "2025-01-01"))

TS1 = data.frame(List_Of_Dates, as.double(TSMean1$Errors))
TS2 = data.frame(List_Of_Dates, as.double(TSMean2$Errors))
TS3 = data.frame(List_Of_Dates, as.double(TSMean3$Errors))
TS4 = data.frame(List_Of_Dates, as.double(TSMean4$Errors))

plot(Testing_Data, col = "blue", lwd = 2,
     main = "Forecasted Vs True New York Home Values \n (August 2024 - January 2025)",
     xlab = "Date",
     ylab = "Value($USD)")
  lines(Testing_Data, col = "blue", lwd = 2)
  lines(TS1, col = "red", lwd = 2)
    points(TS1, col = "red")
  lines(TS2, col = "black", lwd = 2)
    points(TS2, col = "black")
  lines(TS3, col = "green", lwd = 2)
    points(TS3, col = "green")
  lines(TS4, col = "purple", lwd = 2)
    points(TS4, col = "purple")
    legend("topleft",
       legend = c("True Values", "Time Series 1", "Time Series 2",
                  "Time Series 3", "Time Series 4"),
       col = c("blue", "red", "black", "green", "purple"),
       lty = 1, pch = 16, lwd = 2)

4.2 Training Set Forecast Error Analysis

As we saw above, there was a consistent underestimation by each of the training set time series. However, there appeared to be a linear and negative correlation between the sample size of the training set, and that set’s forecasts relative to the true values. Training set 4 had the smallest sample size (beginning at January of 2012) and was the most accurate.

A numeric breakdown of each training sets forecasts’ accuracy is below. The metrics I used were mean error (ME), mean square error (MSE) and mean absolute prediction error (MAPE). As seen in the visuals above, as the sample size of our time series training set decreased, its accuracy increased.

This is likely due to non-linear rise in New York home values, especially from 2018 to 2025. The smaller time series training sets were less influenced by the lesser home values of the early parts of the 2010s.

####### Define Error Objects
TS_Forecasts1_Errors = Testing_Data[,2] - TSMean1
TS_Forecasts2_Errors = Testing_Data[,2] - TSMean2
TS_Forecasts3_Errors = Testing_Data[,2] - TSMean3
TS_Forecasts4_Errors = Testing_Data[,2] - TSMean4

####### ME
TS1_ME = sum(TS_Forecasts1_Errors)/nrow(TS_Forecasts1_Errors)
TS2_ME = sum(TS_Forecasts2_Errors)/nrow(TS_Forecasts2_Errors)
TS3_ME = sum(TS_Forecasts3_Errors)/nrow(TS_Forecasts3_Errors)
TS4_ME = sum(TS_Forecasts4_Errors)/nrow(TS_Forecasts4_Errors)

####### MSE
TS1_MSE = sum(TS_Forecasts1_Errors^2)/nrow(TS_Forecasts1_Errors)
TS2_MSE = sum(TS_Forecasts2_Errors^2)/nrow(TS_Forecasts2_Errors)
TS3_MSE = sum(TS_Forecasts3_Errors^2)/nrow(TS_Forecasts3_Errors)
TS4_MSE = sum(TS_Forecasts4_Errors^2)/nrow(TS_Forecasts4_Errors)

####### MAPE
TS1_MAPE = (sum(abs((TS_Forecasts1_Errors/Testing_Data[,2])*100)))/nrow(TS_Forecasts1_Errors)
TS2_MAPE = (sum(abs((TS_Forecasts2_Errors/Testing_Data[,2])*100)))/nrow(TS_Forecasts2_Errors)
TS3_MAPE = (sum(abs((TS_Forecasts3_Errors/Testing_Data[,2])*100)))/nrow(TS_Forecasts3_Errors)
TS4_MAPE = (sum(abs((TS_Forecasts4_Errors/Testing_Data[,2])*100)))/nrow(TS_Forecasts4_Errors)

##### Table to Visualize
All_MEs = rbind(TS1_ME, TS2_ME, TS3_ME, TS4_ME)
All_MSEs = rbind(TS1_MSE, TS2_MSE, TS3_MSE, TS4_MSE)
All_MAPEs = rbind(TS1_MAPE, TS2_MAPE, TS3_MAPE, TS4_MAPE)

Forecast_Errors = cbind(All_MEs, All_MSEs, All_MAPEs)
colnames(Forecast_Errors) = c("ME", "MSE", "MAPE")
rownames(Forecast_Errors) = c("TS1", "TS2", "TS3", "TS4")
kable(Forecast_Errors, caption = "Model Forecasts' Accuracy Measures")
Model Forecasts’ Accuracy Measures
ME MSE MAPE
TS1 9274.949 102471188 1.891789
TS2 9061.826 97571016 1.848419
TS3 8806.477 92147504 1.796344
TS4 7885.797 75042461 1.608069

5 Conclusion

Ultimately, when looking at New York state’s home values over the past 15 years, forecasting monthly index values via STL-decomposed training sets did not yield particularly accurate results. Despite training our forecast algorithm with the drift method, which is meant to take into account historical increases and decreases, all four of our time series underestimated the true values of New York’s home value index from August of 2024 to January of 2025.

In a future analysis, it would possibly be valuable during the training set process, to widen the scope of sample size selection. Doing so would allow us to see at which point does constraining the time interval of observations for the training data become detrimental. As in this instance, with the maximum number of observations coming out being set to 30, we saw a directly negative relationship between sample size of the training set and forecasts’ accuracy.

---
title: "New York State's Monthly Home Value Index (Since 2010)"
author: "Chris Bahm"
date: "2025-11-21"
output:
  html_document:
    toc: true
    toc_float:
      collapsed: true
      smooth_scroll: true
    toc_depth: 4
    fig_width: 6
    fig_height: 4
    fig_caption: true
    number_sections: true
    code_folding: hide
    code_download: true
    theme: lumen
    highlight: tango
  pdf_document:
    toc: true
    toc_depth: 4
    fig_caption: true
    number_sections: true
  word_document:
    toc: true
    toc_depth: 4
---

```{css, echo = FALSE}
div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkRed;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";
}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.

if (!require("knitr")) {                      # use conditional statement to detect
   install.packages("knitr")                  # whether a package was installed in
   library(knitr)                             # your machine. If not, install it and
}                                             # load it to the working directory.

if (!require(tidyverse)) {library(tidyvserse)} 

if (!require(GGally)) {library(GGally)} 

if (!require(kableExtra)) {library(kableExtra)} 

if (!require(ggplot2)) {library(ggplot2)} 

if (!require(car)) {library(car)} 

if (!require(dplyr)) {library(dplyr)} 

if (!require(pander)) {library(pander)} 

if (!require(forecast)) {library(forecast)} 

if (!require(lubridate)) {library(lubridate)} 

if (!require("scales")) {
install.packages("scales")                                        
library("scales") 
}

knitr::opts_chunk$set(
	echo = TRUE,
	message = FALSE,
	warning = FALSE,
	comment = NA,
	results = TRUE
)
```

# Introduction and Background 
The dataset in use today originally stems from the extremely popular real estate website, Zillow. It is a compilation of Zillow's home value index rankings broken down in a month-by-month and state-by-state manner, beginning in January of 2000 and being most recently updated in September of 2025.

Technically speaking, the dataset's structure is fairly straightforward. There are 51 variables, with the first day of each month acting as a date object within R in YYYY-MM-DD format, and every state in the nation's home value index getting its own variable, stored as a double **(units are $USD)**. There are 228 missing values across the dataset, with the vast majority of them coming from North Dakota, Montana, Wyoming and New Mexico.

For my analysis, I will be specifically focusing on the typical home values in the state of New York, and constraining my study to the 15 year time window of January 2010 to January of 2025. I will not be primarily concerned with associative and external exploration as to *why* the changes in home prices have occurred in the manner which they have. Rather I will be applying a longitudinal perspective and developing a number of time series models with the hope of creating a model that can reliably, based on the historical data at hand, predict future New York home values.

```{r Data Loading and Cleaning, include=FALSE}
Data_Url= "https://raw.githubusercontent.com/ChrisB2323/STA321/refs/heads/main/Zillow_Data.csv"

Original_Data = read_delim(Data_Url, delim = ",", show_col_types = FALSE)
sum(is.na(Original_Data)) # = 228 missing values

  # Breakdown of missing values per state
colSums(is.na(Original_Data))[colSums(is.na(Original_Data)) > 0]

# Will use the last 15 years of data, 15 years * 12 months = 180 observations

Data = Original_Data[121:301,]
colnames(Data)[1] = "Date"
glimpse(Data)
  # All variables formatted right, Month is date object and the rest are doubles

  # Going to conduct univariate analysis, with New York's average home value index (in $USD) as the response in this time series

Data = Data[ ,c("Date", "New York")]
colnames(Data)[2] = "NY"
  # ^ For coding convenience

# Want to label my data by month, season and year.
  # Used AI for season assignment below
Data = Data %>%
  mutate(
    Year = year(Date),
    Month = month(Date, label = TRUE, abbr = TRUE),
    Season = case_when(
      month(Date) %in% c(12, 1, 2) ~ "Winter",
      month(Date) %in% c(3, 4, 5)  ~ "Spring",
      month(Date) %in% c(6, 7, 8)  ~ "Summer",
      month(Date) %in% c(9, 10, 11) ~ "Fall"
    )
  )

TS_Data = Data[,1:2]
```

# Plot of the Data
Below we see a plot of the standard New York home's value over the past 15 years. From this plot, there is one obvious takeaway that instantly comes to the viewer's mind. That being, there is a clear, noticeable and approximately consistent increase in home values throughout the duration between January of 2010 and January of 2025. To quantify; the home value index for New York in January of 2025 was about $496,846 per home, while in January of 2010 that same metric was only $267,153. This is a staggering increase of about `r round((((496846 - 267153)/267153) * 100), 2)`%. 
```{r, fig.align ='center'}
Time_Series = ts(TS_Data$NY, frequency = 12, start = c(2010,1))
  # frequency set to 12 since our dataset is composed of years, which are 12-month periods
  # keyword start ensures that the axes on graph below will be accurate, that ts function recognizes that the first observation is the first of 12 of value 2010

# using xaxt = "n" will override the graph's default axis. I want the x intervals to be more frequent than every 5 years
plot((Time_Series/1000), xaxt="n",
     main = "New York State's Home Value Index, 2010 - 2025",
     xlab = "Year",
     ylab = "Value (1000's $USD)")
axis(1,
     at = seq(2010, 2025, by = 1),
     labels = seq(2010, 2025, by = 1))
```


# Decomposition Analysis
Before creating time series models of varying sizes and using them to forecast future values, I performed decomposition on the dataset. Decomposition is basically the process of breaking down the makeup of a time series into its different key components. These components being the seasonal patterns, overall trend, and random error (sometimes called remainder) that exist within the data. Below are visuals from two decomposition processes; classical and STL.

## Classical Decomposition
Here is a display of the New York home value index data from 2010 to 2025, post classical decomposition. There are a few noticeable takeaways. Looking at the trend component, we see a clear rise in home values over the 15-year period. This is not surprising, as it was also clearly visible in our original plot above. Regarding seasonality, we see consistent oscillation at approximately the same interval length throughout the dataset. On average throughout the years, New York homes had their highest values in the Fall (~ $340,676) and their lowest in the Spring (~ 332,837). Finally, when it comes to the random error component of the decomposed time series, we see a spike in the remainder's magnitude after 2020. This can most likely be attributed to the economic uncertainty related to the COVID pandemic.
```{r, fig.align ='center'}
Classic_Decomp = decompose(Time_Series)
plot(Classic_Decomp)

  Summer = filter(Data, Season == "Summer")
  Fall = filter(Data, Season == "Fall")
  Winter = filter(Data, Season == "Winter")
  Spring = filter(Data, Season == "Spring")
    
  invisible(round(mean(Summer$NY)))
  invisible(round(mean(Fall$NY)))
  invisible(round(mean(Winter$NY)))
  invisible(round(mean(Spring$NY)))
```

## Seasonal and Trend LOESS (STL) Decomposition
The STL decomposition of the data tells practically the same story of the classical decomposition. The only large difference being that this decomposition shows a much larger magnitude of seasonal oscillation from about 2020 to 2025, while the classical decomposition shows equivalent seasonal oscillation throughout.

Generally speaking, STL decomposition is considered a more robust procedure than its classical counterpart, as STL is more effective in dealing with outliers and non-linearity in data. The STL method uses LOESS (modern locally estimated scatterplot smoothing), while the classical method uses a more basic moving average. That is why we see variation in the seasonality of the data in our STL visual.

Going forward, I used the STL method when formulating the time series models.

```{r, fig.align ='center'}
STL_Decomp = stl(Time_Series, s.window = 12)
  # the s.window argument essentially tells the stl function how to estimate the seasonal component. Since I put the number 12, that is basically telling the function that every 12 observations (months) the seasonal pattern (year) should be reset to re-estimate for the next series of observations. If I was trying to estimate the seasonal component for summer/fall/winter/spring, I would set s.window = to 3

plot(STL_Decomp,
     main = "STL Decomposition of Time Series")
```

# Creating Training and Testing Sets
Since it is bad statistical practice to test a model on the same data which it is trained on, I devoted the data's ten most recent observations (April of 2024 to January of 2025) for post-creation testing purposes. For the training data, I used R's runif function to randomly select four sample sizes (*n*) to assign to each of the four training subsets of the data. All four of the training datasets ended with the observation on March of 2024, and respectively began (*n* - 1) months after January of 2010. I created the training datasets (1 through 4) in sequential order of descending sample size (set 1 has the most observations, set 2 has the second most, etc...).

After subsetting the data into these four separate training sets, I performed STL decomposition on each one and then forecasted each model's next ten observations (the same ten which make up the known testing dataset). For each model I used the drift method for forecasting, as the drift accounts for historical increases and decreases in a response variable over time, something that is very prevalent in this dataset.
```{r, Making Training Sets}
# Training dataset of 171 observations, most we want to take out is about 30.
# use runif function for random selection, to select four numbers between 1 and 30, our training subsets of the data will go from that randomly selected observation point up to point 171 (March of 2024)
   # sort(round(runif(4, min = 1, max = 30)))
    # Results are 9, 12, 16, 25

Testing_Data = TS_Data[172:181,]

# Training set 1: 
Training_Data1 = TS_Data[9:171, ]

# Training set 2: 
Training_Data2 = TS_Data[12:171, ]

# Training set 3: 
Training_Data3 = TS_Data[16:171, ]

# Training set 4: 
Training_Data4 = TS_Data[25:171, ]


### Now that training and testing subsets of data have been created, make time series objects for each

TS_Train1 = ts(Training_Data1$NY, frequency = 12, start = c(2010, 9))
TS_Train2 = ts(Training_Data2$NY, frequency = 12, start = c(2010, 12))
TS_Train3 = ts(Training_Data3$NY, frequency = 12, start = c(2011, 4))
TS_Train4 = ts(Training_Data4$NY, frequency = 12, start = c(2012, 1))

### going to use the stl() function to decompose each of the training dataset time series objects, then use those decomposed objects for forecasting

STL_Training1 = stl(TS_Train1, s.window = 12)
STL_Training2 = stl(TS_Train2, s.window = 12)
STL_Training3 = stl(TS_Train3, s.window = 12)
STL_Training4 = stl(TS_Train4, s.window = 12)

### Forecast calculations with decomposed training set time series objects

  # Set h = 10 as there is 10 observations in the testing dataset, so we can do a one-to-one comparison between our training forecasts for each time series model and our observed values in the testing dataset (April 2024 to January 2025)
Train_Set1_Forecasts = forecast(STL_Training1, h = 10, method = "rwdrift")
Train_Set2_Forecasts = forecast(STL_Training2, h = 10, method = "rwdrift")
Train_Set3_Forecasts = forecast(STL_Training3, h = 10, method = "rwdrift")
Train_Set4_Forecasts = forecast(STL_Training4, h = 10, method = "rwdrift")
```

## Examining each Training Set's Forecasted Values
Below I created both a table to quantify each training set's forecasts as well as a graph to visualize their performances relative to both one another and the **true** New York home values provided in the original dataset. As we can see, all of the models underestimated the rise in home values that would take place from August of 2024 to January of 2025.
```{r, fig.align='center'}
  # Use the $mean selection to see the forecasted predicted values at particular points
TSMean1 = data.frame(Train_Set1_Forecasts$mean)
  colnames(TSMean1) = "Errors"
TSMean2 = data.frame(Train_Set2_Forecasts$mean)
  colnames(TSMean2) = "Errors"
TSMean3 = data.frame(Train_Set3_Forecasts$mean)
  colnames(TSMean3) = "Errors"
TSMean4 = data.frame(Train_Set4_Forecasts$mean)
  colnames(TSMean4) = "Errors"
Years = data.frame(TS_Data$Date[172:181])

Forecasted_Values = cbind(Years, TSMean1, TSMean2, TSMean3, TSMean4, Testing_Data[,2])
colnames(Forecasted_Values) = c("Date", "TS 1", "TS 2", "TS 3", "TS 4", "True Values")
kable(Forecasted_Values, caption = "Forecasted Values Per Training Set")

List_Of_Dates = as.Date(c("2024-04-01", "2024-05-01", "2024-06-01", "2024-07-01", "2024-08-01", "2024-09-01", "2024-10-01", "2024-11-01", "2024-12-01", "2025-01-01"))

TS1 = data.frame(List_Of_Dates, as.double(TSMean1$Errors))
TS2 = data.frame(List_Of_Dates, as.double(TSMean2$Errors))
TS3 = data.frame(List_Of_Dates, as.double(TSMean3$Errors))
TS4 = data.frame(List_Of_Dates, as.double(TSMean4$Errors))

plot(Testing_Data, col = "blue", lwd = 2,
     main = "Forecasted Vs True New York Home Values \n (August 2024 - January 2025)",
     xlab = "Date",
     ylab = "Value($USD)")
  lines(Testing_Data, col = "blue", lwd = 2)
  lines(TS1, col = "red", lwd = 2)
    points(TS1, col = "red")
  lines(TS2, col = "black", lwd = 2)
    points(TS2, col = "black")
  lines(TS3, col = "green", lwd = 2)
    points(TS3, col = "green")
  lines(TS4, col = "purple", lwd = 2)
    points(TS4, col = "purple")
    legend("topleft",
       legend = c("True Values", "Time Series 1", "Time Series 2",
                  "Time Series 3", "Time Series 4"),
       col = c("blue", "red", "black", "green", "purple"),
       lty = 1, pch = 16, lwd = 2)
```

## Training Set Forecast Error Analysis
As we saw above, there was a consistent underestimation by each of the training set time series. However, there appeared to be a linear and negative correlation between the sample size of the training set, and that set's forecasts relative to the true values. Training set 4 had the smallest sample size (beginning at January of 2012) and was the most accurate.

A numeric breakdown of each training sets forecasts' accuracy is below. The metrics I used were mean error (ME), mean square error (MSE) and mean absolute prediction error (MAPE). As seen in the visuals above, as the sample size of our time series training set decreased, its accuracy increased. 

This is likely due to non-linear rise in New York home values, especially from 2018 to 2025. The smaller time series training sets were less influenced by the lesser home values of the early parts of the 2010s.
```{r, Error Analysis}
####### Define Error Objects
TS_Forecasts1_Errors = Testing_Data[,2] - TSMean1
TS_Forecasts2_Errors = Testing_Data[,2] - TSMean2
TS_Forecasts3_Errors = Testing_Data[,2] - TSMean3
TS_Forecasts4_Errors = Testing_Data[,2] - TSMean4

####### ME
TS1_ME = sum(TS_Forecasts1_Errors)/nrow(TS_Forecasts1_Errors)
TS2_ME = sum(TS_Forecasts2_Errors)/nrow(TS_Forecasts2_Errors)
TS3_ME = sum(TS_Forecasts3_Errors)/nrow(TS_Forecasts3_Errors)
TS4_ME = sum(TS_Forecasts4_Errors)/nrow(TS_Forecasts4_Errors)

####### MSE
TS1_MSE = sum(TS_Forecasts1_Errors^2)/nrow(TS_Forecasts1_Errors)
TS2_MSE = sum(TS_Forecasts2_Errors^2)/nrow(TS_Forecasts2_Errors)
TS3_MSE = sum(TS_Forecasts3_Errors^2)/nrow(TS_Forecasts3_Errors)
TS4_MSE = sum(TS_Forecasts4_Errors^2)/nrow(TS_Forecasts4_Errors)

####### MAPE
TS1_MAPE = (sum(abs((TS_Forecasts1_Errors/Testing_Data[,2])*100)))/nrow(TS_Forecasts1_Errors)
TS2_MAPE = (sum(abs((TS_Forecasts2_Errors/Testing_Data[,2])*100)))/nrow(TS_Forecasts2_Errors)
TS3_MAPE = (sum(abs((TS_Forecasts3_Errors/Testing_Data[,2])*100)))/nrow(TS_Forecasts3_Errors)
TS4_MAPE = (sum(abs((TS_Forecasts4_Errors/Testing_Data[,2])*100)))/nrow(TS_Forecasts4_Errors)

##### Table to Visualize
All_MEs = rbind(TS1_ME, TS2_ME, TS3_ME, TS4_ME)
All_MSEs = rbind(TS1_MSE, TS2_MSE, TS3_MSE, TS4_MSE)
All_MAPEs = rbind(TS1_MAPE, TS2_MAPE, TS3_MAPE, TS4_MAPE)

Forecast_Errors = cbind(All_MEs, All_MSEs, All_MAPEs)
colnames(Forecast_Errors) = c("ME", "MSE", "MAPE")
rownames(Forecast_Errors) = c("TS1", "TS2", "TS3", "TS4")
kable(Forecast_Errors, caption = "Model Forecasts' Accuracy Measures")
```

# Conclusion
Ultimately, when looking at New York state's home values over the past 15 years, forecasting monthly index values via STL-decomposed training sets did not yield particularly accurate results. Despite training our forecast algorithm with the drift method, which is meant to take into account historical increases and decreases, all four of our time series underestimated the true values of New York's home value index from August of 2024 to January of 2025.

In a future analysis, it would possibly be valuable during the training set process, to widen the scope of sample size selection. Doing so would allow us to see at which point does constraining the time interval of observations for the training data become detrimental. As in this instance, with the maximum number of observations coming out being set to 30, we saw a directly negative relationship between sample size of the training set and forecasts' accuracy.

# References
Original Dataset Source:

- https://www.kaggle.com/datasets/robikscube/zillow-home-value-index

Dataset Download Link via Github:

- https://raw.githubusercontent.com/ChrisB2323/STA321/refs/heads/main/Zillow_Data.csv