Introduction and
Background
The dataset in use today originally stems from the extremely popular
real estate website, Zillow. It is a compilation of Zillow’s home value
index rankings broken down in a month-by-month and state-by-state
manner, beginning in January of 2000 and being most recently updated in
September of 2025.
Technically speaking, the dataset’s structure is fairly
straightforward. There are 51 variables, with the first day of each
month acting as a date object within R in YYYY-MM-DD format, and every
state in the nation’s home value index getting its own variable, stored
as a double (units are $USD). There are 228 missing
values across the dataset, with the vast majority of them coming from
North Dakota, Montana, Wyoming and New Mexico.
For my analysis, I will be specifically focusing on the typical home
values in the state of New York, and constraining my study to the 15
year time window of January 2010 to January of 2025. I will not be
primarily concerned with associative and external exploration as to
why the changes in home prices have occurred in the manner
which they have. Rather I will be applying a longitudinal perspective
and developing a number of time series models with the hope of creating
a model that can reliably, based on the historical data at hand, predict
future New York home values.
Plot of the Data
Below we see a plot of the standard New York home’s value over the
past 15 years. From this plot, there is one obvious takeaway that
instantly comes to the viewer’s mind. That being, there is a clear,
noticeable and approximately consistent increase in home values
throughout the duration between January of 2010 and January of 2025. To
quantify; the home value index for New York in January of 2025 was about
$496,846 per home, while in January of 2010 that same metric was only
$267,153. This is a staggering increase of about 85.98%.
Time_Series = ts(TS_Data$NY, frequency = 12, start = c(2010,1))
# frequency set to 12 since our dataset is composed of years, which are 12-month periods
# keyword start ensures that the axes on graph below will be accurate, that ts function recognizes that the first observation is the first of 12 of value 2010
# using xaxt = "n" will override the graph's default axis. I want the x intervals to be more frequent than every 5 years
plot((Time_Series/1000), xaxt="n",
main = "New York State's Home Value Index, 2010 - 2025",
xlab = "Year",
ylab = "Value (1000's $USD)")
axis(1,
at = seq(2010, 2025, by = 1),
labels = seq(2010, 2025, by = 1))

Decomposition
Analysis
Before creating time series models of varying sizes and using them to
forecast future values, I performed decomposition on the dataset.
Decomposition is basically the process of breaking down the makeup of a
time series into its different key components. These components being
the seasonal patterns, overall trend, and random error (sometimes called
remainder) that exist within the data. Below are visuals from two
decomposition processes; classical and STL.
Classical
Decomposition
Here is a display of the New York home value index data from 2010 to
2025, post classical decomposition. There are a few noticeable
takeaways. Looking at the trend component, we see a clear rise in home
values over the 15-year period. This is not surprising, as it was also
clearly visible in our original plot above. Regarding seasonality, we
see consistent oscillation at approximately the same interval length
throughout the dataset. On average throughout the years, New York homes
had their highest values in the Fall (~ $340,676) and their lowest in
the Spring (~ 332,837). Finally, when it comes to the random error
component of the decomposed time series, we see a spike in the
remainder’s magnitude after 2020. This can most likely be attributed to
the economic uncertainty related to the COVID pandemic.
Classic_Decomp = decompose(Time_Series)
plot(Classic_Decomp)

Summer = filter(Data, Season == "Summer")
Fall = filter(Data, Season == "Fall")
Winter = filter(Data, Season == "Winter")
Spring = filter(Data, Season == "Spring")
invisible(round(mean(Summer$NY)))
invisible(round(mean(Fall$NY)))
invisible(round(mean(Winter$NY)))
invisible(round(mean(Spring$NY)))
Seasonal and Trend
LOESS (STL) Decomposition
The STL decomposition of the data tells practically the same story of
the classical decomposition. The only large difference being that this
decomposition shows a much larger magnitude of seasonal oscillation from
about 2020 to 2025, while the classical decomposition shows equivalent
seasonal oscillation throughout.
Generally speaking, STL decomposition is considered a more robust
procedure than its classical counterpart, as STL is more effective in
dealing with outliers and non-linearity in data. The STL method uses
LOESS (modern locally estimated scatterplot smoothing), while the
classical method uses a more basic moving average. That is why we see
variation in the seasonality of the data in our STL visual.
Going forward, I used the STL method when formulating the time series
models.
STL_Decomp = stl(Time_Series, s.window = 12)
# the s.window argument essentially tells the stl function how to estimate the seasonal component. Since I put the number 12, that is basically telling the function that every 12 observations (months) the seasonal pattern (year) should be reset to re-estimate for the next series of observations. If I was trying to estimate the seasonal component for summer/fall/winter/spring, I would set s.window = to 3
plot(STL_Decomp,
main = "STL Decomposition of Time Series")

Creating Training and
Testing Sets
Since it is bad statistical practice to test a model on the same data
which it is trained on, I devoted the data’s ten most recent
observations (April of 2024 to January of 2025) for post-creation
testing purposes. For the training data, I used R’s runif function to
randomly select four sample sizes (n) to assign to each of the
four training subsets of the data. All four of the training datasets
ended with the observation on March of 2024, and respectively began
(n - 1) months after January of 2010. I created the training
datasets (1 through 4) in sequential order of descending sample size
(set 1 has the most observations, set 2 has the second most, etc…).
After subsetting the data into these four separate training sets, I
performed STL decomposition on each one and then forecasted each model’s
next ten observations (the same ten which make up the known testing
dataset). For each model I used the drift method for forecasting, as the
drift accounts for historical increases and decreases in a response
variable over time, something that is very prevalent in this
dataset.
# Training dataset of 171 observations, most we want to take out is about 30.
# use runif function for random selection, to select four numbers between 1 and 30, our training subsets of the data will go from that randomly selected observation point up to point 171 (March of 2024)
# sort(round(runif(4, min = 1, max = 30)))
# Results are 9, 12, 16, 25
Testing_Data = TS_Data[172:181,]
# Training set 1:
Training_Data1 = TS_Data[9:171, ]
# Training set 2:
Training_Data2 = TS_Data[12:171, ]
# Training set 3:
Training_Data3 = TS_Data[16:171, ]
# Training set 4:
Training_Data4 = TS_Data[25:171, ]
### Now that training and testing subsets of data have been created, make time series objects for each
TS_Train1 = ts(Training_Data1$NY, frequency = 12, start = c(2010, 9))
TS_Train2 = ts(Training_Data2$NY, frequency = 12, start = c(2010, 12))
TS_Train3 = ts(Training_Data3$NY, frequency = 12, start = c(2011, 4))
TS_Train4 = ts(Training_Data4$NY, frequency = 12, start = c(2012, 1))
### going to use the stl() function to decompose each of the training dataset time series objects, then use those decomposed objects for forecasting
STL_Training1 = stl(TS_Train1, s.window = 12)
STL_Training2 = stl(TS_Train2, s.window = 12)
STL_Training3 = stl(TS_Train3, s.window = 12)
STL_Training4 = stl(TS_Train4, s.window = 12)
### Forecast calculations with decomposed training set time series objects
# Set h = 10 as there is 10 observations in the testing dataset, so we can do a one-to-one comparison between our training forecasts for each time series model and our observed values in the testing dataset (April 2024 to January 2025)
Train_Set1_Forecasts = forecast(STL_Training1, h = 10, method = "rwdrift")
Train_Set2_Forecasts = forecast(STL_Training2, h = 10, method = "rwdrift")
Train_Set3_Forecasts = forecast(STL_Training3, h = 10, method = "rwdrift")
Train_Set4_Forecasts = forecast(STL_Training4, h = 10, method = "rwdrift")
Examining each
Training Set’s Forecasted Values
Below I created both a table to quantify each training set’s
forecasts as well as a graph to visualize their performances relative to
both one another and the true New York home values
provided in the original dataset. As we can see, all of the models
underestimated the rise in home values that would take place from August
of 2024 to January of 2025.
# Use the $mean selection to see the forecasted predicted values at particular points
TSMean1 = data.frame(Train_Set1_Forecasts$mean)
colnames(TSMean1) = "Errors"
TSMean2 = data.frame(Train_Set2_Forecasts$mean)
colnames(TSMean2) = "Errors"
TSMean3 = data.frame(Train_Set3_Forecasts$mean)
colnames(TSMean3) = "Errors"
TSMean4 = data.frame(Train_Set4_Forecasts$mean)
colnames(TSMean4) = "Errors"
Years = data.frame(TS_Data$Date[172:181])
Forecasted_Values = cbind(Years, TSMean1, TSMean2, TSMean3, TSMean4, Testing_Data[,2])
colnames(Forecasted_Values) = c("Date", "TS 1", "TS 2", "TS 3", "TS 4", "True Values")
kable(Forecasted_Values, caption = "Forecasted Values Per Training Set")
Forecasted Values Per Training Set
| 2024-04-01 |
471873.8 |
471906.1 |
471970.9 |
472416.2 |
474674.4 |
| 2024-05-01 |
473687.4 |
473749.5 |
473884.2 |
474484.1 |
478114.6 |
| 2024-06-01 |
475432.6 |
475524.3 |
475729.0 |
476477.5 |
480880.8 |
| 2024-07-01 |
476944.1 |
477081.9 |
477258.2 |
478134.8 |
483467.0 |
| 2024-08-01 |
478031.1 |
478215.0 |
478421.9 |
479375.2 |
486631.5 |
| 2024-09-01 |
478733.7 |
478963.6 |
479201.1 |
480193.7 |
489556.9 |
| 2024-10-01 |
479461.4 |
479739.6 |
480047.0 |
481079.5 |
492028.3 |
| 2024-11-01 |
480458.1 |
480784.5 |
481130.9 |
482225.5 |
493899.7 |
| 2024-12-01 |
481615.7 |
481990.3 |
482375.8 |
483545.4 |
495692.6 |
| 2025-01-01 |
482804.7 |
483219.2 |
483708.2 |
485002.2 |
496846.3 |
List_Of_Dates = as.Date(c("2024-04-01", "2024-05-01", "2024-06-01", "2024-07-01", "2024-08-01", "2024-09-01", "2024-10-01", "2024-11-01", "2024-12-01", "2025-01-01"))
TS1 = data.frame(List_Of_Dates, as.double(TSMean1$Errors))
TS2 = data.frame(List_Of_Dates, as.double(TSMean2$Errors))
TS3 = data.frame(List_Of_Dates, as.double(TSMean3$Errors))
TS4 = data.frame(List_Of_Dates, as.double(TSMean4$Errors))
plot(Testing_Data, col = "blue", lwd = 2,
main = "Forecasted Vs True New York Home Values \n (August 2024 - January 2025)",
xlab = "Date",
ylab = "Value($USD)")
lines(Testing_Data, col = "blue", lwd = 2)
lines(TS1, col = "red", lwd = 2)
points(TS1, col = "red")
lines(TS2, col = "black", lwd = 2)
points(TS2, col = "black")
lines(TS3, col = "green", lwd = 2)
points(TS3, col = "green")
lines(TS4, col = "purple", lwd = 2)
points(TS4, col = "purple")
legend("topleft",
legend = c("True Values", "Time Series 1", "Time Series 2",
"Time Series 3", "Time Series 4"),
col = c("blue", "red", "black", "green", "purple"),
lty = 1, pch = 16, lwd = 2)

Training Set Forecast
Error Analysis
As we saw above, there was a consistent underestimation by each of
the training set time series. However, there appeared to be a linear and
negative correlation between the sample size of the training set, and
that set’s forecasts relative to the true values. Training set 4 had the
smallest sample size (beginning at January of 2012) and was the most
accurate.
A numeric breakdown of each training sets forecasts’ accuracy is
below. The metrics I used were mean error (ME), mean square error (MSE)
and mean absolute prediction error (MAPE). As seen in the visuals above,
as the sample size of our time series training set decreased, its
accuracy increased.
This is likely due to non-linear rise in New York home values,
especially from 2018 to 2025. The smaller time series training sets were
less influenced by the lesser home values of the early parts of the
2010s.
####### Define Error Objects
TS_Forecasts1_Errors = Testing_Data[,2] - TSMean1
TS_Forecasts2_Errors = Testing_Data[,2] - TSMean2
TS_Forecasts3_Errors = Testing_Data[,2] - TSMean3
TS_Forecasts4_Errors = Testing_Data[,2] - TSMean4
####### ME
TS1_ME = sum(TS_Forecasts1_Errors)/nrow(TS_Forecasts1_Errors)
TS2_ME = sum(TS_Forecasts2_Errors)/nrow(TS_Forecasts2_Errors)
TS3_ME = sum(TS_Forecasts3_Errors)/nrow(TS_Forecasts3_Errors)
TS4_ME = sum(TS_Forecasts4_Errors)/nrow(TS_Forecasts4_Errors)
####### MSE
TS1_MSE = sum(TS_Forecasts1_Errors^2)/nrow(TS_Forecasts1_Errors)
TS2_MSE = sum(TS_Forecasts2_Errors^2)/nrow(TS_Forecasts2_Errors)
TS3_MSE = sum(TS_Forecasts3_Errors^2)/nrow(TS_Forecasts3_Errors)
TS4_MSE = sum(TS_Forecasts4_Errors^2)/nrow(TS_Forecasts4_Errors)
####### MAPE
TS1_MAPE = (sum(abs((TS_Forecasts1_Errors/Testing_Data[,2])*100)))/nrow(TS_Forecasts1_Errors)
TS2_MAPE = (sum(abs((TS_Forecasts2_Errors/Testing_Data[,2])*100)))/nrow(TS_Forecasts2_Errors)
TS3_MAPE = (sum(abs((TS_Forecasts3_Errors/Testing_Data[,2])*100)))/nrow(TS_Forecasts3_Errors)
TS4_MAPE = (sum(abs((TS_Forecasts4_Errors/Testing_Data[,2])*100)))/nrow(TS_Forecasts4_Errors)
##### Table to Visualize
All_MEs = rbind(TS1_ME, TS2_ME, TS3_ME, TS4_ME)
All_MSEs = rbind(TS1_MSE, TS2_MSE, TS3_MSE, TS4_MSE)
All_MAPEs = rbind(TS1_MAPE, TS2_MAPE, TS3_MAPE, TS4_MAPE)
Forecast_Errors = cbind(All_MEs, All_MSEs, All_MAPEs)
colnames(Forecast_Errors) = c("ME", "MSE", "MAPE")
rownames(Forecast_Errors) = c("TS1", "TS2", "TS3", "TS4")
kable(Forecast_Errors, caption = "Model Forecasts' Accuracy Measures")
Model Forecasts’ Accuracy Measures
| TS1 |
9274.949 |
102471188 |
1.891789 |
| TS2 |
9061.826 |
97571016 |
1.848419 |
| TS3 |
8806.477 |
92147504 |
1.796344 |
| TS4 |
7885.797 |
75042461 |
1.608069 |
Conclusion
Ultimately, when looking at New York state’s home values over the
past 15 years, forecasting monthly index values via STL-decomposed
training sets did not yield particularly accurate results. Despite
training our forecast algorithm with the drift method, which is meant to
take into account historical increases and decreases, all four of our
time series underestimated the true values of New York’s home value
index from August of 2024 to January of 2025.
In a future analysis, it would possibly be valuable during the
training set process, to widen the scope of sample size selection. Doing
so would allow us to see at which point does constraining the time
interval of observations for the training data become detrimental. As in
this instance, with the maximum number of observations coming out being
set to 30, we saw a directly negative relationship between sample size
of the training set and forecasts’ accuracy.
References
Original Dataset Source:
Dataset Download Link via Github:
---
title: "New York State's Monthly Home Value Index (Since 2010)"
author: "Chris Bahm"
date: "2025-11-21"
output:
  html_document:
    toc: true
    toc_float:
      collapsed: true
      smooth_scroll: true
    toc_depth: 4
    fig_width: 6
    fig_height: 4
    fig_caption: true
    number_sections: true
    code_folding: hide
    code_download: true
    theme: lumen
    highlight: tango
  pdf_document:
    toc: true
    toc_depth: 4
    fig_caption: true
    number_sections: true
  word_document:
    toc: true
    toc_depth: 4
---

```{css, echo = FALSE}
div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkRed;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";
}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.

if (!require("knitr")) {                      # use conditional statement to detect
   install.packages("knitr")                  # whether a package was installed in
   library(knitr)                             # your machine. If not, install it and
}                                             # load it to the working directory.

if (!require(tidyverse)) {library(tidyvserse)} 

if (!require(GGally)) {library(GGally)} 

if (!require(kableExtra)) {library(kableExtra)} 

if (!require(ggplot2)) {library(ggplot2)} 

if (!require(car)) {library(car)} 

if (!require(dplyr)) {library(dplyr)} 

if (!require(pander)) {library(pander)} 

if (!require(forecast)) {library(forecast)} 

if (!require(lubridate)) {library(lubridate)} 

if (!require("scales")) {
install.packages("scales")                                        
library("scales") 
}

knitr::opts_chunk$set(
	echo = TRUE,
	message = FALSE,
	warning = FALSE,
	comment = NA,
	results = TRUE
)
```

# Introduction and Background 
The dataset in use today originally stems from the extremely popular real estate website, Zillow. It is a compilation of Zillow's home value index rankings broken down in a month-by-month and state-by-state manner, beginning in January of 2000 and being most recently updated in September of 2025.

Technically speaking, the dataset's structure is fairly straightforward. There are 51 variables, with the first day of each month acting as a date object within R in YYYY-MM-DD format, and every state in the nation's home value index getting its own variable, stored as a double **(units are $USD)**. There are 228 missing values across the dataset, with the vast majority of them coming from North Dakota, Montana, Wyoming and New Mexico.

For my analysis, I will be specifically focusing on the typical home values in the state of New York, and constraining my study to the 15 year time window of January 2010 to January of 2025. I will not be primarily concerned with associative and external exploration as to *why* the changes in home prices have occurred in the manner which they have. Rather I will be applying a longitudinal perspective and developing a number of time series models with the hope of creating a model that can reliably, based on the historical data at hand, predict future New York home values.

```{r Data Loading and Cleaning, include=FALSE}
Data_Url= "https://raw.githubusercontent.com/ChrisB2323/STA321/refs/heads/main/Zillow_Data.csv"

Original_Data = read_delim(Data_Url, delim = ",", show_col_types = FALSE)
sum(is.na(Original_Data)) # = 228 missing values

  # Breakdown of missing values per state
colSums(is.na(Original_Data))[colSums(is.na(Original_Data)) > 0]

# Will use the last 15 years of data, 15 years * 12 months = 180 observations

Data = Original_Data[121:301,]
colnames(Data)[1] = "Date"
glimpse(Data)
  # All variables formatted right, Month is date object and the rest are doubles

  # Going to conduct univariate analysis, with New York's average home value index (in $USD) as the response in this time series

Data = Data[ ,c("Date", "New York")]
colnames(Data)[2] = "NY"
  # ^ For coding convenience

# Want to label my data by month, season and year.
  # Used AI for season assignment below
Data = Data %>%
  mutate(
    Year = year(Date),
    Month = month(Date, label = TRUE, abbr = TRUE),
    Season = case_when(
      month(Date) %in% c(12, 1, 2) ~ "Winter",
      month(Date) %in% c(3, 4, 5)  ~ "Spring",
      month(Date) %in% c(6, 7, 8)  ~ "Summer",
      month(Date) %in% c(9, 10, 11) ~ "Fall"
    )
  )

TS_Data = Data[,1:2]
```

# Plot of the Data
Below we see a plot of the standard New York home's value over the past 15 years. From this plot, there is one obvious takeaway that instantly comes to the viewer's mind. That being, there is a clear, noticeable and approximately consistent increase in home values throughout the duration between January of 2010 and January of 2025. To quantify; the home value index for New York in January of 2025 was about $496,846 per home, while in January of 2010 that same metric was only $267,153. This is a staggering increase of about `r round((((496846 - 267153)/267153) * 100), 2)`%. 
```{r, fig.align ='center'}
Time_Series = ts(TS_Data$NY, frequency = 12, start = c(2010,1))
  # frequency set to 12 since our dataset is composed of years, which are 12-month periods
  # keyword start ensures that the axes on graph below will be accurate, that ts function recognizes that the first observation is the first of 12 of value 2010

# using xaxt = "n" will override the graph's default axis. I want the x intervals to be more frequent than every 5 years
plot((Time_Series/1000), xaxt="n",
     main = "New York State's Home Value Index, 2010 - 2025",
     xlab = "Year",
     ylab = "Value (1000's $USD)")
axis(1,
     at = seq(2010, 2025, by = 1),
     labels = seq(2010, 2025, by = 1))
```


# Decomposition Analysis
Before creating time series models of varying sizes and using them to forecast future values, I performed decomposition on the dataset. Decomposition is basically the process of breaking down the makeup of a time series into its different key components. These components being the seasonal patterns, overall trend, and random error (sometimes called remainder) that exist within the data. Below are visuals from two decomposition processes; classical and STL.

## Classical Decomposition
Here is a display of the New York home value index data from 2010 to 2025, post classical decomposition. There are a few noticeable takeaways. Looking at the trend component, we see a clear rise in home values over the 15-year period. This is not surprising, as it was also clearly visible in our original plot above. Regarding seasonality, we see consistent oscillation at approximately the same interval length throughout the dataset. On average throughout the years, New York homes had their highest values in the Fall (~ $340,676) and their lowest in the Spring (~ 332,837). Finally, when it comes to the random error component of the decomposed time series, we see a spike in the remainder's magnitude after 2020. This can most likely be attributed to the economic uncertainty related to the COVID pandemic.
```{r, fig.align ='center'}
Classic_Decomp = decompose(Time_Series)
plot(Classic_Decomp)

  Summer = filter(Data, Season == "Summer")
  Fall = filter(Data, Season == "Fall")
  Winter = filter(Data, Season == "Winter")
  Spring = filter(Data, Season == "Spring")
    
  invisible(round(mean(Summer$NY)))
  invisible(round(mean(Fall$NY)))
  invisible(round(mean(Winter$NY)))
  invisible(round(mean(Spring$NY)))
```

## Seasonal and Trend LOESS (STL) Decomposition
The STL decomposition of the data tells practically the same story of the classical decomposition. The only large difference being that this decomposition shows a much larger magnitude of seasonal oscillation from about 2020 to 2025, while the classical decomposition shows equivalent seasonal oscillation throughout.

Generally speaking, STL decomposition is considered a more robust procedure than its classical counterpart, as STL is more effective in dealing with outliers and non-linearity in data. The STL method uses LOESS (modern locally estimated scatterplot smoothing), while the classical method uses a more basic moving average. That is why we see variation in the seasonality of the data in our STL visual.

Going forward, I used the STL method when formulating the time series models.

```{r, fig.align ='center'}
STL_Decomp = stl(Time_Series, s.window = 12)
  # the s.window argument essentially tells the stl function how to estimate the seasonal component. Since I put the number 12, that is basically telling the function that every 12 observations (months) the seasonal pattern (year) should be reset to re-estimate for the next series of observations. If I was trying to estimate the seasonal component for summer/fall/winter/spring, I would set s.window = to 3

plot(STL_Decomp,
     main = "STL Decomposition of Time Series")
```

# Creating Training and Testing Sets
Since it is bad statistical practice to test a model on the same data which it is trained on, I devoted the data's ten most recent observations (April of 2024 to January of 2025) for post-creation testing purposes. For the training data, I used R's runif function to randomly select four sample sizes (*n*) to assign to each of the four training subsets of the data. All four of the training datasets ended with the observation on March of 2024, and respectively began (*n* - 1) months after January of 2010. I created the training datasets (1 through 4) in sequential order of descending sample size (set 1 has the most observations, set 2 has the second most, etc...).

After subsetting the data into these four separate training sets, I performed STL decomposition on each one and then forecasted each model's next ten observations (the same ten which make up the known testing dataset). For each model I used the drift method for forecasting, as the drift accounts for historical increases and decreases in a response variable over time, something that is very prevalent in this dataset.
```{r, Making Training Sets}
# Training dataset of 171 observations, most we want to take out is about 30.
# use runif function for random selection, to select four numbers between 1 and 30, our training subsets of the data will go from that randomly selected observation point up to point 171 (March of 2024)
   # sort(round(runif(4, min = 1, max = 30)))
    # Results are 9, 12, 16, 25

Testing_Data = TS_Data[172:181,]

# Training set 1: 
Training_Data1 = TS_Data[9:171, ]

# Training set 2: 
Training_Data2 = TS_Data[12:171, ]

# Training set 3: 
Training_Data3 = TS_Data[16:171, ]

# Training set 4: 
Training_Data4 = TS_Data[25:171, ]


### Now that training and testing subsets of data have been created, make time series objects for each

TS_Train1 = ts(Training_Data1$NY, frequency = 12, start = c(2010, 9))
TS_Train2 = ts(Training_Data2$NY, frequency = 12, start = c(2010, 12))
TS_Train3 = ts(Training_Data3$NY, frequency = 12, start = c(2011, 4))
TS_Train4 = ts(Training_Data4$NY, frequency = 12, start = c(2012, 1))

### going to use the stl() function to decompose each of the training dataset time series objects, then use those decomposed objects for forecasting

STL_Training1 = stl(TS_Train1, s.window = 12)
STL_Training2 = stl(TS_Train2, s.window = 12)
STL_Training3 = stl(TS_Train3, s.window = 12)
STL_Training4 = stl(TS_Train4, s.window = 12)

### Forecast calculations with decomposed training set time series objects

  # Set h = 10 as there is 10 observations in the testing dataset, so we can do a one-to-one comparison between our training forecasts for each time series model and our observed values in the testing dataset (April 2024 to January 2025)
Train_Set1_Forecasts = forecast(STL_Training1, h = 10, method = "rwdrift")
Train_Set2_Forecasts = forecast(STL_Training2, h = 10, method = "rwdrift")
Train_Set3_Forecasts = forecast(STL_Training3, h = 10, method = "rwdrift")
Train_Set4_Forecasts = forecast(STL_Training4, h = 10, method = "rwdrift")
```

## Examining each Training Set's Forecasted Values
Below I created both a table to quantify each training set's forecasts as well as a graph to visualize their performances relative to both one another and the **true** New York home values provided in the original dataset. As we can see, all of the models underestimated the rise in home values that would take place from August of 2024 to January of 2025.
```{r, fig.align='center'}
  # Use the $mean selection to see the forecasted predicted values at particular points
TSMean1 = data.frame(Train_Set1_Forecasts$mean)
  colnames(TSMean1) = "Errors"
TSMean2 = data.frame(Train_Set2_Forecasts$mean)
  colnames(TSMean2) = "Errors"
TSMean3 = data.frame(Train_Set3_Forecasts$mean)
  colnames(TSMean3) = "Errors"
TSMean4 = data.frame(Train_Set4_Forecasts$mean)
  colnames(TSMean4) = "Errors"
Years = data.frame(TS_Data$Date[172:181])

Forecasted_Values = cbind(Years, TSMean1, TSMean2, TSMean3, TSMean4, Testing_Data[,2])
colnames(Forecasted_Values) = c("Date", "TS 1", "TS 2", "TS 3", "TS 4", "True Values")
kable(Forecasted_Values, caption = "Forecasted Values Per Training Set")

List_Of_Dates = as.Date(c("2024-04-01", "2024-05-01", "2024-06-01", "2024-07-01", "2024-08-01", "2024-09-01", "2024-10-01", "2024-11-01", "2024-12-01", "2025-01-01"))

TS1 = data.frame(List_Of_Dates, as.double(TSMean1$Errors))
TS2 = data.frame(List_Of_Dates, as.double(TSMean2$Errors))
TS3 = data.frame(List_Of_Dates, as.double(TSMean3$Errors))
TS4 = data.frame(List_Of_Dates, as.double(TSMean4$Errors))

plot(Testing_Data, col = "blue", lwd = 2,
     main = "Forecasted Vs True New York Home Values \n (August 2024 - January 2025)",
     xlab = "Date",
     ylab = "Value($USD)")
  lines(Testing_Data, col = "blue", lwd = 2)
  lines(TS1, col = "red", lwd = 2)
    points(TS1, col = "red")
  lines(TS2, col = "black", lwd = 2)
    points(TS2, col = "black")
  lines(TS3, col = "green", lwd = 2)
    points(TS3, col = "green")
  lines(TS4, col = "purple", lwd = 2)
    points(TS4, col = "purple")
    legend("topleft",
       legend = c("True Values", "Time Series 1", "Time Series 2",
                  "Time Series 3", "Time Series 4"),
       col = c("blue", "red", "black", "green", "purple"),
       lty = 1, pch = 16, lwd = 2)
```

## Training Set Forecast Error Analysis
As we saw above, there was a consistent underestimation by each of the training set time series. However, there appeared to be a linear and negative correlation between the sample size of the training set, and that set's forecasts relative to the true values. Training set 4 had the smallest sample size (beginning at January of 2012) and was the most accurate.

A numeric breakdown of each training sets forecasts' accuracy is below. The metrics I used were mean error (ME), mean square error (MSE) and mean absolute prediction error (MAPE). As seen in the visuals above, as the sample size of our time series training set decreased, its accuracy increased. 

This is likely due to non-linear rise in New York home values, especially from 2018 to 2025. The smaller time series training sets were less influenced by the lesser home values of the early parts of the 2010s.
```{r, Error Analysis}
####### Define Error Objects
TS_Forecasts1_Errors = Testing_Data[,2] - TSMean1
TS_Forecasts2_Errors = Testing_Data[,2] - TSMean2
TS_Forecasts3_Errors = Testing_Data[,2] - TSMean3
TS_Forecasts4_Errors = Testing_Data[,2] - TSMean4

####### ME
TS1_ME = sum(TS_Forecasts1_Errors)/nrow(TS_Forecasts1_Errors)
TS2_ME = sum(TS_Forecasts2_Errors)/nrow(TS_Forecasts2_Errors)
TS3_ME = sum(TS_Forecasts3_Errors)/nrow(TS_Forecasts3_Errors)
TS4_ME = sum(TS_Forecasts4_Errors)/nrow(TS_Forecasts4_Errors)

####### MSE
TS1_MSE = sum(TS_Forecasts1_Errors^2)/nrow(TS_Forecasts1_Errors)
TS2_MSE = sum(TS_Forecasts2_Errors^2)/nrow(TS_Forecasts2_Errors)
TS3_MSE = sum(TS_Forecasts3_Errors^2)/nrow(TS_Forecasts3_Errors)
TS4_MSE = sum(TS_Forecasts4_Errors^2)/nrow(TS_Forecasts4_Errors)

####### MAPE
TS1_MAPE = (sum(abs((TS_Forecasts1_Errors/Testing_Data[,2])*100)))/nrow(TS_Forecasts1_Errors)
TS2_MAPE = (sum(abs((TS_Forecasts2_Errors/Testing_Data[,2])*100)))/nrow(TS_Forecasts2_Errors)
TS3_MAPE = (sum(abs((TS_Forecasts3_Errors/Testing_Data[,2])*100)))/nrow(TS_Forecasts3_Errors)
TS4_MAPE = (sum(abs((TS_Forecasts4_Errors/Testing_Data[,2])*100)))/nrow(TS_Forecasts4_Errors)

##### Table to Visualize
All_MEs = rbind(TS1_ME, TS2_ME, TS3_ME, TS4_ME)
All_MSEs = rbind(TS1_MSE, TS2_MSE, TS3_MSE, TS4_MSE)
All_MAPEs = rbind(TS1_MAPE, TS2_MAPE, TS3_MAPE, TS4_MAPE)

Forecast_Errors = cbind(All_MEs, All_MSEs, All_MAPEs)
colnames(Forecast_Errors) = c("ME", "MSE", "MAPE")
rownames(Forecast_Errors) = c("TS1", "TS2", "TS3", "TS4")
kable(Forecast_Errors, caption = "Model Forecasts' Accuracy Measures")
```

# Conclusion
Ultimately, when looking at New York state's home values over the past 15 years, forecasting monthly index values via STL-decomposed training sets did not yield particularly accurate results. Despite training our forecast algorithm with the drift method, which is meant to take into account historical increases and decreases, all four of our time series underestimated the true values of New York's home value index from August of 2024 to January of 2025.

In a future analysis, it would possibly be valuable during the training set process, to widen the scope of sample size selection. Doing so would allow us to see at which point does constraining the time interval of observations for the training data become detrimental. As in this instance, with the maximum number of observations coming out being set to 30, we saw a directly negative relationship between sample size of the training set and forecasts' accuracy.

# References
Original Dataset Source:

- https://www.kaggle.com/datasets/robikscube/zillow-home-value-index

Dataset Download Link via Github:

- https://raw.githubusercontent.com/ChrisB2323/STA321/refs/heads/main/Zillow_Data.csv