Introduction and
Background
Here we have a dataset composed of historical records of the New York
Stock Exchange’s most famous index, the S&P 500. Made up of the top
500 companies for which trades across the United States are made, the
S&P 500 (or S&P), was believed to have had a market cap of just
over $57 trillion as of this past August. Since it’s
inception, the S&P has been used as a political benchmark for the
growth and overall health of the United States’ economy.
For my analysis, I wanted to specifically focus on what the opening
price of the S&P was at the first trading day of each year, which
ranged from January 2nd to January 4th depending on if the New Years’
holiday coincided with the weekend. On top of observing any possible
historical trends, I also used this data as grounds to test multiple
different time series modeling methods. A breakdown of the dataset’s
original structure is below.
|
Name
|
Meaning
|
Data_Type
|
|
Date
|
Date for that observation; YYYY-MM-DD form
|
Date
|
|
Open
|
S&P’s opening market price
|
double
|
|
High
|
That day’s highest recorded price
|
double
|
|
Low
|
That day’s lowest recorded price
|
double
|
|
Close
|
S&P’s closing market price
|
double
|
|
Adj Close
|
S&P’s closing market price, accounting for corporate activity*
|
double
|
|
Volume
|
Total number of shares traded in the market that day
|
double
|
- Corporate activity primarily includes, but is not limited to,
dividend payouts and stock splits
Model Creation
For my time series modeling experiments, I chose to divide my dataset
into separate portions, with the larger chunk being used for model
training and the later part being used for model testing. More
specifically, the S&P’s year-beginning market open from 1928 to 2010
I dedicated to training. While the observations from
2011 to 2020 will be used to test my model’s
accuracy.
I created four different time series models - Moving Average, Naive,
Seasonal Naive and Drift - using the training data. Each model was set
to have a forecast horizon (number of future intervals to predict) of
10. A table summary of each model’s predictions for the ten years after
our training data’s end is below.
##### Moving Average Model
Moving_Average = round(meanf(TS_Training_Data, h=10)$mean, 4)
# argument h represents the forecast horizon, which is the number of intervals into the future we are predicting values for
# Setting h=10 since we used 10 for the sample size of the testing data
##### Naive Model
Naive = round(naive(TS_Training_Data, h=10)$mean, 4)
##### Seasonal Naive Model
Seasonal_Naive = round(snaive(TS_Training_Data, h=10)$mean, 4)
##### Drift Model
Drift = round(rwf(TS_Training_Data, h=10, drift=TRUE)$mean, 4)
# function rwf stands for "random walk with drift"
# Note that we MUST set drift = TRUE in order for it to be a proper drift model
Year = c("2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020")
Baseline_Models = cbind(Year, Moving_Average, Naive, Seasonal_Naive, Drift)
colnames(Baseline_Models) = c("Year", "Moving Average", "Naive", "Seasonal Naive", "Drift")
kable(Baseline_Models, caption = "Model Forecasting Projections")
Model Forecasting Projections
| 2011 |
283.2408 |
1116.5601 |
1116.5601 |
1129.9601 |
| 2012 |
283.2408 |
1116.5601 |
1116.5601 |
1143.3601 |
| 2013 |
283.2408 |
1116.5601 |
1116.5601 |
1156.7601 |
| 2014 |
283.2408 |
1116.5601 |
1116.5601 |
1170.1601 |
| 2015 |
283.2408 |
1116.5601 |
1116.5601 |
1183.5601 |
| 2016 |
283.2408 |
1116.5601 |
1116.5601 |
1196.9601 |
| 2017 |
283.2408 |
1116.5601 |
1116.5601 |
1210.3601 |
| 2018 |
283.2408 |
1116.5601 |
1116.5601 |
1223.7601 |
| 2019 |
283.2408 |
1116.5601 |
1116.5601 |
1237.1601 |
| 2020 |
283.2408 |
1116.5601 |
1116.5601 |
1250.5601 |
Looking at our summary table, we see a few interesting takeaways.
First, we see that the year-to-year predicted values from the
moving average model are far beneath those of the other
3 models. This is because the moving average method operates on the
assumption that all future values will be equal to the mean of all
previous values. A large amount of our training data (58 of the 73)
years, had a market open below $200.
We also notice there is total equality between the naive and
seasonal naive models. The naive method assumes that all future
observations will have equal value to that of the most previous
observation (which explains why the predicted value is the same for each
year in the string of forecasts). Meanwhile, the seasonal naive approach
applies the premise of the naive method to data that operates in a
cyclical manner (think frequencies of weather related events tied to
seasonal changes). In the case of our dataset, each S&P market
opening is occurring at approximately the same day every year,
making the naive and seasonal naive model results practically
identical.
Lastly, we see that the drift model is the only one in which
forecasted values differ from year to year. Unlike the other three
methods, the drift method does account for historical averages in rate
of increase and decrease as well as averages in overall value as well.
Given that the S&P has always gained value when looked at from a
long term perspective, it is not surprising that our drift forecasts
show an average increase of about 13 to 14 USD in projected market
opening from one year to the next.
Model Forecast
Visualizations
After examining the model forecasts year-by-year, I decided to graph
each model’s predictions alongside the known historical data to allow
for a quick examination of how accurate each model was. The first graph
consists of the entire scope of this dataset’s observations (1929 to
2020), while the second graph I constrained to this century’s market
opens. It should be noted that, in both graphs, the representation of
the naive model forecasts can be difficult to see as they perfectly
overlap with those of the seasonal naive model.
DF_Predictions = cbind(2011:2020, data.frame(Moving_Average), data.frame(Naive), data.frame(Seasonal_Naive), data.frame(Drift))
colnames(DF_Predictions) = c("Year", "Moving Average", "Naive", "Seasonal Naive", "Drift")
Full_Graph = (ggplot() +
geom_line(data = First_Day, mapping = aes (x = Year, y = Open)) +
geom_point(data = First_Day, mapping = aes(x = Year, y = Open)) +
labs(x = "Year", y = "S&P 500 Open ($)", title ="S&P 500 First Market Open of the Year") +
geom_line(data = DF_Predictions, mapping = aes (x = Year, y = Moving_Average)) +
geom_point(data = DF_Predictions, mapping = aes (x = Year, y = Moving_Average, color = "Moving Average")) +
geom_line(data = DF_Predictions, mapping = aes (x = Year, y = Naive)) +
geom_point(data = DF_Predictions, mapping = aes (x = Year, y = Naive, color = "Naive")) +
geom_line(data = DF_Predictions, mapping=aes(x = Year, y = `Seasonal Naive`)) +
geom_point(data = DF_Predictions, mapping = aes (x = Year, y = Seasonal_Naive, color = "Seasonal Naive")) +
geom_line(data = DF_Predictions, mapping=aes(x = Year, y = `Drift`)) +
geom_point(data = DF_Predictions, mapping = aes (x = Year, y = Drift, color = "Drift")) +
scale_x_continuous(
limits = c(1925, 2020),
breaks = seq(1930, 2020, 10)
) +
scale_y_continuous(
limits = c(0, 3250),
breaks = seq(0, 3500, 500)
))
Full_Graph

########################
Reduced_Graph = (ggplot() +
geom_line(data = First_Day, mapping = aes (x = Year, y = Open)) +
geom_point(data = First_Day, mapping = aes(x = Year, y = Open)) +
labs(x = "Year", y = "S&P 500 Open ($)", title ="S&P 500 First Market Open of the Year (2000 - 2020)") +
geom_line(data = DF_Predictions, mapping = aes (x = Year, y = Moving_Average)) +
geom_point(data = DF_Predictions, mapping = aes (x = Year, y = Moving_Average, color = "Moving Average",)) +
geom_line(data = DF_Predictions, mapping = aes (x = Year, y = Naive)) +
geom_point(data = DF_Predictions, mapping = aes (x = Year, y = Naive, color = "Naive")) +
geom_line(data = DF_Predictions, mapping=aes(x = Year, y = `Seasonal Naive`)) +
geom_point(data = DF_Predictions, mapping = aes (x = Year, y = Seasonal_Naive, color = "Seasonal Naive")) +
geom_line(data = DF_Predictions, mapping=aes(x = Year, y = `Drift`)) +
geom_point(data = DF_Predictions, mapping = aes (x = Year, y = Drift, color = "Drift")) +
scale_x_continuous(
limits = c(2000, 2020),
breaks = seq(2000, 2020, 4)
) +
scale_y_continuous(
limits = c(250, 3250),
breaks = seq(250, 3250, 500)
)
)
Reduced_Graph
We can see from our two graphs that, like seen in the summary table
above, the moving average model’s forecasts are well beneath those of
the other three models. That is because of the much greater influence
that the first roughly 50 years of the S&P’s opening market price
has on the moving average’s calculation than it does on the naive,
seasonal naive or drift. The average opening price from 1928 to 1978 was
only about $45.64.
Additionally, while our other three models’ forecasts resemble that
of the index’ true opening value far greater than those of our moving
average model, they also consistently underestimated. The drift model’s
predicted values were the most accurate, which is not suprising given
that, as previously said, the drift approach does account for historical
trends in increase or decrease. The reason the drift forecasts’ trend
was in the proper direction (increasing over time), but at a rate far
lesser than the true rate of market open increase is due to the
extraordinary weakness of the S&P opening’s upward trend throughout
the first five decades or so of our dataset.
Model Accuracy
Measures
Finally, I wanted to quantify the strength of my four models using
some standard measures of modeling accuracy. Below is a breakdown of
each model’s mean error (ME), mean square error (MSE) and mean absolute
prediction error (MAPE). Mean error and mean square error are considered
measures of absolute error, and are a good reference statistic to
summarize how “off” our models’ forecasts were on average. Meanwhile,
MAPE is called a measure of relative error, as it tells us on average
how “off” our models’ forecasts were as a percent.
None of the metrics calculated in the table below are particularly
surprising. They are in-line with our previous graphical displays and
forecast summary table. Our drift model is the most accurate, our naive
and seasonal naive models are slightly worse, and the moving average is
significantly less useful than the others.
# Will find mean error (ME), mean squared error (MSE) and mean absolute prediction error (MAPE) for each of the 4 baseline models. Then cbind them all into a table and output with kable function.
### ME first
Moving_Average_Errors = data.frame(DF_Predictions$`Moving Average` - Testing_Data$Open)
Moving_Average_ME = sum(Moving_Average_Errors)/nrow(Moving_Average_Errors)
Naive_Errors = data.frame(DF_Predictions$Naive - Testing_Data$Open)
Naive_ME = sum(Naive_Errors)/nrow(Naive_Errors)
Seasonal_Naive_Errors = data.frame(DF_Predictions$`Seasonal Naive` - Testing_Data$Open)
Seasonal_Naive_ME = sum(Seasonal_Naive_Errors)/nrow(Seasonal_Naive_Errors)
Drift_Errors = data.frame(DF_Predictions$Drift - Testing_Data$Open)
Drift_ME = sum(Drift_Errors)/nrow(Drift_Errors)
All_Model_MEs = rbind(Moving_Average_ME, Naive_ME, Seasonal_Naive_ME, Drift_ME)
### MSE second
# Create squared error data frames
Moving_Average_SQErrors = data.frame(Moving_Average_Errors^2)
Naive_SQErrors = data.frame(Naive_Errors^2)
Seasonal_Naive_SQErrors = data.frame(Seasonal_Naive_Errors^2)
Drift_SQErrors = data.frame(Drift_Errors^2)
# Use squared error data frames to create MSE objects
Moving_Average_MSE = sum(Moving_Average_SQErrors)/nrow(Moving_Average_SQErrors)
Naive_MSE = sum(Naive_SQErrors)/nrow(Naive_SQErrors)
Seasonal_Naive_MSE = sum(Seasonal_Naive_SQErrors)/nrow(Seasonal_Naive_SQErrors)
Drift_MSE = sum(Drift_SQErrors)/nrow(Drift_SQErrors)
All_Model_MSEs = rbind(Moving_Average_MSE, Naive_MSE, Seasonal_Naive_MSE, Drift_MSE)
#MAPE third
Moving_Average_PEI = data.frame((Moving_Average_Errors/Testing_Data$Open)*100)
Moving_Average_MAPE = round(sum(abs(Moving_Average_PEI))/nrow(Moving_Average_PEI),4)
Naive_PEI = data.frame((Naive_Errors/Testing_Data$Open)*100)
Naive_MAPE = round(sum(abs(Naive_PEI))/nrow(Naive_PEI),4)
Seasonal_Naive_PEI = data.frame((Seasonal_Naive_Errors/Testing_Data$Open)*100)
Seasonal_Naive_MAPE = round(sum(abs(Seasonal_Naive_PEI))/nrow(Seasonal_Naive_PEI),4)
Drift_PEI = data.frame((Drift_Errors/Testing_Data$Open)*100)
Drift_MAPE = round(sum(abs(Drift_PEI))/nrow(Drift_PEI),4)
All_Model_MAPEs = rbind(Moving_Average_MAPE, Naive_MAPE, Seasonal_Naive_MAPE, Drift_MAPE)
Model_Accuracy = cbind(All_Model_MEs, All_Model_MSEs, All_Model_MAPEs)
rownames(Model_Accuracy) = c("Moving Average", "Naive", "Seasonal Naive", "Drift")
colnames(Model_Accuracy) = c("ME", "MSE", "MAPE")
kable(Model_Accuracy, caption = "Model Accuracy Summary")
Model Accuracy Summary
| Moving Average |
-1771.0152 |
3510134 |
84.8822 |
| Naive |
-937.6959 |
1252913 |
40.4043 |
| Seasonal Naive |
-937.6959 |
1252913 |
40.4043 |
| Drift |
-863.9959 |
1076191 |
37.0768 |
Conclusion
In conclusion, we can see that the straightforward application of
these four standard and common time series modeling methods on the
historical stock market data that was examined, were not particularly
useful in forecasting future S&P 500 opening market prices. This was
largely due to the consistently low market opens throughout the first 50
or so years of the S&P index. In the future, it would be interesting
to examine the model accuracy of these methods - particularly the moving
average and drift - if the range of the training data was greatly
reduced to only include more recent decades.
References:
Original Dataset Source:
Dataset Download Link via Github:
---
title: "Analysis of Year-to-Year S&P 500 Opening Price"
author: "Chris Bahm"
date: "2025-11-14"
output:
  html_document:
    toc: true
    toc_float:
      collapsed: true
      smooth_scroll: true
    toc_depth: 4
    fig_width: 6
    fig_height: 4
    fig_caption: true
    number_sections: true
    code_folding: hide
    code_download: true
    theme: lumen
    highlight: tango
  pdf_document:
    toc: true
    toc_depth: 4
    fig_caption: true
    number_sections: true
  word_document:
    toc: true
    toc_depth: 4
---

```{css, echo = FALSE}
div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkRed;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";
}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.

if (!require("knitr")) {                      # use conditional statement to detect
   install.packages("knitr")                  # whether a package was installed in
   library(knitr)                             # your machine. If not, install it and
}                                             # load it to the working directory.

if (!require(tidyverse)) {library(tidyvserse)} 

if (!require(GGally)) {library(GGally)} 

if (!require(kableExtra)) {library(kableExtra)} 

if (!require(ggplot2)) {library(ggplot2)} 

if (!require(car)) {library(car)} 

if (!require(dplyr)) {library(dplyr)} 

if (!require(pander)) {library(pander)} 

if (!require(forecast)) {library(forecast)} 

if (!require(lubridate)) {library(lubridate)} 

if (!require("scales")) {
install.packages("scales")                                        
library("scales") 
}

knitr::opts_chunk$set(
	echo = TRUE,
	message = FALSE,
	warning = FALSE,
	comment = NA,
	results = TRUE
)
```

# Introduction and Background
Here we have a dataset composed of historical records of the New York Stock Exchange's most famous index, the S&P 500. Made up of the top 500 companies for which trades across the United States are made, the S&P 500 (or S&P), was believed to have had a market cap of just over **$57 trillion** as of this past August. Since it's inception, the S&P has been used as a political benchmark for the growth and overall health of the United States' economy. 

For my analysis, I wanted to specifically focus on what the opening price of the S&P was at the first trading day of each year, which ranged from January 2nd to January 4th depending on if the New Years' holiday coincided with the weekend. On top of observing any possible historical trends, I also used this data as grounds to test multiple different time series modeling methods. A breakdown of the dataset's original structure is below.

```{r Data Loading and Cleaning, include=FALSE}
Data_Url= "https://raw.githubusercontent.com/ChrisB2323/STA321/refs/heads/main/SPX.csv"

Data = read_delim(Data_Url, delim = ",", show_col_types = FALSE)

sum(is.na(Data))
# No missing values in dataset

glimpse(Data)
# All variables formatted properly
  # Open, High, Low, Close and Adj Close all in single dollar units


# Below created a subset of the data, looking at first trading day of each year
First_Day = Data %>%
  # Extract year from each date
  mutate(Year = year(Date)) %>%
  # Filter only years 1928–2019
  filter(Year >= 1928) %>%
  # Group by year
  group_by(Year) %>%
  # Pick the row with the earliest date in that year
  slice_min(Date, n = 1, with_ties = FALSE) %>%
  ungroup()

First_Day_Opening = First_Day[ ,c("Date", "Open")]

# Years 1928 to 2010 will be for model training
Training_Data = First_Day_Opening[1:83,]

# Years 2011 to 2020 will be for model testing
Testing_Data = First_Day_Opening[84:93,]

# We create the time series object with the ts() function, coming from forecast package
TS_Training_Data = ts(Training_Data$Open,
                      start = 1928,
                      end = 2010,
                      frequency = 1)

invisible(TS_Training_Data) 
```

```{r Variable Table, echo=FALSE}
library(knitr)
#glimpse(Data)

Var_Table = data.frame(
  Name = c("Date",
           "Open", 
           "High", 
           "Low",
           "Close", 
           "Adj Close", 
           "Volume"),
  
  Meaning = c("Date for that observation; YYYY-MM-DD form", 
              "S&P's opening market price",
              "That day's highest recorded price", 
              "That day's lowest recorded price", 
              "S&P's closing market price",  
              "S&P's closing market price, accounting for corporate activity*",
              "Total number of shares traded in the market that day"), 
              
  
  Data_Type = c("Date", "double", "double", "double", "double", "double", "double"))

kable(Var_Table) %>%
  kable_styling(
    bootstrap_options = c("striped", "bordered"),
    full_width = FALSE,
    position = "center")
```
* Corporate activity primarily includes, but is not limited to, dividend payouts and stock splits

# Model Creation
For my time series modeling experiments, I chose to divide my dataset into separate portions, with the larger chunk being used for model training and the later part being used for model testing. More specifically, the S&P's year-beginning market open from 1928 to 2010 I dedicated to **training**. While the observations from 2011 to 2020 will be used to **test** my model's accuracy.

I created four different time series models - Moving Average, Naive, Seasonal Naive and Drift - using the training data. Each model was set to have a forecast horizon (number of future intervals to predict) of 10. A table summary of each model's predictions for the ten years after our training data's end is below.
```{r, Baseline Forecasting Model Creation}
##### Moving Average Model
Moving_Average = round(meanf(TS_Training_Data, h=10)$mean, 4)
  # argument h represents the forecast horizon, which is the number of intervals into the future we are predicting values for
  # Setting h=10 since we used 10 for the sample size of the testing data

##### Naive Model
Naive = round(naive(TS_Training_Data, h=10)$mean, 4)

##### Seasonal Naive Model
Seasonal_Naive = round(snaive(TS_Training_Data, h=10)$mean, 4)

##### Drift Model
Drift = round(rwf(TS_Training_Data, h=10, drift=TRUE)$mean, 4)
  # function rwf stands for "random walk with drift"
  # Note that we MUST set drift = TRUE in order for it to be a proper drift model

Year = c("2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020")

Baseline_Models = cbind(Year, Moving_Average, Naive, Seasonal_Naive, Drift)

colnames(Baseline_Models) = c("Year", "Moving Average", "Naive", "Seasonal Naive", "Drift")

kable(Baseline_Models, caption = "Model Forecasting Projections")
```
Looking at our summary table, we see a few interesting takeaways. 

- First, we see that the year-to-year predicted values from the moving average model are **far** beneath those of the other 3 models. This is because the moving average method operates on the assumption that all future values will be equal to the mean of all previous values. A large amount of our training data (58 of the 73) years, had a market open below $200.

- We also notice there is total equality between the naive and seasonal naive models. The naive method assumes that all future observations will have equal value to that of the most previous observation (which explains why the predicted value is the same for each year in the string of forecasts). Meanwhile, the seasonal naive approach applies the premise of the naive method to data that operates in a cyclical manner (think frequencies of weather related events tied to seasonal changes). In the case of our dataset, each S&P market opening is occurring at *approximately* the same day every year, making the naive and seasonal naive model results practically identical.

- Lastly, we see that the drift model is the only one in which forecasted values differ from year to year. Unlike the other three methods, the drift method does account for historical averages in rate of increase and decrease as well as averages in overall value as well. Given that the S&P has always gained value when looked at from a long term perspective, it is not surprising that our drift forecasts show an average increase of about 13 to 14 USD in projected market opening from one year to the next.

# Model Forecast Visualizations
After examining the model forecasts year-by-year, I decided to graph each model's predictions alongside the known historical data to allow for a quick examination of how accurate each model was. The first graph consists of the entire scope of this dataset's observations (1929 to 2020), while the second graph I constrained to this century's market opens. It should be noted that, in both graphs, the representation of the naive model forecasts can be difficult to see as they perfectly overlap with those of the seasonal naive model.
```{r Graphs, fig.align='center', warning=FALSE}
DF_Predictions = cbind(2011:2020, data.frame(Moving_Average), data.frame(Naive), data.frame(Seasonal_Naive), data.frame(Drift))
  colnames(DF_Predictions) = c("Year", "Moving Average", "Naive", "Seasonal Naive", "Drift")

Full_Graph = (ggplot() +
  geom_line(data = First_Day, mapping = aes (x = Year, y = Open)) +             
      geom_point(data = First_Day, mapping = aes(x = Year, y = Open)) +
      labs(x = "Year", y = "S&P 500 Open ($)", title ="S&P 500 First Market Open of the Year") + 
  geom_line(data = DF_Predictions, mapping = aes (x = Year, y = Moving_Average)) +
      geom_point(data = DF_Predictions, mapping = aes (x = Year, y = Moving_Average, color = "Moving Average")) + 
  
  geom_line(data = DF_Predictions, mapping = aes (x = Year, y = Naive)) + 
      geom_point(data = DF_Predictions, mapping = aes (x = Year, y = Naive, color = "Naive")) + 
  
  geom_line(data = DF_Predictions, mapping=aes(x = Year, y = `Seasonal Naive`)) +
      geom_point(data = DF_Predictions, mapping = aes (x = Year, y = Seasonal_Naive, color = "Seasonal Naive")) + 
  
  geom_line(data = DF_Predictions, mapping=aes(x = Year, y = `Drift`)) +
      geom_point(data = DF_Predictions, mapping = aes (x = Year, y = Drift, color = "Drift")) +
  
  scale_x_continuous(
    limits = c(1925, 2020),
    breaks = seq(1930, 2020, 10) 
  ) +

  scale_y_continuous(
    limits = c(0, 3250),
    breaks = seq(0, 3500, 500) 
  ))

Full_Graph


########################
Reduced_Graph = (ggplot() +  
  geom_line(data = First_Day, mapping = aes (x = Year, y = Open)) +
      geom_point(data = First_Day, mapping = aes(x = Year, y = Open)) +
  labs(x = "Year", y = "S&P 500 Open ($)", title ="S&P 500 First Market Open of the Year (2000 - 2020)") + 
  geom_line(data = DF_Predictions, mapping = aes (x = Year, y = Moving_Average)) +
      geom_point(data = DF_Predictions, mapping = aes (x = Year, y = Moving_Average, color = "Moving Average",)) + 
  
  geom_line(data = DF_Predictions, mapping = aes (x = Year, y = Naive)) + 
      geom_point(data = DF_Predictions, mapping = aes (x = Year, y = Naive, color = "Naive")) + 
  
  geom_line(data = DF_Predictions, mapping=aes(x = Year, y = `Seasonal Naive`)) +
      geom_point(data = DF_Predictions, mapping = aes (x = Year, y = Seasonal_Naive, color = "Seasonal Naive")) + 
  
  geom_line(data = DF_Predictions, mapping=aes(x = Year, y = `Drift`)) +
      geom_point(data = DF_Predictions, mapping = aes (x = Year, y = Drift, color = "Drift")) +
  
  scale_x_continuous(
    limits = c(2000, 2020),
    breaks = seq(2000, 2020, 4) 
  ) +
  
  scale_y_continuous(
    limits = c(250, 3250),
   breaks = seq(250, 3250, 500)
  )
  )

Reduced_Graph
```
We can see from our two graphs that, like seen in the summary table above, the moving average model's forecasts are well beneath those of the other three models. That is because of the much greater influence that the first roughly 50 years of the S&P's opening market price has on the moving average's calculation than it does on the naive, seasonal naive or drift. The average opening price from 1928 to 1978 was only about $`r round(sum(Training_Data[1:51, 2])/50, 2)`.

Additionally, while our other three models' forecasts resemble that of the index' true opening value far greater than those of our moving average model, they also consistently underestimated. The drift model's predicted values were the most accurate, which is not suprising given that, as previously said, the drift approach does account for historical trends in increase or decrease. The reason the drift forecasts' trend was in the proper direction (increasing over time), but at a rate far lesser than the true rate of market open increase is due to the extraordinary weakness of the S&P opening's upward trend throughout the first five decades or so of our dataset.

```{r, include=FALSE}
# Average S&P opening price from 1928 - 1978
sum(Training_Data[1:51, 2])/50
```

# Model Accuracy Measures
Finally, I wanted to quantify the strength of my four models using some standard measures of modeling accuracy. Below is a breakdown of each model's mean error (ME), mean square error (MSE) and mean absolute prediction error (MAPE). Mean error and mean square error are considered measures of absolute error, and are a good reference statistic to summarize how "off" our models' forecasts were on average. Meanwhile, MAPE is called a measure of relative error, as it tells us on average how "off" our models' forecasts were as a *percent*.

None of the metrics calculated in the table below are particularly surprising. They are in-line with our previous graphical displays and forecast summary table. Our drift model is the most accurate, our naive and seasonal naive models are slightly worse, and the moving average is significantly less useful than the others.
```{r}
# Will find mean error (ME), mean squared error (MSE) and mean absolute prediction error (MAPE) for each of the 4 baseline models. Then cbind them all into a table and output with kable function.

### ME first
Moving_Average_Errors = data.frame(DF_Predictions$`Moving Average` - Testing_Data$Open)
    Moving_Average_ME = sum(Moving_Average_Errors)/nrow(Moving_Average_Errors)

Naive_Errors = data.frame(DF_Predictions$Naive - Testing_Data$Open)
    Naive_ME = sum(Naive_Errors)/nrow(Naive_Errors)
    
Seasonal_Naive_Errors = data.frame(DF_Predictions$`Seasonal Naive` - Testing_Data$Open)
    Seasonal_Naive_ME = sum(Seasonal_Naive_Errors)/nrow(Seasonal_Naive_Errors)
    
Drift_Errors = data.frame(DF_Predictions$Drift - Testing_Data$Open)
    Drift_ME = sum(Drift_Errors)/nrow(Drift_Errors)

All_Model_MEs = rbind(Moving_Average_ME, Naive_ME, Seasonal_Naive_ME, Drift_ME)

### MSE second

# Create squared error data frames
Moving_Average_SQErrors = data.frame(Moving_Average_Errors^2)
Naive_SQErrors = data.frame(Naive_Errors^2)
Seasonal_Naive_SQErrors = data.frame(Seasonal_Naive_Errors^2)
Drift_SQErrors = data.frame(Drift_Errors^2)

# Use squared error data frames to create MSE objects
Moving_Average_MSE = sum(Moving_Average_SQErrors)/nrow(Moving_Average_SQErrors)
Naive_MSE = sum(Naive_SQErrors)/nrow(Naive_SQErrors)
Seasonal_Naive_MSE = sum(Seasonal_Naive_SQErrors)/nrow(Seasonal_Naive_SQErrors)
Drift_MSE = sum(Drift_SQErrors)/nrow(Drift_SQErrors)

All_Model_MSEs = rbind(Moving_Average_MSE, Naive_MSE, Seasonal_Naive_MSE, Drift_MSE)

#MAPE third
Moving_Average_PEI = data.frame((Moving_Average_Errors/Testing_Data$Open)*100)
Moving_Average_MAPE = round(sum(abs(Moving_Average_PEI))/nrow(Moving_Average_PEI),4)

Naive_PEI = data.frame((Naive_Errors/Testing_Data$Open)*100)
Naive_MAPE = round(sum(abs(Naive_PEI))/nrow(Naive_PEI),4)

Seasonal_Naive_PEI = data.frame((Seasonal_Naive_Errors/Testing_Data$Open)*100)
Seasonal_Naive_MAPE = round(sum(abs(Seasonal_Naive_PEI))/nrow(Seasonal_Naive_PEI),4)

Drift_PEI = data.frame((Drift_Errors/Testing_Data$Open)*100)
Drift_MAPE = round(sum(abs(Drift_PEI))/nrow(Drift_PEI),4)

All_Model_MAPEs = rbind(Moving_Average_MAPE, Naive_MAPE, Seasonal_Naive_MAPE, Drift_MAPE)

Model_Accuracy = cbind(All_Model_MEs, All_Model_MSEs, All_Model_MAPEs)
rownames(Model_Accuracy) = c("Moving Average", "Naive", "Seasonal Naive", "Drift")
colnames(Model_Accuracy) = c("ME", "MSE", "MAPE")
kable(Model_Accuracy, caption = "Model Accuracy Summary")
```

# Conclusion
In conclusion, we can see that the straightforward application of these four standard and common time series modeling methods on the historical stock market data that was examined, were not particularly useful in forecasting future S&P 500 opening market prices. This was largely due to the consistently low market opens throughout the first 50 or so years of the S&P index. In the future, it would be interesting to examine the model accuracy of these methods - particularly the moving average and drift - if the range of the training data was greatly reduced to only include more recent decades.

# References:

Original Dataset Source:

- https://www.kaggle.com/datasets/henryhan117/sp-500-historical-data/data

Dataset Download Link via Github:

- https://raw.githubusercontent.com/ChrisB2323/STA321/refs/heads/main/SPX.csv
