1 Introduction

The purpose of this report is to study the performance of Starbucks based on its closing prices from the year 1992 until 2023. In this time period, starbucks has been boycotted on multiple occasions so this will be something to keep note of as this investigation is pursued. Ultimately, given that these boycotts ocurred, I’ll be interested to see what the forecast is for the starbucks stock and if these bocotts will have a lasting effect whether it be slowing the growth of the stock, or even causing it to lose value.

Note the the data downloaded has the value of the stock at the first of the month everymonth since 1960. This means that as I proceed with my project, I do not need to take into account trading days or anything like that.

2 Processing and GGplot

We will begin with a simple overview of our dataset to identify if anything needs to be addressed.


df <- read.csv("/Users/sumayyakheireddine/Documents/FALLUOttawa2023/4307/Group Project/SBUX.csv")

ggplot(data = df, mapping = aes(x=  Date, y = Adj.Close)) +
        geom_point() +
  ylab("Adjusted Close ") + 
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 40, hjust = 1, size = 8))

Here, though the dates are not visible, we can see it’s consistent for quite a while with a slightly upward trend, it dips, and then steadily rises. It remains consistent and then rises again and at some point, more recently, the trend become indistinguishable. We can look into this furhter.

df$Date <- as.Date(df$Date, format = "%Y-%m-%d")

tsib <- df |> mutate(Month = yearmonth(Date)) |>
as_tsibble(index = Month)

tsib2015<- tsib|>
  filter(yearmonth(Date) >= yearmonth("2015-01-01")) |>
   autoplot(Close)+
  labs(title = "Closing Price of SBUX", y = "$US")

tsib2015

This graph is clearer. We can see some sawtooth behavior starting around 2019. Again, we’ll keep this in mind. For the purpose of modelling and forecasting, we’ll use the entire data set. It doesn’t seem to. be consistent dips so I don’t think there’s any seasonality there.

Based on based on Pegel’s (1969) classification, it looks like this data set has an additive trend with multiplicative seasonality.

However, to double check, I’d like to plot a seasonal plot using months to see if there are any less evident trends that are consistent throughout the years.

dcmp <- tsib |>
model(stl = STL(Close))
components(dcmp) |> 
autoplot()

It seems I was right and there is in fact no seasonality in terms of months. Perhaps this is because Starbucks has become relatively inelastic to their customers. However, there is yearly seasonality which makes a lot more sense.

We can use an ACF to confirm if there are in fact any trends worth obeserving.

tsib |> ACF(Close) |> autoplot()

This ACF corroborates the trend we have identified. The autocorrelations for the smaller lags are larger and positive because because observations nearby in time are also nearby in value. The slow decrease in the ACF as the lags increase is due to the trend.

Evidently, this is not white noise.

3 Model 1 : Basic Forecasting Model

Let’s start with a simple fitted forecast.

fit <- tsib |>
  model(
    Mean = MEAN(Close),
    `Naïve` = NAIVE(Close),
    `Seasonal naïve` = SNAIVE(Close)
  )

fc <- fit |>
forecast(h = "3 years") |>
autoplot(tsib) + labs(title = "Closing Price of SBUX", y = "$US")

fc

It looks like the seasonal naive is the best model. However, the data is not seasonal so let’s use naive instead. We can also use the accuracy function to help us pick the best of the models.

accuracy(fit) |>
arrange(.model) |>
select(.model, .type, RMSE, MAE, MAPE, MASE, RMSSE)

As expected, the RMSE of the naive is smallest. Let’s inspect the naive method further.

Check Residuals

tsib |>
  model(NAIVE(Close)) |>
  gg_tsresiduals()

In terms of residuals:The innovation residuals have about a zero mean suggesting the forecast is good.

In terms of the ACF: The lack of correlation suggests the forecast is good.

In terms of the histogram: It’s not necessary that the forecast be entirely normally distributed, and this only slightly skewed to the right. It is not ideal but it’s okay.

There is no data that needs to be removed.

Let’s now do a training set and test set for visualization purposes.

tsibmodel1train <- tsib |> filter(Date <= as.Date("2021-01-01"))

tsibmodel1test <-tsib |> filter(Date > as.Date("2021-01-01"))

fit <- tsibmodel1train |>
  model(
    `Seasonal naïve` = SNAIVE(Close)
  )

fcmodel1good <- fit |>
forecast(h = "2 years") |>
autoplot(tsib) + labs(title = "Closing Price of SBUX", y = "$US") +
autolayer(tsibmodel1test)
Plot variable not specified, automatically selected `.vars = Open`
fcmodel1good

About half of this data looks accruate while the left-portion is entirely outside of our interval of confidence.

4 Model 2: ETS

To decide whether to use SES, Holt, or Damped Holt, let’s find out which one has the lowest RMSE.

Here, I’ll be using the letters M for multiplicative error, A for trend, and N for no seasonality.

For this model, it will be trained on up to 2 years before the end of the data.This is chosen arbitrarily but mostly because there have been extreme fluctuations in the past 2 years that I want to see if the model will be able to predict.

Initially I used M in the seasonality, but I received this error.

Warning: 8 errors (2 unique) encountered for Holt [3] A seasonal ETS model cannot be used for this data.

Warning: 9 errors (2 unique) encountered for Damped [3] A seasonal ETS model cannot be used for this data.

Both of these errors are making me think, there is no seasonality, so I am changing that paramenter into a N.


fcmodel2train <- tsib |> 
  filter(Month <= as.Date("2021-01-01")) |>
  model(
    Holt = ETS(Close ~ error("A") + trend("A") + season("N")),
    Damped = ETS(Close ~ error("A") + trend("Ad") + season("N"))
  ) |>
  forecast(h = 36)
Warning: There was 1 warning in `filter()`.
ℹ In argument: `Month <= as.Date("2021-01-01")`.
Caused by warning:
! Incompatible methods ("<=.vctrs_vctr", "<=.Date") for "<="
  

fcmodel2train |> as_tsibble() 

fcmodel2test <- tsib |> 
  filter(Month > as.Date("2021-01-01")) |>
  as_tsibble()
Warning: There was 1 warning in `filter()`.
ℹ In argument: `Month > as.Date("2021-01-01")`.
Caused by warning:
! Incompatible methods (">.vctrs_vctr", ">.Date") for ">"
autoplot(fcmodel2train) + 
  labs(title = "Closing Price of SBUX", y = "$US") + autolayer(fcmodel2test)
Plot variable not specified, automatically selected `.vars = Open`

Unfortunately, the RMSE within accuracy function is not working. However, it seems like the damped trend is only slightly more accurate in terms of mean and confidence interval. Let’s plot the one with the lower RMSE, the Damped Holt.

fcmodel2train <- tsib |> 
  filter(Month <= as.Date("2021-01-01")) |>
  model(
    Damped = ETS(Close ~ error("A") + trend("Ad") + season("N"))
  ) |>
  forecast(h = 36)
Warning: There was 1 warning in `filter()`.
ℹ In argument: `Month <= as.Date("2021-01-01")`.
Caused by warning:
! Incompatible methods ("<=.vctrs_vctr", "<=.Date") for "<="
  

fcmodel2train |> as_tsibble(key=Month)

fcmodel2test <- tsib |> 
  filter(Month > as.Date("2021-01-01")) |>
  as_tsibble()
Warning: There was 1 warning in `filter()`.
ℹ In argument: `Month > as.Date("2021-01-01")`.
Caused by warning:
! Incompatible methods (">.vctrs_vctr", ">.Date") for ">"
autoplot(fcmodel2train) + 
  labs(title = "Closing Price of SBUX", y = "$US") +autolayer(fcmodel2test)
Plot variable not specified, automatically selected `.vars = Open`

By validating the model using our training and test set, we can see, this model is still not entirely accurate even with the 95% confidence interval. Given the forecast is around the years of Covid, this does make sense.

5 Model 3: ARIMA

This data is not stationary, so the first thing we have to do is make it stationary.

After differencing once, the mean is now around zero but the variance is still not stabilized. So I took the log. This looks better. Let’s take a look at the ACF of this as well.

tsib <- tsib |>
  mutate(closediff = difference(log(Close)))

tsib |> autoplot(closediff)


 tsib |>
   model(ARIMA(closediff)) |> gg_tsresiduals()

The residuals has mean around zero but some data points still do not feel stabilized. The low negative values at 18 in the ACF indicate some sort of complex seasonality. We’ll let the auto ARIMA take care of that later on. The distribution also looks like an almost perfect normal distribution which is a good sign.

Let’s do a KPSS test on this.

sbux_diff |>
features(difference(log(Close)), unitroot_kpss)
Error: object 'sbux_diff' not found

Since the KPSS value is 0.1, our P value is higher than 0.5 and H0 is not rejected meanign that our data is stationary.

Let’s now design the Arima model.

fcmodel3train <- tsib |> 
  filter(Date <= as.Date("2021-01-01")) |>
  model(auto = ARIMA(Close, stepwise = FALSE, approx = FALSE))

fcmodel3train |>
forecast(h=36) |>
autoplot(tsib)

Let’s break down why the auto-model with these parameters was chosen <ARIMA(3,2,3).

The ARIMA(3,2,3) incorporates three key components. Firstly, the Autoregressive (AR) component with an order (p) of 3 captures the influence of the three most recent observations on the current value. This allows the model to figure out the patterns and dependencies in the historical data. Secondly, the Integrated (I) component with an order (d) of 2 represents the differencing operations applied to the time series. With a differencing order of 2, the model aims to achieve stationarity by removing trends and seasonality through consecutive differencing. This is interesting to know because earlier, I thought that one differencing would be enough. Lastly, the Moving Average (MA) component, with an order (q) of 3, accounts for the relationship between the current value and past forecast errors. The inclusion of a moving average helps the model adapt to any residual patterns not captured by the autoregressive component.

The ACF and PACF plots might show significant correlations at lags 1 and 12 due to the autoregressive and seasonal autoregressive terms, with a decay in correlations for higher lags, while the presence of moving average terms could lead to periodic spikes in the ACF.

6 Compare the Models

We already know that the naive model is simply used a benchmark. We expect both ETS and ARIMA to perform better. Let’s compare those two as a means of concluding the report and presenting the best forecast.

Lastly, let’s compare the two models. I’m using data up to halfway through covid since I do not want to remove the event that occured, but I want to partially factor it in.

split_date <- as.Date("2020-12-31")  # Adjust the split date as needed

train <- tsib %>%
  filter(Date <= split_date)

test <- tsib %>%
  filter(Date > split_date)

# Model training
fit <- train |>
  model(
    arima = ARIMA(Close),
    ets = ETS(Close)
  )

# Forecasting
forecast_results <- fit |>
  forecast(h = nrow(test)) |>
  accuracy(tsib) |>
select(-ME, -MPE, -ACF1)

forecast_results

Here we can see that the ETS has lower values for all measures and so in this case, we should use the ETS model for our forecasting.

fit |>
select(ets) |>
forecast(h = "5 years") |>
autoplot(tsib) +
labs(title = "Closing Price of SBUX", y = "$US")

I hope to conclude that starbucks will continue to perform with the trend increasing as rapidly as it did before as both models suggest, especially given their recent stance on the Palestinian conflict. This ETS model suggests a company that has high risk and reward when investing, and usually, Starbucks has been known to be a safe stock. I know that Starbucks is trying to convince investors to keep their stocks in the company but this report may convince investors otherwise.

7 Improvement

One thing I was not able to do during this project because of my lack of understanding of CPI was to adjust for inflation. Next time I undertake a project like this, I would like to factor this in as I think my analysis would be much more accurate and closer to reality.

8 References

“Starbucks Corporation (SBUX) Stock Historical Prices & Data.” Yahoo! Finance, Yahoo!, 13 Dec. 2023, finance.yahoo.com/quote/SBUX/history?p=SBUX.

---
title: 'ADM 4307: Final Report'
author: "Sumayya Kheireddine"
date: "`r Sys.Date()`"
output:
  html_notebook:
    toc: yes
    toc_float: yes
    df_print: paged
    number_sections: yes
  html_document:
    toc: yes
    df_print: paged
  pdf_document:
    toc: yes
  word_document:
    toc: yes
---

# Introduction

The purpose of this report is to study the performance of Starbucks based on its closing prices from the year 1992 until 2023. In this time period, starbucks has been boycotted on multiple occasions so this will be something to keep note of as this investigation is pursued. Ultimately, given that these boycotts ocurred, I'll be interested to see what the forecast is for the starbucks stock and if these bocotts will have a lasting effect whether it be slowing the growth of the stock, or even causing it to lose value.  

Note the the data downloaded has the value of the stock at the first of the month everymonth since 1960. This means that as I proceed with my project, I do not need to take into account trading days or anything like that.

 
# Processing and GGplot
We will begin with a simple overview of our dataset to identify if anything needs to be addressed.

```{r}
library(ggplot2)
df <- read.csv("/Users/sumayyakheireddine/Documents/FALLUOttawa2023/4307/Group Project/SBUX.csv")

ggplot(data = df, mapping = aes(x=  Date, y = Adj.Close)) +
        geom_point() +
  ylab("Adjusted Close ") + 
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 40, hjust = 1, size = 8))
```
Here, though the dates are not visible, we can see it's consistent for quite a while with a slightly upward trend, it dips, and then steadily rises. It remains consistent and then rises again and at some point, more recently, the trend become indistinguishable. We can look into this furhter.


```{r}
library(tsibble)
library(tibble)
df$Date <- as.Date(df$Date, format = "%Y-%m-%d")

tsib <- df |> mutate(Month = yearmonth(Date)) |>
as_tsibble(index = Month)

tsib2015<- tsib|>
  filter(yearmonth(Date) >= yearmonth("2015-01-01")) |>
   autoplot(Close)+
  labs(title = "Closing Price of SBUX", y = "$US")

tsib2015
```

This graph is clearer. We can see some sawtooth behavior starting around 2019. Again, we'll keep this in mind. For the purpose of modelling and forecasting, we'll use the entire data set. It doesn't seem to. be consistent dips so I don't think there's any seasonality there.

Based on based on Pegel's (1969) classification, it looks like this data set has an additive trend with multiplicative seasonality.

However, to double check, I'd like to plot a seasonal plot using months to see if there are any less evident trends that are consistent throughout the years.

```{r}
dcmp <- tsib |>
model(stl = STL(Close))
components(dcmp) |> 
autoplot()
```
It seems I was right and there is in fact no seasonality in terms of months. Perhaps this is because Starbucks has become relatively inelastic to their customers. However, there is yearly seasonality which makes a lot more sense.

We can use an ACF to confirm if there are in fact any trends worth obeserving.
```{r}
tsib |> ACF(Close) |> autoplot()
```
This ACF corroborates the trend we have identified. The autocorrelations for the smaller lags are larger and positive because because observations nearby in time are also nearby in value. The slow decrease in the ACF as the lags increase is due to the trend.

Evidently, this is not white noise.

# Model 1 : Basic Forecasting Model

Let's start with a simple fitted forecast.
```{r}
fit <- tsib |>
  model(
    Mean = MEAN(Close),
    `Naïve` = NAIVE(Close),
    `Seasonal naïve` = SNAIVE(Close)
  )

fc <- fit |>
forecast(h = "3 years") |>
autoplot(tsib) + labs(title = "Closing Price of SBUX", y = "$US")

fc
```

It looks like the seasonal naive is the best model. However, the data is not seasonal so let's use naive instead. We can also use the accuracy function to help us pick the best of the models.
```{r}
accuracy(fit) |>
arrange(.model) |>
select(.model, .type, RMSE, MAE, MAPE, MASE, RMSSE)
```
As expected, the RMSE of the naive is smallest. Let's inspect the naive method further.

Check Residuals

```{r}
tsib |>
  model(NAIVE(Close)) |>
  gg_tsresiduals()
```

In terms of residuals:The innovation residuals have about a zero mean suggesting the forecast is good.

In terms of the ACF: The lack of correlation suggests the forecast is good. 

In terms of the histogram: It's not necessary that the forecast be entirely normally distributed, and this only slightly skewed to the right. It is not ideal but it's okay.

There is no data that needs to be removed.

Let's now do a training set and test set for visualization purposes.

```{r}
tsibmodel1train <- tsib |> filter(Date <= as.Date("2021-01-01"))

tsibmodel1test <-tsib |> filter(Date > as.Date("2021-01-01"))

fit <- tsibmodel1train |>
  model(
    `Seasonal naïve` = SNAIVE(Close)
  )

fcmodel1good <- fit |>
forecast(h = "2 years") |>
autoplot(tsib) + labs(title = "Closing Price of SBUX", y = "$US") +
autolayer(tsibmodel1test)

fcmodel1good
```

About half of this data looks accruate while the left-portion is entirely outside of our interval of confidence.


# Model 2: ETS
To decide whether to use SES, Holt, or Damped Holt, let's find out which one has the lowest RMSE.

Here, I'll be using the letters M for multiplicative error, A for trend, and N for no seasonality.

For this model, it will be trained on up to 2 years before the end of the data.This is chosen arbitrarily but mostly because there have been extreme fluctuations in the past 2 years that I want to see if the model will be able to predict. 

Initially I used M in the seasonality, but I received this error. 

Warning: 8 errors (2 unique) encountered for Holt
[3] A seasonal ETS model cannot be used for this data.

Warning: 9 errors (2 unique) encountered for Damped
[3] A seasonal ETS model cannot be used for this data.


Both of these errors are making me think, there is no seasonality, so I am changing that paramenter into a N.

```{r}

fcmodel2train <- tsib |> 
  filter(Month <= as.Date("2021-01-01")) |>
  model(
    Holt = ETS(Close ~ error("A") + trend("A") + season("N")),
    Damped = ETS(Close ~ error("A") + trend("Ad") + season("N"))
  ) |>
  forecast(h = 36)
  

fcmodel2train |> as_tsibble() 

fcmodel2test <- tsib |> 
  filter(Month > as.Date("2021-01-01")) |>
  as_tsibble()

autoplot(fcmodel2train) + 
  labs(title = "Closing Price of SBUX", y = "$US") + autolayer(fcmodel2test)

```
Unfortunately, the RMSE within accuracy function is not working. However, it seems like the damped trend is only slightly more accurate in terms of mean and confidence interval. Let's plot the one with the lower RMSE, the Damped Holt.

```{r}
fcmodel2train <- tsib |> 
  filter(Month <= as.Date("2021-01-01")) |>
  model(
    Damped = ETS(Close ~ error("A") + trend("Ad") + season("N"))
  ) |>
  forecast(h = 36)
  

fcmodel2train |> as_tsibble(key=Month)

fcmodel2test <- tsib |> 
  filter(Month > as.Date("2021-01-01")) |>
  as_tsibble()


autoplot(fcmodel2train) + 
  labs(title = "Closing Price of SBUX", y = "$US") +autolayer(fcmodel2test)
```
By validating the model using our training and test set, we can see, this model is still not entirely accurate even with the 95% confidence interval. Given the forecast is around the years of Covid, this does make sense.

# Model 3: ARIMA

This data is not stationary, so the first thing we have to do is make it stationary.

After differencing once, the mean is now around zero but the variance is still not stabilized. So I took the log. This looks better. Let's take a look at the ACF of this as well.
```{r}
tsib <- tsib |>
  mutate(closediff = difference(log(Close)))

tsib |> autoplot(closediff)

 tsib |>
   model(ARIMA(closediff)) |> gg_tsresiduals()
```
The residuals has mean around zero but some data points still do not feel stabilized. The low negative values at 18 in the ACF indicate some sort of complex seasonality. We'll let the auto ARIMA take care of that later on. The distribution also looks like an almost perfect normal distribution which is a good sign.


Let's do a KPSS test on this.
```{r}
sbux_diff |>
features(difference(log(Close)), unitroot_kpss)
```
Since the KPSS value is 0.1, our P value is higher than 0.5 and H0 is not rejected meanign that our data is stationary. 


Let's now design the Arima model.
```{r}
fcmodel3train <- tsib |> 
  filter(Date <= as.Date("2021-01-01")) |>
  model(auto = ARIMA(Close, stepwise = FALSE, approx = FALSE))

fcmodel3train |>
forecast(h=36) |>
autoplot(tsib)
```

Let's break down why the auto-model with these parameters was chosen <ARIMA(3,2,3). 

The ARIMA(3,2,3) incorporates three key components. Firstly, the Autoregressive (AR) component with an order (p) of 3 captures the influence of the three most recent observations on the current value. This allows the model to figure out the patterns and dependencies in the historical data. Secondly, the Integrated (I) component with an order (d) of 2 represents the differencing operations applied to the time series. With a differencing order of 2, the model aims to achieve stationarity by removing trends and seasonality through consecutive differencing. This is interesting to know because earlier, I thought that one differencing would be enough. 
Lastly, the Moving Average (MA) component, with an order (q) of 3, accounts for the relationship between the current value and past forecast errors. The inclusion of a moving average helps the model adapt to any residual patterns not captured by the autoregressive component.


The ACF and PACF plots might show significant correlations at lags 1 and 12 due to the autoregressive and seasonal autoregressive terms, with a decay in correlations for higher lags, while the presence of moving average terms could lead to periodic spikes in the ACF.



# Compare the Models

We already know that the naive model is simply used a benchmark. We expect both ETS and ARIMA to perform better. Let's compare those two as a means of concluding the report and presenting the best forecast. 


Lastly, let's compare the two models. I'm using data up to halfway through covid since I do not want to remove the event that occured, but I want to partially factor it in.
```{r}
split_date <- as.Date("2020-12-31") 

train <- tsib |>
  filter(Date <= split_date)

test <- tsib |>
  filter(Date > split_date)

fit <- train |>
  model(
    arima = ARIMA(Close),
    ets = ETS(Close)
  )

forecast_results <- fit |>
  forecast(h = nrow(test)) |>
  accuracy(tsib) |>
select(-ME, -MPE, -ACF1)

forecast_results
```
Here we can see that the ETS has lower values for all measures and so in this case, we should use the ETS model for our forecasting. 

```{r}
fit |>
select(ets) |>
forecast(h = "5 years") |>
autoplot(tsib) +
labs(title = "Closing Price of SBUX", y = "$US")
```

I hope to conclude that starbucks will continue to perform with the trend increasing as rapidly as it did before as both models suggest, especially given their recent stance on the Palestinian conflict. This ETS model suggests a company that has high risk and reward when investing, and usually, Starbucks has been known to be a safe stock. I know that Starbucks is trying to convince investors to keep their stocks in the company but this report may convince investors otherwise.


# Improvement
One thing I was not able to do during this project because of my lack of understanding of CPI was to adjust for inflation. Next time I undertake a project like this, I would like to factor this in as I think my analysis would be much more accurate and closer to reality. 

# References
“Starbucks Corporation (SBUX) Stock Historical Prices &amp; Data.” Yahoo! Finance, Yahoo!, 13 Dec. 2023, finance.yahoo.com/quote/SBUX/history?p=SBUX. 
