Instructions

Do exercises 5.1, 5.2, 5.3, 5.4 and 5.7 in the Hyndman book. Please submit your Rpubs link as well as your .pdf file showing your run code.

library(dplyr)
library(stringr)
library(fpp3)
library(cowplot)

5.1

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

i

Australian Population (global_economy)

df_aus <- global_economy %>%
            filter(Country == "Australia")

head(df_aus)
 aus_plot1 <- df_aus%>%
  autoplot(Population)+
  labs(title= "Australian Population")+
  annotate("text", x = Inf, y = -Inf, hjust = 1, vjust = -1,
           label = "There is an upward trend")

aus_fit <- df_aus %>%
            # no filter needed
            model(RW(Population ~ drift()))

aus_fc <- aus_fit %>%
            forecast(h = 10)

aus_plot2<- aus_fc %>% 
  autoplot(df_aus)



plot_grid(aus_plot1, aus_plot2, ncol = 2)

Alternatively the plotting can be done with one function

df_aus%>%
  model(RW(Population ~ drift()))%>%
    forecast(h = 10)%>%
      autoplot(df_aus)

In section 5.2 Drift method is explained to “allow the forecasts to increase or decrease over time”. Since the data did not show high seasonality and was not economic or financial, I did not use Naïve method and seasonal naïve. The example shown also uses random walk forecast in conjunction with drift (refer below)

                    bricks |> model(RW(Bricks ~ drift()))

ii

Bricks (aus_production)

head(aus_production)
brick_plot1<-aus_production%>%
                autoplot(Bricks)+
                labs(title= "Bricks Production")+
  annotate("text", x = Inf, y = -Inf, hjust = 1, vjust = -1,
           label = "There is obvious seasonality but not trend")

brick_plot2<-aus_production%>%
# Warning: Removed 20 rows containing missing values (`geom_line()`).
# therefore filter is added
  filter(!is.na(Bricks))%>%
  model(SNAIVE(Bricks~lag("year")))%>%
    forecast(h = 10)%>%
      autoplot(aus_production)+
# aus_production added twice to ensure visual has full plot
  labs(title= "Bricks Production Forecast")

plot_grid(brick_plot1,brick_plot2, ncol = 1)
Warning: Removed 20 rows containing missing values (`geom_line()`).Warning: Removed 20 rows containing missing values (`geom_line()`).

There was obvious seasonality despite the fact that annually there was no trend. Regardless SNAIVE(y) was used as it seemed to be the best fit.

iii

NSW Lambs (aus_livestock)

cat(paste(unique(aus_livestock$State), collapse = "\n"))
Australian Capital Territory
New South Wales
Northern Territory
Queensland
South Australia
Tasmania
Victoria
Western Australia
df_lambs<-aus_livestock%>%
  filter(State == "New South Wales", str_detect(Animal,"Lambs"))
head(df_lambs)
lambs_plot1 <- df_lambs%>%
                autoplot()+
                labs(title= "New South Wales Count")+
  annotate("text", x = Inf, y = -Inf, hjust = 1, vjust = -1,
           label = "There is no obvious seasonality or trend")
Plot variable not specified, automatically selected `.vars = Count`
lambs_plot2<-df_lambs%>%
   model(NAIVE(Count))%>%
     forecast(h = 10)%>%
       autoplot(df_lambs)+
   labs(title= "New South Wales Forecast")

No clear seasonality or trend so NAIVE(y) was most appropriate

iv

Household wealth (hh_budget).

head(hh_budget)
wealth_plot1<-hh_budget%>%
                autoplot(Wealth, show.legend= FALSE)+
                  facet_grid(Country~., scales = "free", space = "free_y")

wealth_plot2<- hh_budget%>%
                model(RW(Wealth~drift()))%>%
                  forecast(h=5)%>%
                    autoplot(hh_budget)

I genuinely considered the NAIVE(y) model, because the data was a time series regarding finance. I also debated the seasonal aspect and considered SNAIVE(y), noting the dip in wealth at certain intervals for certain countries. However, I think the trend is primarily upward, with exception of years that tie in with the recent recessions, therefore RW(y ~ drift()) or Drift method was used.

v

Australian takeaway food turnover (aus_retail).

cat(paste(unique(aus_retail$State), collapse = "\n"))
Australian Capital Territory
New South Wales
Northern Territory
Queensland
South Australia
Tasmania
Victoria
Western Australia
head(aus_retail)
aus_retail%>%
  filter(str_detect(Industry,"takeaway"))%>%
  autoplot(Turnover)+
  scale_color_discrete(name = "State", labels = unique(aus_retail$State))+
  labs(title = "Turnover (Australian takeaway) by State")

aus_retail %>%
  filter(str_detect(Industry,"takeaway")) %>%
  model(RW(Turnover ~ drift())) %>%
  forecast(h = 10) %>%
  autoplot(aus_retail)+
   facet_wrap(~State, scales = "free")

Just like the

          Australian Population (`global_economy`)
          

The data for all states showed an upward trend. So modeling each state using the Drift method made the mose sense.

5.2

Use the Facebook stock price (data set gafa_stock) to do the following:

a.

Produce a time plot of the series.

cat(paste(unique(gafa_stock$Symbol), collapse = "\n"))
AAPL
AMZN
FB
GOOG
distinct(gafa_stock, year = lubridate::year(Date))
df_fb <- gafa_stock %>%
  filter(Symbol == "FB")

head(df_fb)
# Re-index based on trading days
FB_stock <- df_fb %>%
# already filtered
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE)

FB_stock%>%
  autoplot(Close)+
  labs(y = '$US', title = 'The Facebook Daily Closing Stock Price')

NA
NA

b.

Produce forecasts using the drift method and plot them.

AS PER Example: Google’s daily closing stock price

# Filter the year of interest
FB_2015 <- FB_stock %>% filter(year(Date) == 2015)
# Fit the models
FB_fit <- FB_2015 |>
  model(
    Drift = NAIVE(Close ~ drift())
  )
# Produce forecasts for the trading days in January 2016
FB_jan_2016 <- FB_stock |>
  filter(yearmonth(Date) == yearmonth("2016 Jan"))
FB_fc <- FB_fit |>
  forecast(new_data = FB_jan_2016)
# Plot the forecasts
FB_fc |>
  autoplot(FB_2015, level = NULL) +
  autolayer(FB_jan_2016, Close, colour = "black") +
  labs(y = "$US",
       title = "Facebook daily closing stock prices",
       subtitle = "(Jan 2015 - Jan 2016)") +
  guides(colour = guide_legend(title = "Forecast"))

c.

Show that the forecasts are identical to extending the line drawn between the first and last observations.

FB_fc%>% 
  autoplot(FB_2015, level = NULL) +
  geom_line(data = slice(FB_2015, range(cumsum(!is.na(Close)))),
                         aes(y=Close), linetype = 'dashed')

d.

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

FB_fit2 <- FB_2015 %>%
  model(
    Mean = MEAN(Close),
    Naive = NAIVE(Close)
  )
# to make the forecasts for the trading days in January 2016
FB_jan_2016 <- FB_stock %>%
  filter(yearmonth(Date) == yearmonth("2016 Jan"))

FB_fc2 <- FB_fit2 %>%
  forecast(new_data = FB_jan_2016)
# Plotting
FB_fc2 %>%
  autoplot(FB_2015, level = NULL) +
  autolayer(FB_jan_2016, Close, colour = "green") +
  labs(y = "$USD",
       title = "FB  Closing Stock Prices (Daily)",
       subtitle = "(Jan 2015 - Jan 2016)") +
  guides(colour = guide_legend(title = "The Forecast"))

Naive I believe is the most accurate, b/c there is no seasonality and Naive works best with financial data according to the textbook

        "This method works remarkably well for many economic and financial time series."

5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()

# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)

What do you conclude?

I don’t believe this is white noise. This does not appear to be an unpredictable sequence of numbers, rather it seems to have a constant variance, and .resid that centers around 0. For the lag value of 4 in the acf plot, it is larger than normal but also follows the occurrence every 4 quarters. So the model can be improved, it is not likely white noise and therefore seasonal naïve method is valid.

5.4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

** Note**: df_aus established in 5.1 for Australian exports

i

Australian Exports series from global_economy

# Define and estimate a model
df_aus_fit <- df_aus %>% 
  model(NAIVE(Exports))

# Look at the residuals
df_aus_fit %>% 
  gg_tsresiduals()


# Look at some forecasts
df_aus_fit %>% 
  forecast() %>% autoplot(df_aus)

aus_aug |> features(.innov, box_pierce, lag = 10)

aus_aug |> features(.innov, ljung_box, lag = 10)
NA

What do you conclude?

BP p-value of 0.1481135 and LB p-value of 0.08963678 suggests there’s not significant autocorrelation, since they’re both greater than 0.05. There’s also constant variation. I cannot differentiate this from white noise and I’m unsure if the model could be improved.

ii

# Define and estimate a model
aus_prod_fit <- aus_production %>% 
  model(NAIVE(Bricks))

# Look at the residuals
aus_prod_fit %>% 
  gg_tsresiduals()


# Look at some forecasts
aus_prod_fit %>% 
  forecast() %>% autoplot(aus_production)

aus_prod_aug<-aus_prod_fit%>%
          augment()

aus_prod_aug |> features(.innov, box_pierce, lag = 10)

aus_prod_aug |> features(.innov, ljung_box, lag = 10)
NA

The acf being consistently past the dotted line suggests significant periodic autocorrelation and seasonality. The evidence of autocorrelation is supported by the BP and LB statistic.

5.7

For your retail time series (from Exercise 7 in Section 2.10):

a.

Create a training dataset consisting of observations before 2011 using

set.seed(123)


myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries_train <- myseries |>
  filter(year(Month) < 2011)

b.

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

c. 

Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train |>
  model(SNAIVE(Turnover))

d. 

Check the residuals.

fit |> gg_tsresiduals()

Do the residuals appear to be uncorrelated and normally distributed?

The data appears to be correlated, but have a constant variation changing from positive to negative at lag 10 indicating heteroscedasticity. I also does not appear to be normally distributed or center around 0.

e.

Produce forecasts for the test data

fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train))
Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`
fc |> autoplot(myseries)

f.

Compare the accuracy of your forecasts against the actual values.

fit |> accuracy()
fc |> accuracy(myseries)
NA

The forecast models does not perform well on the test data and the errors for training seem smaller in comparison.

g.

How sensitive are the accuracy measures to the amount of training data used?

The measures are highly sensitive to the quantity of data involved, which may be negatively impacted the measures.

---
title: 'DATA 624: PREDICTIVE ANALYTICS HW3'
author: "Gabriel Campos"
date: "Last edited `r format(Sys.time(), '%B %d, %Y')`"
output:
  html_notebook: default
  geometry: left=0.5cm,right=0.5cm,top=1cm,bottom=2cm
  html_document:
    df_print: paged
  pdf_document:
    latex_engine: xelatex
urlcolor: blue
---

# Instructions

Do exercises 5.1, 5.2, 5.3, 5.4 and 5.7 in the Hyndman book.  Please submit your [Rpubs link](https://rpubs.com/gcampos100/DATA_624_HW3) as well as your .pdf file showing your run code.


```{r, message=FALSE, warning=FALSE}
library(dplyr)
library(stringr)
library(fpp3)
library(cowplot)
```

# 5.1

Produce forecasts for the following series using whichever of `NAIVE(y)`, `SNAIVE(y)` or `RW(y ~ drift())` is more appropriate in each case:

## i

Australian Population (`global_economy`)

```{r}
df_aus <- global_economy %>%
            filter(Country == "Australia")

head(df_aus)
```

```{r}
 aus_plot1 <- df_aus%>%
  autoplot(Population)+
  labs(title= "Australian Population")+
  annotate("text", x = Inf, y = -Inf, hjust = 1, vjust = -1,
           label = "There is an upward trend")

aus_fit <- df_aus %>%
            # no filter needed
            model(RW(Population ~ drift()))

aus_fc <- aus_fit %>%
            forecast(h = 10)

aus_plot2<- aus_fc %>% 
  autoplot(df_aus)



plot_grid(aus_plot1, aus_plot2, ncol = 2)
```

Alternatively the plotting can be done with one function 

```{r}
df_aus%>%
  model(RW(Population ~ drift()))%>%
    forecast(h = 10)%>%
      autoplot(df_aus)
```


In section 5.2 Drift method is explained to "allow the forecasts to increase or decrease over time". Since the data did not show high seasonality and was not economic or financial, I did not use `Naïve` method and `seasonal naïve`. The example shown also uses random walk forecast in conjunction with drift (refer below)

                        bricks |> model(RW(Bricks ~ drift()))

## ii

Bricks (`aus_production`)

```{r}
head(aus_production)
```


```{r}
brick_plot1<-aus_production%>%
                autoplot(Bricks)+
                labs(title= "Bricks Production")+
  annotate("text", x = Inf, y = -Inf, hjust = 1, vjust = -1,
           label = "There is obvious seasonality but not trend")

brick_plot2<-aus_production%>%
# Warning: Removed 20 rows containing missing values (`geom_line()`).
# therefore filter is added
  filter(!is.na(Bricks))%>%
  model(SNAIVE(Bricks~lag("year")))%>%
    forecast(h = 10)%>%
      autoplot(aus_production)+
# aus_production added twice to ensure visual has full plot
  labs(title= "Bricks Production Forecast")

plot_grid(brick_plot1,brick_plot2, ncol = 1)
```
There was obvious seasonality despite the fact that annually there was no trend. Regardless `SNAIVE(y)` was used as it seemed to be the best fit.

## iii

NSW Lambs (`aus_livestock`)

```{r}
cat(paste(unique(aus_livestock$State), collapse = "\n"))

```


```{r}
df_lambs<-aus_livestock%>%
  filter(State == "New South Wales", str_detect(Animal,"Lambs"))
head(df_lambs)
```


```{r}
lambs_plot1 <- df_lambs%>%
                autoplot()+
                labs(title= "New South Wales Count")+
  annotate("text", x = Inf, y = -Inf, hjust = 1, vjust = -1,
           label = "There is no obvious seasonality or trend")

lambs_plot2<-df_lambs%>%
   model(NAIVE(Count))%>%
     forecast(h = 10)%>%
       autoplot(df_lambs)+
   labs(title= "New South Wales Forecast")
```

```{r, echo=FALSE}
lambs_plot1
```

```{r, echo=FALSE}
lambs_plot2
```


No clear seasonality or trend so `NAIVE(y)` was most appropriate

## iv

Household wealth (`hh_budget`).

```{r}
head(hh_budget)
```

```{r}
wealth_plot1<-hh_budget%>%
                autoplot(Wealth, show.legend= FALSE)+
                  facet_grid(Country~., scales = "free", space = "free_y")

wealth_plot2<- hh_budget%>%
                model(RW(Wealth~drift()))%>%
                  forecast(h=5)%>%
                    autoplot(hh_budget)

```

```{r, echo=FALSE}
wealth_plot1
```

```{r, echo=FALSE}
wealth_plot2
```

I genuinely considered the `NAIVE(y)` model, because the data was a time series regarding finance. I also debated the seasonal aspect and considered `SNAIVE(y)`, noting the dip in wealth at certain intervals for certain countries. However, I think the trend is primarily upward, with exception of years that tie in with the recent recessions, therefore `RW(y ~ drift())` or Drift method was used.

## v

Australian takeaway food turnover (`aus_retail`).

```{r}
cat(paste(unique(aus_retail$State), collapse = "\n"))
```


```{r}
head(aus_retail)
```

```{r}
aus_retail%>%
  filter(str_detect(Industry,"takeaway"))%>%
  autoplot(Turnover)+
  scale_color_discrete(name = "State", labels = unique(aus_retail$State))+
  labs(title = "Turnover (Australian takeaway) by State")
```

```{r, fig.height=5}
aus_retail %>%
  filter(str_detect(Industry,"takeaway")) %>%
  model(RW(Turnover ~ drift())) %>%
  forecast(h = 10) %>%
  autoplot(aus_retail)+
   facet_wrap(~State, scales = "free")
```

Just like the 
    
              Australian Population (`global_economy`)
              
The data for all states showed an upward trend. So modeling each state using the Drift method made the mose sense.

# 5.2

Use the Facebook stock price (data set `gafa_stock`) to do the following:

## a.

Produce a time plot of the series.

```{r}
cat(paste(unique(gafa_stock$Symbol), collapse = "\n"))
```

```{r}
distinct(gafa_stock, year = lubridate::year(Date))
```


```{r}
df_fb <- gafa_stock %>%
  filter(Symbol == "FB")

head(df_fb)
```

```{r}
# Re-index based on trading days
FB_stock <- df_fb %>%
# already filtered
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE)

FB_stock%>%
  autoplot(Close)+
  labs(y = '$US', title = 'The Facebook Daily Closing Stock Price')


```


## b.

Produce forecasts using the drift method and plot them.

**AS PER Example: Google’s daily closing stock price**

```{r}
# Filter the year of interest
FB_2015 <- FB_stock %>% filter(year(Date) == 2015)
# Fit the models
FB_fit <- FB_2015 |>
  model(
    Drift = NAIVE(Close ~ drift())
  )
# Produce forecasts for the trading days in January 2016
FB_jan_2016 <- FB_stock |>
  filter(yearmonth(Date) == yearmonth("2016 Jan"))
FB_fc <- FB_fit |>
  forecast(new_data = FB_jan_2016)
# Plot the forecasts
FB_fc |>
  autoplot(FB_2015, level = NULL) +
  autolayer(FB_jan_2016, Close, colour = "black") +
  labs(y = "$US",
       title = "Facebook daily closing stock prices",
       subtitle = "(Jan 2015 - Jan 2016)") +
  guides(colour = guide_legend(title = "Forecast"))
```


## c.

Show that the forecasts are identical to extending the line drawn between the first and last observations.

```{r}
FB_fc%>% 
  autoplot(FB_2015, level = NULL) +
  geom_line(data = slice(FB_2015, range(cumsum(!is.na(Close)))),
                         aes(y=Close), linetype = 'dashed')
```


## d.

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

```{r}
FB_fit2 <- FB_2015 %>%
  model(
    Mean = MEAN(Close),
    Naive = NAIVE(Close)
  )
# to make the forecasts for the trading days in January 2016
FB_jan_2016 <- FB_stock %>%
  filter(yearmonth(Date) == yearmonth("2016 Jan"))

FB_fc2 <- FB_fit2 %>%
  forecast(new_data = FB_jan_2016)
# Plotting
FB_fc2 %>%
  autoplot(FB_2015, level = NULL) +
  autolayer(FB_jan_2016, Close, colour = "green") +
  labs(y = "$USD",
       title = "FB  Closing Stock Prices (Daily)",
       subtitle = "(Jan 2015 - Jan 2016)") +
  guides(colour = guide_legend(title = "The Forecast"))
```

Naive I believe is the most accurate, b/c there is no seasonality and Naive works best with financial data according to the textbook

            "This method works remarkably well for many economic and financial time series."

# 5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

```{r}
# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()
# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)

```

What do you conclude?

I don't believe this is white noise. This does not appear to be an unpredictable sequence of numbers, rather it seems to have a constant variance, and `.resid` that centers around 0. For the lag value of 4 in the acf plot, it is larger than normal but also follows the occurrence every 4 quarters. So the model can be improved, it is not likely white noise and therefore seasonal naïve method is valid.

# 5.4 

Repeat the previous exercise using the Australian Exports series from `global_economy` and the Bricks series from `aus_production.` Use whichever of `NAIVE()` or `SNAIVE()` is more appropriate in each case.

** Note**: `df_aus` *established in 5.1 for Australian exports*

## i

*Australian Exports series from* `global_economy`

```{r}
# Define and estimate a model
df_aus_fit <- df_aus %>% 
  model(NAIVE(Exports))

# Look at the residuals
df_aus_fit %>% 
  gg_tsresiduals()

# Look at some forecasts
df_aus_fit %>% 
  forecast() %>% autoplot(df_aus)
```

```{r}
aus_aug<-df_aus_fit%>%
          augment()

aus_aug |> features(.innov, box_pierce, lag = 10)

aus_aug |> features(.innov, ljung_box, lag = 10)

```

What do you conclude?

 BP p-value of 0.1481135 and LB p-value of 0.08963678 suggests there's not significant autocorrelation, since they're both greater than 0.05. There's also constant variation. I cannot differentiate this from white noise and I'm unsure if the model could be improved.
 
## ii 
 
```{r}
# Define and estimate a model
aus_prod_fit <- aus_production %>% 
  model(NAIVE(Bricks))

# Look at the residuals
aus_prod_fit %>% 
  gg_tsresiduals()

# Look at some forecasts
aus_prod_fit %>% 
  forecast() %>% autoplot(aus_production)
```

```{r}
aus_prod_aug<-aus_prod_fit%>%
          augment()

aus_prod_aug |> features(.innov, box_pierce, lag = 10)

aus_prod_aug |> features(.innov, ljung_box, lag = 10)

```
The acf being consistently past the dotted line suggests significant periodic autocorrelation and seasonality. The evidence of autocorrelation is supported by the BP and LB statistic.

# 5.7

For your retail time series (from Exercise 7 in Section [2.10](https://otexts.com/fpp3/graphics-exercises.html#graphics-exercises)):

## a. 

Create a training dataset consisting of observations before 2011 using

```{r}
set.seed(123)


myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries_train <- myseries |>
  filter(year(Month) < 2011)

```

## b.

Check that your data have been split appropriately by producing the following plot.

```{r}
autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

```

## c. 

Fit a seasonal naïve model using SNAIVE() applied to your training data (`myseries_train`).

```{r}
fit <- myseries_train |>
  model(SNAIVE(Turnover))

```

## d. 

Check the residuals.
```{r}
fit |> gg_tsresiduals()

```

Do the residuals appear to be uncorrelated and normally distributed?

The data appears to be correlated, but have a constant variation changing from positive to negative at lag 10 indicating heteroscedasticity. I also does not appear to be normally distributed or center around 0.

## e. 

Produce forecasts for the test data

```{r}
fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train))
fc |> autoplot(myseries)

```

## f.

Compare the accuracy of your forecasts against the actual values.

```{r}
fit |> accuracy()
fc |> accuracy(myseries)

```

The forecast models does not perform well on the test data and the errors for training seem smaller in comparison.

## g. 

How sensitive are the accuracy measures to the amount of training data used?

The measures are highly sensitive to the quantity of data involved, which may be negatively impacted the measures.
