Assignment 1 - DATA 624

Author

Amanda Rose Knudsen

Published

February 8, 2025

Assignment 1: Please submit exercises 2.1, 2.2, 2.3, 2.4, 2.5 and 2.8 from the Hyndman online Forecasting book. Link to Chapter 2 exercises for reference.

Loading the Data

library(tidyverse)
library(fpp3)

2.1

Explore the following four time series: Bricks from aus_production, Lynx from pelt, Close from gafa_stock, Demand from vic_elec.

Use ? (or help()) to find out about the data in each series.
What is the time interval of each series?
Use autoplot() to produce a time plot of each series.
For the last plot, modify the axis labels and title.

Bricks from aus_production

After running ?aus_production in the console, I learned that the Bricks time series from aus_production provides quarterly estimates of clay brick production in millions of bricks in Australia.

Time interval: quarterly

Autoplot:

autoplot(aus_production, Bricks)

Warning: Removed 20 rows containing missing values or values outside the scale range
(`geom_line()`).

Modified autoplot:

autoplot(aus_production, Bricks) +
  labs(title = "Australian Clay brick production",
       x = "Quarter",
       y = "Millions of Bricks"
       )

Warning: Removed 20 rows containing missing values or values outside the scale range
(`geom_line()`).

Lynx from pelt

After running ?pelt in the console, I learned that the Lynx time series from pelt provides annual Hudson Bay Company trading records for Canadian Lynx furs from 1845 to 1935.

Time interval: Annual (yearly)

Autoplot:

autoplot(pelt, Lynx)

Modified autoplot:

autoplot(pelt, Lynx) +
  labs(title = "Annual Canadian Lynx trappings, 1821–1934",
       x = "Year",
       y = "Number of Canadian Lynx pelts traded"
       )

Close from gafa_stock

After running ?gafa_stock in the console, I learned that the Close time series from gafa_stock provides historical stock prices in $USD from 2014-2018 for Google, Amazon, Facebook and Apple, where Close contains the closing price for the stock on ‘irregular trading days’ (language from the R documentation for the gafa_stock tsibble).

Time interval: daily (on trading days)

Autoplot:

autoplot(gafa_stock, Close)

Modified autoplot:

autoplot(gafa_stock, Close) +
  labs(title = "Historical stock prices (2014-2018)",
       subtitle = "Stocks: Google, Amazon, Facebook and Apple",
       x = "Date",
       y = "Closing price ($USD)"
       )

Demand from vic_elec

After running ?vic_elec in the console, I learned that the Demand time series from vic_elec provides total electricity demand in MWh (operational demand) in half-hourly time intervals.

Time interval: sub-daily (half-hourly)

Autoplot:

autoplot(vic_elec, Demand)

Modified autoplot:

autoplot(vic_elec, Demand) +
  labs(title = "Half-hourly electricity demand for Victoria, Australia",
       x = "Date and Time",
       y = "Total Electricity Demand in MWh"
       )

2.2

Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock

gafa_stock |> 
  group_by(Symbol) |> 
  filter(Close==max(Close)) |> 
  select(Symbol, Date, Close)

# A tsibble: 4 x 3 [!]
# Key:       Symbol [4]
# Groups:    Symbol [4]
  Symbol Date       Close
  <chr>  <date>     <dbl>
1 AAPL   2018-10-03  232.
2 AMZN   2018-09-04 2040.
3 FB     2018-07-25  218.
4 GOOG   2018-07-26 1268.

Interestingly, all the peak closing prices for each of the stocks were in 2018, in the same several-month span: from late July 2018 to early October 2018.

2.3

Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

Read the data into R:

tute1 <- read_csv("tute1.csv")

Rows: 100 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (3): Sales, AdBudget, GDP
date (1): Quarter

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Convert the data to time series:

mytimeseries <- tute1 |>
  mutate(Quarter = yearquarter(Quarter)) |>
  as_tsibble(index = Quarter)

Construct time series plots of each of the three series:

mytimeseries |>
  pivot_longer(-Quarter) |>
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y")

Check what happens when you don’t include facet_grid().

mytimeseries |>
  pivot_longer(-Quarter) |>
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line()

If we don’t include facet_grid we see all the time series in one grid. Here, it works OK because there is no overlap and it allows us to see the time series in relation to one another on the same time scale.

2.4

The USgas package contains data on the demand for natural gas in the US.

Install the USgas package.
Create a tsibble from us_total with year as the index and state as the key.
Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).

library(USgas) 

glimpse(us_total)

Rows: 1,266
Columns: 3
$ year  <int> 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007…
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama"…
$ y     <int> 324158, 329134, 337270, 353614, 332693, 379343, 350345, 382367, …

unique(us_total$Year)

NULL

We’ll have to rename the “y” column which we know after running ?us_total is a value representing the yearly total natural gas consumption in a million cubic feet by state.

us_total |> 
  filter(state %in% c("Connecticut", "Maine", "Massachusetts", 
                      "New Hampshire", "Rhode Island", "Vermont")) |> 
  as_tsibble(key = state, index = year) |> 
  rename(consumption = y) -> new_england_natural_gas_consumption

new_england_natural_gas_consumption

# A tsibble: 138 x 3 [1Y]
# Key:       state [6]
    year state       consumption
   <int> <chr>             <int>
 1  1997 Connecticut      144708
 2  1998 Connecticut      131497
 3  1999 Connecticut      152237
 4  2000 Connecticut      159712
 5  2001 Connecticut      146278
 6  2002 Connecticut      177587
 7  2003 Connecticut      154075
 8  2004 Connecticut      162642
 9  2005 Connecticut      168067
10  2006 Connecticut      172682
# ℹ 128 more rows

autoplot(new_england_natural_gas_consumption, consumption) +
  labs(title = "New England Annual natural gas consumption by state",
       subtitle = "CT, ME, MA, NH, RI, VT",
       y = "Million cubic feet",
       x = "Year")

Massachusetts is by far the largest consumer of natural gas in New England, colloed by Connecticut. Both states show an increasing trend over the time period from 1997 to 2019.

2.5

Download tourism.xlsx from the book website and read it into R using readxl::read_excel()
Create a tsibble which is identical to the tourism tsibble from the tsibble package.
Find what combination of Region and Purpose had the maximum number of overnight trips on average.
Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.

tourism2 <- readxl::read_excel("tourism.xlsx")

We’ve named this new one tourism2 so we don’t get confused with the one we’re trying to make it identical to.

tourism2_tsibble <- tourism2 |> 
  mutate(Quarter = yearquarter(Quarter)) |> 
  as_tsibble(key = c(Region, State, Purpose),index = Quarter)

tourism2_tsibble

# A tsibble: 24,320 x 5 [1Q]
# Key:       Region, State, Purpose [304]
   Quarter Region   State           Purpose  Trips
     <qtr> <chr>    <chr>           <chr>    <dbl>
 1 1998 Q1 Adelaide South Australia Business  135.
 2 1998 Q2 Adelaide South Australia Business  110.
 3 1998 Q3 Adelaide South Australia Business  166.
 4 1998 Q4 Adelaide South Australia Business  127.
 5 1999 Q1 Adelaide South Australia Business  137.
 6 1999 Q2 Adelaide South Australia Business  200.
 7 1999 Q3 Adelaide South Australia Business  169.
 8 1999 Q4 Adelaide South Australia Business  134.
 9 2000 Q1 Adelaide South Australia Business  154.
10 2000 Q2 Adelaide South Australia Business  169.
# ℹ 24,310 more rows

tourism

# A tsibble: 24,320 x 5 [1Q]
# Key:       Region, State, Purpose [304]
   Quarter Region   State           Purpose  Trips
     <qtr> <chr>    <chr>           <chr>    <dbl>
 1 1998 Q1 Adelaide South Australia Business  135.
 2 1998 Q2 Adelaide South Australia Business  110.
 3 1998 Q3 Adelaide South Australia Business  166.
 4 1998 Q4 Adelaide South Australia Business  127.
 5 1999 Q1 Adelaide South Australia Business  137.
 6 1999 Q2 Adelaide South Australia Business  200.
 7 1999 Q3 Adelaide South Australia Business  169.
 8 1999 Q4 Adelaide South Australia Business  134.
 9 2000 Q1 Adelaide South Australia Business  154.
10 2000 Q2 Adelaide South Australia Business  169.
# ℹ 24,310 more rows

Great! They look identical now to me.

Find what combination of Region and Purpose had the maximum number of overnight trips on average.

maximum_avg_trips <- tourism2 |> 
  group_by(Region, Purpose) |>  
  summarise(avg_trips = mean(Trips, na.rm = TRUE)) |> 
  ungroup() |>
  select(Region, Purpose, avg_trips) |> 
  slice_max(avg_trips, n = 10)

`summarise()` has grouped output by 'Region'. You can override using the
`.groups` argument.

maximum_avg_trips

# A tibble: 10 × 3
   Region          Purpose  avg_trips
   <chr>           <chr>        <dbl>
 1 Sydney          Visiting      747.
 2 Melbourne       Visiting      619.
 3 Sydney          Business      602.
 4 North Coast NSW Holiday       588.
 5 Sydney          Holiday       550.
 6 Gold Coast      Holiday       528.
 7 Melbourne       Holiday       507.
 8 South Coast     Holiday       495.
 9 Brisbane        Visiting      493.
10 Melbourne       Business      478.

We can see that the region “Sydney” and purpose “Visiting” had the highest number of average overnight trips, at 747.27 average overnight trips.

Below we create a new tsibble which combines the Purposes and Regions, and just has total trips by State.

tourism2_tsibble2 <- tourism2_tsibble |> 
  select(Quarter, Trips, State) |> 
  group_by(State) |> 
  summarise(total_trips = sum(Trips), .groups = "drop") 

tourism2_tsibble2

# A tsibble: 640 x 3 [1Q]
# Key:       State [8]
   State Quarter total_trips
   <chr>   <qtr>       <dbl>
 1 ACT   1998 Q1        551.
 2 ACT   1998 Q2        416.
 3 ACT   1998 Q3        436.
 4 ACT   1998 Q4        450.
 5 ACT   1999 Q1        379.
 6 ACT   1999 Q2        558.
 7 ACT   1999 Q3        449.
 8 ACT   1999 Q4        595.
 9 ACT   2000 Q1        600.
10 ACT   2000 Q2        557.
# ℹ 630 more rows

#2.8

Use the following graphics functions: autoplot(), gg_season(), gg_subseries(), gg_lag(), ACF() and explore features from the following time series: “Total Private” Employed from us_employment, Bricks from aus_production, Hare from pelt, “H02” Cost from PBS, and Barrels from us_gasoline.

Can you spot any seasonality, cyclicity and trend?
What do you learn about the series?
What can you say about the seasonal patterns?
Can you identify any unusual years?

“Total Private” Employed from us_employment

First, we’ll create a tsibble that contains just this filtered-down data.

us_employment_totalprivate <- us_employment |> 
  filter(Title == "Total Private") |> 
  select(Month, Series_ID, Employed)

autoplot()

autoplot(us_employment_totalprivate)

Plot variable not specified, automatically selected `.vars = Employed`

gg_season()

us_employment_totalprivate |> 
  gg_season(Employed, labels = "both")

gg_subseries()

us_employment_totalprivate |> 
  gg_subseries(Employed)

gg_lag()

us_employment_totalprivate |> 
  gg_lag(Employed, geom = "point")

ACF()

us_employment_totalprivate |> 
  ACF(Employed)

# A tsibble: 29 x 3 [1M]
# Key:       Series_ID [1]
   Series_ID          lag   acf
   <chr>         <cf_lag> <dbl>
 1 CEU0500000001       1M 0.997
 2 CEU0500000001       2M 0.993
 3 CEU0500000001       3M 0.990
 4 CEU0500000001       4M 0.986
 5 CEU0500000001       5M 0.983
 6 CEU0500000001       6M 0.980
 7 CEU0500000001       7M 0.977
 8 CEU0500000001       8M 0.974
 9 CEU0500000001       9M 0.971
10 CEU0500000001      10M 0.968
# ℹ 19 more rows

us_employment_totalprivate |> 
  ACF(Employed) |> 
  autoplot()

Can you spot any seasonality, cyclicity and trend? There appears to be an increasing (upward) trend with periods of seasonality visible: hiring and layoff trends do appear to be visible on a seasonal pattern across years. There are also some visible cycles that appear with likely recessions visible in the data. We can also observe the way that a strong trend is visible when using the autocorrelation function (when data have a trend, the autocorrelations for small lags tend to be large and positive.) All the correlations are strongly positive.
What do you learn about the series? We can say with certainty that the number of employed persons is constantly increasing - an ever-upward trend across the data. This dataset ends prior to the COVID-19 global pandemic, which likely would’ve made this dataset look quite a bit different in terms of overall trend as well as change or disruption to the seasonal patterns overall.
What can you say about the seasonal patterns? We can say that winter appears to be a time of a drop in employement, with January consistently appearing to be the lowest employed numbers overall each year, dropping off after a December uptick which corresponds with the holiday season. We also see that the summer months are consistently the periods of highest employment, indicating a strong seasonal pattern that recurs across years in the data. Summer months (June, July and August) showcase the highest employment numbers.
Can you identify any unusual years? We can see the impact of global recessions in this data. In particular the “Great Recession” 2008-2010, a period of known major job losses in the US. We can also see other recessions (which are not as extreme as the losses of the great recession), such as the dot-com bust of the early 2000.

Bricks from aus_production

First we’ll create a filtered tsibble for just “Bricks” production.

aus_production_bricks <- aus_production |> 
  select(Bricks, Quarter)

autoplot()

autoplot(aus_production_bricks)

Plot variable not specified, automatically selected `.vars = Bricks`

Warning: Removed 20 rows containing missing values or values outside the scale range
(`geom_line()`).

gg_season()

aus_production_bricks |> 
  gg_season(Bricks, labels = "both")

Warning: Removed 20 rows containing missing values or values outside the scale range
(`geom_line()`).

Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_text()`).

Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_text()`).

gg_subseries()

aus_production_bricks |> 
  gg_subseries(Bricks)

Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_line()`).

gg_lag()

aus_production_bricks |> 
  gg_lag(Bricks, geom = "point")

Warning: Removed 20 rows containing missing values (gg_lag).

ACF()

aus_production_bricks |> 
  ACF(Bricks)

# A tsibble: 22 x 2 [1Q]
        lag   acf
   <cf_lag> <dbl>
 1       1Q 0.900
 2       2Q 0.815
 3       3Q 0.813
 4       4Q 0.828
 5       5Q 0.720
 6       6Q 0.642
 7       7Q 0.655
 8       8Q 0.692
 9       9Q 0.609
10      10Q 0.556
# ℹ 12 more rows

aus_production_bricks |> 
  ACF(Bricks) |> 
  autoplot()

Can you spot any seasonality, cyclicity and trend? We can observe an increasing trend from the start of the dataset recording until the early 1980s, which was the start of a slow downward trend from the peak in the early 1980s. We can observe with the ACF that there is stronger correlation for the smaller lags.
What do you learn about the series? We learn a few things, including that there was a peak in Australian brick production in the early 1980s we also learn that the summers are the most popular time for brick production, which corresponds to a drier climate.
What can you say about the seasonal patterns? As noted previously, we can observe seasonal peaks in Q2-Q3 corresponding to the summer months of dryness and peak brick production. We can also in that same way observe that Q1 is consistently the lowest production in seasonal patterns observed.
Can you identify any unusual years? 1983-1984 appear with a significant drop in Australian bricks production.

Hare from pelt

First we’ll create a filtered tsibble for just “Hare” pelts in the pelt tsibble.

hare_pelt <- pelt |> 
  select(Year, Hare)

autoplot()

autoplot(hare_pelt)

Plot variable not specified, automatically selected `.vars = Hare`

gg_season() This plot does not work for this tsibble because the data is reportd per year. There are no “seasons” and when we attempt to use gg_season() an error is returned, indicating to us that there must be more than one observation per year to use gg_season().

gg_subseries()

hare_pelt |> 
  gg_subseries(Hare)

gg_lag()

hare_pelt |> 
  gg_lag(Hare, geom = "point")

ACF()

hare_pelt |> 
  ACF(Hare)

# A tsibble: 19 x 2 [1Y]
        lag     acf
   <cf_lag>   <dbl>
 1       1Y  0.658 
 2       2Y  0.214 
 3       3Y -0.155 
 4       4Y -0.401 
 5       5Y -0.493 
 6       6Y -0.401 
 7       7Y -0.168 
 8       8Y  0.113 
 9       9Y  0.307 
10      10Y  0.340 
11      11Y  0.296 
12      12Y  0.206 
13      13Y  0.0372
14      14Y -0.153 
15      15Y -0.285 
16      16Y -0.295 
17      17Y -0.202 
18      18Y -0.0676
19      19Y  0.0956

hare_pelt |> 
  ACF(Hare) |> 
  autoplot()

Can you spot any seasonality, cyclicity and trend? Since the data is annual, we cannot observe seasonality; however, we can observe fluctuations of a cyclical pattern over time. We can also observe that there were overall higher peaks in the 1800s compared with later in the dataset, when the peaks are not nearly as high. We can see the cyclical patterns using the ACF function - correlations at cyclical frequency.
What do you learn about the series? We can learn about the cycle corresponding to the lifecycle of the Canadian Hare, or perhaps related to the cycles and trends of fashion and demand. It is interesting to note that there are observable cycles in the peak and trough within this dataset. The ACF autoplot is telling in particular regarding the lag and appearance of a pattern in the data, where the lag cycles appear as waves.
What can you say about the seasonal patterns? As noted previously, this data is annual and therefore we cannot observe the standard “seasonality” which must be observed using more than one observation per year, which this dataset does not have. We can see cyclical patterns though, and based on our understanding of what “white noise” would be - this does not appear to be a white noise series because we have so many spikes outside the 95% range indicated by the blue lines.
Can you identify any unusual years? 1863 stands out to me for its intense peak - a year of higher-than-ever hare pelts sold by the Hudson Bay company, a peak which has never been repeated since.

“H02” Cost from PBS First we’ll create a filtered tsibble for just “H02” from the PBS tsibble.

PBS_H02 <- PBS |> 
  filter(ATC2 == "H02") |> 
  select(Month, Concession, Type, Cost) |> 
  summarise(TotalCost = sum(Cost))

autoplot()

autoplot(PBS_H02)

Plot variable not specified, automatically selected `.vars = TotalCost`

gg_season()

PBS_H02 |> 
  gg_season(TotalCost, labels = "both")

gg_subseries()

PBS_H02 |> 
  gg_subseries(TotalCost)

gg_lag()

PBS_H02 |> 
  gg_lag(TotalCost, geom = "point")

ACF()

PBS_H02 |> 
  ACF(TotalCost)

# A tsibble: 23 x 2 [1M]
        lag    acf
   <cf_lag>  <dbl>
 1       1M 0.755 
 2       2M 0.558 
 3       3M 0.386 
 4       4M 0.228 
 5       5M 0.126 
 6       6M 0.0874
 7       7M 0.116 
 8       8M 0.203 
 9       9M 0.335 
10      10M 0.479 
# ℹ 13 more rows

PBS_H02 |> 
  ACF(TotalCost) |> 
  autoplot()

Can you spot any seasonality, cyclicity and trend? We can see a general upward trend over time. We can also see seasonality in the sense that there is strong correlation between January in one year compared to January in the prior year.
What do you learn about the series? We can learn that there is repetition across years - strong seasonal patterns based on the month within the year in the data. We also observe that there is a increasing trend over time.
What can you say about the seasonal patterns? There is a repeated seasonal trend where the total cost is lowest in the early part of the year, in particular in the month of February. This month appears as the trough across years in the data, and December-January are the peaks.
Can you identify any unusual years? 2008 appears unusual however we are uncertain about how the rest of the year turned out because the data cuts off mid-way through the year. That said, we can see that it’s quite unusual in that it’s one of the only years in the dataset where February is not the lowest (the trough) – rather, March is lower than February in 2008. This is unusual compared to other years we observe in the tsibble.

Barrels from us_gasoline

autoplot()

autoplot(us_gasoline)

Plot variable not specified, automatically selected `.vars = Barrels`

gg_season()

us_gasoline |> 
  gg_season(Barrels, labels = "both")

gg_subseries()

us_gasoline |> 
  gg_subseries(Barrels)

gg_lag()

us_gasoline |> 
  gg_lag(Barrels, geom = "point")

ACF()

us_gasoline |> 
  ACF(Barrels)

# A tsibble: 31 x 2 [1W]
        lag   acf
   <cf_lag> <dbl>
 1       1W 0.893
 2       2W 0.882
 3       3W 0.873
 4       4W 0.866
 5       5W 0.847
 6       6W 0.844
 7       7W 0.832
 8       8W 0.831
 9       9W 0.822
10      10W 0.808
# ℹ 21 more rows

us_gasoline |> 
  ACF(Barrels) |> 
  autoplot()

Can you spot any seasonality, cyclicity and trend? We can observe a general trend increasing over time, which can be observed in the data overall and also when looking at the autocorrelation function which follows that the strongest correlations are those with smaller lag.
What do you learn about the series? We can observe that there is strong correlation between smaller lags, indicating that the more recent the barrels the more relevant to other data points (the smaller lags have more strong positive correlation compared to larger lags, which still have a positive correlation but not as strong as the smaller lags.) We can also learn about lower production periods of time after the great recession and we can also learn that since ~2012 there has been a trend of upward increase.
What can you say about the seasonal patterns? We can observe seasonal patterns in the data by looking at the weeks over time; it is observable that the ‘early’ weeks in a year tend to have lowest production, and the middle of the year tends to have the highest production.
Can you identify any unusual years? We can observe unusual years in the data by looking at the period around 2011, when the barrels did not continue on an ever-increasing upward trend.