library(tidyverse)
library(fpp3)Assignment 1 - DATA 624
Assignment 1: Please submit exercises 2.1, 2.2, 2.3, 2.4, 2.5 and 2.8 from the Hyndman online Forecasting book. Link to Chapter 2 exercises for reference.
Loading the Data
2.1
Explore the following four time series: Bricks from aus_production, Lynx from pelt, Close from gafa_stock, Demand from vic_elec.
Use ? (or help()) to find out about the data in each series.
What is the time interval of each series?
Use autoplot() to produce a time plot of each series.
For the last plot, modify the axis labels and title.
Bricks from aus_production
After running ?aus_production in the console, I learned that the Bricks time series from aus_production provides quarterly estimates of clay brick production in millions of bricks in Australia.
Time interval: quarterly
Autoplot:
autoplot(aus_production, Bricks)Warning: Removed 20 rows containing missing values or values outside the scale range
(`geom_line()`).
Modified autoplot:
autoplot(aus_production, Bricks) +
labs(title = "Australian Clay brick production",
x = "Quarter",
y = "Millions of Bricks"
)Warning: Removed 20 rows containing missing values or values outside the scale range
(`geom_line()`).
Lynx from pelt
After running ?pelt in the console, I learned that the Lynx time series from pelt provides annual Hudson Bay Company trading records for Canadian Lynx furs from 1845 to 1935.
Time interval: Annual (yearly)
Autoplot:
autoplot(pelt, Lynx)Modified autoplot:
autoplot(pelt, Lynx) +
labs(title = "Annual Canadian Lynx trappings, 1821–1934",
x = "Year",
y = "Number of Canadian Lynx pelts traded"
)Close from gafa_stock
After running ?gafa_stock in the console, I learned that the Close time series from gafa_stock provides historical stock prices in $USD from 2014-2018 for Google, Amazon, Facebook and Apple, where Close contains the closing price for the stock on ‘irregular trading days’ (language from the R documentation for the gafa_stock tsibble).
Time interval: daily (on trading days)
Autoplot:
autoplot(gafa_stock, Close)Modified autoplot:
autoplot(gafa_stock, Close) +
labs(title = "Historical stock prices (2014-2018)",
subtitle = "Stocks: Google, Amazon, Facebook and Apple",
x = "Date",
y = "Closing price ($USD)"
)Demand from vic_elec
After running ?vic_elec in the console, I learned that the Demand time series from vic_elec provides total electricity demand in MWh (operational demand) in half-hourly time intervals.
Time interval: sub-daily (half-hourly)
Autoplot:
autoplot(vic_elec, Demand)Modified autoplot:
autoplot(vic_elec, Demand) +
labs(title = "Half-hourly electricity demand for Victoria, Australia",
x = "Date and Time",
y = "Total Electricity Demand in MWh"
)2.2
Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock
gafa_stock |>
group_by(Symbol) |>
filter(Close==max(Close)) |>
select(Symbol, Date, Close)# A tsibble: 4 x 3 [!]
# Key: Symbol [4]
# Groups: Symbol [4]
Symbol Date Close
<chr> <date> <dbl>
1 AAPL 2018-10-03 232.
2 AMZN 2018-09-04 2040.
3 FB 2018-07-25 218.
4 GOOG 2018-07-26 1268.
Interestingly, all the peak closing prices for each of the stocks were in 2018, in the same several-month span: from late July 2018 to early October 2018.
2.3
Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.
Read the data into R:
tute1 <- read_csv("tute1.csv")Rows: 100 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): Sales, AdBudget, GDP
date (1): Quarter
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Convert the data to time series:
mytimeseries <- tute1 |>
mutate(Quarter = yearquarter(Quarter)) |>
as_tsibble(index = Quarter)Construct time series plots of each of the three series:
mytimeseries |>
pivot_longer(-Quarter) |>
ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line() +
facet_grid(name ~ ., scales = "free_y")Check what happens when you don’t include facet_grid().
mytimeseries |>
pivot_longer(-Quarter) |>
ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line() If we don’t include facet_grid we see all the time series in one grid. Here, it works OK because there is no overlap and it allows us to see the time series in relation to one another on the same time scale.
2.4
The USgas package contains data on the demand for natural gas in the US.
Install the USgas package.
Create a tsibble from us_total with year as the index and state as the key.
Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).
library(USgas)
glimpse(us_total)Rows: 1,266
Columns: 3
$ year <int> 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007…
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama"…
$ y <int> 324158, 329134, 337270, 353614, 332693, 379343, 350345, 382367, …
unique(us_total$Year)NULL
We’ll have to rename the “y” column which we know after running ?us_total is a value representing the yearly total natural gas consumption in a million cubic feet by state.
us_total |>
filter(state %in% c("Connecticut", "Maine", "Massachusetts",
"New Hampshire", "Rhode Island", "Vermont")) |>
as_tsibble(key = state, index = year) |>
rename(consumption = y) -> new_england_natural_gas_consumption
new_england_natural_gas_consumption# A tsibble: 138 x 3 [1Y]
# Key: state [6]
year state consumption
<int> <chr> <int>
1 1997 Connecticut 144708
2 1998 Connecticut 131497
3 1999 Connecticut 152237
4 2000 Connecticut 159712
5 2001 Connecticut 146278
6 2002 Connecticut 177587
7 2003 Connecticut 154075
8 2004 Connecticut 162642
9 2005 Connecticut 168067
10 2006 Connecticut 172682
# ℹ 128 more rows
autoplot(new_england_natural_gas_consumption, consumption) +
labs(title = "New England Annual natural gas consumption by state",
subtitle = "CT, ME, MA, NH, RI, VT",
y = "Million cubic feet",
x = "Year")Massachusetts is by far the largest consumer of natural gas in New England, colloed by Connecticut. Both states show an increasing trend over the time period from 1997 to 2019.
2.5
- Download tourism.xlsx from the book website and read it into R using readxl::read_excel()
- Create a tsibble which is identical to the tourism tsibble from the tsibble package.
- Find what combination of Region and Purpose had the maximum number of overnight trips on average.
- Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.
tourism2 <- readxl::read_excel("tourism.xlsx")We’ve named this new one tourism2 so we don’t get confused with the one we’re trying to make it identical to.
tourism2_tsibble <- tourism2 |>
mutate(Quarter = yearquarter(Quarter)) |>
as_tsibble(key = c(Region, State, Purpose),index = Quarter)
tourism2_tsibble# A tsibble: 24,320 x 5 [1Q]
# Key: Region, State, Purpose [304]
Quarter Region State Purpose Trips
<qtr> <chr> <chr> <chr> <dbl>
1 1998 Q1 Adelaide South Australia Business 135.
2 1998 Q2 Adelaide South Australia Business 110.
3 1998 Q3 Adelaide South Australia Business 166.
4 1998 Q4 Adelaide South Australia Business 127.
5 1999 Q1 Adelaide South Australia Business 137.
6 1999 Q2 Adelaide South Australia Business 200.
7 1999 Q3 Adelaide South Australia Business 169.
8 1999 Q4 Adelaide South Australia Business 134.
9 2000 Q1 Adelaide South Australia Business 154.
10 2000 Q2 Adelaide South Australia Business 169.
# ℹ 24,310 more rows
tourism# A tsibble: 24,320 x 5 [1Q]
# Key: Region, State, Purpose [304]
Quarter Region State Purpose Trips
<qtr> <chr> <chr> <chr> <dbl>
1 1998 Q1 Adelaide South Australia Business 135.
2 1998 Q2 Adelaide South Australia Business 110.
3 1998 Q3 Adelaide South Australia Business 166.
4 1998 Q4 Adelaide South Australia Business 127.
5 1999 Q1 Adelaide South Australia Business 137.
6 1999 Q2 Adelaide South Australia Business 200.
7 1999 Q3 Adelaide South Australia Business 169.
8 1999 Q4 Adelaide South Australia Business 134.
9 2000 Q1 Adelaide South Australia Business 154.
10 2000 Q2 Adelaide South Australia Business 169.
# ℹ 24,310 more rows
Great! They look identical now to me.
Find what combination of Region and Purpose had the maximum number of overnight trips on average.
maximum_avg_trips <- tourism2 |>
group_by(Region, Purpose) |>
summarise(avg_trips = mean(Trips, na.rm = TRUE)) |>
ungroup() |>
select(Region, Purpose, avg_trips) |>
slice_max(avg_trips, n = 10)`summarise()` has grouped output by 'Region'. You can override using the
`.groups` argument.
maximum_avg_trips# A tibble: 10 × 3
Region Purpose avg_trips
<chr> <chr> <dbl>
1 Sydney Visiting 747.
2 Melbourne Visiting 619.
3 Sydney Business 602.
4 North Coast NSW Holiday 588.
5 Sydney Holiday 550.
6 Gold Coast Holiday 528.
7 Melbourne Holiday 507.
8 South Coast Holiday 495.
9 Brisbane Visiting 493.
10 Melbourne Business 478.
We can see that the region “Sydney” and purpose “Visiting” had the highest number of average overnight trips, at 747.27 average overnight trips.
Below we create a new tsibble which combines the Purposes and Regions, and just has total trips by State.
tourism2_tsibble2 <- tourism2_tsibble |>
select(Quarter, Trips, State) |>
group_by(State) |>
summarise(total_trips = sum(Trips), .groups = "drop")
tourism2_tsibble2# A tsibble: 640 x 3 [1Q]
# Key: State [8]
State Quarter total_trips
<chr> <qtr> <dbl>
1 ACT 1998 Q1 551.
2 ACT 1998 Q2 416.
3 ACT 1998 Q3 436.
4 ACT 1998 Q4 450.
5 ACT 1999 Q1 379.
6 ACT 1999 Q2 558.
7 ACT 1999 Q3 449.
8 ACT 1999 Q4 595.
9 ACT 2000 Q1 600.
10 ACT 2000 Q2 557.
# ℹ 630 more rows
#2.8
Use the following graphics functions: autoplot(), gg_season(), gg_subseries(), gg_lag(), ACF() and explore features from the following time series: “Total Private” Employed from us_employment, Bricks from aus_production, Hare from pelt, “H02” Cost from PBS, and Barrels from us_gasoline.
- Can you spot any seasonality, cyclicity and trend?
- What do you learn about the series?
- What can you say about the seasonal patterns?
- Can you identify any unusual years?
“Total Private” Employed from us_employment
First, we’ll create a tsibble that contains just this filtered-down data.
us_employment_totalprivate <- us_employment |>
filter(Title == "Total Private") |>
select(Month, Series_ID, Employed) autoplot()
autoplot(us_employment_totalprivate)Plot variable not specified, automatically selected `.vars = Employed`
gg_season()
us_employment_totalprivate |>
gg_season(Employed, labels = "both")gg_subseries()
us_employment_totalprivate |>
gg_subseries(Employed)gg_lag()
us_employment_totalprivate |>
gg_lag(Employed, geom = "point")ACF()
us_employment_totalprivate |>
ACF(Employed) # A tsibble: 29 x 3 [1M]
# Key: Series_ID [1]
Series_ID lag acf
<chr> <cf_lag> <dbl>
1 CEU0500000001 1M 0.997
2 CEU0500000001 2M 0.993
3 CEU0500000001 3M 0.990
4 CEU0500000001 4M 0.986
5 CEU0500000001 5M 0.983
6 CEU0500000001 6M 0.980
7 CEU0500000001 7M 0.977
8 CEU0500000001 8M 0.974
9 CEU0500000001 9M 0.971
10 CEU0500000001 10M 0.968
# ℹ 19 more rows
us_employment_totalprivate |>
ACF(Employed) |>
autoplot()Can you spot any seasonality, cyclicity and trend? There appears to be an increasing (upward) trend with periods of seasonality visible: hiring and layoff trends do appear to be visible on a seasonal pattern across years. There are also some visible cycles that appear with likely recessions visible in the data. We can also observe the way that a strong trend is visible when using the autocorrelation function (when data have a trend, the autocorrelations for small lags tend to be large and positive.) All the correlations are strongly positive.
What do you learn about the series? We can say with certainty that the number of employed persons is constantly increasing - an ever-upward trend across the data. This dataset ends prior to the COVID-19 global pandemic, which likely would’ve made this dataset look quite a bit different in terms of overall trend as well as change or disruption to the seasonal patterns overall.
What can you say about the seasonal patterns? We can say that winter appears to be a time of a drop in employement, with January consistently appearing to be the lowest employed numbers overall each year, dropping off after a December uptick which corresponds with the holiday season. We also see that the summer months are consistently the periods of highest employment, indicating a strong seasonal pattern that recurs across years in the data. Summer months (June, July and August) showcase the highest employment numbers.
Can you identify any unusual years? We can see the impact of global recessions in this data. In particular the “Great Recession” 2008-2010, a period of known major job losses in the US. We can also see other recessions (which are not as extreme as the losses of the great recession), such as the dot-com bust of the early 2000.
Bricks from aus_production
First we’ll create a filtered tsibble for just “Bricks” production.
aus_production_bricks <- aus_production |>
select(Bricks, Quarter) autoplot()
autoplot(aus_production_bricks)Plot variable not specified, automatically selected `.vars = Bricks`
Warning: Removed 20 rows containing missing values or values outside the scale range
(`geom_line()`).
gg_season()
aus_production_bricks |>
gg_season(Bricks, labels = "both")Warning: Removed 20 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_text()`).
Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_text()`).
gg_subseries()
aus_production_bricks |>
gg_subseries(Bricks)Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_line()`).
gg_lag()
aus_production_bricks |>
gg_lag(Bricks, geom = "point")Warning: Removed 20 rows containing missing values (gg_lag).
ACF()
aus_production_bricks |>
ACF(Bricks) # A tsibble: 22 x 2 [1Q]
lag acf
<cf_lag> <dbl>
1 1Q 0.900
2 2Q 0.815
3 3Q 0.813
4 4Q 0.828
5 5Q 0.720
6 6Q 0.642
7 7Q 0.655
8 8Q 0.692
9 9Q 0.609
10 10Q 0.556
# ℹ 12 more rows
aus_production_bricks |>
ACF(Bricks) |>
autoplot()Can you spot any seasonality, cyclicity and trend? We can observe an increasing trend from the start of the dataset recording until the early 1980s, which was the start of a slow downward trend from the peak in the early 1980s. We can observe with the ACF that there is stronger correlation for the smaller lags.
What do you learn about the series? We learn a few things, including that there was a peak in Australian brick production in the early 1980s we also learn that the summers are the most popular time for brick production, which corresponds to a drier climate.
What can you say about the seasonal patterns? As noted previously, we can observe seasonal peaks in Q2-Q3 corresponding to the summer months of dryness and peak brick production. We can also in that same way observe that Q1 is consistently the lowest production in seasonal patterns observed.
Can you identify any unusual years? 1983-1984 appear with a significant drop in Australian bricks production.
Hare from pelt
First we’ll create a filtered tsibble for just “Hare” pelts in the pelt tsibble.
hare_pelt <- pelt |>
select(Year, Hare) autoplot()
autoplot(hare_pelt)Plot variable not specified, automatically selected `.vars = Hare`
gg_season() This plot does not work for this tsibble because the data is reportd per year. There are no “seasons” and when we attempt to use gg_season() an error is returned, indicating to us that there must be more than one observation per year to use gg_season().
gg_subseries()
hare_pelt |>
gg_subseries(Hare)gg_lag()
hare_pelt |>
gg_lag(Hare, geom = "point")ACF()
hare_pelt |>
ACF(Hare) # A tsibble: 19 x 2 [1Y]
lag acf
<cf_lag> <dbl>
1 1Y 0.658
2 2Y 0.214
3 3Y -0.155
4 4Y -0.401
5 5Y -0.493
6 6Y -0.401
7 7Y -0.168
8 8Y 0.113
9 9Y 0.307
10 10Y 0.340
11 11Y 0.296
12 12Y 0.206
13 13Y 0.0372
14 14Y -0.153
15 15Y -0.285
16 16Y -0.295
17 17Y -0.202
18 18Y -0.0676
19 19Y 0.0956
hare_pelt |>
ACF(Hare) |>
autoplot()Can you spot any seasonality, cyclicity and trend? Since the data is annual, we cannot observe seasonality; however, we can observe fluctuations of a cyclical pattern over time. We can also observe that there were overall higher peaks in the 1800s compared with later in the dataset, when the peaks are not nearly as high. We can see the cyclical patterns using the ACF function - correlations at cyclical frequency.
What do you learn about the series? We can learn about the cycle corresponding to the lifecycle of the Canadian Hare, or perhaps related to the cycles and trends of fashion and demand. It is interesting to note that there are observable cycles in the peak and trough within this dataset. The ACF autoplot is telling in particular regarding the lag and appearance of a pattern in the data, where the lag cycles appear as waves.
What can you say about the seasonal patterns? As noted previously, this data is annual and therefore we cannot observe the standard “seasonality” which must be observed using more than one observation per year, which this dataset does not have. We can see cyclical patterns though, and based on our understanding of what “white noise” would be - this does not appear to be a white noise series because we have so many spikes outside the 95% range indicated by the blue lines.
Can you identify any unusual years? 1863 stands out to me for its intense peak - a year of higher-than-ever hare pelts sold by the Hudson Bay company, a peak which has never been repeated since.
“H02” Cost from PBS First we’ll create a filtered tsibble for just “H02” from the PBS tsibble.
PBS_H02 <- PBS |>
filter(ATC2 == "H02") |>
select(Month, Concession, Type, Cost) |>
summarise(TotalCost = sum(Cost))autoplot()
autoplot(PBS_H02)Plot variable not specified, automatically selected `.vars = TotalCost`
gg_season()
PBS_H02 |>
gg_season(TotalCost, labels = "both")gg_subseries()
PBS_H02 |>
gg_subseries(TotalCost)gg_lag()
PBS_H02 |>
gg_lag(TotalCost, geom = "point")ACF()
PBS_H02 |>
ACF(TotalCost) # A tsibble: 23 x 2 [1M]
lag acf
<cf_lag> <dbl>
1 1M 0.755
2 2M 0.558
3 3M 0.386
4 4M 0.228
5 5M 0.126
6 6M 0.0874
7 7M 0.116
8 8M 0.203
9 9M 0.335
10 10M 0.479
# ℹ 13 more rows
PBS_H02 |>
ACF(TotalCost) |>
autoplot()- Can you spot any seasonality, cyclicity and trend? We can see a general upward trend over time. We can also see seasonality in the sense that there is strong correlation between January in one year compared to January in the prior year.
- What do you learn about the series? We can learn that there is repetition across years - strong seasonal patterns based on the month within the year in the data. We also observe that there is a increasing trend over time.
- What can you say about the seasonal patterns? There is a repeated seasonal trend where the total cost is lowest in the early part of the year, in particular in the month of February. This month appears as the trough across years in the data, and December-January are the peaks.
- Can you identify any unusual years? 2008 appears unusual however we are uncertain about how the rest of the year turned out because the data cuts off mid-way through the year. That said, we can see that it’s quite unusual in that it’s one of the only years in the dataset where February is not the lowest (the trough) – rather, March is lower than February in 2008. This is unusual compared to other years we observe in the tsibble.
Barrels from us_gasoline
autoplot()
autoplot(us_gasoline)Plot variable not specified, automatically selected `.vars = Barrels`
gg_season()
us_gasoline |>
gg_season(Barrels, labels = "both")gg_subseries()
us_gasoline |>
gg_subseries(Barrels)gg_lag()
us_gasoline |>
gg_lag(Barrels, geom = "point")ACF()
us_gasoline |>
ACF(Barrels) # A tsibble: 31 x 2 [1W]
lag acf
<cf_lag> <dbl>
1 1W 0.893
2 2W 0.882
3 3W 0.873
4 4W 0.866
5 5W 0.847
6 6W 0.844
7 7W 0.832
8 8W 0.831
9 9W 0.822
10 10W 0.808
# ℹ 21 more rows
us_gasoline |>
ACF(Barrels) |>
autoplot()Can you spot any seasonality, cyclicity and trend? We can observe a general trend increasing over time, which can be observed in the data overall and also when looking at the autocorrelation function which follows that the strongest correlations are those with smaller lag.
What do you learn about the series? We can observe that there is strong correlation between smaller lags, indicating that the more recent the barrels the more relevant to other data points (the smaller lags have more strong positive correlation compared to larger lags, which still have a positive correlation but not as strong as the smaller lags.) We can also learn about lower production periods of time after the great recession and we can also learn that since ~2012 there has been a trend of upward increase.
What can you say about the seasonal patterns? We can observe seasonal patterns in the data by looking at the weeks over time; it is observable that the ‘early’ weeks in a year tend to have lowest production, and the middle of the year tends to have the highest production.
Can you identify any unusual years? We can observe unusual years in the data by looking at the period around 2011, when the barrels did not continue on an ever-increasing upward trend.