This assignment answers questions from the Forecasting: Principles and Practice textbook, 3rd edition. The questions come from chapter 2 section 10 - Exercises. Prompts for each question are included before their respective code/answers.
library(fpp3)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.6
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.4.1
## ✔ lubridate 1.9.3 ✔ fable 0.4.1
## ✔ ggplot2 3.5.1
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(readxl)
library(tsibble)
library(dplyr)
library(ggplot2)
Explore the following four time series: Bricks from aus_production, Lynx from pelt, Close from gafa_stock, Demand from vic_elec.
Use ? (or help()) to find out about the data in each series. What is the time interval of each series? Use autoplot() to produce a time plot of each series. For the last plot, modify the axis labels and title.
The below code first takes the Bricks columns from the aus_production data set and reads the head to see the data formatting. Next, the data documentation is explored with the help() method. Bricks data is defined as ‘Clay brick production in millions of bricks.’ The interval of the data is quarterly. A basic autoplot is used by passing the appropriate arguments through.
bricks_raw <- aus_production |> select(Bricks)
head(bricks_raw)
## # A tsibble: 6 x 2 [1Q]
## Bricks Quarter
## <dbl> <qtr>
## 1 189 1956 Q1
## 2 204 1956 Q2
## 3 208 1956 Q3
## 4 197 1956 Q4
## 5 187 1957 Q1
## 6 214 1957 Q2
help(aus_production)
# Bricks: Clay brick production in millions of bricks.
interval(aus_production)
## <interval[1]>
## [1] 1Q
# Quarter
autoplot(aus_production, Bricks)
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).
The below code first takes the Lynx columns from the pelt data set and reads the head to see the data formatting. Next, the data documentation is explored with the help() method. Lynx data is defined as ‘The number of Canadian Lynx pelts traded.’ The interval of the data is yearly. A basic autoplot is used by passing the appropriate arguments through.
lynx_raw <- pelt |> select(Lynx)
head(lynx_raw)
## # A tsibble: 6 x 2 [1Y]
## Lynx Year
## <dbl> <dbl>
## 1 30090 1845
## 2 45150 1846
## 3 49150 1847
## 4 39520 1848
## 5 21230 1849
## 6 8420 1850
help(pelt)
# Lynx: The number of Canadian Lynx pelts traded.
interval(pelt)
## <interval[1]>
## [1] 1Y
# Year
autoplot(pelt, Lynx)
The below code first takes the Close columns from the gafa_stock data set and reads the head to see the data formatting. Next, the data documentation is explored with the help() method. Close data is defined as ‘The closing price for the stock.’ Using the interval() method returns ‘!’ meaning there may be irregular intervals. To find out what intervals exist in the data, I find the difference between a row’s date to the previous row’s date using lag(Date) and mutate. The distinct intervals are 1 day, 3 days, 4 days, 2 days, and -1824 days (The identified interval differences include -1824 days which might indicate indicate a misalignment of the dataset). A basic autoplot is used by passing the appropriate arguments through.
close_raw <- gafa_stock |> select(Close)
head(close_raw)
## # A tsibble: 6 x 3 [!]
## # Key: Symbol [1]
## Close Date Symbol
## <dbl> <date> <chr>
## 1 79.0 2014-01-02 AAPL
## 2 77.3 2014-01-03 AAPL
## 3 77.7 2014-01-06 AAPL
## 4 77.1 2014-01-07 AAPL
## 5 77.6 2014-01-08 AAPL
## 6 76.6 2014-01-09 AAPL
help(gafa_stock)
# Close: The closing price for the stock.
interval(gafa_stock)
## <interval[1]>
## [1] !
# Returns '!' which suggest irregular intervals. Check what unique intervals there are:
gafa_stock |>
mutate(diff = Date - lag(Date)) |>
distinct(diff) |>
drop_na()
## # A tibble: 5 × 1
## diff
## <drtn>
## 1 1 days
## 2 3 days
## 3 4 days
## 4 2 days
## 5 -1824 days
autoplot(gafa_stock, Close)
The below code first takes the Demand columns from the vic_elec data set and reads the head to see the data formatting. Next, the data documentation is explored with the help() method. Demand data is defined as ‘Total electricity demand in MWh.’ The interval of the data is 30 minutes. Autoplot is used by passing the appropriate arguments through, including a descriptive title, X-axis label, and Y-axis label. The data is quite concentrated, as the intervals of observations are 30 minutes and the data spans several years.
demand_raw <- vic_elec |> select(Demand)
head(demand_raw)
## # A tsibble: 6 x 2 [30m] <Australia/Melbourne>
## Demand Time
## <dbl> <dttm>
## 1 4383. 2012-01-01 00:00:00
## 2 4263. 2012-01-01 00:30:00
## 3 4049. 2012-01-01 01:00:00
## 4 3878. 2012-01-01 01:30:00
## 5 4036. 2012-01-01 02:00:00
## 6 3866. 2012-01-01 02:30:00
help(vic_elec)
# Demand: Total electricity demand in MWh.
interval(vic_elec)
## <interval[1]>
## [1] 30m
# 30 minutes
autoplot(vic_elec, Demand) +
ggtitle("Electricity Demand") +
ylab("Demand (MWh)") +
xlab("Year")
Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.
Peak closing price for each stock is found by the below calculations. First, the gafa_stock data passes through the grouped Symbols column. The data is filtered to find the rows with max values of close. Only Symbol, Data, and Close columns are kept, and they are ordered. See each of the four stock’s closing dates with those date’s prices below.
# Find the peak closing price for each stock
peak_days <- gafa_stock |>
group_by(Symbol) |>
filter(Close == max(Close, na.rm = TRUE)) |>
select(Symbol, Date, Close) |>
arrange(Symbol, Date)
print(peak_days)
## # A tsibble: 4 x 3 [!]
## # Key: Symbol [4]
## # Groups: Symbol [4]
## Symbol Date Close
## <chr> <date> <dbl>
## 1 AAPL 2018-10-03 232.
## 2 AMZN 2018-09-04 2040.
## 3 FB 2018-07-25 218.
## 4 GOOG 2018-07-26 1268.
Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.
After opening the data in Excel and looking at the columns and values, the data is uploaded to this Markdown. The below logic is given by the textbook: taking in the tute1.csv file that is stored in this project’s repository documentation. Data is previewed using head() function.
# Use read_csv for tute1.csv and preview data
tute1 <- readr::read_csv("tute1.csv")
## Rows: 100 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Sales, AdBudget, GDP
## date (1): Quarter
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(tute1)
## # A tibble: 6 × 4
## Quarter Sales AdBudget GDP
## <date> <dbl> <dbl> <dbl>
## 1 1981-03-01 1020. 659. 252.
## 2 1981-06-01 889. 589 291.
## 3 1981-09-01 795 512. 291.
## 4 1981-12-01 1004. 614. 292.
## 5 1982-03-01 1058. 647. 279.
## 6 1982-06-01 944. 602 254
Next, the book logic has the data converted to time series.
mytimeseries <- tute1 |>
mutate(Quarter = yearquarter(Quarter)) |>
as_tsibble(index = Quarter)
Then it constructs time series plots of each of the three series
mytimeseries |>
pivot_longer(-Quarter) |>
ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line() +
facet_grid(name ~ ., scales = "free_y")
Without including the facet_grid line, all of the series are on the same graph. Each are a measure with different ranges for values. Because they are on the same y-axis, the changes of GDP across time are less varied.
mytimeseries |>
pivot_longer(-Quarter) |>
ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line()
The USgas package contains data on the demand for natural gas in the US.
This logic loads the USgas package with the logic for installing that package commented out. The tibble us_gas_tibble is created by passing the us_total data through as_tibble() and assigning year as the index and state as the key. New England area states are saved as a new value new_england_states. Then autoplot is used for plotting only those New England states, including a title, axis labels, and a facet wrap.
# install.packages("USgas")
library(USgas)
us_gas_tsibble <- us_total |>
as_tsibble(index = year, key = state)
new_england_states <- c("Maine", "Vermont", "New Hampshire", "Massachusetts", "Connecticut", "Rhode Island")
us_gas_tsibble |>
filter(state %in% new_england_states) |>
autoplot(y) +
ggtitle("Annual Natural Gas Consumption in New England") +
ylab("Consumption") +
xlab("Year") +
facet_wrap(~ state, scales = "free_y")
Data is read in using read_excel. The data structure, column names, and data types are checked from the tourism tsibble. The tsibble is created mirroring the tsibble data structure. I check if the new tsibble mirrors the tourism tsibble from the package using all_equal(). For c, the combination of Region and Purpose that had the maximum number of overnight trips on average is found by group by(), summarise(), arrange(), and slice_head() and is Melbourne. For d, a new tsibble is created which combines the Purposes and Regions, and just has total trips by State. This is done by indexing by quarter, using summarise() on the trips, and saving as_tibble().
tourism_data <- readxl::read_excel("tourism.xlsx")
head(tourism)
## # A tsibble: 6 x 5 [1Q]
## # Key: Region, State, Purpose [1]
## Quarter Region State Purpose Trips
## <qtr> <chr> <chr> <chr> <dbl>
## 1 1998 Q1 Adelaide South Australia Business 135.
## 2 1998 Q2 Adelaide South Australia Business 110.
## 3 1998 Q3 Adelaide South Australia Business 166.
## 4 1998 Q4 Adelaide South Australia Business 127.
## 5 1999 Q1 Adelaide South Australia Business 137.
## 6 1999 Q2 Adelaide South Australia Business 200.
tourism_tsibble <- tourism_data |>
mutate(Quarter = yearquarter(Quarter)) |> # Ensure Quarter is in year-quarter format
as_tsibble(index = Quarter, key = c(Region, State, Purpose)) # Convert to tsibble
all.equal(tourism_tsibble, tourism)
## [1] TRUE
max_trips <- tourism_tsibble |>
group_by(Region, Purpose) |>
summarise(avg_trips = mean(Trips, na.rm = TRUE), .groups = "drop") |>
arrange(desc(avg_trips)) |>
slice_head(n = 1)
## Warning: Current temporal ordering may yield unexpected results.
## ℹ Suggest to sort by `Region`, `Purpose`, `Quarter` first.
print(max_trips)
## # A tsibble: 1 x 4 [1Q]
## # Key: Region, Purpose [1]
## Region Purpose Quarter avg_trips
## <chr> <chr> <qtr> <dbl>
## 1 Melbourne Visiting 2017 Q4 985.
state_trips_tsibble <- tourism_tsibble |>
index_by(Quarter) |>
summarise(Total_Trips = sum(Trips, na.rm = TRUE)) |>
as_tsibble(index = Quarter)
print(state_trips_tsibble)
## # A tsibble: 80 x 2 [1Q]
## Quarter Total_Trips
## <qtr> <dbl>
## 1 1998 Q1 23182.
## 2 1998 Q2 20323.
## 3 1998 Q3 19827.
## 4 1998 Q4 20830.
## 5 1999 Q1 22087.
## 6 1999 Q2 21458.
## 7 1999 Q3 19914.
## 8 1999 Q4 20028.
## 9 2000 Q1 22339.
## 10 2000 Q2 19941.
## # ℹ 70 more rows
Use the following graphics functions: autoplot(), gg_season(), gg_subseries(), gg_lag(), ACF() and explore features from the following time series: “Total Private” Employed from us_employment, Bricks from aus_production, Hare from pelt, “H02” Cost from PBS, and Barrels from us_gasoline.
Can you spot any seasonality, cyclicity and trend? What do you learn about the series? What can you say about the seasonal patterns? Can you identify any unusual years?
The data is all loaded. Each analysis is broken up by the data set.
data("us_employment")
data("aus_production")
data("pelt")
data("PBS")
data("us_gasoline")
The below graphs help answer the prompt questions.
Can you spot any seasonality, cyclicity and trend? -> The Seasonality of Employment graph shows some small bumps seasonally, meaning there are slight increases in early summer and keeps steady or decreases in other months. Any cyclicity comes from major economic events. The upward angle of the graph shows a positive overall trend. What do you learn about the series?
What can you say about the seasonal patterns? -> Small seasonality with increases in early summer and decreases in other months.
Can you identify any unusual years? -> There is an unusual dip in employment around 2009-2019. This is of course from the Great Recession.
employment_ts <- us_employment |> filter(Title == "Total Private")
# autoplot()
autoplot(employment_ts, Employed) +
ggtitle("Total Private Employment Over Time") +
ylab("Employment (thousands)") + xlab("Year")
# gg_season()
gg_season(employment_ts, Employed) + ggtitle("Seasonality of Employment")
# gg_subseries()
gg_subseries(employment_ts, Employed) + ggtitle("Employment Subseries Plot")
# gg_lag()
gg_lag(employment_ts, Employed, lags = 1:12) + ggtitle("Lag Plot for Employment")
# ACF()
employment_ts |> ACF(Employed) |> autoplot() + ggtitle("ACF of Employment")
The below graphs help answer the prompt questions.
Can you spot any seasonality, cyclicity and trend? -> Brick production shows strong seasonality, with the lowest quarter of each year expected as Q1. The trend increases until around 1980 and trends downward after, with fluctuations throughout.
What can you say about the seasonal patterns? -> There are decreased brick productions in Q1s, increases to Q2, and variable changes from Q2 to Q3 and Q3 to Q4.
Can you identify any unusual years? -> There are unusual dips in bricks around 1975 and 1983-1984.
# autoplot()
autoplot(aus_production, Bricks) +
ggtitle("Brick Production Over Time") +
ylab("Millions of Bricks") + xlab("Year")
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).
# gg_season()
gg_season(aus_production, Bricks) + ggtitle("Seasonality of Brick Production")
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).
# gg_subseries()
gg_subseries(aus_production, Bricks) + ggtitle("Brick Production Subseries Plot")
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).
# gg_lag()
gg_lag(aus_production, Bricks, lags = 1:12) + ggtitle("Lag Plot for Bricks")
## Warning: Removed 20 rows containing missing values (gg_lag).
# ACF()
aus_production |> ACF(Bricks) |> autoplot() + ggtitle("ACF of Bricks")
The below graphs help answer the prompt questions.
Can you spot any seasonality, cyclicity and trend? -> The seasonality graph could not be created as the data does not exist to that drill down. The the data shows strong cyclic behavior around every 10-11 years, as seen in the ACF graphic. The trend seems to be pretty even with no clear increases or decreases over time.
What can you say about the seasonal patterns? -> Nothing can be said about seasonality! This is because the data is annual rather than monthly or quarterly, meaning seasonality is not relevant in this context.
Can you identify any unusual years? -> 1964 and 1983 have larger increases to the population than other years.
# autoplot()
autoplot(pelt, Hare) +
ggtitle("Hare Population Over Time") +
ylab("Number of Pelts") + xlab("Year")
# gg_season()
# the following code does not work as there is not one observation per seasonal period recognized: gg_season(pelt, Hare) + ggtitle("Seasonality of Hare Population")
# gg_subseries()
gg_subseries(pelt, Hare) + ggtitle("Hare Population Subseries Plot")
# gg_lag()
gg_lag(pelt, Hare, lags = 1:12) + ggtitle("Lag Plot for Hare Population")
# ACF()
pelt |> ACF(Hare) |> autoplot() + ggtitle("ACF of Hare Population")
The below graphs help answer the prompt questions.
Can you spot any seasonality, cyclicity and trend? -> Seasonality, cyclicity and trend all vary across the different series that have data here. For the safety net serieses, there are big drops in February and slow increases starting in May, yearly. Concessional co-payments have seasonal changes of increases in spring and decreases for other months. No clear seasonality trends for general co-payments. All have cycles that last around 12 years in total. Trends are positive for concessional co-payments. General payments look relatively steady, with slight decreased trend for general safety net.
What can you say about the seasonal patterns? -> As stated above, the safety net serieses show big drops in February and slow increases starting in May, yearly. Concessional co-payments have seasonal changes of increases in spring and decreases for other months. No clear seasonality trends for general co-payments.
Can you identify any unusual years? -> Pre 1995 data shows higher values than would be expected by post 1995 trends.
# H02 data filters differently
PBS_H02 <- PBS |> filter(ATC2 == "H02")
glimpse(PBS_H02)
## Rows: 816
## Columns: 9
## Key: Concession, Type, ATC1, ATC2 [4]
## $ Month <mth> 1991 Jul, 1991 Aug, 1991 Sep, 1991 Oct, 1991 Nov, 1991 Dec,…
## $ Concession <chr> "Concessional", "Concessional", "Concessional", "Concession…
## $ Type <chr> "Co-payments", "Co-payments", "Co-payments", "Co-payments",…
## $ ATC1 <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", "H",…
## $ ATC1_desc <chr> "Systemic hormonal preparations, excl. sex hormones and ins…
## $ ATC2 <chr> "H02", "H02", "H02", "H02", "H02", "H02", "H02", "H02", "H0…
## $ ATC2_desc <chr> "CORTICOSTEROIDS FOR SYSTEMIC USE", "CORTICOSTEROIDS FOR SY…
## $ Scripts <dbl> 63261, 53528, 52822, 54016, 49281, 51798, 42436, 52913, 629…
## $ Cost <dbl> 317384.0, 269891.0, 269703.0, 280418.0, 268070.0, 277139.0,…
# autoplot()
autoplot(PBS_H02, Cost) +
ggtitle("H02 Prescription Costs Over Time") +
ylab("Cost ($AUD)") +
xlab("Year")
# gg_season()
gg_season(PBS_H02, Cost) + ggtitle("Seasonality of H02 Costs")
# gg_subseries()
gg_subseries(PBS_H02, Cost) + ggtitle("H02 Cost Subseries Plot")
# gg_lag()
PBS_H02_summarized <- PBS_H02 |>
index_by(Month) |> # Group by time index
summarise(Cost = sum(Cost, na.rm = TRUE))
gg_lag(PBS_H02_summarized, Cost, lags = 1:12) +
ggtitle("Lag Plot for H02 Costs (Summed Across Groups)")
# ACF()
PBS_H02 |> ACF(Cost) |> autoplot() + ggtitle("ACF of H02 Costs")
The below graphs help answer the prompt questions.
Can you spot any seasonality, cyclicity and trend? -> This data is quite volatile. Seasonality has increases in summer months and decreases in winter months. The cycles are not too clear; there do not seem to be any regular cycles in the data. The data trends up overall, but does have some decreases around and after 2009.
What can you say about the seasonal patterns? -> Seasonality roughly has increases in summer months and decreases in winter months
Can you identify any unusual years? -> The decreases around and after 2009 are likely due to the Great Recession.
# autoplot()
autoplot(us_gasoline, Barrels) +
ggtitle("Gasoline Consumption Over Time") +
ylab("Millions of Barrels") + xlab("Year")
# gg_season()
gg_season(us_gasoline, Barrels) + ggtitle("Seasonality of Gasoline Consumption")
# gg_subseries()
gg_subseries(us_gasoline, Barrels) + ggtitle("Gasoline Consumption Subseries Plot")
# gg_lag()
gg_lag(us_gasoline, Barrels, lags = 1:12) + ggtitle("Lag Plot for Gasoline Consumption")
# ACF()
us_gasoline |> ACF(Barrels) |> autoplot() + ggtitle("ACF of Gasoline Consumption")