DATA624 - Homework 1

library(fpp3)
library(cowplot)
library(knitr)
library(httr)

Exercise 2.1

Explore the following four time series: Bricks from aus_production, Lynx from pelt, Close from gafa_stock, Demand from vic_elec.

data(aus_production, pelt, gafa_stock, vic_elec)

Use ? (or help()) to find out about the data in each series.

#To prevent the markdown file from opening Web pages for the help files, eval is set to false here.
?aus_production
?pelt
?gafa_stock
?vic_elec

What is the time interval of each series?

The data in the aus_production data set was measured quarterly (although the help file seems to mistakenly call the data set a half-hourly tsibble). The data in the pelt data set was measured annually. The data in the gafa_stock data set was measured daily (excluding days the stock market wasn’t open). The data in the vic_elec data set was measured half-hourly.

Use autoplot() to produce a time plot of each series.

p1 <- autoplot(aus_production, Bricks)
p2 <- autoplot(pelt, Lynx)
p3 <- autoplot(gafa_stock, Close)
p4 <- autoplot(vic_elec, Demand)
plot_grid(p1, p2, p3, p4)

For the last plot, modify the axis labels and title.

p4 <- p4 +
    labs(title = "Electricity Demand Over Time for Victoria, Australia",
         subtitle = "Source: Australian Energy Market Operator",
         x = "Half-Hour",
         y = "Total Demand in MWh")
p4

Exercise 2.2

Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.

gafa_stock_max <- gafa_stock |>
    group_by(Symbol) |>
    filter(Close == max(Close))
kable(gafa_stock_max, format="simple")

Symbol	Date	Open	High	Low	Close	Adj_Close	Volume
AAPL	2018-10-03	230.05	233.470	229.78	232.07	230.2755	28654800
AMZN	2018-09-04	2026.50	2050.500	2013.00	2039.51	2039.5100	5721100
FB	2018-07-25	215.72	218.620	214.27	217.50	217.5000	58954200
GOOG	2018-07-26	1251.00	1269.771	1249.02	1268.33	1268.3300	2405600

Exercise 2.3

Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

You can read the data into R with the following script:

fn1 <- "https://raw.githubusercontent.com/geedoubledee/data624_homework1/main/tute1.csv"
tute1 <- readr::read_csv(fn1)

Convert the data to time series

mytimeseries <- tute1 |>
    mutate(Quarter = yearquarter(Quarter)) |>
    as_tsibble(index = Quarter)

Construct time series plots of each of the three series

mytimeseries |>
    pivot_longer(-Quarter) |>
    ggplot(aes(x = Quarter, y = value, colour = name)) +
    geom_line() +
    facet_grid(name ~ ., scales = "free_y")

Check what happens when you don’t include facet_grid().

mytimeseries |>
    pivot_longer(-Quarter) |>
    ggplot(aes(x = Quarter, y = value, colour = name)) +
    geom_line()

When you don’t include facet_grid(), the AdBudget, GDP, and Sales columns of data are plotted together rather than separately. Since none of the value ranges of these columns overlap, it is visually a little weird to plot them together. You can see more detailed information for each category when you plot them separately, and it doesn’t really take up more room than plotting them together anyway, so I think plotting them separately is preferred here.

Exercise 2.4

The USgas package contains data on the demand for natural gas in the US.

Install the USgas package.

library(USgas)

Create a tsibble from us_total with year as the index and state as the key.

data(us_total)
us_total_ts <- us_total |>
    as_tsibble(key = state, index = year)

Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).

new_england <- c("Maine", "Vermont", "New Hampshire", "Massachusetts", "Connecticut", "Rhode Island")
us_total_ts |>
    filter(state %in% new_england) |>
    mutate(y = y / 1000) |>
    autoplot(y) -> p5
p5 <- p5 +
    labs(title = "New England Total Natural Gas Consumption",
         subtitle = "Source: https://www.eia.gov/",
         x = "Year",
         y = "Billion Cubic Feet")
p5

Exercise 2.5

Download tourism.xlsx from the book website and read it into R using readxl::read_excel().

fn2 <- "https://github.com/geedoubledee/data624_homework1/raw/main/tourism.xlsx"
temp <- tempfile(fileext = ".xlsx")
req <- GET(fn2, authenticate(Sys.getenv("GITHUB_PAT"), ""),
           write_disk(path = temp))
tourism_duplicate <- readxl::read_excel(temp)
unlink(temp)

Create a tsibble which is identical to the tourism tsibble from the tsibble package.

data(tourism)
keys <- c("Region", "State", "Purpose")
tourism_duplicate <- tourism_duplicate |>
    mutate(Quarter = yearquarter(Quarter)) |>
    as_tsibble(key = all_of(keys), index = Quarter)

After creating the new tsibble object, we check to see if it is identical to the original.

identical(tourism, tourism_duplicate)

## [1] FALSE

The new tsibble object we created is not considered identical to the original, and we can narrow down why.

check <- as.matrix(tourism == tourism_duplicate)
tourism_duplicate[!check]

## [1]   7.545404 450.050853  37.947745

tourism[!check]

## [1]   7.545404 450.050853  37.947745

The same three values in the Trips column of both objects look identical, but are not considered so computationally. We can confirm the Trips column in both objects meets the all.equal standard for double precision floating numbers, however.

all.equal(tourism$Trips, tourism_duplicate$Trips)

## [1] TRUE

Thus, we have functionally identical objects.

Find what combination of Region and Purpose had the maximum number of overnight trips on average.

tourism_duplicate_max_avg <- as_tibble(tourism_duplicate) |>
    group_by(Region, Purpose) |>
    summarize(Avg_Trips = mean(Trips)) |>
    ungroup() |>
    filter(Avg_Trips == max(Avg_Trips))
kable(tourism_duplicate_max_avg, format="simple")

Region	Purpose	Avg_Trips
Sydney	Visiting	747.27

Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.

tourism_duplicate_by_state <- as_tibble(tourism_duplicate) |>
    select(-Region, -Purpose) |>
    group_by(Quarter, State) |>
    summarize(Trips = sum(Trips)) |>
    as_tsibble(index = Quarter, key = State)
kable(head(tourism_duplicate_by_state, n = 10), format="simple")

Quarter	State	Trips
1998 Q1	ACT	551.0019
1998 Q2	ACT	416.0256
1998 Q3	ACT	436.0290
1998 Q4	ACT	449.7984
1999 Q1	ACT	378.5728
1999 Q2	ACT	558.1781
1999 Q3	ACT	448.9012
1999 Q4	ACT	594.8254
2000 Q1	ACT	599.6685
2000 Q2	ACT	557.1351

Exercise 2.8

Use the following graphics functions: autoplot(), gg_season(), gg_subseries(), gg_lag(), and ACF() to explore features from the following time series: “Total Private” Employed from us_employment, Bricks from aus_production, Hare from pelt, “H02” Cost from PBS, and Barrels from us_gasoline.

Can you spot any seasonality, cyclicity and trend?
What do you learn about the series?
What can you say about the seasonal patterns?
Can you identify any unusual years?

data(us_employment)
data(PBS)
data(us_gasoline)
total_priv <- us_employment |>
    filter(Title == "Total Private") 
p6 <- total_priv |>
    autoplot(Employed)
p7 <- total_priv |>
    gg_season(Employed, period = "year")
p6

p7

The “Total Private” Employed data from the us_employment dataset exhibits a general upward trend. Seasonally, the lowest value for the year is generally recorded in January. Many companies use the calendar year or a close approximation to it as their fiscal year, so it makes sense that headcount decisions for the next year might finalize in December, explaining why there are always fewer people employed at the beginning of one year than there were at the end of the previous.

After the January low point, the value tends to increase until June. Then the value usually levels off until the end of the year. Positions are probably being filled more in the first half of the year than in the second half. In my experience, company executives usually feel optimistic about the year in Q1 and Q2 and are less optimistic by Q3 and Q4. Hiring pauses are normal in the second half of the year, when there’s not a lot of time left to hit sales goals and so forth, so keeping costs down becomes more important, and that often means waiting until Q1 or Q2 of the following year to start hiring again.

There were 5-year cycles between 1940-1960 where values would level off, then dip down after a growth period, eliminating about half the growth that occurred up until then. The growth/expansion cycles after that are of different lengths and varying impact.

There is an outlier year, 1945, where there is a huge drop in the value between August and September.

p8 <- autoplot(aus_production, Bricks)
p9 <- gg_season(aus_production, Bricks, period = "year")
p10 <- gg_subseries(aus_production, Bricks)
p11 <- gg_lag(aus_production, Bricks, geom = "point")
p8

p9

p10

p11

In the Bricks data from the aus_production dataset, we see that there was an upward trend until the early 1980s, at which point a downward trend begins.

Seasonally, Q3 is generally the busiest period, and Q1 and Q4 are generally the least busy. There are some outlier years in which Q2 is busier than Q3:

remove <- c("Beer", "Tobacco", "Cement", "Electricity", "Gas")
aus_production_tbl <- aus_production |>
    as_tibble() |>
    separate_wider_delim(Quarter, delim = " ", names = c("Year", "Quarter")) |>
    select(-all_of(remove)) |>
    pivot_wider(names_from = Quarter, values_from = Bricks) |>
    filter(Q2 > Q3)
kable(aus_production_tbl, format = "simple")

Year	Q1	Q2	Q3	Q4
1974	478	526	518	417
1978	421	487	470	482
1982	501	560	512	412
1989	510	571	556	509
1990	458	510	494	460
1991	372	436	422	423
1995	430	457	417	370
2000	416	447	421	379

Cyclically, beginning in the 1970s, there are big dips around every five years or so that eliminate much of the growth gained in that five-yearish period. The dips are big enough such that since the general downward trend began in the early 1980s, the growth peaks after the dips never surpass previous growth peaks, even when the dips are not as big as previous dips were.

p12 <- autoplot(pelt, Hare)
p13 <- gg_lag(pelt, Hare, geom = "point")
p14 <- pelt |>
    ACF(Hare) |>
    autoplot()
p12

p13

p14

The Hare data from the pelt dataset does not exhibit an upward or downward trend. There appear to be 10-year cycles of rising to a peak and falling to a valley. 1863/1864 and 1885/1886 were exceptional peaks. Plotting the ACF makes it more clear that peaks are 10 years apart, and valleys are too.

ho2 <- PBS |>
    filter(ATC2 == "H02")
p15 <- ho2 |>
    autoplot(Cost)
p16 <- ho2 |>
    gg_season(Cost, period = "year")
p15

p16

In the PBS dataset, I think we’re seeing the concessional safety net and the general safety net both have payment maximums that patients can hit, typically reducing cost to almost nothing between February/March to May. Costs then begin to rise again around June until November, when they rise drastically through January, after which they decline drastically. Concessional copayments on the other hand are lower at the end and beginning of the year and larger in the middle. I can’t see a pattern with general copayments, but they’re much lower than their counterparts. General safety net costs trended downward and have leveled off. Concessional copayments and safety net costs have trended upward.

p17 <- autoplot(us_gasoline, Barrels)
p18 <- gg_season(us_gasoline, Barrels, period = "year")
p17

p18

The us_gasoline dataset exhibits an upward trend. I can’t discern any other patterns unfortunately despite looking at a few different plots to see if they highlighted anything more interesting (that I could see and understand anyway).