DATA624HW1

library(fpp3)
library(zoo) #to convert Quarter to type quarter in Q3
library(USgas)

Background

The purpose of this assignment was to explore the Time Series Graphics exercises from Forecasting: Principles and Practice.

2.1

Use the help function to explore what the series gafa_stock, PBS, vic_elec and pelt represent. Use autoplot() to plot some of the series in these data sets. What is the time interval of each series?

First we apply the help function to each series to explore their description, format, and details.

help(PBS) #monthly medicare Australia prescription data
help(gafa_stock) #stock prices 2014-2018 for Google, Amazon, Facebook, Apple
help(vic_elec) #half-hur electricity demand for Victoria, Australia
help(pelt) #pelt trading records from Hudson Bay Company from 1845 to 1935

We then use autoplot() to plot the series in each data set and interpret time intervals:

#autoplot(PBS, Concession) # did not work --> list of variables
#autoplot(PBS, Type) # did not work --> list of variables
#autoplot(PBS, ATC1) # did not work --> list of variables
#autoplot(PBS, ATC2) # did not work --> list of variables
autoplot(PBS, Scripts)

#autoplot(PBS, Cost)

#PBS$Month

When we consider the PBS data set we see that autoplot() didn’t lead to any plot with a time interval to interpret but when we explore the Month field within the dataset we see that our time interval is per month.

autoplot(gafa_stock, Open) # 2014 - 2018, yearly

#autoplot(gafa_stock, High)
#autoplot(gafa_stock, Low)
#autoplot(gafa_stock, Close)
#autoplot(gafa_stock, Adj_Close)
#autoplot(gafa_stock, Volume)

Applying autoplot() to the gafa_stock dataset, we see what appears to be daily stock price data plot from 2014 to 2018. There appear to be clear upward trends in the Open price for each of these stocks with Amazon overtaking Google in valuation from late 2017 onward.

autoplot(vic_elec, Demand) # 2012 - 2015, yearly

#autoplot(vic_elec, Temperature) # 2012 - 2015, yearly
#autoplot(vic_elec, Holiday) # 2012 - 2015, yearly : NA for Boolean

Applying autoplot() to the vic_elec dataset, we see Demand data from 2012 to 2015 in 30 minute increments with clear seasonality / sinusoidal wave patterns to electricity demand.

autoplot(pelt, Hare)

#autoplot(pelt, Lynx)

Applying autoplot() to the pelt dataset, we see Hare data plot from before 1860 thru 1920 in 1 year increments with clear wave patterns to the data that may require smaller increments or a more in-depth scope to truly determine. The largest peak appears to be during the Civil War years (1860s).

2.2

Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.

gafa_stock %>%
    group_by(Symbol) %>%
    filter(Close == max(Close))

## # A tsibble: 4 x 8 [!]
## # Key:       Symbol [4]
## # Groups:    Symbol [4]
##   Symbol Date        Open  High   Low Close Adj_Close   Volume
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1 AAPL   2018-10-03  230.  233.  230.  232.      230. 28654800
## 2 AMZN   2018-09-04 2026. 2050. 2013  2040.     2040.  5721100
## 3 FB     2018-07-25  216.  219.  214.  218.      218. 58954200
## 4 GOOG   2018-07-26 1251  1270. 1249. 1268.     1268.  2405600

From above we see that each of the four stock’s peak closing prices occurred in 2018.

2.3

Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

Read the data into R, convert the data to time series, construct time series plots of each of the three series, and then check what happens when you don’t include facet_grid() using the provided script.

To start we read in the provided csv and convert our data to time series. More specifically, we convert the Quarter field to year-quarter and then convert the data to be of type tibble. Because the provided conversion script did not work, I first converted the Quarter field using yearmon() and then proceeded with the provided script:

#Read the data in
tute1 <- readr::read_csv("tute1.csv")
#view(tute1)

#Convert the data to time series
##the text's script DID NOT WORK so I adapted the Quarter conversion
tute1$Quarter <- as.yearmon(as.Date( tute1$Quarter, "%m/%d/%Y"))
#tute1$Quarter <- as.yearqtr(as.Date( tute1$Quarter, "%m/%d/%Y" ))
#tute1

mytimeseries <- tute1 %>%
  mutate(Quarter = yearmonth(Quarter)) %>%
  as_tsibble(index = Quarter)

After verifying that the data was in our desired form, we construct our time series plots of the three series, first on separate plots and then on the same plot:

#Construct time series plots of each of the three series
mytimeseries %>%
  pivot_longer(-Quarter) %>%
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y")

#run the script again without facet_grid
mytimeseries %>%
  pivot_longer(-Quarter) %>%
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line()

What we observe, when we incorporate facet_grid() is that each of our numeric fields are plotted on separate plots with scales suited to their values and thus we’re able to gain more insight than when we plot all of our fields on the same plot, with the same scales.

2.4

The USgas package contains data on the demand for natural gas in the US.

Install the USgas package.
Create a tsibble from us_total with year as the index and state as the key.
Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).

Upon installing the UGgas package, we proceed to create a tsibble with the year as the index and the key as the state:

#install `USgas` package and view input data
#view(us_total)

#create with year as index and state as key
mytimeseries2 <- us_total %>%
    mutate(year = year(as.Date(as.character(us_total$year), format = "%Y"))) %>%
  as_tsibble(index = year, key = state)

#mytimeseries2 #verify conversion

#filter for New England area states
ne_gas_consumption <- mytimeseries2 %>%
    filter(state == 'Connecticut' | state == 'Maine' | state == 'Massachusetts' | state == 'New Hampshire' | state == 'Rhode Island' | state == 'Vermont')

#ne_gas_consumption #verify

We then proceed to plot the annual natural gas consumption by state for the New England area, using our filtered tsibble and the ggplot() function:

#plot annual natural gas consumption by state for the New England area
NE_plot <- ggplot(ne_gas_consumption, aes(x = year, y = y, colour = state)) +
  geom_line()

print(NE_plot + labs(title = "Annual Gas Consumption in New England", y = "Gas Consumption", x = "Year"))

There are a number of noteworthy observations from the output:

Massachussetts has the highest gas consumption and Vermont the lowest which makes sense when we consider the difference in population. There also might be different regulations and energy sources between states that come into play …
Connecticut had the greatest increase in consumption from before 2000 until ~2018 while Rhode Island appears to have had the largest decrease.
Maine’s consumption patterns appear to be erratic from the late 1990s until ~2005 and then they steady out. This might be due to changes in energy sources, regulations or some other policy or infrastructure developments within the state.

2.5

Download tourism.xlsx from the book website and read it into R using readxl::read_excel().
Create a tsibble which is identical to the tourism tsibble from the tsibble package.
Find what combination of Region and Purpose had the maximum number of overnight trips on average.
Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.

We read in the Excel file using read_excel() and then verify the entries within the tourism tsibble to note the differences between it and our tourism_xlsx table:

#read in the Excel file using read_excel()
tourism_xlsx <- readxl::read_excel("tourism.xlsx")

#create a tsibble identical to the tourism tsibble
#tourism
#tourism_xlsx

They appear to be identical aside from the format and type of the Quarter column.

I converted the data to time series by first converting the Quarter column to the proper form for the yearquarter() function an then I created a tsibble object with 3 keys and 1 index:

#Convert the data to time series
tourism_xlsx$Quarter <- yearquarter(as.Date(tourism_xlsx$Quarter))
tourism_tsibble <- tourism_xlsx %>%
    as_tsibble(key = c(Region, State, Purpose), index = Quarter)
#tourism_tsibble

The result is identical to the provided tourism tsibble.

From this point, we proceed to finding what combination of Region and Purpose had the maximum number of overnight trips on average.

tourism_xlsx %>%
    select(Region, Purpose, Trips) %>%
    group_by(Region, Purpose) %>%
    summarise(Trips = mean(Trips)) %>%
    filter(Trips == max(Trips)) -> max_avg_trips

max_avg_trips[order(-max_avg_trips$Trips),]

## # A tibble: 76 x 3
## # Groups:   Region [76]
##    Region                 Purpose  Trips
##    <chr>                  <chr>    <dbl>
##  1 Sydney                 Visiting  747.
##  2 Melbourne              Visiting  619.
##  3 North Coast NSW        Holiday   588.
##  4 Gold Coast             Holiday   528.
##  5 South Coast            Holiday   495.
##  6 Brisbane               Visiting  493.
##  7 Sunshine Coast         Holiday   436.
##  8 Hunter                 Holiday   319.
##  9 Australia's South West Holiday   309.
## 10 Experience Perth       Visiting  291.
## # ... with 66 more rows

We see that Sydney-Visiting had the maximum number of over-nighting trips on average.

Next we create a new tsibble which combines Purpose and Region and just has total trips by State:

#Combine Purpose and Region
##group by Region, Purpose and then concatenate these columns
grouped_tourism <- tourism_xlsx %>% group_by(Region, Purpose)
grouped_tourism$Region.Purpose <- paste(grouped_tourism$Region, grouped_tourism$Purpose) 

simple_tourism <- grouped_tourism[-c(2,4)] #drop Region, Purpose
simple_tourism <- simple_tourism[,c(1,2,4,3)] #reorder columns
#simple_tourism #verify

#calculate total trips by State
simple_tourism %>%
    group_by(Quarter,State) %>%
    mutate(Trips = sum(Trips)) -> trips_by_state

#convert Quarter to proper time series form
trips_by_state$Quarter <- yearquarter(as.Date(tourism_xlsx$Quarter))

#create new tsibble
trips_by_state_tsibble <- trips_by_state %>%
    as_tsibble(key = c(State, Region.Purpose), index = Quarter)

trips_by_state_tsibble

## # A tsibble: 24,320 x 4 [1Q]
## # Key:       State, Region.Purpose [304]
## # Groups:    State @ Quarter [640]
##    Quarter State Region.Purpose    Trips
##      <qtr> <chr> <chr>             <dbl>
##  1 1998 Q1 ACT   Canberra Business  551.
##  2 1998 Q2 ACT   Canberra Business  416.
##  3 1998 Q3 ACT   Canberra Business  436.
##  4 1998 Q4 ACT   Canberra Business  450.
##  5 1999 Q1 ACT   Canberra Business  379.
##  6 1999 Q2 ACT   Canberra Business  558.
##  7 1999 Q3 ACT   Canberra Business  449.
##  8 1999 Q4 ACT   Canberra Business  595.
##  9 2000 Q1 ACT   Canberra Business  600.
## 10 2000 Q2 ACT   Canberra Business  557.
## # ... with 24,310 more rows

I was a little regarding the exact desired output but for the sake of maintaining a tsibble (time series table) output, I:

combined Purpose and Region,
grouped by Region and Purpose and then concatenated these fields,
dropped Region and Purpose from further consideration,
re-ordered the fields in a more common-sensical manner,
calculate total trips by State, and then
created the tsibble shown above.

2.8

Monthly Australian retail data is provided in aus_retail.

Select one of the time series and set your own seed value.
Explore your chosen retail time series using the following functions: autoplot(), gg_season(), gg_subseries(), gg_lag(), ACF() %>% autoplot()
Can you spot any seasonality, cyclicity and trend? What do you learn about the series?

set.seed(333)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries #Series ID = A3349773T, 441 rows

## # A tsibble: 441 x 5 [1M]
## # Key:       State, Industry [1]
##    State                   Industry                `Series ID`    Month Turnover
##    <chr>                   <chr>                   <chr>          <mth>    <dbl>
##  1 Australian Capital Ter~ Newspaper and book ret~ A3349773T   1982 Apr      2.3
##  2 Australian Capital Ter~ Newspaper and book ret~ A3349773T   1982 May      2.5
##  3 Australian Capital Ter~ Newspaper and book ret~ A3349773T   1982 Jun      2.3
##  4 Australian Capital Ter~ Newspaper and book ret~ A3349773T   1982 Jul      2.6
##  5 Australian Capital Ter~ Newspaper and book ret~ A3349773T   1982 Aug      2.6
##  6 Australian Capital Ter~ Newspaper and book ret~ A3349773T   1982 Sep      2.6
##  7 Australian Capital Ter~ Newspaper and book ret~ A3349773T   1982 Oct      2.7
##  8 Australian Capital Ter~ Newspaper and book ret~ A3349773T   1982 Nov      3  
##  9 Australian Capital Ter~ Newspaper and book ret~ A3349773T   1982 Dec      4.9
## 10 Australian Capital Ter~ Newspaper and book ret~ A3349773T   1983 Jan      2.5
## # ... with 431 more rows

autoplot(myseries)

For the A3349773T series, autoplot() produces a monthly plot of Turnover from 1982 thru 2018 for Newspaper and book retailing in the Australian Capital Territory as one continuous line. It’s a timeseries continuous plot where we can more closely see the ties between Turnover and the years under consideration.

What we gather from the plot is that Turnover climbed consistently thru the nineties, dropped shortly before the year 2000, reached a peak in / near the year 2000, stayed relatively high through the 1st decade of the 2000s (reaching a second peak near 2008), and then dropped again shortly after 2010.

The peaks may be tied to the Dot Com bubble (ca. ~2000) and the mortgage crises (ca. ~2008) in the US and because we live in such an international market with the US being so heavily influential international, as one of the leading consumer economies, the ripples were felt in Australia.

gg_season(myseries)

For the A3349773T series, gg_season() produces a monthly plot of Turnover from 1982 thru 2018 for Newspaper and book retailing in the Australian Capital Territory with each year plotted in a different hue for each month in the year. It’s a time series seasonal plot so we can more clearly see the ties between Turnover and month of year.

What we gather from the plot, is that there appears to be a tie between higher Turnover and the months of February, March, July and August. December is the month with the highest Turnover.

gg_subseries(myseries)

For the A3349773T series, gg_season() produces a monthly plot of Turnover from 1982 thru 2018 for Newspaper and book retailing in the Australian Capital Territory as a continuous plot from year to year based on the month. It’s a seasonal subseries plot that facets the time series by each season in the seasonal period.

From the plot, we gather the difference from month-to-month by year as well as the average Turnover comparison between months for all years under consideration:

December has the highest average Turnover.
Each plot highlights a climb in Turnover from 1990 to 200, a maintenance of relatively high Turnover from 2000 to 2010, and a drop in Turnover from 2010 to 2020.

gg_lag(myseries)

For the A3349773T series, gg_lag() produces a time series against lags of itself. It is often coloured per seasonal period to highlight how seasons are correlated.

It’s difficult to draw much of value from this plot, but what we do see is that certain hues / months sit higher on the scale of Turnover and so we might draw from that that those months when compared to others have a higher relative Turnover (ie. December).

myseries %>%
    ACF(Turnover) %>%
    autoplot()

For the A3349773T series, ACF() %>% autoplot() produces a time series against lags of itself. It shows how correlations change with the lag.

From the plot, we gather that months 1, 2, 3, 5, and 12 are the highest which indicates that there is a seasonality to when turnover is at its highest. Every year during the same month, for 2 months after, and at 4 months after.

As a quick summary of three main points from the above plots:

December has the highest average Turnover.
There’s a relationship between higher Turnover and the months of February, March, July and August. December is the month with the highest Turnover.
Turnover climbed from 1990 to 200, maintained a relatively high Turnover from 2000 to 2010, and dropped from 2010 to 2020.

In short, there is clear seasonality and trends to the Australian Turnover data.

DATA624HW1

Magnus Skonberg

2021-09-06

Background

2.1

2.2

2.3

2.4

2.5

2.8