DATA 624 Homework 1

library(fpp3)

## Warning: package 'fpp3' was built under R version 3.6.3

## -- Attaching packages ------------------------------------------------------ fpp3 0.4.0 --

## v tibble      3.1.1     v tsibble     1.1.1
## v dplyr       1.0.6     v tsibbledata 0.4.0
## v tidyr       1.1.3     v feasts      0.2.2
## v lubridate   1.7.4     v fable       0.3.0
## v ggplot2     3.3.5

## Warning: package 'tibble' was built under R version 3.6.3

## Warning: package 'dplyr' was built under R version 3.6.3

## Warning: package 'tidyr' was built under R version 3.6.3

## Warning: package 'fable' was built under R version 3.6.3

## -- Conflicts ----------------------------------------------------------- fpp3_conflicts --
## x lubridate::date()       masks base::date()
## x dplyr::filter()         masks stats::filter()
## x tsibble::intersect()    masks base::intersect()
## x tsibble::interval()     masks lubridate::interval()
## x dplyr::lag()            masks stats::lag()
## x tsibble::new_interval() masks lubridate::new_interval()
## x tsibble::setdiff()      masks base::setdiff()
## x tsibble::union()        masks base::union()

library(ggplot2)

Exercise 2.10.1

Use the help function to explore what the series gafa_stock, PBS, vic_elec and pelt represent.

gafa_stock: Historical stock prices from 2014-2018 for Google, Amazon, Facebook and Apple. All prices are in $USD.

PBS: Monthly Medicare Australia prescription data from July 1991 to June 2008. It contains the total number of scripts and cost of the scripts in $AUD.

vic_elec: Half-hourly electricity demand data for Victoria, Australia from 2012-2014. It includes total electricity demand in MW, temperature of Melbourne and indicator for whether a specific day is a public holiday.

a. Use autoplot() to plot some of the series in these data sets.

We can see below that autoplot can’t interpret the interval of the data, this means we have encounterd data with irregular intervals.

autoplot(gafa_stock)

## Plot variable not specified, automatically selected `.vars = Open`

In order to use autoplot with the PBS data, I borrowed some of the code from the book to subset the data, otherwise the results we were getting were not the ones expected. This Could be due to the large number of categories and medicines in this time series.

PBS %>%
  filter(ATC2 == "A10") %>%
  select(Month, Concession, Type, Cost) %>%
  summarise(TotalC = sum(Cost)) %>%
  mutate(Cost = TotalC / 1e6) -> a10
autoplot(a10)

## Plot variable not specified, automatically selected `.vars = TotalC`

We can observe the seasonal pattern for electricity demand in Victoria, Australia. It is evident that there is an increase in demand at the beginning of each year, which is the summer time in Australia.

autoplot(vic_elec)

## Plot variable not specified, automatically selected `.vars = Demand`

b. What is the time interval of each series?

gafa_stock: Irregular interval. PBS: Monthly. vic_elec: Half-hourly.

Exercise 2.10.2

Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.

For the following exercise, we must also include the group_by() function in order to group the data by “Symbol” or stock and show the desired result for each of the four stocks. Failing to add the group_by() function, and only using the filter() function, would result in only getting the day in which any of the four stocks had the peak closing price in the whole data set.

gafa_stock %>% group_by(Symbol) %>% filter(Close == max(Close))

## # A tsibble: 4 x 8 [!]
## # Key:       Symbol [4]
## # Groups:    Symbol [4]
##   Symbol Date        Open  High   Low Close Adj_Close   Volume
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1 AAPL   2018-10-03  230.  233.  230.  232.      230. 28654800
## 2 AMZN   2018-09-04 2026. 2050. 2013  2040.     2040.  5721100
## 3 FB     2018-07-25  216.  219.  214.  218.      218. 58954200
## 4 GOOG   2018-07-26 1251  1270. 1249. 1268.     1268.  2405600

Exercise 2.10.3

Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

a. You can read the data into R with the following script:

tute1 <- readr::read_csv("tute1.csv")

## Parsed with column specification:
## cols(
##   Quarter = col_date(format = ""),
##   Sales = col_double(),
##   AdBudget = col_double(),
##   GDP = col_double()
## )

View(tute1)

b. Convert the data to time series

mytimeseries <- tute1 %>%
  mutate(Quarter = yearmonth(Quarter)) %>%
  as_tsibble(index = Quarter)

c. Construct time series plots of each of the three series

mytimeseries %>%
  pivot_longer(-Quarter) %>%
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y")

Check what happens when you don’t include facet_grid().

mytimeseries %>%
  pivot_longer(-Quarter) %>%
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line()

Without using facet_grid() we have the plots of all three time series in the same graph and the same scale. This makes it a bit harder to see the variation of the data. As it is evident on the first plot, all three series have very similar seasonality, but if we look at the bottom plot, they seem to be quite different.

Exercise 2.10.4

The USgas package contains data on the demand for natural gas in the US.

a. Install the USgas package.

library(USgas)

b. Create a tsibble from us_total with year as the index and state as the key.

I used the examples from the book to construct the tsibble from us_total and made year the index and state the key.

mytimeseries2 <- us_total %>%
  as_tsibble(key = state, index = year)

c. Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).

I used some of the code examples from the book to filter the data by the desired states that we were asked to observe and plot them on separate graphs with different scales. I also decided to split the states in two groups (two ggplot() functions) because even with the facet_grid() function the plot lines looked misleading and did not represent the data accurately or as accurate. Perhaps the graphs produced were too small in order to fit all six states in one facet_grid() display.

We can observe that the natural gas consumption for Connecticut and Massachusetts has been growing more or less steadily. Maine reached its peak consumption in the early 2000’s, and has been slowly decreasing thereafter, While a similar trend has happened with the consumption in New Hampshire after reaching its peak around 2004. Vermont had a constant consumption more or less until the year 2010 where it began to see an upward trend thereafter. Rhode Island has had some ups and downs in ther gas consumption, with its peak in the late 1990’s, its lowest in 2005, followed by some more ups and downs in the subsequent years.

mytimeseries2 %>%
  filter(state == c('Maine', 'Vermont', 'New Hampshire')) %>%
  ggplot(aes(x = year, y = y, colour = state)) +
  geom_line() +
  facet_grid(state ~ ., scales = "free_y")

mytimeseries2 %>%
  filter(state == c('Massachusetts', 'Connecticut', 'Rhode Island')) %>%
  ggplot(aes(x = year, y = y, colour = state)) +
  geom_line() +
  facet_grid(state ~ ., scales = "free_y")

Exercise 2.10.5

a. Download tourism.xlsx from the book website and read it into R using readxl::read_excel().

tourism <- readxl::read_excel("tourism.xlsx")

b. Create a tsibble which is identical to the tourism tsibble from the tsibble package.

I first used the help function to explore what the tourism tsibble looks like, and then I followed the books examples on creating tsibbles while identifying the “index” column and the “key” columns.

tour_tsibble <- tourism %>%
  mutate(Quarter = yearquarter(Quarter)) %>%
  as_tsibble(key = c(Region, State, Purpose), index = Quarter)
tour_tsibble

## # A tsibble: 24,320 x 5 [1Q]
## # Key:       Region, State, Purpose [304]
##    Quarter Region   State           Purpose  Trips
##      <qtr> <chr>    <chr>           <chr>    <dbl>
##  1 1998 Q1 Adelaide South Australia Business  135.
##  2 1998 Q2 Adelaide South Australia Business  110.
##  3 1998 Q3 Adelaide South Australia Business  166.
##  4 1998 Q4 Adelaide South Australia Business  127.
##  5 1999 Q1 Adelaide South Australia Business  137.
##  6 1999 Q2 Adelaide South Australia Business  200.
##  7 1999 Q3 Adelaide South Australia Business  169.
##  8 1999 Q4 Adelaide South Australia Business  134.
##  9 2000 Q1 Adelaide South Australia Business  154.
## 10 2000 Q2 Adelaide South Australia Business  169.
## # ... with 24,310 more rows

c. Find what combination of Region and Purpose had the maximum number of overnight trips on average.

For this exercie I built a sequence of operations through the pipe function. The first operation used is the group_by() function in order to find the average of overnight trips by Region and Purpose with the summarise() function after. Then using the ungroup() function before filtering the data to find the maximum number of trips, allows to get one result. Failing to ungroup() before filtering will cause to output the result of the maximum number of trips for every Region and Purpose combination.

tour_tsibble %>% group_by(Region, Purpose) %>%
 summarise(Trips = mean(Trips)) %>%
 ungroup() %>%
 filter(Trips == max(Trips))

## # A tsibble: 1 x 4 [1Q]
## # Key:       Region, Purpose [1]
##   Region    Purpose  Quarter Trips
##   <chr>     <chr>      <qtr> <dbl>
## 1 Melbourne Visiting 2017 Q4  985.

d. Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.

In this exercise I understood that we were just to find the total number of trips by State, which it implies that Purposes and Regions would already be combined when calculating this number. Thus, I used the group_by() function with states and then the summarise() function to sum up all the trips and saved it as a new tsibble under a new name. The result is the total number of trips by State and by the index (Quarter).

new_tour_tsibble <- tour_tsibble %>%
 group_by(State) %>% 
 summarise(Trips = sum(Trips))
new_tour_tsibble

## # A tsibble: 640 x 3 [1Q]
## # Key:       State [8]
##    State Quarter Trips
##    <chr>   <qtr> <dbl>
##  1 ACT   1998 Q1  551.
##  2 ACT   1998 Q2  416.
##  3 ACT   1998 Q3  436.
##  4 ACT   1998 Q4  450.
##  5 ACT   1999 Q1  379.
##  6 ACT   1999 Q2  558.
##  7 ACT   1999 Q3  449.
##  8 ACT   1999 Q4  595.
##  9 ACT   2000 Q1  600.
## 10 ACT   2000 Q2  557.
## # ... with 630 more rows

Exercise 2.10.8

Monthly Australian retail data is provided in aus_retail. Select one of the time series as follows (but choose your own seed value):

set.seed(123)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

Explore your chosen retail time series using the following functions:

autoplot(), gg_season(), gg_subseries(), gg_lag(), ACF() %>% autoplot()

autoplot(myseries, Turnover)

gg_season(myseries, Turnover)

gg_subseries(myseries, Turnover)

gg_lag(myseries, Turnover, geom = 'point')

myseries %>% ACF(Turnover) %>% autoplot()

Can you spot any seasonality, cyclicity and trend? What do you learn about the series?

We can clearly see seasonality in the autoplot() graph above. We keep seeing a simmiliar pattern each year, there seems to be a peak at the end of each year, and then goes back down. This could be due to what we discussed in class where we see retailers have the majority of their sales during the months of November and December (Holiday Season, at least in this part of the world).

We can also observe seasonality as it is evident by the gg_season() and the gg_subseries() plots.

I can’t say that I see a cycle in the data but we can clearly see an upward trend in turnover. The lag plots show a strong positive relationship and autocorrelation plots show both trend and seasonality. As it is explained in our text book with respect to autocorrelation “The slow decrease in the ACF as the lags increase is due to the trend, while the “scalloped” shape is due to the seasonality.".