Data624 Homework 1

library(fpp3)

## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.4
## ✔ dplyr       1.1.3     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.0     ✔ feasts      0.3.1
## ✔ lubridate   1.9.3     ✔ fable       0.3.3
## ✔ ggplot2     3.4.4     ✔ fabletools  0.3.4

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(tsibble)

2.1, 2.2, 2.3, 2.4, 2.5 and 2.8

2.1 Explore the following four time series: Bricks from aus_production, Lynx from pelt, Close from gafa_stock, Demand from vic_elec.

Use ? (or help()) to find out about the data in each series.

What is the time interval of each series? Use autoplot() to produce a time plot of each series. For the last plot, modify the axis labels and title.

aus_production

## # A tsibble: 218 x 7 [1Q]
##    Quarter  Beer Tobacco Bricks Cement Electricity   Gas
##      <qtr> <dbl>   <dbl>  <dbl>  <dbl>       <dbl> <dbl>
##  1 1956 Q1   284    5225    189    465        3923     5
##  2 1956 Q2   213    5178    204    532        4436     6
##  3 1956 Q3   227    5297    208    561        4806     7
##  4 1956 Q4   308    5681    197    570        4418     6
##  5 1957 Q1   262    5577    187    529        4339     5
##  6 1957 Q2   228    5651    214    604        4811     7
##  7 1957 Q3   236    5317    227    603        5259     7
##  8 1957 Q4   320    6152    222    582        4735     6
##  9 1958 Q1   272    5758    199    554        4608     5
## 10 1958 Q2   233    5641    229    620        5196     7
## # ℹ 208 more rows

Ths australian production tsibble shows quarterly data for specific commodities manufactured (6) that have been provided by the Australian government for over 50 years. Bricks are tracked by the million of bricks produced.

autoplot(aus_production,Bricks) +
    labs(title='Australian Brick Production over Time',x='Quarterly Production',y='Millions (Bricks)')

## Warning: Removed 20 rows containing missing values (`geom_line()`).

There is definitely some seasonality that exists with brick production and for the early 1950s - 1980s there was a clear increasing trend that existed despite a few drops. Some cycles may have occurred from mid 1980s onward.

pelt

## # A tsibble: 91 x 3 [1Y]
##     Year  Hare  Lynx
##    <dbl> <dbl> <dbl>
##  1  1845 19580 30090
##  2  1846 19600 45150
##  3  1847 19610 49150
##  4  1848 11990 39520
##  5  1849 28040 21230
##  6  1850 58000  8420
##  7  1851 74600  5560
##  8  1852 75090  5080
##  9  1853 88480 10170
## 10  1854 61280 19600
## # ℹ 81 more rows

The time series shows the trading counts for two specific type of furs sold by the Hudson Bay Company and is based on yearly totals over a 91 year period.

autoplot(pelt,Lynx) + labs(title='Total Lynx Furs Traded from 1845 - 1945',subtitle='by Hudson Bay Company',x='Years',y='Total Canadian Lynx pelts traded')

It’s not so clear if there is a long term trend here but perhaps some seasonality that exists for the lynx pelts.

gafa_stock

## # A tsibble: 5,032 x 8 [!]
## # Key:       Symbol [4]
##    Symbol Date        Open  High   Low Close Adj_Close    Volume
##    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>
##  1 AAPL   2014-01-02  79.4  79.6  78.9  79.0      67.0  58671200
##  2 AAPL   2014-01-03  79.0  79.1  77.2  77.3      65.5  98116900
##  3 AAPL   2014-01-06  76.8  78.1  76.2  77.7      65.9 103152700
##  4 AAPL   2014-01-07  77.8  78.0  76.8  77.1      65.4  79302300
##  5 AAPL   2014-01-08  77.0  77.9  77.0  77.6      65.8  64632400
##  6 AAPL   2014-01-09  78.1  78.1  76.5  76.6      65.0  69787200
##  7 AAPL   2014-01-10  77.1  77.3  75.9  76.1      64.5  76244000
##  8 AAPL   2014-01-13  75.7  77.5  75.7  76.5      64.9  94623200
##  9 AAPL   2014-01-14  76.9  78.1  76.8  78.1      66.1  83140400
## 10 AAPL   2014-01-15  79.1  80.0  78.8  79.6      67.5  97909700
## # ℹ 5,022 more rows

The tsibble appears to be business days where 4 popular tech stocks have pricing data available and is uniquely idenfied by the symbol. Given that the stock market is not always open every day the ! is warning us there isn’t a complete time period (some interval gaps) from 2014 - 2018

autoplot(gafa_stock,Close) +
    labs(title='Closing price of Stocks from 2014 - 2018',x='Trading Days',y='Stock Price')

The trends for Amazon and Google are clearly increasing for the vast majority of the time window available. Given all these stocks were included in the “FAANG” they experienced large positive gains in the prices during the 2010s.

summary(vic_elec)

##       Time                         Demand      Temperature   
##  Min.   :2012-01-01 00:00:00   Min.   :2858   Min.   : 1.50  
##  1st Qu.:2012-09-30 22:52:30   1st Qu.:3969   1st Qu.:12.30  
##  Median :2013-07-01 22:45:00   Median :4635   Median :15.40  
##  Mean   :2013-07-01 22:45:00   Mean   :4665   Mean   :16.27  
##  3rd Qu.:2014-04-01 23:37:30   3rd Qu.:5244   3rd Qu.:19.40  
##  Max.   :2014-12-31 23:30:00   Max.   :9345   Max.   :43.20  
##       Date             Holiday       
##  Min.   :2012-01-01   Mode :logical  
##  1st Qu.:2012-09-30   FALSE:51120    
##  Median :2013-07-01   TRUE :1488     
##  Mean   :2013-07-01                  
##  3rd Qu.:2014-04-01                  
##  Max.   :2014-12-31

vic_elec

## # A tsibble: 52,608 x 5 [30m] <Australia/Melbourne>
##    Time                Demand Temperature Date       Holiday
##    <dttm>               <dbl>       <dbl> <date>     <lgl>  
##  1 2012-01-01 00:00:00  4383.        21.4 2012-01-01 TRUE   
##  2 2012-01-01 00:30:00  4263.        21.0 2012-01-01 TRUE   
##  3 2012-01-01 01:00:00  4049.        20.7 2012-01-01 TRUE   
##  4 2012-01-01 01:30:00  3878.        20.6 2012-01-01 TRUE   
##  5 2012-01-01 02:00:00  4036.        20.4 2012-01-01 TRUE   
##  6 2012-01-01 02:30:00  3866.        20.2 2012-01-01 TRUE   
##  7 2012-01-01 03:00:00  3694.        20.1 2012-01-01 TRUE   
##  8 2012-01-01 03:30:00  3562.        19.6 2012-01-01 TRUE   
##  9 2012-01-01 04:00:00  3433.        19.1 2012-01-01 TRUE   
## 10 2012-01-01 04:30:00  3359.        19.0 2012-01-01 TRUE   
## # ℹ 52,598 more rows

This tsibble shows the electricity demand in 30 minute intervals for the state of Victoria in Austrailia. The units for demand are in MWh for each time period.

autoplot(vic_elec,Demand) +
    labs(title='Total Electrical Demand in Victoria Australia',x='Hours of Demand',y='MegaWatt Hours (MWh)')

Given the hourly time interval there is heavy concentration when looking across such a wide period of time although there are clear periods of seasonality likely stemming from times of peak heat or cold.

2.2: Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.

gafa_stock |> 
    group_by(Symbol) |> 
    mutate(max_price = rank(-Close)) |> filter(max_price==1)

## # A tsibble: 4 x 9 [!]
## # Key:       Symbol [4]
## # Groups:    Symbol [4]
##   Symbol Date        Open  High   Low Close Adj_Close   Volume max_price
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>     <dbl>
## 1 AAPL   2018-10-03  230.  233.  230.  232.      230. 28654800         1
## 2 AMZN   2018-09-04 2026. 2050. 2013  2040.     2040.  5721100         1
## 3 FB     2018-07-25  216.  219.  214.  218.      218. 58954200         1
## 4 GOOG   2018-07-26 1251  1270. 1249. 1268.     1268.  2405600         1

2.3 Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

You can read the data into R with the following script:

tute1 <- readr::read_csv("https://otexts.com/fpp3/extrafiles/tute1.csv")

## Rows: 100 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (3): Sales, AdBudget, GDP
## date (1): Quarter
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(tute1)

## Rows: 100
## Columns: 4
## $ Quarter  <date> 1981-03-01, 1981-06-01, 1981-09-01, 1981-12-01, 1982-03-01, …
## $ Sales    <dbl> 1020.2, 889.2, 795.0, 1003.9, 1057.7, 944.4, 778.5, 932.5, 99…
## $ AdBudget <dbl> 659.2, 589.0, 512.5, 614.1, 647.2, 602.0, 530.7, 608.4, 637.9…
## $ GDP      <dbl> 251.8, 290.9, 290.8, 292.4, 279.1, 254.0, 295.6, 271.7, 259.6…

Convert the data to time series

mytimeseries <- tute1 |>
  mutate(Quarter = yearquarter(Quarter)) |>
  as_tsibble(index = Quarter)

Construct time series plots of each of the three series

mytimeseries |>
  pivot_longer(-Quarter) |>
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y")

Facet grid allows us to separate the plots by the pivoted numeric data which are named name by default when pivoting longer from column to rows. All of these lines will be added to one plot

Check what happens when you don’t include facet_grid().

mytimeseries |>
  pivot_longer(-Quarter) |>
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line()

As expected the lines are all in one graph. It is odd that GDP would be so much lower a value than the others given it’s for only one company.

2.4: The USgas package contains data on the demand for natural gas in the US.

Install the USgas package.

library(USgas)

Create a tsibble from us_total with year as the index and state as the key.

Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).

gas_ts <- us_total  |>
 as_tsibble(key = c(state),
            index = year)

gas_ts |> filter(state %in% c('Maine','Vermont','New Hampshire','Massachusetts','Connecticut','Rhode Island')) |> autoplot(.vars = y) + labs(title='US New England Gas Consumption',y='Million Cubic Feet',x='Years')

2.5: Download tourism.xlsx from the book website and read it into R using readxl::read_excel(). Create a tsibble which is identical to the tourism tsibble from the tsibble package.

tour_input <- readxl::read_excel('/Users/JoshForster/Desktop/Masters_Data_Sci/Data624/tourism.xlsx')
#/Users/JoshForster/Desktop/Masters_Data_Sci/Data624#("https://OTexts.com/fpp3/extrafiles/tourism.xlsx")
head(tour_input)

## # A tibble: 6 × 5
##   Quarter    Region   State           Purpose  Trips
##   <chr>      <chr>    <chr>           <chr>    <dbl>
## 1 1998-01-01 Adelaide South Australia Business  135.
## 2 1998-04-01 Adelaide South Australia Business  110.
## 3 1998-07-01 Adelaide South Australia Business  166.
## 4 1998-10-01 Adelaide South Australia Business  127.
## 5 1999-01-01 Adelaide South Australia Business  137.
## 6 1999-04-01 Adelaide South Australia Business  200.

tourism

## # A tsibble: 24,320 x 5 [1Q]
## # Key:       Region, State, Purpose [304]
##    Quarter Region   State           Purpose  Trips
##      <qtr> <chr>    <chr>           <chr>    <dbl>
##  1 1998 Q1 Adelaide South Australia Business  135.
##  2 1998 Q2 Adelaide South Australia Business  110.
##  3 1998 Q3 Adelaide South Australia Business  166.
##  4 1998 Q4 Adelaide South Australia Business  127.
##  5 1999 Q1 Adelaide South Australia Business  137.
##  6 1999 Q2 Adelaide South Australia Business  200.
##  7 1999 Q3 Adelaide South Australia Business  169.
##  8 1999 Q4 Adelaide South Australia Business  134.
##  9 2000 Q1 Adelaide South Australia Business  154.
## 10 2000 Q2 Adelaide South Australia Business  169.
## # ℹ 24,310 more rows

tourism_df <- tour_input |> mutate(qtr=yearquarter(Quarter)) |> select(-Quarter) |> rename(Quarter=qtr)  |> relocate(Quarter,.before=Region) |> as_tsibble(index=Quarter,key=c(Region,State,Purpose))

tourism_df

## # A tsibble: 24,320 x 5 [1Q]
## # Key:       Region, State, Purpose [304]
##    Quarter Region   State           Purpose  Trips
##      <qtr> <chr>    <chr>           <chr>    <dbl>
##  1 1998 Q1 Adelaide South Australia Business  135.
##  2 1998 Q2 Adelaide South Australia Business  110.
##  3 1998 Q3 Adelaide South Australia Business  166.
##  4 1998 Q4 Adelaide South Australia Business  127.
##  5 1999 Q1 Adelaide South Australia Business  137.
##  6 1999 Q2 Adelaide South Australia Business  200.
##  7 1999 Q3 Adelaide South Australia Business  169.
##  8 1999 Q4 Adelaide South Australia Business  134.
##  9 2000 Q1 Adelaide South Australia Business  154.
## 10 2000 Q2 Adelaide South Australia Business  169.
## # ℹ 24,310 more rows

The main changes between the initial input excel dataframe were that the quarter variable needed to be updated using yearquarter as the index and to set the keys for the same columns as tourism.

Find what combination of Region and Purpose had the maximum number of overnight trips on average.

tourism_df |> as_tibble() |> group_by(Region,Purpose) |>  
    summarise(avg_trips=mean(Trips))|> ungroup() |> arrange(-avg_trips)  |> slice(c(1))

## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.

## # A tibble: 1 × 3
##   Region Purpose  avg_trips
##   <chr>  <chr>        <dbl>
## 1 Sydney Visiting      747.

Tourists visiting Sydney was the most popular combination with 747 thousand visitors on average.

Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.

state_tour <- tour_input |> mutate(qtr=yearquarter(Quarter)) |>
    select(-c(Region,Purpose,Quarter)) |> rename(Quarter=qtr)  |> relocate(Quarter,.before=State) |>
    group_by(State,Quarter) |> summarise(Total_Trips=sum(Trips)) |>
    as_tsibble(index=Quarter,key=c(State))

## `summarise()` has grouped output by 'State'. You can override using the
## `.groups` argument.

state_tour

## # A tsibble: 640 x 3 [1Q]
## # Key:       State [8]
## # Groups:    State [8]
##    State Quarter Total_Trips
##    <chr>   <qtr>       <dbl>
##  1 ACT   1998 Q1        551.
##  2 ACT   1998 Q2        416.
##  3 ACT   1998 Q3        436.
##  4 ACT   1998 Q4        450.
##  5 ACT   1999 Q1        379.
##  6 ACT   1999 Q2        558.
##  7 ACT   1999 Q3        449.
##  8 ACT   1999 Q4        595.
##  9 ACT   2000 Q1        600.
## 10 ACT   2000 Q2        557.
## # ℹ 630 more rows

2.8: Use the following graphics functions: autoplot(), gg_season(), gg_subseries(), gg_lag(), ACF() and explore features from the following time series: “Total Private” Employed from us_employment, Bricks from aus_production, Hare from pelt, “H02” Cost from PBS, and Barrels from us_gasoline.

Can you spot any seasonality, cyclicity and trend? What do you learn about the series? What can you say about the seasonal patterns? Can you identify any unusual years?

autoplot(us_employment |> filter(Title=='Total Private'),Employed) + labs(title='Private Employed Persons in the US')

The autoplot of private employed persons shows clear seasonality and an upward trend in general with occasional cycles of unemployment that occur. The trend is primarily a strong increase from the 1960s onward and employment over time appears to be going up.

us_employment |> filter(Title=='Total Private') |> gg_season(Employed)

It’s a bit challenging to discern a real pattern across the months, but there might be a slight increase towards November and December. This possibly aligns with hiring of temporary retail workers to help out during the holiday season.

us_employment |> filter(Title=='Total Private') |> filter_index("2000-01"~.)|> select(-Series_ID) |> gg_subseries(Employed)

Given the wide time frame available in the data set it is only possible to visualize some of the patterns in the subseries by taking a smaller cut of time. In this case, the summer months have a slightly higher average employed total than the rest of the months. This fact was not discernable in the previous graphics that were reviewed.

us_employment |> filter(Title=='Total Private') |> gg_lag(max_lag=24) + theme(axis.text.x = element_text(angle =90))

## Plot variable not specified, automatically selected `y = Employed`

## Warning in lag_geom(..., arrow = arrow): Ignoring unknown parameters: `max_lag`

The initial lagged months have a very strong lagged correlation and gradually with each additional time period lag start to show a bit more variability although the linear relationship remains very strong overall.

us_employment |> filter(Title=='Total Private') |> 
ACF(Employed) |>
  autoplot() + labs(title="US Employed Autocorrelation")

Similar to the lagged plot, the lagged values are still strong correctly as we increase the number of months comparison in the lag. Perhaps this is due to longer economic cycles that have persistently increased employment over time. Across the entire data set the value generally has moved up with temporary recessions.

aus_production |> autoplot(Bricks)

## Warning: Removed 20 rows containing missing values (`geom_line()`).

At this high a level there is clear seasonality in the plot but over what time period of quarters is harder to define. The 1950s through early 1980s shows a strong upward trend which ultimately reverses to a downward trend. There are cycles that occur after 1980 that may correlate with the economy as house construction is typically highly sensitive to the economic cycles and I would expect bricks as a component of homes/buildings would follow the same relationship. There are two very pronounced times where there is a massive dropoff in production during the late 1970s and in the early 1980s.

aus_production |> gg_season(Bricks)

## Warning: Removed 20 rows containing missing values (`geom_line()`).

The seasons aren’t uniform for brick production across the years, but it appears for many of the years Q2 and Q3 are the highest levels of output. There are plenty of exceptions so it is by no means a consistent rule.

aus_production |> gg_subseries(Bricks)

## Warning: Removed 5 rows containing missing values (`geom_line()`).

This graphic does a better job of identifying differences across the quarters over the full period of production available in this series. Q2 and Q3 appear to be slighly higher on average for brick production although the cycles that are present in the data apply to all the seasons. The seasonality that occurs is generally the same across each of the quarter.

aus_production |> gg_lag(Bricks,geom = "point")

## Warning: Removed 20 rows containing missing values (gg_lag).

With each lagged comparison in time the general pattern is a decreased correlation that is present, but perhaps on a regular cadence (seasonality) there might be some slightly higher correlation although it’s not so obvious with a lag plot.

aus_production |> ACF(Bricks) |> autoplot()

The seasonality of a full cycle is more evident in the above autocorelation plot as there are crests that occur after every full cycle (year/4 quarters) where there is a slight increase in those intervals in the lagged correlation despite the decrease is relationship strength with additional lagged time.

pelt |> autoplot(Hare) + labs(title='Hudson Bay Company Snowshoe Hare trade')

This view of the data doesn’t really show a dramatic trend that is occurring over the available period. There is some seasonality occurring every few years although the interval is not immediately obvious from this visual in isolation. It’s hard to tell if there is a cycle occuring although the early 1860s and mid 1880s have substantially higher hare pelts traded.

pelt |> filter_index('1900' ~ .) |> gg_subseries(Hare)

The plots do not appear to be so useful for this time series even after cutting to a much smaller window. Every few years there is some repetitive seasonality that is occurring.

gg_lag(pelt,Hare,geom='point')

The lag plot doesn’t show strong linear relationships across many of the lags which might stem from the fact that there weren’t too many clear patterns in the autoplot.

ACF(pelt,Hare) |> autoplot()

The autocorrelation plot does a better job of showing that there is a stronger correlation for lagged time frame every 5 years although many of the lagged values have limited to no linear relationship. It appears to move up and down with each half decade similar to a sin curve. This regular cadence indicates it is seasonality rather than a cycle which is more variable in nature.

PBS |> filter(ATC2=='H02') |> autoplot(Cost)

There is clear seasonality that is occuring across all of the keyed groups of payments/costs. Concessional payments have a clearer upward trend whereas it is harder to discern for the other payment costs.

PBS |> filter(ATC2=='H02') |> gg_season(Cost)

The different type of costs have different seasonal patterns that are much clearer from this graphic. Concessional and General Safety Net costs have large spikes around the end and beginning of the year which likely corresponds with some type of incentive to purchase these medicines/drugs around this time. Perhaps individuals must purchase a yearly supply that would typically occur around this time. Co-payments aren’t exhibiting a clear pattern although there are increased costs around the April and May timeframe, but not so dramatically concentrated during this time of year.

PBS |> filter(ATC2=='H02') |> gg_subseries(Cost)

This graphic displays similar patterns in seasonality as the gg_season method although it’s easier to see that there are some apparent cycles for co-payments that are occuring at the end of the time period. This is more pronounced for concessional safety net costs while general safety nets expenses are much more muted in comparison.

PBS |> filter(ATC2=='H02' & Concession=='Concessional' & Type=='Co-payments') |> gg_lag(Cost,geom='point',lags=1:12)

Similar to other monthly data time series it follows that a full 12 months is the most highly correlated of a lag.

PBS |> filter(ATC2=='H02' & Concession=='Concessional' & Type=='Co-payments') |> ACF(Cost) |> autoplot()

There is more evident seasonlity present in this graphic although it primarily makes the same point that 12 month intervals have much higher lagged correlations.

autoplot(us_gasoline) + labs(title='Barrels of Gasoline Consumed')

## Plot variable not specified, automatically selected `.vars = Barrels`

At first glance the autoplot shows a strong upward trend from 1991 through 2008 or so and it appears there is some seasonality with the data. There is a potential cycle that occurs post 2008 financial crisis although the trend ultimately returned somewhat upward again.

us_gasoline |> gg_season(Barrels)

It’s rather hard to tell much of anything although it appears that peak production might occur around June although further graphics may help to identify this pattern more clearly than gg_season

us_gasoline |> gg_subseries(Barrels)

This plot even when tested with smaller timeframes does not have enough space to show patterns.

us_gasoline |> gg_lag(Barrels,lags=52,geom='point')

It’s not immediately apparent how to differentiate the lagged timeframes with weeks as the period, but there is clearly strong linear lagged relationships across the dataset.

us_gasoline |> ACF(Barrels,lag_max=75) |> autoplot() + labs(title='Autocorrelation Plot of US Gasoline Consumption')

Similar to other datasets there is higher lagged correlation after a full run through of the time unit which in this case is 52. Despite that slight increase after a year the correlation of lagged weeks overall is fairly strong which speaks to the continued trend that has been occuring during the time series.

Data624 Homework 1

jfor: Josh Forster

2024-01-30