HW1DATA624

R Markdown

Below I load the necessary packages

We are tasked with completing questions 2.1, 2.2, 2.3, 2.4, 2.5 and 2.8 from chapter 2 of “Forecasting: Principles and Practice” Third Edition by Rob Hyndman and George Athanasopoulos.

Exercise 2.1

Bricks

We are tasked with exploring four different time series. Below I start with the Bricks time series from aus_production. I used the ? function to get a description of the data set. The data set is quarterly production data of selected commodities in Australia, captured is a tsibble.

bricks <- aus_production |> select(Quarter, Bricks)
head(bricks)

## # A tsibble: 6 x 2 [1Q]
##   Quarter Bricks
##     <qtr>  <dbl>
## 1 1956 Q1    189
## 2 1956 Q2    204
## 3 1956 Q3    208
## 4 1956 Q4    197
## 5 1957 Q1    187
## 6 1957 Q2    214

I use autoplot to create a plot of the time series

bricks |> autoplot() + labs(title = 'Quarterly Bricks Production')

## Plot variable not specified, automatically selected `.vars = Bricks`

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

Lynx

I used the ? function to get a description of the data set. The data set captures trading records from the Hudson Bay Company for the Snowshoe Hare and Canadian Lynx furs from 1845 to 1935

lynx <- pelt |> select(Year, Lynx)
head(lynx)

## # A tsibble: 6 x 2 [1Y]
##    Year  Lynx
##   <dbl> <dbl>
## 1  1845 30090
## 2  1846 45150
## 3  1847 49150
## 4  1848 39520
## 5  1849 21230
## 6  1850  8420

The data is captured annually

I use autoplot to create a plot of the time series

lynx |> autoplot() + labs(title = 'Annual Lynx furs trading record')

## Plot variable not specified, automatically selected `.vars = Lynx`

Close

I used the ? function to get a description of the data set. The data set captures historical stock prices from 2014-2018 for Google, Amazon, Facebook, and Apple in $USD.

head(gafa_stock)

## # A tsibble: 6 x 8 [!]
## # Key:       Symbol [1]
##   Symbol Date        Open  High   Low Close Adj_Close    Volume
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>
## 1 AAPL   2014-01-02  79.4  79.6  78.9  79.0      67.0  58671200
## 2 AAPL   2014-01-03  79.0  79.1  77.2  77.3      65.5  98116900
## 3 AAPL   2014-01-06  76.8  78.1  76.2  77.7      65.9 103152700
## 4 AAPL   2014-01-07  77.8  78.0  76.8  77.1      65.4  79302300
## 5 AAPL   2014-01-08  77.0  77.9  77.0  77.6      65.8  64632400
## 6 AAPL   2014-01-09  78.1  78.1  76.5  76.6      65.0  69787200

The data is captured daily on irregular trading days (certain days don’t have trading activity, such as weekends or holidays).

I filter for the Closing price. There are four time series in this date set, the closing price for each of the four stocks.

close <- gafa_stock |> select(Date, Close)
head(close)

## # A tsibble: 6 x 3 [!]
## # Key:       Symbol [1]
##   Date       Close Symbol
##   <date>     <dbl> <chr> 
## 1 2014-01-02  79.0 AAPL  
## 2 2014-01-03  77.3 AAPL  
## 3 2014-01-06  77.7 AAPL  
## 4 2014-01-07  77.1 AAPL  
## 5 2014-01-08  77.6 AAPL  
## 6 2014-01-09  76.6 AAPL

I use autoplot to create a plot of the time series

close |> autoplot() + labs(title = 'Daily close price')

## Plot variable not specified, automatically selected `.vars = Close`

Given the different prices, it would be more appropriate to separate the time series onto different graphs:

close |> ggplot(aes(x = Date, y= Close, color= Symbol)) + geom_line() + facet_grid(Symbol ~ ., scales = 'free_y') + labs(title = 'Daily close price')

Demand

I used the ? function to get a description of the data set. The data set contains half-hourly electricity demand data for Victoria Australia. The time interval is 30 minutes.

demand <- vic_elec |> select(Time, Demand)
head(demand)

## # A tsibble: 6 x 2 [30m] <Australia/Melbourne>
##   Time                Demand
##   <dttm>               <dbl>
## 1 2012-01-01 00:00:00  4383.
## 2 2012-01-01 00:30:00  4263.
## 3 2012-01-01 01:00:00  4049.
## 4 2012-01-01 01:30:00  3878.
## 5 2012-01-01 02:00:00  4036.
## 6 2012-01-01 02:30:00  3866.

I use autoplot to create a plot of the time series

demand |> autoplot() + labs(title = 'Half Hour Electricity demand for Victoria, Australia')

## Plot variable not specified, automatically selected `.vars = Demand`

Exercise 2.2

Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.

It should be noted that while the question asks us to find the day of the week corresponding to the maximum value of Close, it would be more accurate to view this for Adj_Close, which captures dividends, stock splits, etc.

unique(gafa_stock$Symbol)

## [1] "AAPL" "AMZN" "FB"   "GOOG"

Google

googleMax <- gafa_stock |> filter(Symbol=='GOOG') |> slice_max(Close) 
weekdays(as.Date(googleMax$Date))

## [1] "Thursday"

Apple

aapleMax <- gafa_stock |> filter(Symbol=='AAPL') |> slice_max(Close) 
weekdays(as.Date(aapleMax$Date))

## [1] "Wednesday"

FaceBook

facebookMax <- gafa_stock |> filter(Symbol=='FB') |> slice_max(Close) 
weekdays(as.Date(facebookMax$Date))

## [1] "Wednesday"

Amazon

amazonMax <- gafa_stock |> filter(Symbol=='AMZN') |> slice_max(Close) 
weekdays(as.Date(amazonMax$Date))

## [1] "Tuesday"

Exercise 2.3

We are tasked with importing a csv file that has quarterly data for a company and performing data analysis.

Note that the file is saved in my working directory. To reproduce this code, the user would need to save the file in their working directory.

tute1 <- readr::read_csv("tute1.csv")

## Rows: 100 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (3): Sales, AdBudget, GDP
## date (1): Quarter
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(tute1)

Converting the data to time series

mytimeseries <- tute1 |>
  mutate(Quarter = yearquarter(Quarter)) |>
  as_tsibble(index = Quarter)

Constructing time series plots

mytimeseries |>
  pivot_longer(-Quarter) |>
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y")

When we don’t include fact_grid, we get the below result - the time series are all graphed on one graph.

mytimeseries |>
  pivot_longer(-Quarter) |>
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line()

Exercise 2.4

library(USgas)

We are tasked with creating a tsibble from us_total with year as the index and state as the key.

First, I examine the structure of us_total

head(us_total)

##   year   state      y
## 1 1997 Alabama 324158
## 2 1998 Alabama 329134
## 3 1999 Alabama 337270
## 4 2000 Alabama 353614
## 5 2001 Alabama 332693
## 6 2002 Alabama 379343

Next, I create a tsibble

gasTsibble <- us_total |> as_tsibble(index = year, key = state)

head(gasTsibble)

## # A tsibble: 6 x 3 [1Y]
## # Key:       state [1]
##    year state        y
##   <int> <chr>    <int>
## 1  1997 Alabama 324158
## 2  1998 Alabama 329134
## 3  1999 Alabama 337270
## 4  2000 Alabama 353614
## 5  2001 Alabama 332693
## 6  2002 Alabama 379343

Next, we have to plot gas consumption for the states: Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island

newEnglandGas <- gasTsibble |> filter(state== c('Maine', 'Vermont', 'New Hampshire', 'Massachusetts', 'Connecticut', 'Rhode Island'))

newEnglandGas |> ggplot(aes(x = year, y= y, color = state)) +
  geom_line() + 
  facet_grid(state ~ ., scales = "free_y")

Exercise 2.5

We are instructed to download an excel file that contains tourism data and import it via the read_excel function.

I downloaded this file into my working directory. To reproduce this code would require the person executing the code to save the excel file in their working directory before running the code.

excelTourism <- readxl::read_excel('tourism.xlsx')

head(excelTourism)

## # A tibble: 6 × 5
##   Quarter    Region   State           Purpose  Trips
##   <chr>      <chr>    <chr>           <chr>    <dbl>
## 1 1998-01-01 Adelaide South Australia Business  135.
## 2 1998-04-01 Adelaide South Australia Business  110.
## 3 1998-07-01 Adelaide South Australia Business  166.
## 4 1998-10-01 Adelaide South Australia Business  127.
## 5 1999-01-01 Adelaide South Australia Business  137.
## 6 1999-04-01 Adelaide South Australia Business  200.

We are then instructed to create a tsibble that is identical to the tourism tsibble in the tsibble package. First, I check what the tourism tsibble looks like

head(tourism)

## # A tsibble: 6 x 5 [1Q]
## # Key:       Region, State, Purpose [1]
##   Quarter Region   State           Purpose  Trips
##     <qtr> <chr>    <chr>           <chr>    <dbl>
## 1 1998 Q1 Adelaide South Australia Business  135.
## 2 1998 Q2 Adelaide South Australia Business  110.
## 3 1998 Q3 Adelaide South Australia Business  166.
## 4 1998 Q4 Adelaide South Australia Business  127.
## 5 1999 Q1 Adelaide South Australia Business  137.
## 6 1999 Q2 Adelaide South Australia Business  200.

Next, I convert my excelTourism data frame to a tsibble that matches the above. Before I can covert the data frame, I need to change the Quarter column into a tsibble recognized date structure.

excelTourism$Quarter <- yearquarter(excelTourism$Quarter)

head(excelTourism)

## # A tibble: 6 × 5
##   Quarter Region   State           Purpose  Trips
##     <qtr> <chr>    <chr>           <chr>    <dbl>
## 1 1998 Q1 Adelaide South Australia Business  135.
## 2 1998 Q2 Adelaide South Australia Business  110.
## 3 1998 Q3 Adelaide South Australia Business  166.
## 4 1998 Q4 Adelaide South Australia Business  127.
## 5 1999 Q1 Adelaide South Australia Business  137.
## 6 1999 Q2 Adelaide South Australia Business  200.

Next, I convert the data frame to a tsibble

excelTourismTsibble <- as_tsibble(excelTourism, key = c('Region', 'State','Purpose'),
                                  index = Quarter)

head(excelTourismTsibble)

## # A tsibble: 6 x 5 [1Q]
## # Key:       Region, State, Purpose [1]
##   Quarter Region   State           Purpose  Trips
##     <qtr> <chr>    <chr>           <chr>    <dbl>
## 1 1998 Q1 Adelaide South Australia Business  135.
## 2 1998 Q2 Adelaide South Australia Business  110.
## 3 1998 Q3 Adelaide South Australia Business  166.
## 4 1998 Q4 Adelaide South Australia Business  127.
## 5 1999 Q1 Adelaide South Australia Business  137.
## 6 1999 Q2 Adelaide South Australia Business  200.

Next, we have to find what combination of Region and Purpose had the maximum number of overnight trips on average. I use the dataframe for this as opposed to the tsibble because I want to find the average for each region by type across time, so I no longer need the temporal structure.

excelTourism |> group_by(Region, Purpose) |> summarise(averageTrips = mean(Trips)) |> ungroup() |> head(n = 10)

## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.

## # A tibble: 10 × 3
##    Region         Purpose  averageTrips
##    <chr>          <chr>           <dbl>
##  1 Adelaide       Business       156.  
##  2 Adelaide       Holiday        157.  
##  3 Adelaide       Other           56.6 
##  4 Adelaide       Visiting       205.  
##  5 Adelaide Hills Business         2.66
##  6 Adelaide Hills Holiday         10.5 
##  7 Adelaide Hills Other            1.40
##  8 Adelaide Hills Visiting        14.2 
##  9 Alice Springs  Business        14.6 
## 10 Alice Springs  Holiday         31.9

excelTourism |> group_by(Region, Purpose) |> summarise(averageTrips = mean(Trips)) |> ungroup() |>slice_max(averageTrips)

## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.

## # A tibble: 1 × 3
##   Region Purpose  averageTrips
##   <chr>  <chr>           <dbl>
## 1 Sydney Visiting         747.

The maximum number of overnight trips on average was to Sydney for Vising Purposes

Lastly, we are instructed to create a new tsibble which combines the Purposes and Regions, and just has total trips by State.

totalTripsByStateTsibble <- excelTourismTsibble |> group_by(State) |> summarise(total_Trips = sum(Trips))

totalTripsByStateTsibble |> head(n=10)

## # A tsibble: 10 x 3 [1Q]
## # Key:       State [1]
##    State Quarter total_Trips
##    <chr>   <qtr>       <dbl>
##  1 ACT   1998 Q1        551.
##  2 ACT   1998 Q2        416.
##  3 ACT   1998 Q3        436.
##  4 ACT   1998 Q4        450.
##  5 ACT   1999 Q1        379.
##  6 ACT   1999 Q2        558.
##  7 ACT   1999 Q3        449.
##  8 ACT   1999 Q4        595.
##  9 ACT   2000 Q1        600.
## 10 ACT   2000 Q2        557.

Exercise 2.8

We are instructed to explore various features for a variety of data sets

Total Private from Employed in us_employment

totalPrivate <- us_employment |> filter(Title == 'Total Private')

head(totalPrivate)

## # A tsibble: 6 x 4 [1M]
## # Key:       Series_ID [1]
##      Month Series_ID     Title         Employed
##      <mth> <chr>         <chr>            <dbl>
## 1 1939 Jan CEU0500000001 Total Private    25338
## 2 1939 Feb CEU0500000001 Total Private    25447
## 3 1939 Mar CEU0500000001 Total Private    25833
## 4 1939 Apr CEU0500000001 Total Private    25801
## 5 1939 May CEU0500000001 Total Private    26113
## 6 1939 Jun CEU0500000001 Total Private    26485

Autoplot of Total Private

totalPrivate |> autoplot()

## Plot variable not specified, automatically selected `.vars = Employed`

gg_season of Total Private

totalPrivate |> gg_season(Employed)

gg_subseries of Total Private

totalPrivate |> gg_subseries(Employed)

gg_lag of Total Private

totalPrivate |> gg_lag(Employed, geom = 'point')

ACF - Total Private

totalPrivate |> ACF(Employed) |> autoplot()

Based on the above plots, I answer the below questions

Can you spot any seasonality, cyclicity and trend?
- There is a positive trend in the employment data with seasonality, evidenced by the first and second plots. There seems to be some periods of cyclicity, where the upward trend pauses before continuing further.
What do you learn about the series?
- I learned that US total employment exhibits seasonality and the time series has a positive trend.
What can you say about the seasonal patterns?
- There seems to be quarterly seasonality. Looking at the ACF plot, there are descending peaks on every second lag.
Can you identify any unusual years?
- The most unusual year seems to be 2008, which has a much larger decrease in total employment than other years. This coincides with the Global Financial Crisis of 2008.

Bricks from aus_production

head(bricks)

## # A tsibble: 6 x 2 [1Q]
##   Quarter Bricks
##     <qtr>  <dbl>
## 1 1956 Q1    189
## 2 1956 Q2    204
## 3 1956 Q3    208
## 4 1956 Q4    197
## 5 1957 Q1    187
## 6 1957 Q2    214

Autoplot of Bricks

bricks |> autoplot()

## Plot variable not specified, automatically selected `.vars = Bricks`

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

gg_season of Bricks

bricks |> gg_season(Bricks)

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

gg_subseries of Bricks

bricks |> gg_subseries(Bricks)

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).

gg_lag of Bricks

bricks |> gg_lag(Bricks, geom = 'point')

## Warning: Removed 20 rows containing missing values (gg_lag).

ACF - Bricks

bricks |> ACF(Bricks) |> autoplot()

Based on the above plots, I answer the below questions

Can you spot any seasonality, cyclicity and trend?
- The time series exhibits all three. There is both an upward trend and a downward trend, and seasonality within both trends. The downward trends seems to change towards an upward trend - these changes in trends indicate cyclicity.
What do you learn about the series?
- Brick production in Australia, as of the mid-2000s has not reached its previous highs. There may be macro reasons for this (slowing construction, change in construction material, etc).
What can you say about the seasonal patterns?
- Seasonality appears on a four month basis - the ACF plot shows descending highs on multiples of four. Q2 and Q3 are the highest production quarters, while Q4 and Q1 are lower.
Can you identify any unusual years?
- There is an usually large drop in the early to mid 1980s - there may have been an external macro factor such as an economic recession/depression.

Hare from pelt

hare <- pelt |> select(Year, Hare) 
hare <- as_tsibble(hare, index = Year)

head(hare)

## # A tsibble: 6 x 2 [1Y]
##    Year  Hare
##   <dbl> <dbl>
## 1  1845 19580
## 2  1846 19600
## 3  1847 19610
## 4  1848 11990
## 5  1849 28040
## 6  1850 58000

Autoplot of Hare

hare |> autoplot(Hare)

gg_season of Hare

gg_season won’t work for this tsibble because the data is on a yearly interval. The function is meant to demonstrate seasonality within a year which we cannot do since we only have yearly data.

gg_subseries of Hare

given data is on a yearly basis, the below chart is similar to the autoplot chart, though it does display the average annual value.

hare |> gg_subseries(Hare)

gg_lag of Hare

hare |> gg_lag(Hare, geom = 'point')

ACF - Hare

hare |> ACF(Hare) |> autoplot()

Based on the above plots, I answer the below questions

Can you spot any seasonality, cyclicity and trend?
- There is no trend for the data and no seasonality (the data is on annual basis); however, there is cyclicity, demonstrated by the peaks and valleys that take multiple years to manifest.
What do you learn about the series?
- There is cyclicity in the time series for hares captured. As suggested by the book, it corresponds with the life span and breeding cycles of the animals.
What can you say about the seasonal patterns?
- N/A - data is annual
Can you identify any unusual years?
- The early to mid 1860s and early to mid 1880s had highs that were never surpassed before or after.

H02 Cost from PBS

h02 <- PBS |> filter(ATC2 =='H02')


head(h02)

## # A tsibble: 6 x 9 [1M]
## # Key:       Concession, Type, ATC1, ATC2 [1]
##      Month Concession   Type      ATC1  ATC1_desc ATC2  ATC2_desc Scripts   Cost
##      <mth> <chr>        <chr>     <chr> <chr>     <chr> <chr>       <dbl>  <dbl>
## 1 1991 Jul Concessional Co-payme… H     Systemic… H02   CORTICOS…   63261 317384
## 2 1991 Aug Concessional Co-payme… H     Systemic… H02   CORTICOS…   53528 269891
## 3 1991 Sep Concessional Co-payme… H     Systemic… H02   CORTICOS…   52822 269703
## 4 1991 Oct Concessional Co-payme… H     Systemic… H02   CORTICOS…   54016 280418
## 5 1991 Nov Concessional Co-payme… H     Systemic… H02   CORTICOS…   49281 268070
## 6 1991 Dec Concessional Co-payme… H     Systemic… H02   CORTICOS…   51798 277139

Since the question is asking to analyze H02 without considering Concession and Type, I will tally up the total costs for H02s across categories

h02Total <- h02 |> summarise(totalCost = sum(Cost)) 

head(h02Total)

## # A tsibble: 6 x 2 [1M]
##      Month totalCost
##      <mth>     <dbl>
## 1 1991 Jul    429795
## 2 1991 Aug    400906
## 3 1991 Sep    432159
## 4 1991 Oct    492543
## 5 1991 Nov    502369
## 6 1991 Dec    602652

Autoplot of H02

h02Total |> autoplot(totalCost)

gg_season of H02

h02Total |> gg_season(totalCost)

gg_subseries of H02

h02Total |> gg_subseries(totalCost)

gg_lag of H02

h02Total |> gg_lag(totalCost, geom = 'point')

ACF - H02

h02Total |> ACF(totalCost) |> autoplot()

Based on the above plots, I answer the below questions

Can you spot any seasonality, cyclicity and trend?
- There is a positive trend is h02 costs with seasonality - prices tend to consistently drop around February and climbing higher thereafter until the following February
What do you learn about the series?
- The time series has a positive trend and seasonality.
What can you say about the seasonal patterns?
- Prices tend to consistently drop around February and climbing higher thereafter until the following February
Can you identify any unusual years?
- There is a peak in 1993/1994 that is higher than prior years and not surpassed until several years later

Barrels from us_gasoline

head(us_gasoline)

## # A tsibble: 6 x 2 [1W]
##       Week Barrels
##     <week>   <dbl>
## 1 1991 W06    6.62
## 2 1991 W07    6.43
## 3 1991 W08    6.58
## 4 1991 W09    7.22
## 5 1991 W10    6.88
## 6 1991 W11    6.95

Autoplot of barrels

us_gasoline |> autoplot(Barrels)

gg_season of barrels

us_gasoline |> gg_season(Barrels)

gg_subseries of barrels

since the data is weekly, this graph is a bit noisy, though it does show that there tends to be a dip in us gasoline supplied in the third quarter (or so) of the year compared to the prior weeks.

us_gasoline |> gg_subseries(Barrels)

gg_lag of barrels

us_gasoline |> gg_lag(Barrels, geom = 'point')

ACF - barrels

us_gasoline |> ACF(Barrels) |> autoplot()

Based on the above plots, I answer the below questions

Can you spot any seasonality, cyclicity and trend?
- The time series has a positive trend that pauses and turns to a negative trend in the early to mid 2000s, before resuming an upward trend, indicative of both trend and cyclicity. Additionally, there is seasonality within the data (four week, or monthly basis).
What do you learn about the series?
- US gasoline supply has seasonality, with the peak supply happening around the middle of the year (week 26 or so).
What can you say about the seasonal patterns?
- Gasoline supply peaks in the middle of the year (around week 26, which is sometime around summer) and starts declining until around week 39, before resuming an increase towards the holiday season, and then declining until spring time.
Can you identify any unusual years?
- Gasoline supply around 1994 dropped to the lowest level in the data set. Furthermore, supply around 2000/2001 dropped to a lower low than prior years.

HW1DATA624

Semyon Toybis

2024-09-02

R Markdown

Exercise 2.1

Bricks

Lynx

Close

Demand

Exercise 2.2

Google

Apple

FaceBook

Amazon

Exercise 2.3

Exercise 2.4

Exercise 2.5

Exercise 2.8

Total Private from Employed in us_employment

Bricks from aus_production

Hare from pelt

H02 Cost from PBS

Barrels from us_gasoline