624 Homework 1

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.4.2

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tsibble     1.1.6     ✔ feasts      0.4.1
## ✔ tsibbledata 0.4.1     ✔ fable       0.4.1

## Warning: package 'tsibble' was built under R version 4.4.2

## Warning: package 'tsibbledata' was built under R version 4.4.2

## Warning: package 'feasts' was built under R version 4.4.2

## Warning: package 'fabletools' was built under R version 4.4.2

## Warning: package 'fable' was built under R version 4.4.2

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

#Introduction

In this homework assignment I will be submitting exercises 2.1, 2.2, 2.3, 2.4, 2.5 and 2.8 from the Hyndman online Forecasting book.

##Question 2.1

Explore the following four time series: Bricks from aus_production, Lynx from pelt, Close from gafa_stock, Demand from vic_elec.

-Use ? (or help()) to find out about the data in each series. -What is the time interval of each series? -Use autoplot() to produce a time plot of each series. -For the last plot, modify the axis labels and title.

#help("aus_production")
#help("pelt")
#help("gafa_stock")
#help("vic_elec")

Help and time intervals:

The bricks data refers to the quarterly estimates for clay brick production in millions of bricks in Australia. The time interval here is quarterly.

The lynx data refers to the Hudson Bay Company trading records for Canadian Lynx furs from 1845 to 1935 for all areas of the company. The time interval here is yearly.

The close data refers to the historical closing prices for Google, Amazon, Facebook, and Apple from 2014 to 2018 (in USD). The time interval here is daily although weekends and stock market closing days (holidays, etc.) are missing from the data.

The demand data refers to the total electricity demand in MWh on a half-hourly time interval. The time interval here is by the half hour.

bricks <- aus_production %>%
  select(Bricks)
head(bricks)

## # A tsibble: 6 x 2 [1Q]
##   Bricks Quarter
##    <dbl>   <qtr>
## 1    189 1956 Q1
## 2    204 1956 Q2
## 3    208 1956 Q3
## 4    197 1956 Q4
## 5    187 1957 Q1
## 6    214 1957 Q2

autoplot(bricks,Bricks)

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

lynx <- pelt %>%
  select(Lynx)
head(lynx)

## # A tsibble: 6 x 2 [1Y]
##    Lynx  Year
##   <dbl> <dbl>
## 1 30090  1845
## 2 45150  1846
## 3 49150  1847
## 4 39520  1848
## 5 21230  1849
## 6  8420  1850

autoplot(lynx,Lynx)

close <- gafa_stock %>%
  select(Close)
head(close)

## # A tsibble: 6 x 3 [!]
## # Key:       Symbol [1]
##   Close Date       Symbol
##   <dbl> <date>     <chr> 
## 1  79.0 2014-01-02 AAPL  
## 2  77.3 2014-01-03 AAPL  
## 3  77.7 2014-01-06 AAPL  
## 4  77.1 2014-01-07 AAPL  
## 5  77.6 2014-01-08 AAPL  
## 6  76.6 2014-01-09 AAPL

autoplot(close,Close)

demand <- vic_elec %>%
  select(Demand)
head(demand)

## # A tsibble: 6 x 2 [30m] <Australia/Melbourne>
##   Demand Time               
##    <dbl> <dttm>             
## 1  4383. 2012-01-01 00:00:00
## 2  4263. 2012-01-01 00:30:00
## 3  4049. 2012-01-01 01:00:00
## 4  3878. 2012-01-01 01:30:00
## 5  4036. 2012-01-01 02:00:00
## 6  3866. 2012-01-01 02:30:00

autoplot(demand,Demand) +
  labs(y = "Demand (MWh)",
       x = "Time per half hour",
       title = "Electricity Demand per Half Hour")

##Question 2.2

Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.

#Filtering by max(close) to return peak closing price for each of the stocks
close %>%
  group_by(Symbol) %>%
  filter(Close == max(Close))

## # A tsibble: 4 x 3 [!]
## # Key:       Symbol [4]
## # Groups:    Symbol [4]
##   Close Date       Symbol
##   <dbl> <date>     <chr> 
## 1  232. 2018-10-03 AAPL  
## 2 2040. 2018-09-04 AMZN  
## 3  218. 2018-07-25 FB    
## 4 1268. 2018-07-26 GOOG

##Question 2.3

Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

You can read the data into R with the following script:

#Reading csv
tute1 <- readr::read_csv("tute1.csv")

## Rows: 100 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (3): Sales, AdBudget, GDP
## date (1): Quarter
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(tute1)

Convert the data to time series

#Converting data into tsibble time series
mytimeseries <- tute1 |>
  mutate(Quarter = yearquarter(Quarter)) |>
  as_tsibble(index = Quarter)

Construct time series plots of each of the three series

mytimeseries |>
  pivot_longer(-Quarter) |>
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y")

Check what happens when you don’t include facet_grid().

mytimeseries |>
  pivot_longer(-Quarter) |>
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line()

Without facet_grid each of our three variables are plotted on the same Y axis; I would see this as making the visuzaliation a bit harder to understand for people seeing the chart for the first time, but also without the facet_grid function the variability of the peaks and troughs of each variable’s seasonality are distorted. For example the seasonality of GDP seems much weaker when plotted without facet_grid. In reality the peaks and troughs if GDP may have more significance than is apparent when plotted without facet_grid.

##Question 2.4

The USgas package contains data on the demand for natural gas in the US.

a.Install the USgas package.

library(USgas)

## Warning: package 'USgas' was built under R version 4.4.2

#help("usgas")

b.Create a tsibble from us_total with year as the index and state as the key.

# Creating tsibble from us_total
us_total.tsibble <- us_total %>%
  as_tibble(key = state,
            index = year)
us_total.tsibble

## # A tibble: 1,266 × 3
##     year state        y
##    <int> <chr>    <int>
##  1  1997 Alabama 324158
##  2  1998 Alabama 329134
##  3  1999 Alabama 337270
##  4  2000 Alabama 353614
##  5  2001 Alabama 332693
##  6  2002 Alabama 379343
##  7  2003 Alabama 350345
##  8  2004 Alabama 382367
##  9  2005 Alabama 353156
## 10  2006 Alabama 391093
## # ℹ 1,256 more rows

c.Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).

#Plotting natural gas consumption by state - leaving facetwrap out to see the relative usages of states in the same y axis
us_total.tsibble %>%
  filter(state %in% c('Maine', 'Vermont', 'New Hampshire', 'Massachusetts', 'Connecticut', 'Rhode Island')) %>%
  ggplot(aes(x=year,y=y,colour=state)) +
  geom_line()+
  labs(title = "Annual Natural Gas Consumption", y = "Gas Consumption", x = "Year")

##Question 2.5

Download tourism.xlsx from the book website and read it into R using readxl::read_excel().

#Reading excel file into R
tourism <- readxl::read_excel("C:\\Users\\chung\\School\\624\\tourism.xlsx")
head(tourism)

## # A tibble: 6 × 5
##   Quarter    Region   State           Purpose  Trips
##   <chr>      <chr>    <chr>           <chr>    <dbl>
## 1 1998-01-01 Adelaide South Australia Business  135.
## 2 1998-04-01 Adelaide South Australia Business  110.
## 3 1998-07-01 Adelaide South Australia Business  166.
## 4 1998-10-01 Adelaide South Australia Business  127.
## 5 1999-01-01 Adelaide South Australia Business  137.
## 6 1999-04-01 Adelaide South Australia Business  200.

Create a tsibble which is identical to the tourism tsibble from the tsibble package.

# Creating tsibble from excel file
tourism.tsibble <- tourism %>%
  mutate(Quarter = yearquarter(Quarter)) %>%
  as_tsibble(key = c(Region, State, Purpose),
             index = Quarter)
tourism.tsibble

## # A tsibble: 24,320 x 5 [1Q]
## # Key:       Region, State, Purpose [304]
##    Quarter Region   State           Purpose  Trips
##      <qtr> <chr>    <chr>           <chr>    <dbl>
##  1 1998 Q1 Adelaide South Australia Business  135.
##  2 1998 Q2 Adelaide South Australia Business  110.
##  3 1998 Q3 Adelaide South Australia Business  166.
##  4 1998 Q4 Adelaide South Australia Business  127.
##  5 1999 Q1 Adelaide South Australia Business  137.
##  6 1999 Q2 Adelaide South Australia Business  200.
##  7 1999 Q3 Adelaide South Australia Business  169.
##  8 1999 Q4 Adelaide South Australia Business  134.
##  9 2000 Q1 Adelaide South Australia Business  154.
## 10 2000 Q2 Adelaide South Australia Business  169.
## # ℹ 24,310 more rows

Find what combination of Region and Purpose had the maximum number of overnight trips on average.

#Finding the combination of region and purspose that had the maximum average number of trips on average

#Group then mutate in calculation, arrange by highest number of average trips and return top combination
tourism.tsibble %>%
  group_by(Region, Purpose) %>%
  mutate(avg_trips = mean(Trips)) %>%
  arrange(desc(avg_trips))

## # A tsibble: 24,320 x 6 [1Q]
## # Key:       Region, State, Purpose [304]
## # Groups:    Region, Purpose [304]
##    Quarter Region State           Purpose  Trips avg_trips
##      <qtr> <chr>  <chr>           <chr>    <dbl>     <dbl>
##  1 1998 Q1 Sydney New South Wales Visiting  818.      747.
##  2 1998 Q2 Sydney New South Wales Visiting  639.      747.
##  3 1998 Q3 Sydney New South Wales Visiting  674.      747.
##  4 1998 Q4 Sydney New South Wales Visiting  881.      747.
##  5 1999 Q1 Sydney New South Wales Visiting  784.      747.
##  6 1999 Q2 Sydney New South Wales Visiting  767.      747.
##  7 1999 Q3 Sydney New South Wales Visiting  681.      747.
##  8 1999 Q4 Sydney New South Wales Visiting  728.      747.
##  9 2000 Q1 Sydney New South Wales Visiting  645.      747.
## 10 2000 Q2 Sydney New South Wales Visiting  590.      747.
## # ℹ 24,310 more rows

The region of Sydney and purpose of visiting is the combination with the maximum number of overnight trips on average at 747.27

Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.

# Creating new tsibble with total trips by state and indexes by Quarter
tourism.tsibble2 <- tourism %>%
  group_by(Quarter, State) %>%
  mutate(Quarter = yearquarter(Quarter),
         Total_Trips = sum(Trips)) %>%
  select(Quarter, State, Total_Trips) %>%
  distinct() %>%
  as_tsibble(index = Quarter,
             key = State)
tourism.tsibble2

## # A tsibble: 640 x 3 [1Q]
## # Key:       State [8]
## # Groups:    State @ Quarter [640]
##    Quarter State Total_Trips
##      <qtr> <chr>       <dbl>
##  1 1998 Q1 ACT          551.
##  2 1998 Q2 ACT          416.
##  3 1998 Q3 ACT          436.
##  4 1998 Q4 ACT          450.
##  5 1999 Q1 ACT          379.
##  6 1999 Q2 ACT          558.
##  7 1999 Q3 ACT          449.
##  8 1999 Q4 ACT          595.
##  9 2000 Q1 ACT          600.
## 10 2000 Q2 ACT          557.
## # ℹ 630 more rows

Question 2.8

Use the following graphics functions: autoplot(), gg_season(), gg_subseries(), gg_lag(), ACF() and explore features from the following time series: “Total Private” Employed from us_employment, Bricks from aus_production, Hare from pelt, “H02” Cost from PBS, and Barrels from us_gasoline.

Can you spot any seasonality, cyclicity and trend? What do you learn about the series? What can you say about the seasonal patterns? Can you identify any unusual years?

total_private <- us_employment %>%
  filter(Title == "Total Private")

head(total_private)

## # A tsibble: 6 x 4 [1M]
## # Key:       Series_ID [1]
##      Month Series_ID     Title         Employed
##      <mth> <chr>         <chr>            <dbl>
## 1 1939 Jan CEU0500000001 Total Private    25338
## 2 1939 Feb CEU0500000001 Total Private    25447
## 3 1939 Mar CEU0500000001 Total Private    25833
## 4 1939 Apr CEU0500000001 Total Private    25801
## 5 1939 May CEU0500000001 Total Private    26113
## 6 1939 Jun CEU0500000001 Total Private    26485

total_private %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Employed`

total_private %>%
  gg_season()

## Plot variable not specified, automatically selected `y = Employed`

total_private %>%
  gg_subseries()

## Plot variable not specified, automatically selected `y = Employed`

total_private %>%
  gg_lag()

## Plot variable not specified, automatically selected `y = Employed`

total_private %>%
  ACF(Employed) %>%
  autoplot()

There is a strong positive trend in the employment numbers as seen in the autoplot and the lag plots. The subseries plots and season plots may indicate a weak seasonality where around the June, July, and August months emplyment rises. The average line in the June, July and Augues months in the subseries plots is slightly higher than the other months along with a slight increase seen in the gg_season plot. However the ACF plot does not show much toward seasonality besides that there is a strong positive trend (a high acf that slightly decreases overtime). The years around 2008 seem to be an outlier probably due to the recession at the time. The data resumes its trend after the outlier year. In terms of cyclicity there seems to be some cyclicity potentially due to normal buisness cycles.

bricks %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Bricks`

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

bricks %>%
  gg_season()

## Plot variable not specified, automatically selected `y = Bricks`

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

bricks %>%
  gg_subseries()

## Plot variable not specified, automatically selected `y = Bricks`

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).

bricks %>%
  gg_lag()

## Plot variable not specified, automatically selected `y = Bricks`

## Warning: Removed 20 rows containing missing values (gg_lag).

bricks %>%
  ACF() %>%
  autoplot()

## Response variable not specified, automatically selected `var = Bricks`

The bricks data shows an upwards trend from 1960 to 1980 and then a change in trend to a downwards trend from 1980 onward. There seems to be seasonality where in Q1 through Q3 we have an increase in production and in Q4 a decrease. The ACF plot seems to indicate seasonality via the repeating up and down nature every 3 quarters. The gg_season plot shows this as well through its increases in Q1 to Q3 then decrease by Q4. If there were more data in the bricks dataset, we might see that there could be a cyclic nature to the data, where the change in trend that we do see is part of a longer term cycle. Around 1982 there seems to be an outlier on the lowside which may indicate need for further research.

hare <- pelt %>%
  select(Hare)

hare %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Hare`

#hare %>%
  #gg_season()

#gg_seasons function cannot be run on the data as the data contains annual data

hare %>%
  gg_subseries()

## Plot variable not specified, automatically selected `y = Hare`

hare %>%
  gg_lag()

## Plot variable not specified, automatically selected `y = Hare`

hare %>%
  ACF() %>%
  autoplot()

## Response variable not specified, automatically selected `var = Hare`

There is no way to determine seasonality for the hare data as the data is annual. There seems to be no trend in the data via the time graph, however there does seem to be cyclicity with boom and bust years. The ACF plot indicates strong cyclicity where perhaps hare goes in an out of fashion. 1862 seems to be an outlier year with large volumes of sales.

#help(PBS)

H02 <- PBS %>% filter(ATC2 == "H02") 
H02 %>% autoplot(Cost)

H02 %>%
  gg_season(Cost)

H02 %>%
  gg_subseries(Cost)

H02 %>%
  ACF(Cost) %>%
  autoplot()

The data for each of these scripts there are different trends, two of which have positive trends and the other two seems to have no trend. This could be due to increasing patient demand for these specific products and not the others. In the gg_season plot we see that there is strong seasonality for three of the scripts and seemingly no seasonality for the general co-payments scripts. The ACF plot also can indicate cyclicity with each individual script type taking on a different time horizon from peak to trough.

us_gasoline %>% autoplot(Barrels)

us_gasoline %>% gg_season(Barrels)

us_gasoline %>% gg_subseries(Barrels)

us_gasoline %>% gg_lag(Barrels)

us_gasoline %>% 
  ACF(Barrels)%>%
  autoplot()

The data shows a positive trend where around the year 2000 the trend is flattening. There seems to be seasonality where the weeks around June have a higher barrel count than the months prior or post. For cyclicity there does not seem to be much cyclicity viewing the data from this time horizon, although the change in trend may indicate that a longer term horizon could show cyclic behavior. The gg_lag plots show constant clustering which could also be an indicator of stable trend or the abstance of cycles.

624 Homework 1

Brandon Chung

2025-02-09

Question 2.8