DATA 624: Homework 1

Nakesha Fray

2025-01-28

Please submit exercises 2.1, 2.2, 2.3, 2.4, 2.5 and 2.8 from the Hyndman online Forecasting book. Please submit both your Rpubs link as well as attach the .pdf file with your code.

2.1: Explore the following four time series: Bricks from aus_production, Lynx from pelt, Close from gafa_stock, Demand from vic_elec.

Use ? (or help()) to find out about the data in each series.

Bricks refers to the productions of clay bricks (in millions) in Australia from 1956-2010 from the tsibble aus_production.
Lynx refers to the number of Canadian Lynx pelts that were traded by the Hudsun Bay Company form 1845 to 1935 from the tsibble pelt.
Close refers to the stock closing price (in USD) for Amazon, Google, Facebook, and Apple from 2014-2018 in the irregular trading days tsibble gafa_stock.
Demand refers to the total electricity demand in Victoria, Australia (in MWh) from the tsibble vic_elec,

What is the time interval of each series?

Bricks from aus_production: Quarterly, with a total of 4 quarters (every 3 months)
Lynx from pelt: Yearly/annual (every 12 months)
Close from gafa_stock: Irregular trading days
Demand from vic_elec: Half-hour (every 30 minutes)

?aus_production
## starting httpd help server ... done
?pelt
?gafa_stock
?vic_elec

Use autoplot() to produce a time plot of each series. For the last plot, modify the axis labels and title.

autoplot(aus_production, Bricks)
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

autoplot(pelt, Lynx)

autoplot(gafa_stock, Close)

autoplot(vic_elec, Demand) + 
      labs(title = "Electricity Demand in MWh",
       subtitle = " Victoria, Australia",
       y = "Total electricity demand (MWh)",
       x = "Half-hour time interval (2012-2015)")

2.2 Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.

glimpse(gafa_stock)
## Rows: 5,032
## Columns: 8
## Key: Symbol [4]
## $ Symbol    <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAP…
## $ Date      <date> 2014-01-02, 2014-01-03, 2014-01-06, 2014-01-07, 2014-01-08,…
## $ Open      <dbl> 79.38286, 78.98000, 76.77857, 77.76000, 76.97285, 78.11429, …
## $ High      <dbl> 79.57571, 79.10000, 78.11429, 77.99429, 77.93714, 78.12286, …
## $ Low       <dbl> 78.86000, 77.20428, 76.22857, 76.84571, 76.95571, 76.47857, …
## $ Close     <dbl> 79.01857, 77.28286, 77.70428, 77.14857, 77.63715, 76.64571, …
## $ Adj_Close <dbl> 66.96433, 65.49342, 65.85053, 65.37959, 65.79363, 64.95345, …
## $ Volume    <dbl> 58671200, 98116900, 103152700, 79302300, 64632400, 69787200,…
peak_close_gs <- gafa_stock %>%
  select(Symbol, Date, Close) %>%
  group_by(Symbol) %>%
  filter(Close == max(Close))

peak_close_gs
## # A tsibble: 4 x 3 [!]
## # Key:       Symbol [4]
## # Groups:    Symbol [4]
##   Symbol Date       Close
##   <chr>  <date>     <dbl>
## 1 AAPL   2018-10-03  232.
## 2 AMZN   2018-09-04 2040.
## 3 FB     2018-07-25  218.
## 4 GOOG   2018-07-26 1268.

The days which correspond to the peak closing price for each of the four stocks in gafa_stock are:

Apple: 10/3/2018 with a closing price of 232.07
Amazon: 9/4/2018 with a closing price of 2039.51
Facebook: 7/25/2018 with a closing price of 217.50
Google: 7/26/2018 with a closing price of 1268.33

2.3 Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

a. You can read the data into R with the following script:

tute1 <- readr::read_csv("C:\\Users\\Kesha\\Desktop\\Spring 2024\\DATA 624\\tute1.csv")
## Rows: 100 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (3): Sales, AdBudget, GDP
## date (1): Quarter
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(tute1)

b. Convert the data to time series

mytimeseries <- tute1 |>
  mutate(Quarter = yearquarter(Quarter)) |>
  as_tsibble(index = Quarter)

c. Construct time series plots of each of the three series. Check what happens when you don’t include facet_grid().

Facet_grid creates a matrix of each variable (Sales, Adbudget, and GDP) into three plots but in one output. Without facet_wrap, we can see that the plots of each variable is now combined in the same plot without their own y-labels. Sales, Adbudget, and GDP also had to be rearranged to fit the y-axis values, therefore, they are not in the same order as the plot with facet_grid.

mytimeseries |>
  pivot_longer(-Quarter) |>
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y")

# Without face_grid
mytimeseries |>
  pivot_longer(-Quarter) |>
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() 

2.4 The USgas package contains data on the demand for natural gas in the US.

Install the USgas package.

#install.packages("USgas")
library(USgas)

Create a tsibble from us_total with year as the index and state as the key.

view(us_total)

us_total_tsibble <- us_total %>%
   as_tsibble(key = state,
             index = year)

us_total_tsibble
## # A tsibble: 1,266 x 3 [1Y]
## # Key:       state [53]
##     year state        y
##    <int> <chr>    <int>
##  1  1997 Alabama 324158
##  2  1998 Alabama 329134
##  3  1999 Alabama 337270
##  4  2000 Alabama 353614
##  5  2001 Alabama 332693
##  6  2002 Alabama 379343
##  7  2003 Alabama 350345
##  8  2004 Alabama 382367
##  9  2005 Alabama 353156
## 10  2006 Alabama 391093
## # ℹ 1,256 more rows

Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).

us_total_tsibble %>%
  filter(state %in% c("Maine", "Vermont", "New Hampshire", "Massachusetts",
                      "Connecticut", "Rhode Island")) %>%
  autoplot()  + 
      labs(title = "Annual Natural Gas Consumption for the New England Area",
       y = "Gas Consumption (MMcf)",
       x = "Yearly time interval")
## Plot variable not specified, automatically selected `.vars = y`

2.5 Download tourism.xlsx from the book website and read it into R using readxl::read_excel().

tourism1 <- readxl::read_excel("C:\\Users\\Kesha\\Desktop\\Spring 2024\\DATA 624\\tourism.xlsx")
View(tourism1)

Create a tsibble which is identical to the tourism tsibble from the tsibble package.

data(tourism)
tourism
## # A tsibble: 24,320 x 5 [1Q]
## # Key:       Region, State, Purpose [304]
##    Quarter Region   State           Purpose  Trips
##      <qtr> <chr>    <chr>           <chr>    <dbl>
##  1 1998 Q1 Adelaide South Australia Business  135.
##  2 1998 Q2 Adelaide South Australia Business  110.
##  3 1998 Q3 Adelaide South Australia Business  166.
##  4 1998 Q4 Adelaide South Australia Business  127.
##  5 1999 Q1 Adelaide South Australia Business  137.
##  6 1999 Q2 Adelaide South Australia Business  200.
##  7 1999 Q3 Adelaide South Australia Business  169.
##  8 1999 Q4 Adelaide South Australia Business  134.
##  9 2000 Q1 Adelaide South Australia Business  154.
## 10 2000 Q2 Adelaide South Australia Business  169.
## # ℹ 24,310 more rows
tourism2 <- tourism1 %>%
  mutate(Quarter = yearquarter(Quarter)) %>%
  as_tsibble(key = c(Region, State, Purpose),
             index = Quarter)

tourism2
## # A tsibble: 24,320 x 5 [1Q]
## # Key:       Region, State, Purpose [304]
##    Quarter Region   State           Purpose  Trips
##      <qtr> <chr>    <chr>           <chr>    <dbl>
##  1 1998 Q1 Adelaide South Australia Business  135.
##  2 1998 Q2 Adelaide South Australia Business  110.
##  3 1998 Q3 Adelaide South Australia Business  166.
##  4 1998 Q4 Adelaide South Australia Business  127.
##  5 1999 Q1 Adelaide South Australia Business  137.
##  6 1999 Q2 Adelaide South Australia Business  200.
##  7 1999 Q3 Adelaide South Australia Business  169.
##  8 1999 Q4 Adelaide South Australia Business  134.
##  9 2000 Q1 Adelaide South Australia Business  154.
## 10 2000 Q2 Adelaide South Australia Business  169.
## # ℹ 24,310 more rows

Find what combination of Region and Purpose had the maximum number of overnight trips on average.

The region Sydney and the purpose Visiting had the maximum number of overnight trips on average with 747.2699682 average trips.

tourism1 %>% 
  group_by(Region, Purpose)%>%
  summarise(trip_avg = mean(Trips))%>%
  arrange(desc(trip_avg))
## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.
## # A tibble: 304 × 3
## # Groups:   Region [76]
##    Region          Purpose  trip_avg
##    <chr>           <chr>       <dbl>
##  1 Sydney          Visiting     747.
##  2 Melbourne       Visiting     619.
##  3 Sydney          Business     602.
##  4 North Coast NSW Holiday      588.
##  5 Sydney          Holiday      550.
##  6 Gold Coast      Holiday      528.
##  7 Melbourne       Holiday      507.
##  8 South Coast     Holiday      495.
##  9 Brisbane        Visiting     493.
## 10 Melbourne       Business     478.
## # ℹ 294 more rows

Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.

tourism_new <- tourism2 %>%
  group_by(State) %>%
  summarise(total_trips = sum(Trips)) 

tourism_new
## # A tsibble: 640 x 3 [1Q]
## # Key:       State [8]
##    State Quarter total_trips
##    <chr>   <qtr>       <dbl>
##  1 ACT   1998 Q1        551.
##  2 ACT   1998 Q2        416.
##  3 ACT   1998 Q3        436.
##  4 ACT   1998 Q4        450.
##  5 ACT   1999 Q1        379.
##  6 ACT   1999 Q2        558.
##  7 ACT   1999 Q3        449.
##  8 ACT   1999 Q4        595.
##  9 ACT   2000 Q1        600.
## 10 ACT   2000 Q2        557.
## # ℹ 630 more rows

2.8 Use the following graphics functions: autoplot(), gg_season(), gg_subseries(), gg_lag(), ACF() and explore features from the following time series: “Total Private” Employed from us_employment, Bricks from aus_production, Hare from pelt, “H02” Cost from PBS, and Barrels from us_gasoline.

Can you spot any seasonality, cyclicity and trend?
What do you learn about the series?
What can you say about the seasonal patterns?
Can you identify any unusual years?

“Total Private” Employed from us_employment:
There does not seem to be an obvious cyclicity or seasonality in this time series, the gg_lag shows that each month almost overlap as they consistently increase. However, looking closely at gg_season, there is a very slight increase over the summer months every year, this shows that there might be seasonality here. The overall trend appears to be upward, a positive linear relationship between time and us_employment from January 1939 to June 2019 for the private sector. From the series, it looks like the US employment will continue to increase at a predictable rate unless an event occurs that affects the economy, as I assume COVID-19 did if we had data from those years. I would say an unusual year for this time series was around 2009 from autoplot where we see there was actually a very large decrease in employment which could be due to the recession during that time in the US.

Bricks from aus_production:
There does seem to be some seasonality here for Australian brick production, where Q1 (Jan-Mar) has a lot less brick production than Q2 (Apr-June) and especially Q3 (Jul-Sept). Then we start to see a decrease again towards Q4 (Oct-Dec) - which is very apparent in gg_season for almost every year. The overall trend looks like there was first an upward trend then a downward trend. There was an increase in brick production from the late 1950s to the 1980s, except for a large dip in the mid-1970s. Then we see a decreasing trend with little peaks in between continuing to the 2000s. From plotting this series, it does not seem to be entirely predictable but there is seasonality which seems to be somewhat consistent. It seems like 1983 was an unusual year, since we see a significant drop in brick production, from my research since I did not know much about Australia, it looks like their economy was suffering during this time.

Hare from pelt:
The first thing I noticed from the pelt data is that because it is annual, we actually cannot determine seasonality since we cannot see how the number of Hare traded would behave in different seasons. However, there does seem to be a cyclicity pattern as we look at autoplot, and see intervals of peaks and dips throughout the years (1845-1935) of Snowshoe Hare pelt trading. The trend seems to be this cycle of increasing and decreasing, and from the entire series, while we see cyclicity, it would be a little difficult to predict since the number of pelts traded peaks and dips at different amounts in each cycle. An unusual year was in the 1860s where we saw a dramatic increase in Snowshoe Hare pelt trading.

“H02” Cost from PBS:
For PBS, I further filitered “concession” and “type” in the dataset to use gg_lag. I ended up plotting the only ones with data, which was co-payments for both concessions. There definitely seems to be a seasonality, especially for “General - Safety Net”, “Concessional - Safety Net” and “Concessional’ - Co-payments.” For “General - Safety Net” and “Concessional - Safety Net,” the beginning of the year starts off with Medicare prescription costs that were higher and then slowly starts to decrease with very low costs during the Spring. Then the costs start to pick up again increasing towards the rest of the year. “Concessional’ - Co-payments,” on the other hand, starts off lower and increases in the spring, then slowly decreases towards the end of the year. The trend and from what I see in this time series, I would say there is an overall upward trend with increasing Medicare Australia prescription costs for all categories. An unusual year was around 1997 where “Concessional’ - Co-payments” decreased a bit.

Barrels from us_gasoline:
For us_gasoline, there does not appear to be a clear seasonality or cyclicity since we see fluctuations all throughout each season from 1991 to 2017 in gg_season, without a visible cycle. In terms of trends of the overall series, the US finished motor gasoline product supply has an upward trend up until around 2003/2004 and then starts to slowly decrease towards a downward tread, then picks back up again around 2013. I would say the unusual years were around 2009 where we see the drop in product supplied, but again, this could be related to the recession at that time.

#"Total Private” Employed from us_employment
us_employment1 <- us_employment %>%
   filter(Title == "Total Private")

autoplot(us_employment1, Employed)

gg_season(us_employment1, Employed)

gg_subseries(us_employment1, Employed)

gg_lag(us_employment1, Employed)

ACF(us_employment1, Employed)
## # A tsibble: 29 x 3 [1M]
## # Key:       Series_ID [1]
##    Series_ID          lag   acf
##    <chr>         <cf_lag> <dbl>
##  1 CEU0500000001       1M 0.997
##  2 CEU0500000001       2M 0.993
##  3 CEU0500000001       3M 0.990
##  4 CEU0500000001       4M 0.986
##  5 CEU0500000001       5M 0.983
##  6 CEU0500000001       6M 0.980
##  7 CEU0500000001       7M 0.977
##  8 CEU0500000001       8M 0.974
##  9 CEU0500000001       9M 0.971
## 10 CEU0500000001      10M 0.968
## # ℹ 19 more rows
#Bricks from aus_production
autoplot(aus_production, Bricks)
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

gg_season(aus_production, Bricks)
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

gg_subseries(aus_production, Bricks)
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).

gg_lag(aus_production, Bricks)
## Warning: Removed 20 rows containing missing values (gg_lag).

ACF(aus_production, Bricks)
## # A tsibble: 22 x 2 [1Q]
##         lag   acf
##    <cf_lag> <dbl>
##  1       1Q 0.900
##  2       2Q 0.815
##  3       3Q 0.813
##  4       4Q 0.828
##  5       5Q 0.720
##  6       6Q 0.642
##  7       7Q 0.655
##  8       8Q 0.692
##  9       9Q 0.609
## 10      10Q 0.556
## # ℹ 12 more rows
#Hare from pelt
pelt <- pelt
autoplot(pelt, Hare)

# gg_season(pelt, Hare) There are no "months" or "Quarters" in this variable for gg_season to show season trends
gg_subseries(pelt, Hare)

gg_lag(pelt, Hare)

ACF(pelt, Hare)
## # A tsibble: 19 x 2 [1Y]
##         lag     acf
##    <cf_lag>   <dbl>
##  1       1Y  0.658 
##  2       2Y  0.214 
##  3       3Y -0.155 
##  4       4Y -0.401 
##  5       5Y -0.493 
##  6       6Y -0.401 
##  7       7Y -0.168 
##  8       8Y  0.113 
##  9       9Y  0.307 
## 10      10Y  0.340 
## 11      11Y  0.296 
## 12      12Y  0.206 
## 13      13Y  0.0372
## 14      14Y -0.153 
## 15      15Y -0.285 
## 16      16Y -0.295 
## 17      17Y -0.202 
## 18      18Y -0.0676
## 19      19Y  0.0956
#H02” Cost from PBS
PBS1 <- PBS %>%
   filter(ATC2 == "H02")

autoplot(PBS1, Cost)

gg_season(PBS1, Cost)

gg_subseries(PBS1, Cost)

pbs_cc <- PBS1 %>%
  filter(Concession == "Concessional", Type == "Co-payments")
pbs_gc <- PBS1 %>%
  filter(Concession == "General", Type == "Co-payments")

gg_lag(pbs_cc, Cost)

gg_lag(pbs_gc, Cost)

ACF(PBS1, Cost)
## # A tsibble: 92 x 6 [1M]
## # Key:       Concession, Type, ATC1, ATC2 [4]
##    Concession   Type        ATC1  ATC2       lag   acf
##    <chr>        <chr>       <chr> <chr> <cf_lag> <dbl>
##  1 Concessional Co-payments H     H02         1M 0.834
##  2 Concessional Co-payments H     H02         2M 0.679
##  3 Concessional Co-payments H     H02         3M 0.514
##  4 Concessional Co-payments H     H02         4M 0.352
##  5 Concessional Co-payments H     H02         5M 0.264
##  6 Concessional Co-payments H     H02         6M 0.219
##  7 Concessional Co-payments H     H02         7M 0.253
##  8 Concessional Co-payments H     H02         8M 0.337
##  9 Concessional Co-payments H     H02         9M 0.464
## 10 Concessional Co-payments H     H02        10M 0.574
## # ℹ 82 more rows
#Barrels from us_gasoline
autoplot(us_gasoline, Barrels)

gg_season(us_gasoline, Barrels)

gg_subseries(us_gasoline, Barrels)

gg_lag(us_gasoline, Barrels)

ACF(us_gasoline, Barrels)
## # A tsibble: 31 x 2 [1W]
##         lag   acf
##    <cf_lag> <dbl>
##  1       1W 0.893
##  2       2W 0.882
##  3       3W 0.873
##  4       4W 0.866
##  5       5W 0.847
##  6       6W 0.844
##  7       7W 0.832
##  8       8W 0.831
##  9       9W 0.822
## 10      10W 0.808
## # ℹ 21 more rows

```