The below are answers to exercises 2.1, 2.2, 2.3, 2.4, 2.5 and 2.8 from section 2.10 of the [Forecasting: Principles and Practice (3rd ed)] (https://otexts.com/fpp3/graphics-exercises.html). This is being completed for homework one of the DATA 624 class “Predictive Analytics.”
The exercise states: “Explore the following four time series: Bricks from aus_production, Lynx from pelt, Close from gafa_stock, and Demand from vic_elec.
I have installed a few libraries needed for these exercises, which I’m going to load at the top for efficiency:
tsible and tsibbledata: the packages where the data lives ggplot2: common tidyverse plotting library dplyr: common data manipulation package fabletools: plotting for tsibble objects
library(fpp3)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.0 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.5
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.3.2
## ✔ lubridate 1.9.3 ✔ fable 0.3.4
## ✔ ggplot2 3.5.1 ✔ fabletools 0.4.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(tsibble)
library(tsibbledata)
library(ggplot2)
library(dplyr)
library(fabletools)
library(readr)
data("aus_production")
data("pelt")
data("gafa_stock")
data("vic_elec")
data("us_employment")
data("PBS")
data("us_gasoline")
For each time series, the exercise asks me to answer the following questions:
What is the time interval of each series? Use autoplot() to produce a time plot of each series.
For the last time series, it asks me to modify the axis labels and title.
Time interval: every three months (a calendar year quarter) covering 1956 Q1 to 2010 Q2
Using autoplot() to produce a timeplot:
aus_production |>
filter(!is.na(Bricks)) |>
autoplot(Bricks)
Time interval: 12 months (one calendar year) from 1845 to 1935
Using autoplot() to produce a timeplot:
pelt |>
filter(!is.na(Lynx)) |>
autoplot(Lynx)
Time interval: one day interval from 01/02/2014 to 12/31/2018
Using autoplot() to produce a timeplot:
gafa_stock |>
filter(!is.na(Close)) |>
autoplot(Close)
Time interval: every 30 minutes from 1/1/2012 at 12:00am AEDT to 23:30 AEDT on 12/31/2014:
For this autoplot, I will modify the title and axis labels as part of the autoplot function:
```{r-electricity-demand} vic_elec |> filter(!is.na(Demand)) |> autoplot(Demand) + labs(title = “Electricity Demand for Victoria, Australia - 30 Minute Intervals”, x = “Date & Time”, y = “Operational Demand”)
### 2.2 - Using Filter on gafa_stock
Exercise 2.2 asks me to "use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock."
```{r-gafa-filter}
gafa_stock %>%
group_by(Symbol) %>%
filter(Close == max(Close, na.rm = TRUE))
This code will tell you the following dates have the highest close prices for each stock:
AAPL 2018-10-03 AMZN 2018-09-04 FB 2018-07-25 GOOG 2018-07-26
Exercise 2.3 has three components, each of which I’ll discuss as I do them. All of them require working with the tute1 dataset that’s provided on the bookset.
For A, I have to use the read_csv function from the tidyverse package readr to read and view the tute1.csv file that’s provided by the book website.
I loaded the file into a Google Cloud Platform bucket that’s designed for my CUNY work. I set it up so I can permission files at the file level, allowing for public access as needed while protecting the overall security of the bucket. When working programmatically in python, I use authewnticated user access to bring in the file.
gcp_tute_url <- "https://storage.googleapis.com/cuny_files/data_624_forecasting/weektwo_homework_one/tute1.csv"
gcp_tute1 <- read_csv(url(gcp_tute_url))
## Rows: 100 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Sales, AdBudget, GDP
## date (1): Quarter
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gcp_tute1
## # A tibble: 100 × 4
## Quarter Sales AdBudget GDP
## <date> <dbl> <dbl> <dbl>
## 1 1981-03-01 1020. 659. 252.
## 2 1981-06-01 889. 589 291.
## 3 1981-09-01 795 512. 291.
## 4 1981-12-01 1004. 614. 292.
## 5 1982-03-01 1058. 647. 279.
## 6 1982-06-01 944. 602 254
## 7 1982-09-01 778. 531. 296.
## 8 1982-12-01 932. 608. 272.
## 9 1983-03-01 996. 638. 260.
## 10 1983-06-01 908. 582. 280.
## # ℹ 90 more rows
This question just asks me to convert to a time series with provided code:
```{r-time-series-conversion} tute_timeseries <- gcp_tute1 |> mutate(Quarter = yearquarter(Quarter)) |> as_tsibble(index = Quarter)
#### 2.3.c - Plot Time Series
The final question asks me to plot of each of the three series in the time-converted series I made. Additionally, it suggests I check what happens when I don’t include facet_grid():
```{r-tute_timeseries-plot}
tute_timeseries |>
pivot_longer(-Quarter) |>
ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line() +
facet_grid(name ~ ., scales = "free_y")
Removing facet_grid combines the three lines into one chart where the lines are stacked on top of each other. I prefer it for these three since the y-axis scales for each are similar enough and it removes the strange right hand side y-axis label for the metric, which is redundant with the legend right there.
If the y-axis scales were a lot different then you would end up with a scale warped to the largest values.
```{r-tute_timeseries-plot} tute_timeseries |> pivot_longer(-Quarter) |> ggplot(aes(x = Quarter, y = value, colour = name)) + geom_line()
### 2.4 - Working With USgas Package
Exercise 2.4 also three components that involves installing and working with the USgas Package:
#### 2.4.a - Install and Load
The first component only asks me to install the package, which will naturally involve loading it as well:
```{r-install-us-gas}
install.packages("USgas")
library(USgas)
The second component is to create a tsibble fromn the us_total dataset within USGas
```{r-create-tsibble} us_total_tsibble <- us_total %>% as_tsibble(index = year, key = state)
#### 2.4.c - Plot New England Gas State Consumption
The final component is to plot the annual natural gas consumption for the New England region: Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island.
As a proud boy from Portland, Maine, I can say we have the best maple syrup in the country. This is not relevant to this exercise but is relevant to societal wellbeing as a whole.
To create the plot, I will park the New England state names in a variable, filter us_total_tsibble for it, and then create the plot. I'm using the scales package to create a y-axis that isn't in scientific notatiob
```{r-new-england-glas-plot}
install.packages("scales")
library(scales)
ne_states<- c("Maine", "Vermont", "New Hampshire", "Massachusetts", "Connecticut", "Rhode Island")
ne_filter <- us_total_tsibble %>%
filter(state %in% ne_states)
ggplot(ne_filter, aes(x = year, y = y, color = state)) +
geom_line() +
labs(title = "Annual Natural Gas Consumption in New England States",
x = "Year",
y = "Natural Gas Consumption",
color = "State") +
scale_y_continuous(labels = label_number(big.mark = ",")) +
theme_minimal()
Exercise 2.5 has four components, which involves loading and using the tourism.xlsx file provided by the book website.
For A, I have to use the read_excel() function from the readxl package to read and view the tourism-xlsx file that’s provided by the book website.
Readxl does not support reading from a url so I parked this one on my local computer and will read it from there
```{r-tourism-load} install.packages(“readxl”) library(readxl) tourism_excel <- read_excel(“/Users/kevinkirby/Desktop/github_sync/data_624/homework_one/tourism.xlsx”) tourism_excel
#### 2.5.b - Create Tsibble
For B, I need to make a tsibble that's exactly the same as the tourism tsibble from the tsibble package.
Upon inspect that package, I see the only difference is that tourism_excel has Quarter formatted as "1998-01-01" while tourism from tsibble has it as 1998 Q1. Since I need the excel file to exactly match the tssible, I will use the same Quarter time conversion that I used in 2.3.c
```{r-tourism-tissble}
tourism_excel_tsibble <- tourism_excel |>
mutate(Quarter = yearquarter(Quarter)) |>
as_tsibble(index = Quarter, key = c(Region, Purpose))
tourism_excel_tsibble
To check for differences, I can use semi_join() to perform a filtered join where there are only matches between the two, removing anything that doesn’t have a match.
```{r-excel-tsibble-match} excel_tsibble_match <- semi_join(tourism_excel_tsibble, tourism, by = c(“Region”, “Purpose”, “Quarter”, “Trips”))
#### 2.5.c - Average Maximum Overnight Trips
The third component is to "find what combination of Region and Purpose had the maximum number of overnight trips on average." To do so, I will make a new tsibble that calculates average trips by region and purpose. I can then calculate max average on that.
The answer will be:
Region Purpose Quarter avg_trips
Melbourne Visiting 2017 Q4 985.
```{r-trip-averages}
rp_trips <- excel_tsibble_match %>%
group_by(Region, Purpose) %>%
summarise(avg_trips = mean(Trips, na.rm = TRUE)) %>%
ungroup()
trip_avgs
rp_trips %>%
filter(avg_trips == max(avg_trips))
The final homework exercise is to use the following graphics functions: * autoplot() * gg_season() * gg_subseries() * gg_lag() * ACF()
With these functions, I will explore the following time series:
I will answer the following questions from the exercise under each series of graphs by dataset:
library(tidyr)
data("us_employment")
total_private <- us_employment %>%
filter(Title == "Total Private")
total_private %>%
drop_na(Employed) %>%
autoplot(Employed) +
labs(title = "Monthly Employment - 1939 to 2019",
x = "Month & Year",
y = "Employed")
total_private %>%
drop_na(Employed) %>%
gg_season(Employed) +
labs(title = "Seasonal Monthly Employment - 1939 to 2019",
x = "Month",
y = "Employed")
total_private %>%
drop_na(Employed) %>%
gg_subseries(Employed) +
labs(title = "Private Employment by Month - 1939 to 2019",
x = "Year",
y = "Employed")
total_private %>%
drop_na(Employed) %>%
gg_lag(Employed) +
labs(title = "Lag Charts for Private Employment by Month")
total_private %>%
drop_na(Employed) %>%
ACF(Employed) %>% autoplot()
These charts plot Brick production in Australia and covers 1956 Q1 to 2010 Q2
aus_production %>%
drop_na(Bricks) %>%
autoplot(Bricks) +
labs(title = "Quarterly Brick Production",
x = "Year & Quarter",
y = "Bricks Produced")
aus_production %>%
drop_na(Bricks) %>%
gg_season(Bricks) +
labs(title = "Seasonal Brick Production",
x = "Quarter",
y = "Bricks Produced")
aus_production %>%
drop_na(Bricks) %>%
gg_subseries(Bricks) +
labs(title = "Brick Production by Quarter and Year",
x = "Year",
y = "Bricks Produced")
aus_production %>%
drop_na(Bricks) %>%
gg_lag(Bricks) +
labs(title = "Lag Charts for Private Employment by Month")
aus_production %>%
drop_na(Bricks) %>%
ACF(Bricks) %>% autoplot()
These charts plot the number of Snowshoe Hare pelts traded by the Hudson Bay Company from 1845 to 1935.
pelt %>%
drop_na(Hare) %>%
autoplot(Hare) +
labs(title = "Yearly Hare Pelts Traded",
x = "Year",
y = "Pelts Traded")
pelt %>%
drop_na(Hare) %>%
gg_subseries(Hare) +
labs(title = "Hare Pelts Traded",
x = "Year",
y = "Pelts Traded")
pelt %>%
drop_na(Hare) %>%
gg_lag(Hare) +
labs(title = "Lag Charts for Hare Pelts Traded By Year")
pelt %>%
drop_na(Hare) %>%
ACF(Hare) %>% autoplot()
These charts plot Cost data from the PBS dataset that’s been filtered for ATC2 equals “H02”.
hohtwo <- PBS %>%
filter(ATC2 == "H02")
hohtwo %>%
drop_na(Cost) %>%
autoplot(Cost) +
labs(title = "Monthly HO2 Cost",
x = "Year & Month",
y = "Cost") +
scale_y_continuous(labels = scales::label_number(big.mark = ",")) +
theme_minimal()
hohtwo %>%
drop_na(Cost) %>%
gg_season(Cost) +
labs(title = "Seasonal HO2 Cost",
x = "Month",
y = "Cost") +
scale_y_continuous(labels = scales::label_number(big.mark = ",")) +
theme_minimal()
hohtwo %>%
drop_na(Cost) %>%
gg_subseries(Cost) +
labs(title = "HO2 Cost by Month and Year",
x = "Year",
y = "Cost") +
scale_y_continuous(labels = scales::label_number(big.mark = ",")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
hohtwo %>%
drop_na(Cost) %>%
ACF(Cost) %>% autoplot()
Can you spot any seasonality, cyclicity and trend? Safety net for both concessional and general have sharp drops at the start of the year that aren’t extensions of drops starting at the end of the year, suggesting some sort of calendar year driver in the drop at the start The autocorrelations show a cyclical pattern the movement between 0 and .5. Interestingly, the concessional co-payment values always stay positive while the safety net ones go negative.
What do you learn about the series? *HO2 costs have been steadily increasing over time, with sharp peaks and valleys on a predictable cycle
What can you say about the seasonal patterns?
Can you identify any unusual years?
Barrels from us_gasoline
These are charts showing millions of barrels of gasoline sold per week in the United STates between Week 6 1991 and Week 3 2017. All Barrels sold are in units of millions of barrels.
us_gasoline %>%
drop_na(Barrels) %>%
autoplot(Barrels) +
labs(title = "Weekly Barrels of Gasoline Sold in the US",
x = "Week & Year",
y = "Barrels")
us_gasoline %>%
drop_na(Barrels) %>%
gg_season(Barrels) +
labs(title = "Season Barrels of Gasoline Sold in the US",
x = "Month",
y = "Barrels")
us_gasoline %>%
drop_na(Barrels) %>%
gg_subseries(Barrels) +
labs(title = "Weekly Subseries for Barrels Sold in the US",
x = "Week & Year",
y = "Barrels")
us_gasoline %>%
drop_na(Barrels) %>%
gg_lag(Barrels) +
labs(title = "Lag Charts for Barrels Sold by Week in the US")
us_gasoline %>%
drop_na(Barrels) %>%
ACF(Barrels) %>% autoplot()