DATA624: Homework 1
library(fpp3)
Task
Please submit exercises 2.1, 2.2, 2.3, 2.4, 2.5 and 2.8 from the Hyndman online Forecasting book. Please submit both your Rpubs link as well as attach the .rmd file with your code.
Exercises
2.1
Use the help function to explore what the series
gafa_stock
, PBS
, vic_elec
and
pelt
represent.
<- gafa_stock
gafa_stock <- PBS
PBS <- vic_elec
vic_elec <- pelt pelt
The help function provides the documentation provided to a function
or dataset in question. gafa_stock
represents the
historical stock prices (in US dollars) from 2014-2018 for 4 companies:
Google, Amazon, Facebook, and Apple. PBS
contains monthly
data related to Australia’s Medicare prescription. vic_elec
represents half-hourly electicity demand for Victoria, Australia.
pelt
stores pelt trading records from 1845 to 1935.
# help(gafa_stock)
# help(PBS)
# help(vic_elec)
# help(pelt)
a. Use autoplot()
to plot some of the series in these
data sets.
gafa_stock
: It appears that Amazon and Google have had a consistent upward trend year over year, whereas Facebook and Apple have remained stagnant in terms of open price.
autoplot(gafa_stock, .vars = Open) +
labs(title = "Daily Opening Price from 2014 - 2018")
PBS
: There is a positive trend of total cost over time. There is also strong seasonality in the data.
%>%
PBS filter(ATC2 == "A10") %>%
select(Month, Concession, Type, Cost) %>%
summarise(TotalC = sum(Cost)) %>%
autoplot(.vars = TotalC) +
labs(title = "Monthly Medicare Prescription Costs in Australia",
y = "Total Cost")
vic_elec
: The demand for electricity seems to vary throughout the year indicating the presence of seasonality. There appears to be spikes in the electricity demand during the winter months (begining / end of each year) as well as a noticeable increase during the summer months (mid year)
autoplot(vic_elec, .vars = Demand) +
labs(title = "Half-hourly Electricity Demand in Victoria, Australia")
pelt
: These fluctuations appear to be random making it difficult to discern and future hare pelt trades.
autoplot(pelt, .vars = Hare) +
labs(title = "Hare Pelts Traded from 1845 to 1935")
b. What is the time interval of each series?
gafa_stock
: 1 day excluding weekendsPBS
: 1 Monthvic_elec
: Every 30 minspelt
: 1 Year
2.2
Use filter()
to find what days corresponded to the peak
closing price for each of the four stocks in
gafa_stock
.
%>%
gafa_stock group_by(Symbol) %>%
filter(Close == max(Close))
## # A tsibble: 4 x 8 [!]
## # Key: Symbol [4]
## # Groups: Symbol [4]
## Symbol Date Open High Low Close Adj_Close Volume
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 2018-10-03 230. 233. 230. 232. 230. 28654800
## 2 AMZN 2018-09-04 2026. 2050. 2013 2040. 2040. 5721100
## 3 FB 2018-07-25 216. 219. 214. 218. 218. 58954200
## 4 GOOG 2018-07-26 1251 1270. 1249. 1268. 1268. 2405600
- AAPL: October 3rd, 2018
- AMZN: September 4th, 2018
- FB: July 25th, 2018
- GOOG: July 26th, 2018
2.3
Download the file tute1.csv
from the book website, open
it in Excel (or some other spreadsheet application), and review its
contents. You should find four columns of information. Columns B through
D each contain a quarterly series, labelled Sales, AdBudget and GDP.
Sales contains the quarterly sales for a small company over the period
1981-2005. AdBudget is the advertising budget and GDP is the gross
domestic product. All series have been adjusted for inflation.
a. You can read the data into R with the following script:
<- readr::read_csv("tute1.csv") tute1
## Rows: 100 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (3): Sales, AdBudget, GDP
## date (1): Quarter
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View(tute1)
b. Convert the data to time series
<- tute1 %>%
mytimeseries mutate(Quarter = yearquarter(Quarter)) %>%
as_tsibble(index = Quarter)
c. Construct time series plots of each of the three series
%>%
mytimeseries pivot_longer(-Quarter) %>%
ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line() +
facet_grid(name ~ ., scales = "free_y") +
labs(title = "Uses facet_grid()")
If facet_grid()
isnt used:
%>%
mytimeseries pivot_longer(-Quarter) %>%
ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line() +
labs(title = "Does not use facet_grid()")
There is a difference between the two time series when
facet_grid()
is applied versus not applied.
Facet grid divides the plot into multiple sections (in this instance, 3) to show each plot individually. This means that each section will have a different y scale.
When facet grid is not applied, all the time series lines will be in 1 section. The y axis will be the same for all 3 plots. This could make it difficult to view patterns in the data or even make a volatile time series look like a straight line if the scale between the time series is extremely different.
2.4
The USgas
package contains data on the demand for
natural gas in the US.
a. Install the USgas
package.
library(USgas)
b. Create a tsibble from us_total
with year as the
index and state as the key.
<- us_total %>%
us_total as_tsibble(index = year,
key = state)
c. Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).
- Looking at all states together (without
facet_grid()
)
= c("Maine", "Vermont", "New Hampshire", "Massachusetts", "Connecticut", "Rhode Island")
new_england
%>%
us_total filter(state %in% new_england) %>%
ggplot(aes(x = year, y = y, color = state)) +
geom_line()
- Looking at all states in individually (with
facet_grid()
)
= c("Maine", "Vermont", "New Hampshire", "Massachusetts", "Connecticut", "Rhode Island")
new_england
%>%
us_total filter(state %in% new_england) %>%
ggplot(aes(x = year, y = y, color = state)) +
geom_line() +
facet_grid(state ~ ., scale = "free_y")
2.5
a. Download tourism.xlsx
from the book website and read
it into R using readxl::read_excel()
.
<- readxl::read_excel("tourism.xlsx") tourism.data
tourism
## # A tsibble: 24,320 x 5 [1Q]
## # Key: Region, State, Purpose [304]
## Quarter Region State Purpose Trips
## <qtr> <chr> <chr> <chr> <dbl>
## 1 1998 Q1 Adelaide South Australia Business 135.
## 2 1998 Q2 Adelaide South Australia Business 110.
## 3 1998 Q3 Adelaide South Australia Business 166.
## 4 1998 Q4 Adelaide South Australia Business 127.
## 5 1999 Q1 Adelaide South Australia Business 137.
## 6 1999 Q2 Adelaide South Australia Business 200.
## 7 1999 Q3 Adelaide South Australia Business 169.
## 8 1999 Q4 Adelaide South Australia Business 134.
## 9 2000 Q1 Adelaide South Australia Business 154.
## 10 2000 Q2 Adelaide South Australia Business 169.
## # ... with 24,310 more rows
b. Create a tsibble which is identical to the tourism
tsibble from the tsibble
package.
Notes from using
help()
:- The time is stored as year followed by the quarter
- Contains region, state, purpose, and trip information
- The index will be the Quarterly data
- Keys will be Region, State, and Purpose,
- Trips is the Measure variable
# help(tourism)
<- tourism.data %>%
tourism.data.tsibble mutate(Quarter = yearquarter(Quarter)) %>%
as_tsibble(index = Quarter,
key = c(Region, State, Purpose))
c. Find what combination of Region
and
Purpose
had the maximum number of overnight trips on
average.
The Region and Purpose that maximized average overnight trips were Melbourne with the purpose of visiting.
%>%
tourism.data.tsibble group_by(Region, Purpose) %>%
summarise(Average_Trips = mean(Trips)) %>%
ungroup() %>%
filter(Average_Trips == max(Average_Trips))
## # A tsibble: 1 x 4 [1Q]
## # Key: Region, Purpose [1]
## Region Purpose Quarter Average_Trips
## <chr> <chr> <qtr> <dbl>
## 1 Melbourne Visiting 2017 Q4 985.
d. Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.
To create this new tsibble, I used the group_by() function to allow me to get the total trips by State as it will store like state values together to be easily aggregated via the sum() function in tandem with the summarise() function.
.2 <- tourism.data.tsibble %>%
tourism.data.tsibblegroup_by(State) %>%
summarise(Total_Trips = sum(Trips))
.2 tourism.data.tsibble
## # A tsibble: 640 x 3 [1Q]
## # Key: State [8]
## State Quarter Total_Trips
## <chr> <qtr> <dbl>
## 1 ACT 1998 Q1 551.
## 2 ACT 1998 Q2 416.
## 3 ACT 1998 Q3 436.
## 4 ACT 1998 Q4 450.
## 5 ACT 1999 Q1 379.
## 6 ACT 1999 Q2 558.
## 7 ACT 1999 Q3 449.
## 8 ACT 1999 Q4 595.
## 9 ACT 2000 Q1 600.
## 10 ACT 2000 Q2 557.
## # ... with 630 more rows
2.8
Monthly Australian retail data is provided in aus_retail. Select one of the time series as follows (but choose your own seed value):
set.seed(15)
<- aus_retail %>%
myseries filter(`Series ID` == sample(aus_retail$`Series ID`,1))
Explore your chosen retail time series using the following
functions: autoplot()
, gg_season()
,
gg_subseries()
, gg_lag()
,
ACF() %>% autoplot()
Autoplot
autoplot(myseries, Turnover)
gg_season
gg_season(myseries, Turnover)
gg_subseries
gg_subseries(myseries, Turnover)
gg_lag
gg_lag(myseries, Turnover, geom = "path")
gg_lag(myseries, Turnover, geom = "point")
ACF %>% autoplot
%>% ACF(Turnover) %>% autoplot() myseries
Can you spot any seasonality, cyclicity and trend? What do you learn about the series?
The autoplot shows that there is a strong upward trend within this
dataset. The data seems to have seasonality in it as there are similar
changes that seem to recur every year, especially at year end each year.
This jump could be due to the increased retail traffic the holiday
season brings in. The gg_season()
and
gg_subseries()
plots also show the seasonality in the data
as the general shapes of the lines are similar year after year. The
gg_lag()
and ACF %>% autoplot()
provide
indisputable evidence that the data is both seasonal and trended as he
gg_lag plots show an positive linear relationship and the ACF %>%
autoplot has extremely high ACF values through 26 lag cycle and
fluctuates each lag. This dataset appears to have strong seasonal
behaviors with little to no cyclical behavior.