library(fpp3)
library(zoo) #to convert Quarter to type quarter in Q3
library(USgas)
The purpose of this assignment was to explore the Time Series Graphics exercises from Forecasting: Principles and Practice.
Use the help function to explore what the series gafa_stock
, PBS
, vic_elec
and pelt
represent. Use autoplot()
to plot some of the series in these data sets. What is the time interval of each series?
First we apply the help function to each series to explore their description, format, and details.
help(PBS) #monthly medicare Australia prescription data
help(gafa_stock) #stock prices 2014-2018 for Google, Amazon, Facebook, Apple
help(vic_elec) #half-hur electricity demand for Victoria, Australia
help(pelt) #pelt trading records from Hudson Bay Company from 1845 to 1935
We then use autoplot()
to plot the series in each data set and interpret time intervals:
#autoplot(PBS, Concession) # did not work --> list of variables
#autoplot(PBS, Type) # did not work --> list of variables
#autoplot(PBS, ATC1) # did not work --> list of variables
#autoplot(PBS, ATC2) # did not work --> list of variables
autoplot(PBS, Scripts)
#autoplot(PBS, Cost)
#PBS$Month
When we consider the PBS data set we see that autoplot()
didn’t lead to any plot with a time interval to interpret but when we explore the Month
field within the dataset we see that our time interval is per month.
autoplot(gafa_stock, Open) # 2014 - 2018, yearly
#autoplot(gafa_stock, High)
#autoplot(gafa_stock, Low)
#autoplot(gafa_stock, Close)
#autoplot(gafa_stock, Adj_Close)
#autoplot(gafa_stock, Volume)
Applying autoplot()
to the gafa_stock
dataset, we see what appears to be daily stock price data plot from 2014 to 2018. There appear to be clear upward trends in the Open
price for each of these stocks with Amazon overtaking Google in valuation from late 2017 onward.
autoplot(vic_elec, Demand) # 2012 - 2015, yearly
#autoplot(vic_elec, Temperature) # 2012 - 2015, yearly
#autoplot(vic_elec, Holiday) # 2012 - 2015, yearly : NA for Boolean
Applying autoplot()
to the vic_elec
dataset, we see Demand
data from 2012 to 2015 in 30 minute increments with clear seasonality / sinusoidal wave patterns to electricity demand.
autoplot(pelt, Hare)
#autoplot(pelt, Lynx)
Applying autoplot()
to the pelt
dataset, we see Hare
data plot from before 1860 thru 1920 in 1 year increments with clear wave patterns to the data that may require smaller increments or a more in-depth scope to truly determine. The largest peak appears to be during the Civil War years (1860s).
Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.
%>%
gafa_stock group_by(Symbol) %>%
filter(Close == max(Close))
## # A tsibble: 4 x 8 [!]
## # Key: Symbol [4]
## # Groups: Symbol [4]
## Symbol Date Open High Low Close Adj_Close Volume
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 2018-10-03 230. 233. 230. 232. 230. 28654800
## 2 AMZN 2018-09-04 2026. 2050. 2013 2040. 2040. 5721100
## 3 FB 2018-07-25 216. 219. 214. 218. 218. 58954200
## 4 GOOG 2018-07-26 1251 1270. 1249. 1268. 1268. 2405600
From above we see that each of the four stock’s peak closing prices occurred in 2018.
Download the file tute1.csv
from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.
Read the data into R, convert the data to time series, construct time series plots of each of the three series, and then check what happens when you don’t include facet_grid()
using the provided script.
To start we read in the provided csv and convert our data to time series. More specifically, we convert the Quarter
field to year-quarter and then convert the data to be of type tibble. Because the provided conversion script did not work, I first converted the Quarter
field using yearmon()
and then proceeded with the provided script:
#Read the data in
<- readr::read_csv("tute1.csv")
tute1 #view(tute1)
#Convert the data to time series
##the text's script DID NOT WORK so I adapted the Quarter conversion
$Quarter <- as.yearmon(as.Date( tute1$Quarter, "%m/%d/%Y"))
tute1#tute1$Quarter <- as.yearqtr(as.Date( tute1$Quarter, "%m/%d/%Y" ))
#tute1
<- tute1 %>%
mytimeseries mutate(Quarter = yearmonth(Quarter)) %>%
as_tsibble(index = Quarter)
After verifying that the data was in our desired form, we construct our time series plots of the three series, first on separate plots and then on the same plot:
#Construct time series plots of each of the three series
%>%
mytimeseries pivot_longer(-Quarter) %>%
ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line() +
facet_grid(name ~ ., scales = "free_y")
#run the script again without facet_grid
%>%
mytimeseries pivot_longer(-Quarter) %>%
ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line()
What we observe, when we incorporate facet_grid()
is that each of our numeric fields are plotted on separate plots with scales suited to their values and thus we’re able to gain more insight than when we plot all of our fields on the same plot, with the same scales.
The USgas
package contains data on the demand for natural gas in the US.
USgas
package.tsibble
from us_total
with year as the index and state as the key.Upon installing the UGgas
package, we proceed to create a tsibble with the year as the index and the key as the state:
#install `USgas` package and view input data
#view(us_total)
#create with year as index and state as key
<- us_total %>%
mytimeseries2 mutate(year = year(as.Date(as.character(us_total$year), format = "%Y"))) %>%
as_tsibble(index = year, key = state)
#mytimeseries2 #verify conversion
#filter for New England area states
<- mytimeseries2 %>%
ne_gas_consumption filter(state == 'Connecticut' | state == 'Maine' | state == 'Massachusetts' | state == 'New Hampshire' | state == 'Rhode Island' | state == 'Vermont')
#ne_gas_consumption #verify
We then proceed to plot the annual natural gas consumption by state for the New England area, using our filtered tsibble and the ggplot()
function:
#plot annual natural gas consumption by state for the New England area
<- ggplot(ne_gas_consumption, aes(x = year, y = y, colour = state)) +
NE_plot geom_line()
print(NE_plot + labs(title = "Annual Gas Consumption in New England", y = "Gas Consumption", x = "Year"))
There are a number of noteworthy observations from the output:
readxl::read_excel()
.We read in the Excel file using read_excel()
and then verify the entries within the tourism tsibble to note the differences between it and our tourism_xlsx
table:
#read in the Excel file using read_excel()
<- readxl::read_excel("tourism.xlsx")
tourism_xlsx
#create a tsibble identical to the tourism tsibble
#tourism
#tourism_xlsx
They appear to be identical aside from the format and type of the Quarter
column.
I converted the data to time series by first converting the Quarter
column to the proper form for the yearquarter()
function an then I created a tsibble object with 3 keys and 1 index:
#Convert the data to time series
$Quarter <- yearquarter(as.Date(tourism_xlsx$Quarter))
tourism_xlsx<- tourism_xlsx %>%
tourism_tsibble as_tsibble(key = c(Region, State, Purpose), index = Quarter)
#tourism_tsibble
The result is identical to the provided tourism tsibble.
From this point, we proceed to finding what combination of Region and Purpose had the maximum number of overnight trips on average.
%>%
tourism_xlsx select(Region, Purpose, Trips) %>%
group_by(Region, Purpose) %>%
summarise(Trips = mean(Trips)) %>%
filter(Trips == max(Trips)) -> max_avg_trips
order(-max_avg_trips$Trips),] max_avg_trips[
## # A tibble: 76 x 3
## # Groups: Region [76]
## Region Purpose Trips
## <chr> <chr> <dbl>
## 1 Sydney Visiting 747.
## 2 Melbourne Visiting 619.
## 3 North Coast NSW Holiday 588.
## 4 Gold Coast Holiday 528.
## 5 South Coast Holiday 495.
## 6 Brisbane Visiting 493.
## 7 Sunshine Coast Holiday 436.
## 8 Hunter Holiday 319.
## 9 Australia's South West Holiday 309.
## 10 Experience Perth Visiting 291.
## # ... with 66 more rows
We see that Sydney-Visiting had the maximum number of over-nighting trips on average.
Next we create a new tsibble which combines Purpose and Region and just has total trips by State:
#Combine Purpose and Region
##group by Region, Purpose and then concatenate these columns
<- tourism_xlsx %>% group_by(Region, Purpose)
grouped_tourism $Region.Purpose <- paste(grouped_tourism$Region, grouped_tourism$Purpose)
grouped_tourism
<- grouped_tourism[-c(2,4)] #drop Region, Purpose
simple_tourism <- simple_tourism[,c(1,2,4,3)] #reorder columns
simple_tourism #simple_tourism #verify
#calculate total trips by State
%>%
simple_tourism group_by(Quarter,State) %>%
mutate(Trips = sum(Trips)) -> trips_by_state
#convert Quarter to proper time series form
$Quarter <- yearquarter(as.Date(tourism_xlsx$Quarter))
trips_by_state
#create new tsibble
<- trips_by_state %>%
trips_by_state_tsibble as_tsibble(key = c(State, Region.Purpose), index = Quarter)
trips_by_state_tsibble
## # A tsibble: 24,320 x 4 [1Q]
## # Key: State, Region.Purpose [304]
## # Groups: State @ Quarter [640]
## Quarter State Region.Purpose Trips
## <qtr> <chr> <chr> <dbl>
## 1 1998 Q1 ACT Canberra Business 551.
## 2 1998 Q2 ACT Canberra Business 416.
## 3 1998 Q3 ACT Canberra Business 436.
## 4 1998 Q4 ACT Canberra Business 450.
## 5 1999 Q1 ACT Canberra Business 379.
## 6 1999 Q2 ACT Canberra Business 558.
## 7 1999 Q3 ACT Canberra Business 449.
## 8 1999 Q4 ACT Canberra Business 595.
## 9 2000 Q1 ACT Canberra Business 600.
## 10 2000 Q2 ACT Canberra Business 557.
## # ... with 24,310 more rows
I was a little regarding the exact desired output but for the sake of maintaining a tsibble (time series table) output, I:
Purpose
and Region
,Region
and Purpose
and then concatenated these fields,Region
and Purpose
from further consideration,State
, and thenMonthly Australian retail data is provided in aus_retail
.
Select one of the time series and set your own seed value.
Explore your chosen retail time series using the following functions: autoplot(), gg_season(), gg_subseries(), gg_lag(), ACF() %>% autoplot()
Can you spot any seasonality, cyclicity and trend? What do you learn about the series?
set.seed(333)
<- aus_retail %>%
myseries filter(`Series ID` == sample(aus_retail$`Series ID`,1))
#Series ID = A3349773T, 441 rows myseries
## # A tsibble: 441 x 5 [1M]
## # Key: State, Industry [1]
## State Industry `Series ID` Month Turnover
## <chr> <chr> <chr> <mth> <dbl>
## 1 Australian Capital Ter~ Newspaper and book ret~ A3349773T 1982 Apr 2.3
## 2 Australian Capital Ter~ Newspaper and book ret~ A3349773T 1982 May 2.5
## 3 Australian Capital Ter~ Newspaper and book ret~ A3349773T 1982 Jun 2.3
## 4 Australian Capital Ter~ Newspaper and book ret~ A3349773T 1982 Jul 2.6
## 5 Australian Capital Ter~ Newspaper and book ret~ A3349773T 1982 Aug 2.6
## 6 Australian Capital Ter~ Newspaper and book ret~ A3349773T 1982 Sep 2.6
## 7 Australian Capital Ter~ Newspaper and book ret~ A3349773T 1982 Oct 2.7
## 8 Australian Capital Ter~ Newspaper and book ret~ A3349773T 1982 Nov 3
## 9 Australian Capital Ter~ Newspaper and book ret~ A3349773T 1982 Dec 4.9
## 10 Australian Capital Ter~ Newspaper and book ret~ A3349773T 1983 Jan 2.5
## # ... with 431 more rows
autoplot(myseries)
For the A3349773T
series, autoplot()
produces a monthly plot of Turnover from 1982 thru 2018 for Newspaper and book retailing in the Australian Capital Territory as one continuous line. It’s a timeseries continuous plot where we can more closely see the ties between Turnover and the years under consideration.
What we gather from the plot is that Turnover climbed consistently thru the nineties, dropped shortly before the year 2000, reached a peak in / near the year 2000, stayed relatively high through the 1st decade of the 2000s (reaching a second peak near 2008), and then dropped again shortly after 2010.
The peaks may be tied to the Dot Com bubble (ca. ~2000) and the mortgage crises (ca. ~2008) in the US and because we live in such an international market with the US being so heavily influential international, as one of the leading consumer economies, the ripples were felt in Australia.
gg_season(myseries)
For the A3349773T
series, gg_season()
produces a monthly plot of Turnover from 1982 thru 2018 for Newspaper and book retailing in the Australian Capital Territory with each year plotted in a different hue for each month in the year. It’s a time series seasonal plot so we can more clearly see the ties between Turnover and month of year.
What we gather from the plot, is that there appears to be a tie between higher Turnover and the months of February, March, July and August. December is the month with the highest Turnover.
gg_subseries(myseries)
For the A3349773T
series, gg_season()
produces a monthly plot of Turnover from 1982 thru 2018 for Newspaper and book retailing in the Australian Capital Territory as a continuous plot from year to year based on the month. It’s a seasonal subseries plot that facets the time series by each season in the seasonal period.
From the plot, we gather the difference from month-to-month by year as well as the average Turnover comparison between months for all years under consideration:
gg_lag(myseries)
For the A3349773T
series, gg_lag()
produces a time series against lags of itself. It is often coloured per seasonal period to highlight how seasons are correlated.
It’s difficult to draw much of value from this plot, but what we do see is that certain hues / months sit higher on the scale of Turnover and so we might draw from that that those months when compared to others have a higher relative Turnover (ie. December).
%>%
myseries ACF(Turnover) %>%
autoplot()
For the A3349773T
series, ACF() %>% autoplot()
produces a time series against lags of itself. It shows how correlations change with the lag.
From the plot, we gather that months 1, 2, 3, 5, and 12 are the highest which indicates that there is a seasonality to when turnover is at its highest. Every year during the same month, for 2 months after, and at 4 months after.
As a quick summary of three main points from the above plots:
In short, there is clear seasonality and trends to the Australian Turnover data.