Data 624 HW 2 (Week 3) Time Series

Alexander Ng

Due 09/19/2021

Overview

Exercises from Hyndman and Athanosopoulos, Forecasting: Principles and Practice, 3rd Edition, Chapter 3 Time series decomposition.

library(fpp3)
library(kableExtra)

Exercise 3.1

Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

Solution

Two challenges to plotting GDP per capita for global_economy is the quantity of data and the wide range of values. It is impractical to evaluate the 256 GDP time series graphs using facet_wrap by country. Moreover, GDP per capita can range from $120 in 2001 (Ethiopia) to $82,553 (Monaco). We use the log transformation of the GDP per capita (omitting the legend) to construct a “hairy chart”.

The plot below shows GDP per capita has risen for all countries from 1960-2017. The relative rank of countries by GDP per capita has relatively stable. While all countries have gotten wealthier, lower ranking (poorer) countries in 1960 remained lower in 2017 (with notable exceptions like China).

global_economy %>% mutate(gdpc =  GDP/Population ) %>% 
  filter(!is.na(gdpc) ) %>% select( Year, Country, gdpc) %>%
  ggplot(aes(x=Year,y=log(gdpc), color=Country)) + 
  geom_line() + theme(legend.position="none") +
  ylab("log($GDP/capita)") + ggtitle("Log GDP per capital- All Countries 1960-2017")

Monaco had the highest GDP per capita in 2014 at $185,153 for all countries and all years in the global_economy dataset.

The table below shows the top country by GDP per capita in each year. A handful of countries have held the title of highest GDP per capita of the year. The winners include United States, Kuwait, Monaco, Liechtenstein, Luxembourg and United Arab Emirates. Monaco has been the winner the most times.

# This code constructs a gdp per capita variable, removes NA observations,
# and finds the country of highest gdp per capita within each year.
global_economy %>% mutate(gdpc =  GDP/Population ) %>% 
  filter(!is.na(gdpc) ) %>% index_by(Year) %>% 
  filter( gdpc == max(gdpc) ) %>% select(Year, gdpc) %>% 
  arrange(Year) %>% 
  kable(col.names=c("Year", "GDP/capita", "Country"), digits = 0, align='ccl') %>% 
  kable_styling(bootstrap_options = c("hover", "striped"))
Year GDP/capita Country
1960 3007 United States
1961 3067 United States
1962 3244 United States
1963 3375 United States
1964 3574 United States
1965 4429 Kuwait
1966 4556 Kuwait
1967 4336 United States
1968 4696 United States
1969 5032 United States
1970 12480 Monaco
1971 13813 Monaco
1972 16734 Monaco
1973 21423 Monaco
1974 22707 Monaco
1975 28254 Monaco
1976 29698 United Arab Emirates
1977 33246 United Arab Emirates
1978 38354 Monaco
1979 45838 Monaco
1980 51529 Monaco
1981 44366 Monaco
1982 41385 Monaco
1983 38888 Monaco
1984 36382 Monaco
1985 37553 Monaco
1986 52175 Monaco
1987 63043 Monaco
1988 68434 Monaco
1989 68577 Monaco
1990 84287 Monaco
1991 83733 Monaco
1992 91654 Monaco
1993 85422 Monaco
1994 89404 Monaco
1995 101993 Monaco
1996 101329 Monaco
1997 90883 Monaco
1998 93093 Monaco
1999 91384 Monaco
2000 82535 Monaco
2001 82553 Monaco
2002 89061 Monaco
2003 108978 Monaco
2004 123382 Monaco
2005 124374 Monaco
2006 133195 Monaco
2007 167125 Monaco
2008 180640 Monaco
2009 149221 Monaco
2010 144569 Monaco
2011 162155 Monaco
2012 152000 Monaco
2013 173528 Liechtenstein
2014 185153 Monaco
2015 167591 Liechtenstein
2016 168011 Monaco
2017 104103 Luxembourg

Exercise 3.2

For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

Solution

For each time series below, we plot the raw time series and then assess if a transformation is desirable.

United States GDP

The time series needs to be adjusted for both population and inflation to get a real GDP per capita.

First we illustrate the raw time series. The nominal aggregate US GDP increases approximately geometrically except during economic recessions.

global_economy %>% filter(Country == "United States") %>% 
  autoplot(GDP) + ggtitle("Raw United States GDP") + xlab("Year") + ylab("$GDP")

The US GDP needs to be transformed because the raw data does not tell us if individuals enjoyed increasing economic prosperity across time. So we define real GDP by performing 2 adjustments simultaneously in the formula below. Real GDP is based in terms of 2010 equivalent dollars. Note that \(CPI_{2010}=100\) so the R formula uses this simplification.

\[ \text{realGDP}(t) = \frac{\text{GDP}(t)}{\text{Population}(t)}\frac{CPI_{2010}}{CPI_{t}}\] We plot the real GDP per capita in green and nominal GDP per capita in blue.

global_economy %>% filter(Country == "United States") %>% 
  mutate( gdpc = GDP/Population, realgdpc = gdpc / CPI * 100) %>% 
  filter(!is.na(gdpc)) %>% 
  ggplot() + 
  geom_line(aes(x=Year,y=gdpc ) , color="blue")+ 
  geom_line(aes(x=Year,y=realgdpc), color="green") +
  ylab("Dollars/Capita") +
  ggtitle("Real vs Nominal US GDP/capita")

The effect of this transformation is that real GDP per capita has indeed increased over time. An individual’s productivity increased from about $21,000 in 1960 to over $52,000 in 2017 in 2010 dollars.

Two notable periods when nominal GDP increased but real GDP declined are noteworthy.

  • The late 1970s and early 1980s was a period of “stagflation”.
  • In 2000-2003, the Internet bubble burst.

Australian livestock

The raw count data for Australian livestock slaughter in Victoria province shows seasonality and trend. So, a decomposition using STL or X11 would be useful while a transformation or adjustment is not justified in all use cases. For example, if the intended use is to measure peak slaughter house capacity in each decade, the raw data appears sufficient.

aus_livestock %>% 
  filter( State== "Victoria", Animal == 'Bulls, bullocks and steers' ) %>% 
  autoplot(Count) + ggtitle("Victoria Cattle Slaughter - Monthly Counts")  +
  ylab("Number of animals")

Victorian Electrical Demand

The vic_elec shows a tremendous density of information because the period is half-hourly over 3 years. The data analyst must decide the purpose and nature of the analysis before transforming vic_elec. I believe vic_elec data should not be transformed by the 4 proposed transformation: calendar, population, inflation and Box-Cox but instead could be time aggregated to a time period based on the data analyst’s purpose. I illustrate with monthly aggregation.

First we plot the raw electrical demand data along with its seasonal graphs at 3 periods: daily, weekly and annually in a panel.

library(cowplot)

p_orig   = vic_elec %>% autoplot(Demand) + labs(y="MW", title="Original")
p_daily  = vic_elec %>% gg_season(Demand, period="day") + labs(y="MW", title="Hour")
p_weekly = vic_elec %>% gg_season(Demand, period="week") + labs(y="MW", title="Week")
p_yearly = vic_elec %>% gg_season(Demand, period="year") + labs(y="MW", title="Year")


plot_grid(p_orig, p_daily, p_weekly, p_yearly)

There is clear visual evidence of seasonality at the daily, monthly, and annual level.

I argue that none of the proposed transformations are appropriate without additional context.

  • Calendar adjustments like dividing by number of days in a calendar month don’t apply since the data is half hourly. We would first have to do time aggregation. Half hourly demand needs to be grouped at the daily level before applying calendar adjustment. But daily aggregation prevents us from analyzing higher frequency patterns of demand.
  • Inflation adjustment doesn’t make sense since the demand is not monetary.
  • Population adjustment could make sense provided that we could estimate the resident and non-resident population. For example, commuters and tourists use a lot of electricity but would be omitted from population statistics.
  • Box-Cox transformation does not stabilize variance because the volatility is clustered and time varying across the range of \(\lambda\) considered below. We try \(\lambda=0, 0.1, 0.5, 1.0\).
p1   = vic_elec %>% mutate(bc1 = box_cox(Demand, 0 ) ) %>% autoplot(bc1) + labs( title="Lambda=0")
p2   = vic_elec %>% mutate(bc2 = box_cox(Demand, 0.1) ) %>% autoplot(bc2) + labs( title="Lambda=0.1")
p3   = vic_elec %>% mutate(bc3 = box_cox(Demand, 0.5) ) %>% autoplot(bc3) + labs( title="Lambda=0.5")
p4   = vic_elec %>% mutate(bc4 = box_cox(Demand, 1.0) ) %>% autoplot(bc4) + labs( title="Lambda=1.0")


plot_grid(p1, p2, p3, p4)

The only sensible transformation might be to aggregate Demand for longer periods of time. We illustrate by aggregating Demand to monthly intervals.

vic_elec %>% index_by(Month = ~ yearmonth(.) ) %>% 
  summarize( Demand2 = sum(Demand)) %>% 
  select(Month, Demand2) %>% autoplot(Demand2) + 
  labs(title="Time Aggregation by Month for Victoria Electric Demand")

Gas Production

Gas Production seems to benefit from a Box-Cox transformation to stabilize the variation. We recommend a \(\lambda=0.12\).

The Guerrero algorithm estimates the best lambda (\(\lambda=0.12\)). We confirm its suitability by visual comparison of the original series and 2 other lambdas (\(\lambda=0.05, 0.40\)).

lambdaX<- aus_production %>%
  features(Gas, features = guerrero) %>%
  pull(lambda_guerrero)

lambdaX
## [1] 0.1205077
library(cowplot)

p1 = autoplot(aus_production, Gas) + ggtitle("Original")
p2 = autoplot(aus_production, box_cox( Gas, lambda = 0.05)) + ggtitle("Lambda = .05")
p3 = autoplot(aus_production, box_cox( Gas, lambda = 0.12)) + ggtitle("Lambda = .12")
p4 = autoplot(aus_production, box_cox( Gas, lambda = 0.40)) + ggtitle("Lambda = .40")

plot_grid(p1, p2, p3, p4)

Exercise 3.3

Why is a Box-Cox transformation unhelpful for the canadian_gas data?

Solution

Box-Cox will be unhelpful to stabilize the variance because canadian_gas fails to satisfy monotonicity of variation. Box-Cox transformation is most effective when the change in variation (i.e. volatility) is monotonic.

To illustrate the failure, we make two plots.
The first one shows canadian_gas and the rolling 12 month (backward) standard deviation of Volume in red.

g1 = ggplot(canadian_gas, aes(x=Month, y= Volume)) + 
  geom_line() + ggtitle("Canadian Gas monthly Volume")

# Use the slider library to compute rolling standard deviation.
# The size of the output vector is aligned.
rollsd = slider::slide_dbl(canadian_gas$Volume, sd, .before = 12)

# Zero out the first term of the slider output which is NA for sd() calculation.
rollsd[1] = 0  

g1 + geom_line(aes(x=Month,y=rollsd), color="red")

We clearly observe that the standard deviation is elevated in the decade of 1980-1990 and lower before and after. Thus, the variation of Volume is non-monotonic over the long term.

In the second plot, we show a Box-Cox transformation and its corresponding rolling 12 month standard deviation in red. \(\lambda\) is estimated using the Guerrero method which is allowed because the series is non-zero.

lambda <- canadian_gas %>%
  features(Volume, features = guerrero) %>%
  pull(lambda_guerrero)

lambda
## [1] 0.3921381
# New tsibble for Box-Cox transformed gas Volume data
bcgas = canadian_gas
bcgas$Volume = box_cox(canadian_gas$Volume, lambda )

bcrollsd = slider::slide_dbl(bcgas$Volume, sd, .before = 12)

# Zero out the first term of the slider output which is NA for sd() calculation.
bcrollsd[1] = 0  


g2 = ggplot(bcgas, aes(x=Month, y= Volume)) + 
  geom_line() + ggtitle("Box Cox Canadian Gas monthly Volume")

g2 + geom_line(aes(x=Month,y=rollsd), color="red")

The resulting Box-cox transformation of Volume does not work uniformly. We see Box-Cox works well from 1960 to 1970 where variance is stabilized but badly from 1970-2000 where variation shrinks, grows and shrinks again.

Exercise 3.4

What Box-Cox transformation would you select for your retail data (from Exercise 8 in Section 2.10)?

Solution

In the submitted HW1, Exercise 8 of Section 2.10, we used sales(Turnover) of the Pharmaceutical, cosmetic and toiletry good retail sector of Western Australia. For this dataset, we have chosen a Box-Cox transformation with \(\lambda \approx 0.14\) estimated using the Guerrero method. We make two plots to illustrate the noticeable stabilization of variance.

set.seed(392381)
myseries <- aus_retail %>% filter(`Series ID` == sample(aus_retail$`Series ID`, 1 ))

The first plot depicts the raw data (Turnover) in black, a trend line in red and the rolling 36 month standard deviation in green. Variation is clearly low from 1980-2000 and then much higher from 2008-2020.

dcmp <- myseries %>% model( stl=STL(Turnover))

stl_info = components(dcmp) %>% as_tsibble()

# Use the slider library to compute rolling standard deviation.
# The size of the output vector is aligned.
rollsd_sales = slider::slide_dbl(myseries$Turnover, sd, .before = 36)

# Replace the initial value of NA with 0.  The rolling std vector has same size as original vector
rollsd_sales[1] = 0 

autoplot(stl_info, Turnover) + 
  geom_line(aes(y=trend), color="red") + 
  geom_line(aes(y=rollsd_sales), color="green") +
  ggtitle("Raw - Sales Turnover - Western Australia")

In the second plot, we illustrate a Box-Cox transformation using \(\lambda = 0.14\). The same color scheme is used as in the first plot.

lambda <- myseries %>%
  features(Turnover, features = guerrero) %>%
  pull(lambda_guerrero)

lambda
## [1] 0.1414312
myseries$BCTurnover = box_cox(myseries$Turnover, lambda )

dcmpBC <- myseries %>% model( stl=STL(BCTurnover))

stl_infoBC = components(dcmpBC) %>% as_tsibble()

# Use the slider library to compute rolling standard deviation.
# The size of the output vector is aligned.
rollsd_BC = slider::slide_dbl(myseries$BCTurnover, sd, .before = 36)

# Replace the initial value of NA with 0.  The rolling std vector has same size as original vector
rollsd_BC[1] = 0 

autoplot(stl_infoBC, BCTurnover) + 
  geom_line(aes(y=trend), color="red") + 
  geom_line(aes(y=rollsd_BC), color="green") +
  ggtitle("Box Cox - Sales Turnover - Western Australia")

In the second plot, we observe the desired stabilization of variance. The variation from 1980-2000 and from 2008-2020 are more uniform than in the first plot.

Exercise 3.5

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance.

Solution

Each of these data series poses challenges not covered in the text. We improvise to find the best solution using known techniques.

Tobacco Production

For the Tobacco data series in aus_production we will a Box-Cox transformation with \(\lambda= \approx .928\) chosen by Guerrero’s method. We review historical background on Australian Tobacco production, address missing data in Tobacco and compare iagnostic plots before and after Box-Cox transformation. We conclude there is limited improvement in variance stabilization.

The history of Australian tobacco production from 1960-2017 is one of decline and replacement by imports. The elimination of protective tariffs and the decline in local tabacco consumption led to an inefficient industry. In 1994-1995, state governments announced financial incentives for tobacco growers to shut down production. Roughly 40% of growers ceased production in one year. The decline continued until commercial production record keeping ceased in 2004.

Note that the final 24 rows (starting from Q3 2004) are missing data, so we truncate the end of the dataframe. This is consistent with the historical background.

We plot the original Tobacco data series in black and the rolling 24 period backward standard deviation in green.

# Omit the ending observations which are NA values for Tobacco.
#
tobacco_data = aus_production %>% filter( !is.na(Tobacco)) %>% select(Quarter, Tobacco)


rollsd = slider::slide_dbl(tobacco_data$Tobacco, sd, .before = 12)

rollsd[1] = 0  # Remove the NA at the front


autoplot(tobacco_data, Tobacco) + ggtitle("Original Tobacco data series w rolling 12 quarter volatility") +
  geom_line(aes(x=Quarter, y = rollsd), color = "green")

We observe that volatility increases in 1994-1996 due to the state sponsored buyouts.

lambda <- aus_production %>%
  filter(!is.na(Tobacco)) %>%
  features(Tobacco, features = guerrero) %>%
  pull(lambda_guerrero)

lambda
## [1] 0.9289402
tobacco_data$TobaccoBC = box_cox(tobacco_data$Tobacco, lambda)

# Use the slider library to compute rolling standard deviation.
# The size of the output vector is aligned.

rollsd_BC1 = slider::slide_dbl(tobacco_data$TobaccoBC, sd, .before = 12)

# Replace the initial value of NA with 0.  The rolling std vector has same size as original vector
rollsd_BC1[1] = 0 

ggplot(tobacco_data) +
  geom_line(aes(x = Quarter, y=TobaccoBC), color="red") + 
  geom_line(aes(x = Quarter, y=rollsd_BC1), color="green") +
  ggtitle("Box Cox - Tobacco Production - Australia")

By visual inspection below, variation in Tobacco production from 1956 to 1970 (the left side) has increased while the variation from 1980 to 2004 on the right side has declined.

We also analyze the STL decomposition of tobacco before and after Box-Cox.
* Both residual plots of the STL decomposition look white-noise like with known left tail outliers. * Plotting the residuals on a qqnorm plot shows limited benefit of the transformation. The variance stabilization does not improve normality of the STL residuals significantly.

tobacco_dcmp = tobacco_data %>% model( stl = STL(Tobacco)) %>% components()

tobacco_dcmp %>% autoplot(Tobacco)

ggplot(data=tobacco_dcmp) + geom_qq(aes(sample=remainder )) + geom_qq_line(aes(sample=remainder)) + labs(title="QQNorm of Tobacco")

tobaccobc_dcmp = tobacco_data %>% model( stl = STL(TobaccoBC)) %>% components()

tobaccobc_dcmp %>% autoplot(TobaccoBC)

ggplot(data=tobaccobc_dcmp) + geom_qq(aes(sample=remainder )) + geom_qq_line(aes(sample=remainder)) + labs(title="QQNorm of Box-Cox Tobacco")

Economy Class Passengers

This time series is challenging due to a period of missing data and other issues. Ultimately, none of the Box-Cox transformations produce better variance stabilization before the issues addressed below. We chose a \(\lambda \approx 1.02\) as the optimal choice from Guerrero’s method but this is almost the same as not using Box-Cox.

Version 2 of this text gives a detailed list of the data quality issues in Section 2.2 which is omitted in Version 3. It states:

The time plot immediately reveals some interesting features. There was a period in 1989 when no passengers were carried – this was due to an industrial dispute. There was a period of reduced load in 1992. This was due to a trial in which some economy seats where replaced by business class seats. A large increase in passenger load occurred in the second half of 1991. There are some large dips in load around the start of each year. These are due to holiday effects. There is a long-term fluctuation in the level of the series which increases during 1987, decreases in 1989, and increases again through 1990 and 1991.

melsyd = ansett %>% 
  filter( Airports == 'MEL-SYD' , Class == 'Economy' ) 
    
melsyd %>% autoplot(Passengers)

We also observe a missing observation: 1987 W38 is missing which we backfill with the prior week’s value ‘1987 W37’ of 21911.

melsyd %>% filter_index('1987 W35' ~ '1987 W40')
## # A tsibble: 5 x 4 [1W]
## # Key:       Airports, Class [1]
##       Week Airports Class   Passengers
##     <week> <chr>    <chr>        <dbl>
## 1 1987 W35 MEL-SYD  Economy      20772
## 2 1987 W36 MEL-SYD  Economy      21642
## 3 1987 W37 MEL-SYD  Economy      21911
## 4 1987 W39 MEL-SYD  Economy      23777
## 5 1987 W40 MEL-SYD  Economy      22658
# Save the modified passenger data into a new tsibble.
#
melsyd %>% fill_gaps() %>% 
  tidyr::fill(Passengers, .direction = 'down') -> melsyd2

Next, we observe a period of consecutive weeks in 1989 with zero Passengers due to the previously mentioned labor dispute from W34 to W40. We choose to impute the missing data values because we believe the model should not predict labor stoppages. Thus, the imputed series represents a counterfactual scenario where labor stoppages don’t exist but other business events may effect the time series.

melsyd2 %>% filter(Passengers == 0 )
## # A tsibble: 7 x 4 [1W]
## # Key:       Airports, Class [1]
##       Week Airports Class   Passengers
##     <week> <chr>    <chr>        <dbl>
## 1 1989 W34 MEL-SYD  Economy          0
## 2 1989 W35 MEL-SYD  Economy          0
## 3 1989 W36 MEL-SYD  Economy          0
## 4 1989 W37 MEL-SYD  Economy          0
## 5 1989 W38 MEL-SYD  Economy          0
## 6 1989 W39 MEL-SYD  Economy          0
## 7 1989 W40 MEL-SYD  Economy          0

Let the average passenger level per week in 1988 be \(P_{1988}\) and the average passenger level per week in 1989 excluding the weeks from W34 to W40 be \(P_{1989}\). The imputed Passengers for missing week \(k\) in 1989 will be:

\[M(k,1989) = \frac{P_{1989}}{P_{1988}} M(k, 1988) \text{ for } k=34, \cdots , 40\]

This formula replicates the seasonality of the week while preserving the average level of each year’s passenger count relative to these two years.

P1988 = melsyd2 %>% filter_index('1988 W01' ~ '1988 W52') %>% 
  as_tibble() %>% summarize(mean(Passengers)) %>% magrittr::extract2(1)

P1988
## [1] 22164.71
P1989 = melsyd2 %>% 
  filter_index('1989 W01' ~ '1989 W52') %>% 
  filter( Passengers != 0) %>%
  as_tibble() %>% 
  summarize(mean(Passengers)) %>% magrittr::extract2(1)

P1989
## [1] 18997.93
melsyd2[114:120,]
## # A tsibble: 7 x 4 [1W]
## # Key:       Airports, Class [1]
##       Week Airports Class   Passengers
##     <week> <chr>    <chr>        <dbl>
## 1 1989 W34 MEL-SYD  Economy          0
## 2 1989 W35 MEL-SYD  Economy          0
## 3 1989 W36 MEL-SYD  Economy          0
## 4 1989 W37 MEL-SYD  Economy          0
## 5 1989 W38 MEL-SYD  Economy          0
## 6 1989 W39 MEL-SYD  Economy          0
## 7 1989 W40 MEL-SYD  Economy          0
melsyd2[114:120,"Passengers"] = melsyd2[62:68, "Passengers"] * (P1989)/(P1988)

melsyd2[114:120,]
## # A tsibble: 7 x 4 [1W]
## # Key:       Airports, Class [1]
##       Week Airports Class   Passengers
##     <week> <chr>    <chr>        <dbl>
## 1 1989 W34 MEL-SYD  Economy     20127.
## 2 1989 W35 MEL-SYD  Economy     20274.
## 3 1989 W36 MEL-SYD  Economy     22332.
## 4 1989 W37 MEL-SYD  Economy     22024.
## 5 1989 W38 MEL-SYD  Economy     22902.
## 6 1989 W39 MEL-SYD  Economy     12885.
## 7 1989 W40 MEL-SYD  Economy     19148.
melsyd2 %>% gg_season(Passengers)

melsyd2
## # A tsibble: 283 x 4 [1W]
## # Key:       Airports, Class [1]
##        Week Airports Class   Passengers
##      <week> <chr>    <chr>        <dbl>
##  1 1987 W26 MEL-SYD  Economy      20167
##  2 1987 W27 MEL-SYD  Economy      20161
##  3 1987 W28 MEL-SYD  Economy      19993
##  4 1987 W29 MEL-SYD  Economy      20986
##  5 1987 W30 MEL-SYD  Economy      20497
##  6 1987 W31 MEL-SYD  Economy      20770
##  7 1987 W32 MEL-SYD  Economy      21111
##  8 1987 W33 MEL-SYD  Economy      20675
##  9 1987 W34 MEL-SYD  Economy      22092
## 10 1987 W35 MEL-SYD  Economy      20772
## # … with 273 more rows
lambda<- melsyd2 %>%
  features(Passengers, features = guerrero) %>%
  pull(lambda_guerrero)

lambda
## [1] 1.017995
melsyd2$bc  = box_cox(melsyd2$Passengers, lambda)

autoplot(melsyd2, bc)

The STL decomposition below and the resulting residuals show departure from normality on the left tail.

melsyd2bc_dcmp = melsyd2 %>% model( stl = STL(bc)) %>% components()

melsyd2bc_dcmp %>% autoplot(bc)

ggplot(data=melsyd2bc_dcmp) + geom_qq(aes(sample=remainder )) + geom_qq_line(aes(sample=remainder)) + labs(title="QQNorm of Box-Cox Mel-Syd Passengers")

Pedestrians

Since the pedestrian dataset has many observations with zero count, we cannot use the Guerrero method to compute \(\lambda\) in the Box-Cox transformation. We also need to fill gaps in the pedestrian dataset with the prior available count as a best estimate. We do that below. We suggest \(\lambda=0.1\) as for the Box-Cox transformation but argue no parameter is satisfactory.

pedestrian %>%
  filter(Sensor=="Southern Cross Station") -> ped1

#ped1 %>% fill_gaps(Count = as.integer(median(Count))) -> pedscs
ped1 %>% fill_gaps() %>% 
  tidyr::fill(Count, .direction = 'down') -> pedscs

Note that zero counts are the most frequently observed count in the dataset as shown in the histogram below.

# Show that zero counts are frequent in the dataset.
hist(pedscs$Count, breaks = 40 )

Construct a STL decomposition of the raw series.

xcmp = pedscs %>% model( stl = STL(Count)) %>% components()

xcmp %>% autoplot(Count)

ggplot(data=xcmp) + geom_qq(aes(sample=remainder )) + 
  geom_qq_line(aes(sample=remainder)) + labs(title="QQNorm of Pedestrians")

In the above components plot, the STL algorithm decomposes the time series into trend and 3 seasonal components automatically. The annual, weekly and daily components show some unevenness in plots. The annual seasonality plot shows downward spikes and the daily seasonality plot shows some uneven clumping around 2016-01. Lastly, the residuals seem significant kurtosis. There are a number of large tail events.

By trial and error, we observe 0.1 allows reducing some of these residual tails and smooths out the clumping of the annual and daily plots as shown below. However, the Q-Q plot shows significant departures from normality at the tail. So we conclude Box-Cox does not solve the challenge of modeling this time series effectively.

lambda = 0.1

xcmp2 = pedscs %>% 
  mutate( bc = box_cox(Count, lambda) ) %>% 
  model( stl = STL(bc)) %>% components()

xcmp2 %>% autoplot(bc)

ggplot(data=xcmp2) + geom_qq(aes(sample=remainder )) + 
  geom_qq_line(aes(sample=remainder)) + labs(title="QQNorm of Box Cox Pedestrians")

Exercise 3.7

Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) %>% 
select(Gas)
  1. Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?
  2. Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.
  3. Do the results support the graphical interpretation from part a?
  4. Compute and plot the seasonally adjusted data.
  5. Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
  6. Does it make any difference if the outlier is near the end rather than in the middle of the time series?

Solution

  1. We observe both a seasonal fluctation and a trend in the gas data. The season plot shows an increase in quarters 2 and 3 and a decline in quarters 1 and 4.
gas <- tail(aus_production, 5*4) %>% 
select(Gas)

autoplot(gas, Gas) + 
  labs(title="Original: Last 20 observations of Gas Production", 
       subtitle="aus_production" )

# Seasonality is observed using gg_season plot.
gg_season(gas, Gas, period="year")

  1. A classical decomposition is calculated below in multiplicative form.
classic_dcmp <- gas %>%
  model(
    classical_decomposition(Gas, type = "multiplicative")
  ) %>%
  components() 

classic_dcmp %>%
  autoplot() +
  labs(title = 
  "Classical multiplicative decomposition: Gas",
  caption = "Austrialia Gas Production")

c. The results in b. are consistent with the diagnostic plots in a. They extend the results of a. by fitting a trend through the raw data. d. The seasonally adjusted Gas production data from the classical decomposition is contained in the column season_adjust. This is graphed below.

classic_dcmp %>% autoplot(season_adjust) + labs(caption = "Seasonally adjusted gas production")

e. We shock the gas data at observation 10 of the original dataset to study a mid-range shock.

gasXMid = gas
gasXMid[10, "Gas"] = gas[10,"Gas"] + 300

classic_dcmp_mid <- gasXMid %>%
  model(
    classical_decomposition(Gas, type = "multiplicative")
  ) %>%
  components() 

classic_dcmp_mid %>%
  autoplot() +
  labs(title = 
  "Classical multiplicative decomposition: Gas Mid-Outlier",
  caption = "Australia Gas Production")

classic_dcmp_mid %>% 
  autoplot(season_adjust) + 
  labs(caption = "Seasonally adjusted gas production with Mid-Outlier")

A large hump interrupts the smooth trend. We see that the outlier disrupts the seasonal pattern. Because outlier occurs at an observation in Q4, seasonality component shows an increase at Q4 instead of a decline in Q4.
Lastly, the range of the residual explodes. Previously, the residual component range about 2% from the base level of 1. When the outlier is added, the residual range increases to +20% to -20%.

  1. Now we compute the same shock impact at observation 20 (the end of the time series).
gasXEnd = gas
gasXEnd[20, "Gas"] = gas[20,"Gas"] + 300

classic_dcmp_end <- gasXEnd %>%
  model(
    classical_decomposition(Gas, type = "multiplicative")
  ) %>%
  components() 

classic_dcmp_end %>%
  autoplot() +
  labs(title = 
  "Classical multiplicative decomposition: Gas End-Outlier",
  caption = "Austrialia Gas Production")

classic_dcmp_end %>% 
  autoplot(season_adjust) + 
  labs(caption = "Seasonally adjusted gas production with shock at end")

The impact of an outlier at the last observation produces different effects than in the middle. First, the trend does not show a hump but rather a spike at the end. Second, the seasonal component shows an increase in Q2 but keeps the decline in Q4. In contrast, the middle of the range outlier causes an increase in Q4 but no impact on Q2. Third, the residuals show a smaller impact at the end. The decline is only 12% compared to a 40% increase in the mid-range outlier.

Exercise 3.8

Recall your retail time series data (from Exercise 8 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

Solution

Our solution first examines the multiplicative versions of the X11 and classical decompositions then compares their residuals in a labeled scatterplot to identify outliers.

Our retail time series data is replicated below with the same seed as in Exercise 8, Section 2.10.

# We replicate the Time Series used in HW 1 Exercise by using the same seed.

set.seed(392381)
myseries <- aus_retail %>% 
  filter(`Series ID` == sample(aus_retail$`Series ID`, 1 ))

The X11 decomposition is computed below.

# Import the seasonal library where X_13ARIMA_SEATS is available
library(seasonal)
x11_dcmp <- myseries %>% 
  model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) %>%
  components()

autoplot(x11_dcmp ) + 
  labs( title = 
  "Decomposition of Pharmaeutical, cosmetic, toiletry turnover using X-11",
        caption = "Western Australia")

The classical decomposition is computed below.

classic_dcmp <- myseries %>%
  model(
    classical_decomposition(Turnover, type = "multiplicative")
  ) %>%
  components() 

classic_dcmp %>%
  autoplot() +
  labs(title = 
  "Classical multiplicative decomposition: Pharmaceuticals, cosmetics, toiletry",
  caption = "Western Australia")

Lastly, we examine a scatter plot of the residuals of both the X11 decomposition (horizontal axis) to the classical decomposition (vertical axis) below. As the residuals are multiplicative, the non-outliers are centered around the point \((1,1)\) rather than the origin.
We define outliers as the datapoints where either X11 or classical residual exceed a 9.5% threshold. We use ggrepel and conditional labeling to identify outliers.

library(ggrepel)
turnover_compare = classic_dcmp %>% select(Month, random)

turnover_compare$x11err = x11_dcmp$irregular

ggplot(turnover_compare, aes(x=x11err, y=random)) + 
  geom_point(alpha=0.6) + 
  ggtitle("Residuals Between X11 and Classical Decomposition") + 
  geom_label_repel(aes(label=ifelse(  abs(x11err - 1) > 0.095 |   
                               abs(random - 1) > 0.095,                                                                                                   as.character(Month), "") ) , 
                 color="red", point.padding = 0.3, box.padding = 0.3) +
  xlab("X11 residual") + ylab("Classical residual") +
  labs(caption="Outliers labeled if absolute residual exceeds 9.5%")

Observations

The X11 decomposition reveals some interesting new features that I did not expect.

  • There is significant variation in seasonality between each decade. In the decade after 2010, seasonality has declined. In 2013, 2014 seasonality declined to its lowest point in over 2 decades. This information is absent in the classical decomposition.
  • Several months appear to be outliers in X11 but not in classical decomposition. The scatter plot shows Jan 2000, Dec 2010 as outliers (below trend).
  • Some months are outliers in both models: Dec 1989 and Dec 1990 are worth further exploration.

In general, X11 seems to do a better job of identify business cycles and deviations from trend than the classical decomposition. However, digging into Australian business cycles to investigate the above outliers and seasonal variation is of scope.

Exercise 3.9

Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Figure 3.19: Decomposition of the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Figure 3.20: Seasonal component from the decomposition shown in the previous figure.

  1. Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.
  2. Is the recession of 1991/1992 visible in the estimated components?

Solution

  1. The decomposition of the civilian labor force shows a steady positive upward trend. The trend could be explained by population growth which may be due to changes in birth rates or immigration. In 1995, the trend seems to reach 9,000,000 workers. The seasonal component of the labor force increases in March, September and December. For example, seasonal chart shows 70,000 workforce decline relative to trend in August which may be due to holiday seasonal work or summer school.

  2. The recession of 1991/1992 is highly visible in the residuals and shows both a decline in employment followed by a sharp rebound in following months.