Exercises from Hyndman and Athanosopoulos, Forecasting: Principles and Practice, 3rd Edition, Chapter 3 Time series decomposition.
library(fpp3)
library(kableExtra)Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?
Two challenges to plotting GDP per capita for global_economy is the quantity of data and the wide range of values. It is impractical to evaluate the 256 GDP time series graphs using facet_wrap by country. Moreover, GDP per capita can range from $120 in 2001 (Ethiopia) to $82,553 (Monaco). We use the log transformation of the GDP per capita (omitting the legend) to construct a “hairy chart”.
The plot below shows GDP per capita has risen for all countries from 1960-2017. The relative rank of countries by GDP per capita has relatively stable. While all countries have gotten wealthier, lower ranking (poorer) countries in 1960 remained lower in 2017 (with notable exceptions like China).
global_economy %>% mutate(gdpc = GDP/Population ) %>%
filter(!is.na(gdpc) ) %>% select( Year, Country, gdpc) %>%
ggplot(aes(x=Year,y=log(gdpc), color=Country)) +
geom_line() + theme(legend.position="none") +
ylab("log($GDP/capita)") + ggtitle("Log GDP per capital- All Countries 1960-2017")Monaco had the highest GDP per capita in 2014 at $185,153 for all countries and all years in the global_economy dataset.
The table below shows the top country by GDP per capita in each year. A handful of countries have held the title of highest GDP per capita of the year. The winners include United States, Kuwait, Monaco, Liechtenstein, Luxembourg and United Arab Emirates. Monaco has been the winner the most times.
# This code constructs a gdp per capita variable, removes NA observations,
# and finds the country of highest gdp per capita within each year.
global_economy %>% mutate(gdpc = GDP/Population ) %>%
filter(!is.na(gdpc) ) %>% index_by(Year) %>%
filter( gdpc == max(gdpc) ) %>% select(Year, gdpc) %>%
arrange(Year) %>%
kable(col.names=c("Year", "GDP/capita", "Country"), digits = 0, align='ccl') %>%
kable_styling(bootstrap_options = c("hover", "striped"))| Year | GDP/capita | Country |
|---|---|---|
| 1960 | 3007 | United States |
| 1961 | 3067 | United States |
| 1962 | 3244 | United States |
| 1963 | 3375 | United States |
| 1964 | 3574 | United States |
| 1965 | 4429 | Kuwait |
| 1966 | 4556 | Kuwait |
| 1967 | 4336 | United States |
| 1968 | 4696 | United States |
| 1969 | 5032 | United States |
| 1970 | 12480 | Monaco |
| 1971 | 13813 | Monaco |
| 1972 | 16734 | Monaco |
| 1973 | 21423 | Monaco |
| 1974 | 22707 | Monaco |
| 1975 | 28254 | Monaco |
| 1976 | 29698 | United Arab Emirates |
| 1977 | 33246 | United Arab Emirates |
| 1978 | 38354 | Monaco |
| 1979 | 45838 | Monaco |
| 1980 | 51529 | Monaco |
| 1981 | 44366 | Monaco |
| 1982 | 41385 | Monaco |
| 1983 | 38888 | Monaco |
| 1984 | 36382 | Monaco |
| 1985 | 37553 | Monaco |
| 1986 | 52175 | Monaco |
| 1987 | 63043 | Monaco |
| 1988 | 68434 | Monaco |
| 1989 | 68577 | Monaco |
| 1990 | 84287 | Monaco |
| 1991 | 83733 | Monaco |
| 1992 | 91654 | Monaco |
| 1993 | 85422 | Monaco |
| 1994 | 89404 | Monaco |
| 1995 | 101993 | Monaco |
| 1996 | 101329 | Monaco |
| 1997 | 90883 | Monaco |
| 1998 | 93093 | Monaco |
| 1999 | 91384 | Monaco |
| 2000 | 82535 | Monaco |
| 2001 | 82553 | Monaco |
| 2002 | 89061 | Monaco |
| 2003 | 108978 | Monaco |
| 2004 | 123382 | Monaco |
| 2005 | 124374 | Monaco |
| 2006 | 133195 | Monaco |
| 2007 | 167125 | Monaco |
| 2008 | 180640 | Monaco |
| 2009 | 149221 | Monaco |
| 2010 | 144569 | Monaco |
| 2011 | 162155 | Monaco |
| 2012 | 152000 | Monaco |
| 2013 | 173528 | Liechtenstein |
| 2014 | 185153 | Monaco |
| 2015 | 167591 | Liechtenstein |
| 2016 | 168011 | Monaco |
| 2017 | 104103 | Luxembourg |
For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.
global_economy.aus_livestock.vic_elec.aus_production.For each time series below, we plot the raw time series and then assess if a transformation is desirable.
The time series needs to be adjusted for both population and inflation to get a real GDP per capita.
First we illustrate the raw time series. The nominal aggregate US GDP increases approximately geometrically except during economic recessions.
global_economy %>% filter(Country == "United States") %>%
autoplot(GDP) + ggtitle("Raw United States GDP") + xlab("Year") + ylab("$GDP")The US GDP needs to be transformed because the raw data does not tell us if individuals enjoyed increasing economic prosperity across time. So we define real GDP by performing 2 adjustments simultaneously in the formula below. Real GDP is based in terms of 2010 equivalent dollars. Note that \(CPI_{2010}=100\) so the R formula uses this simplification.
\[ \text{realGDP}(t) = \frac{\text{GDP}(t)}{\text{Population}(t)}\frac{CPI_{2010}}{CPI_{t}}\] We plot the real GDP per capita in green and nominal GDP per capita in blue.
global_economy %>% filter(Country == "United States") %>%
mutate( gdpc = GDP/Population, realgdpc = gdpc / CPI * 100) %>%
filter(!is.na(gdpc)) %>%
ggplot() +
geom_line(aes(x=Year,y=gdpc ) , color="blue")+
geom_line(aes(x=Year,y=realgdpc), color="green") +
ylab("Dollars/Capita") +
ggtitle("Real vs Nominal US GDP/capita") The effect of this transformation is that real GDP per capita has indeed increased over time. An individual’s productivity increased from about $21,000 in 1960 to over $52,000 in 2017 in 2010 dollars.
Two notable periods when nominal GDP increased but real GDP declined are noteworthy.
The raw count data for Australian livestock slaughter in Victoria province shows seasonality and trend. So, a decomposition using STL or X11 would be useful while a transformation or adjustment is not justified in all use cases. For example, if the intended use is to measure peak slaughter house capacity in each decade, the raw data appears sufficient.
aus_livestock %>%
filter( State== "Victoria", Animal == 'Bulls, bullocks and steers' ) %>%
autoplot(Count) + ggtitle("Victoria Cattle Slaughter - Monthly Counts") +
ylab("Number of animals")The vic_elec shows a tremendous density of information because the period is half-hourly over 3 years. The data analyst must decide the purpose and nature of the analysis before transforming vic_elec. I believe vic_elec data should not be transformed by the 4 proposed transformation: calendar, population, inflation and Box-Cox but instead could be time aggregated to a time period based on the data analyst’s purpose. I illustrate with monthly aggregation.
First we plot the raw electrical demand data along with its seasonal graphs at 3 periods: daily, weekly and annually in a panel.
library(cowplot)
p_orig = vic_elec %>% autoplot(Demand) + labs(y="MW", title="Original")
p_daily = vic_elec %>% gg_season(Demand, period="day") + labs(y="MW", title="Hour")
p_weekly = vic_elec %>% gg_season(Demand, period="week") + labs(y="MW", title="Week")
p_yearly = vic_elec %>% gg_season(Demand, period="year") + labs(y="MW", title="Year")
plot_grid(p_orig, p_daily, p_weekly, p_yearly) There is clear visual evidence of seasonality at the daily, monthly, and annual level.
I argue that none of the proposed transformations are appropriate without additional context.
p1 = vic_elec %>% mutate(bc1 = box_cox(Demand, 0 ) ) %>% autoplot(bc1) + labs( title="Lambda=0")
p2 = vic_elec %>% mutate(bc2 = box_cox(Demand, 0.1) ) %>% autoplot(bc2) + labs( title="Lambda=0.1")
p3 = vic_elec %>% mutate(bc3 = box_cox(Demand, 0.5) ) %>% autoplot(bc3) + labs( title="Lambda=0.5")
p4 = vic_elec %>% mutate(bc4 = box_cox(Demand, 1.0) ) %>% autoplot(bc4) + labs( title="Lambda=1.0")
plot_grid(p1, p2, p3, p4)The only sensible transformation might be to aggregate Demand for longer periods of time. We illustrate by aggregating Demand to monthly intervals.
vic_elec %>% index_by(Month = ~ yearmonth(.) ) %>%
summarize( Demand2 = sum(Demand)) %>%
select(Month, Demand2) %>% autoplot(Demand2) +
labs(title="Time Aggregation by Month for Victoria Electric Demand")Gas Production seems to benefit from a Box-Cox transformation to stabilize the variation. We recommend a \(\lambda=0.12\).
The Guerrero algorithm estimates the best lambda (\(\lambda=0.12\)). We confirm its suitability by visual comparison of the original series and 2 other lambdas (\(\lambda=0.05, 0.40\)).
lambdaX<- aus_production %>%
features(Gas, features = guerrero) %>%
pull(lambda_guerrero)
lambdaX## [1] 0.1205077
library(cowplot)
p1 = autoplot(aus_production, Gas) + ggtitle("Original")
p2 = autoplot(aus_production, box_cox( Gas, lambda = 0.05)) + ggtitle("Lambda = .05")
p3 = autoplot(aus_production, box_cox( Gas, lambda = 0.12)) + ggtitle("Lambda = .12")
p4 = autoplot(aus_production, box_cox( Gas, lambda = 0.40)) + ggtitle("Lambda = .40")
plot_grid(p1, p2, p3, p4)Why is a Box-Cox transformation unhelpful for the canadian_gas data?
Box-Cox will be unhelpful to stabilize the variance because canadian_gas fails to satisfy monotonicity of variation. Box-Cox transformation is most effective when the change in variation (i.e. volatility) is monotonic.
To illustrate the failure, we make two plots.
The first one shows canadian_gas and the rolling 12 month (backward) standard deviation of Volume in red.
g1 = ggplot(canadian_gas, aes(x=Month, y= Volume)) +
geom_line() + ggtitle("Canadian Gas monthly Volume")
# Use the slider library to compute rolling standard deviation.
# The size of the output vector is aligned.
rollsd = slider::slide_dbl(canadian_gas$Volume, sd, .before = 12)
# Zero out the first term of the slider output which is NA for sd() calculation.
rollsd[1] = 0
g1 + geom_line(aes(x=Month,y=rollsd), color="red")We clearly observe that the standard deviation is elevated in the decade of 1980-1990 and lower before and after. Thus, the variation of Volume is non-monotonic over the long term.
In the second plot, we show a Box-Cox transformation and its corresponding rolling 12 month standard deviation in red. \(\lambda\) is estimated using the Guerrero method which is allowed because the series is non-zero.
lambda <- canadian_gas %>%
features(Volume, features = guerrero) %>%
pull(lambda_guerrero)
lambda## [1] 0.3921381
# New tsibble for Box-Cox transformed gas Volume data
bcgas = canadian_gas
bcgas$Volume = box_cox(canadian_gas$Volume, lambda )
bcrollsd = slider::slide_dbl(bcgas$Volume, sd, .before = 12)
# Zero out the first term of the slider output which is NA for sd() calculation.
bcrollsd[1] = 0
g2 = ggplot(bcgas, aes(x=Month, y= Volume)) +
geom_line() + ggtitle("Box Cox Canadian Gas monthly Volume")
g2 + geom_line(aes(x=Month,y=rollsd), color="red")The resulting Box-cox transformation of Volume does not work uniformly. We see Box-Cox works well from 1960 to 1970 where variance is stabilized but badly from 1970-2000 where variation shrinks, grows and shrinks again.
What Box-Cox transformation would you select for your retail data (from Exercise 8 in Section 2.10)?
In the submitted HW1, Exercise 8 of Section 2.10, we used sales(Turnover) of the Pharmaceutical, cosmetic and toiletry good retail sector of Western Australia. For this dataset, we have chosen a Box-Cox transformation with \(\lambda \approx 0.14\) estimated using the Guerrero method. We make two plots to illustrate the noticeable stabilization of variance.
set.seed(392381)
myseries <- aus_retail %>% filter(`Series ID` == sample(aus_retail$`Series ID`, 1 ))The first plot depicts the raw data (Turnover) in black, a trend line in red and the rolling 36 month standard deviation in green. Variation is clearly low from 1980-2000 and then much higher from 2008-2020.
dcmp <- myseries %>% model( stl=STL(Turnover))
stl_info = components(dcmp) %>% as_tsibble()
# Use the slider library to compute rolling standard deviation.
# The size of the output vector is aligned.
rollsd_sales = slider::slide_dbl(myseries$Turnover, sd, .before = 36)
# Replace the initial value of NA with 0. The rolling std vector has same size as original vector
rollsd_sales[1] = 0
autoplot(stl_info, Turnover) +
geom_line(aes(y=trend), color="red") +
geom_line(aes(y=rollsd_sales), color="green") +
ggtitle("Raw - Sales Turnover - Western Australia")In the second plot, we illustrate a Box-Cox transformation using \(\lambda = 0.14\). The same color scheme is used as in the first plot.
lambda <- myseries %>%
features(Turnover, features = guerrero) %>%
pull(lambda_guerrero)
lambda## [1] 0.1414312
myseries$BCTurnover = box_cox(myseries$Turnover, lambda )
dcmpBC <- myseries %>% model( stl=STL(BCTurnover))
stl_infoBC = components(dcmpBC) %>% as_tsibble()
# Use the slider library to compute rolling standard deviation.
# The size of the output vector is aligned.
rollsd_BC = slider::slide_dbl(myseries$BCTurnover, sd, .before = 36)
# Replace the initial value of NA with 0. The rolling std vector has same size as original vector
rollsd_BC[1] = 0
autoplot(stl_infoBC, BCTurnover) +
geom_line(aes(y=trend), color="red") +
geom_line(aes(y=rollsd_BC), color="green") +
ggtitle("Box Cox - Sales Turnover - Western Australia")In the second plot, we observe the desired stabilization of variance. The variation from 1980-2000 and from 2008-2020 are more uniform than in the first plot.
For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance.
aus_production,ansettpedestrianEach of these data series poses challenges not covered in the text. We improvise to find the best solution using known techniques.
For the Tobacco data series in aus_production we will a Box-Cox transformation with \(\lambda= \approx .928\) chosen by Guerrero’s method. We review historical background on Australian Tobacco production, address missing data in Tobacco and compare iagnostic plots before and after Box-Cox transformation. We conclude there is limited improvement in variance stabilization.
The history of Australian tobacco production from 1960-2017 is one of decline and replacement by imports. The elimination of protective tariffs and the decline in local tabacco consumption led to an inefficient industry. In 1994-1995, state governments announced financial incentives for tobacco growers to shut down production. Roughly 40% of growers ceased production in one year. The decline continued until commercial production record keeping ceased in 2004.
Note that the final 24 rows (starting from Q3 2004) are missing data, so we truncate the end of the dataframe. This is consistent with the historical background.
We plot the original Tobacco data series in black and the rolling 24 period backward standard deviation in green.
# Omit the ending observations which are NA values for Tobacco.
#
tobacco_data = aus_production %>% filter( !is.na(Tobacco)) %>% select(Quarter, Tobacco)
rollsd = slider::slide_dbl(tobacco_data$Tobacco, sd, .before = 12)
rollsd[1] = 0 # Remove the NA at the front
autoplot(tobacco_data, Tobacco) + ggtitle("Original Tobacco data series w rolling 12 quarter volatility") +
geom_line(aes(x=Quarter, y = rollsd), color = "green")We observe that volatility increases in 1994-1996 due to the state sponsored buyouts.
lambda <- aus_production %>%
filter(!is.na(Tobacco)) %>%
features(Tobacco, features = guerrero) %>%
pull(lambda_guerrero)
lambda## [1] 0.9289402
tobacco_data$TobaccoBC = box_cox(tobacco_data$Tobacco, lambda)
# Use the slider library to compute rolling standard deviation.
# The size of the output vector is aligned.
rollsd_BC1 = slider::slide_dbl(tobacco_data$TobaccoBC, sd, .before = 12)
# Replace the initial value of NA with 0. The rolling std vector has same size as original vector
rollsd_BC1[1] = 0
ggplot(tobacco_data) +
geom_line(aes(x = Quarter, y=TobaccoBC), color="red") +
geom_line(aes(x = Quarter, y=rollsd_BC1), color="green") +
ggtitle("Box Cox - Tobacco Production - Australia")By visual inspection below, variation in Tobacco production from 1956 to 1970 (the left side) has increased while the variation from 1980 to 2004 on the right side has declined.
We also analyze the STL decomposition of tobacco before and after Box-Cox.
* Both residual plots of the STL decomposition look white-noise like with known left tail outliers. * Plotting the residuals on a qqnorm plot shows limited benefit of the transformation. The variance stabilization does not improve normality of the STL residuals significantly.
tobacco_dcmp = tobacco_data %>% model( stl = STL(Tobacco)) %>% components()
tobacco_dcmp %>% autoplot(Tobacco)ggplot(data=tobacco_dcmp) + geom_qq(aes(sample=remainder )) + geom_qq_line(aes(sample=remainder)) + labs(title="QQNorm of Tobacco")tobaccobc_dcmp = tobacco_data %>% model( stl = STL(TobaccoBC)) %>% components()
tobaccobc_dcmp %>% autoplot(TobaccoBC)ggplot(data=tobaccobc_dcmp) + geom_qq(aes(sample=remainder )) + geom_qq_line(aes(sample=remainder)) + labs(title="QQNorm of Box-Cox Tobacco")This time series is challenging due to a period of missing data and other issues. Ultimately, none of the Box-Cox transformations produce better variance stabilization before the issues addressed below. We chose a \(\lambda \approx 1.02\) as the optimal choice from Guerrero’s method but this is almost the same as not using Box-Cox.
Version 2 of this text gives a detailed list of the data quality issues in Section 2.2 which is omitted in Version 3. It states:
The time plot immediately reveals some interesting features. There was a period in 1989 when no passengers were carried – this was due to an industrial dispute. There was a period of reduced load in 1992. This was due to a trial in which some economy seats where replaced by business class seats. A large increase in passenger load occurred in the second half of 1991. There are some large dips in load around the start of each year. These are due to holiday effects. There is a long-term fluctuation in the level of the series which increases during 1987, decreases in 1989, and increases again through 1990 and 1991.
melsyd = ansett %>%
filter( Airports == 'MEL-SYD' , Class == 'Economy' )
melsyd %>% autoplot(Passengers)We also observe a missing observation: 1987 W38 is missing which we backfill with the prior week’s value ‘1987 W37’ of 21911.
melsyd %>% filter_index('1987 W35' ~ '1987 W40')## # A tsibble: 5 x 4 [1W]
## # Key: Airports, Class [1]
## Week Airports Class Passengers
## <week> <chr> <chr> <dbl>
## 1 1987 W35 MEL-SYD Economy 20772
## 2 1987 W36 MEL-SYD Economy 21642
## 3 1987 W37 MEL-SYD Economy 21911
## 4 1987 W39 MEL-SYD Economy 23777
## 5 1987 W40 MEL-SYD Economy 22658
# Save the modified passenger data into a new tsibble.
#
melsyd %>% fill_gaps() %>%
tidyr::fill(Passengers, .direction = 'down') -> melsyd2Next, we observe a period of consecutive weeks in 1989 with zero Passengers due to the previously mentioned labor dispute from W34 to W40. We choose to impute the missing data values because we believe the model should not predict labor stoppages. Thus, the imputed series represents a counterfactual scenario where labor stoppages don’t exist but other business events may effect the time series.
melsyd2 %>% filter(Passengers == 0 )## # A tsibble: 7 x 4 [1W]
## # Key: Airports, Class [1]
## Week Airports Class Passengers
## <week> <chr> <chr> <dbl>
## 1 1989 W34 MEL-SYD Economy 0
## 2 1989 W35 MEL-SYD Economy 0
## 3 1989 W36 MEL-SYD Economy 0
## 4 1989 W37 MEL-SYD Economy 0
## 5 1989 W38 MEL-SYD Economy 0
## 6 1989 W39 MEL-SYD Economy 0
## 7 1989 W40 MEL-SYD Economy 0
Let the average passenger level per week in 1988 be \(P_{1988}\) and the average passenger level per week in 1989 excluding the weeks from W34 to W40 be \(P_{1989}\). The imputed Passengers for missing week \(k\) in 1989 will be:
\[M(k,1989) = \frac{P_{1989}}{P_{1988}} M(k, 1988) \text{ for } k=34, \cdots , 40\]
This formula replicates the seasonality of the week while preserving the average level of each year’s passenger count relative to these two years.
P1988 = melsyd2 %>% filter_index('1988 W01' ~ '1988 W52') %>%
as_tibble() %>% summarize(mean(Passengers)) %>% magrittr::extract2(1)
P1988## [1] 22164.71
P1989 = melsyd2 %>%
filter_index('1989 W01' ~ '1989 W52') %>%
filter( Passengers != 0) %>%
as_tibble() %>%
summarize(mean(Passengers)) %>% magrittr::extract2(1)
P1989## [1] 18997.93
melsyd2[114:120,]## # A tsibble: 7 x 4 [1W]
## # Key: Airports, Class [1]
## Week Airports Class Passengers
## <week> <chr> <chr> <dbl>
## 1 1989 W34 MEL-SYD Economy 0
## 2 1989 W35 MEL-SYD Economy 0
## 3 1989 W36 MEL-SYD Economy 0
## 4 1989 W37 MEL-SYD Economy 0
## 5 1989 W38 MEL-SYD Economy 0
## 6 1989 W39 MEL-SYD Economy 0
## 7 1989 W40 MEL-SYD Economy 0
melsyd2[114:120,"Passengers"] = melsyd2[62:68, "Passengers"] * (P1989)/(P1988)
melsyd2[114:120,]## # A tsibble: 7 x 4 [1W]
## # Key: Airports, Class [1]
## Week Airports Class Passengers
## <week> <chr> <chr> <dbl>
## 1 1989 W34 MEL-SYD Economy 20127.
## 2 1989 W35 MEL-SYD Economy 20274.
## 3 1989 W36 MEL-SYD Economy 22332.
## 4 1989 W37 MEL-SYD Economy 22024.
## 5 1989 W38 MEL-SYD Economy 22902.
## 6 1989 W39 MEL-SYD Economy 12885.
## 7 1989 W40 MEL-SYD Economy 19148.
melsyd2 %>% gg_season(Passengers)melsyd2## # A tsibble: 283 x 4 [1W]
## # Key: Airports, Class [1]
## Week Airports Class Passengers
## <week> <chr> <chr> <dbl>
## 1 1987 W26 MEL-SYD Economy 20167
## 2 1987 W27 MEL-SYD Economy 20161
## 3 1987 W28 MEL-SYD Economy 19993
## 4 1987 W29 MEL-SYD Economy 20986
## 5 1987 W30 MEL-SYD Economy 20497
## 6 1987 W31 MEL-SYD Economy 20770
## 7 1987 W32 MEL-SYD Economy 21111
## 8 1987 W33 MEL-SYD Economy 20675
## 9 1987 W34 MEL-SYD Economy 22092
## 10 1987 W35 MEL-SYD Economy 20772
## # … with 273 more rows
lambda<- melsyd2 %>%
features(Passengers, features = guerrero) %>%
pull(lambda_guerrero)
lambda## [1] 1.017995
melsyd2$bc = box_cox(melsyd2$Passengers, lambda)
autoplot(melsyd2, bc)The STL decomposition below and the resulting residuals show departure from normality on the left tail.
melsyd2bc_dcmp = melsyd2 %>% model( stl = STL(bc)) %>% components()
melsyd2bc_dcmp %>% autoplot(bc)ggplot(data=melsyd2bc_dcmp) + geom_qq(aes(sample=remainder )) + geom_qq_line(aes(sample=remainder)) + labs(title="QQNorm of Box-Cox Mel-Syd Passengers")Since the pedestrian dataset has many observations with zero count, we cannot use the Guerrero method to compute \(\lambda\) in the Box-Cox transformation. We also need to fill gaps in the pedestrian dataset with the prior available count as a best estimate. We do that below. We suggest \(\lambda=0.1\) as for the Box-Cox transformation but argue no parameter is satisfactory.
pedestrian %>%
filter(Sensor=="Southern Cross Station") -> ped1
#ped1 %>% fill_gaps(Count = as.integer(median(Count))) -> pedscs
ped1 %>% fill_gaps() %>%
tidyr::fill(Count, .direction = 'down') -> pedscsNote that zero counts are the most frequently observed count in the dataset as shown in the histogram below.
# Show that zero counts are frequent in the dataset.
hist(pedscs$Count, breaks = 40 )Construct a STL decomposition of the raw series.
xcmp = pedscs %>% model( stl = STL(Count)) %>% components()
xcmp %>% autoplot(Count)ggplot(data=xcmp) + geom_qq(aes(sample=remainder )) +
geom_qq_line(aes(sample=remainder)) + labs(title="QQNorm of Pedestrians") In the above components plot, the STL algorithm decomposes the time series into trend and 3 seasonal components automatically. The annual, weekly and daily components show some unevenness in plots. The annual seasonality plot shows downward spikes and the daily seasonality plot shows some uneven clumping around 2016-01. Lastly, the residuals seem significant kurtosis. There are a number of large tail events.
By trial and error, we observe 0.1 allows reducing some of these residual tails and smooths out the clumping of the annual and daily plots as shown below. However, the Q-Q plot shows significant departures from normality at the tail. So we conclude Box-Cox does not solve the challenge of modeling this time series effectively.
lambda = 0.1
xcmp2 = pedscs %>%
mutate( bc = box_cox(Count, lambda) ) %>%
model( stl = STL(bc)) %>% components()
xcmp2 %>% autoplot(bc)ggplot(data=xcmp2) + geom_qq(aes(sample=remainder )) +
geom_qq_line(aes(sample=remainder)) + labs(title="QQNorm of Box Cox Pedestrians")Consider the last five years of the Gas data from aus_production.
gas <- tail(aus_production, 5*4) %>%
select(Gas)classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.gas <- tail(aus_production, 5*4) %>%
select(Gas)
autoplot(gas, Gas) +
labs(title="Original: Last 20 observations of Gas Production",
subtitle="aus_production" )# Seasonality is observed using gg_season plot.
gg_season(gas, Gas, period="year")classic_dcmp <- gas %>%
model(
classical_decomposition(Gas, type = "multiplicative")
) %>%
components()
classic_dcmp %>%
autoplot() +
labs(title =
"Classical multiplicative decomposition: Gas",
caption = "Austrialia Gas Production") c. The results in b. are consistent with the diagnostic plots in a. They extend the results of a. by fitting a trend through the raw data. d. The seasonally adjusted Gas production data from the classical decomposition is contained in the column
season_adjust. This is graphed below.
classic_dcmp %>% autoplot(season_adjust) + labs(caption = "Seasonally adjusted gas production") e. We shock the gas data at observation 10 of the original dataset to study a mid-range shock.
gasXMid = gas
gasXMid[10, "Gas"] = gas[10,"Gas"] + 300
classic_dcmp_mid <- gasXMid %>%
model(
classical_decomposition(Gas, type = "multiplicative")
) %>%
components()
classic_dcmp_mid %>%
autoplot() +
labs(title =
"Classical multiplicative decomposition: Gas Mid-Outlier",
caption = "Australia Gas Production")classic_dcmp_mid %>%
autoplot(season_adjust) +
labs(caption = "Seasonally adjusted gas production with Mid-Outlier")A large hump interrupts the smooth trend. We see that the outlier disrupts the seasonal pattern. Because outlier occurs at an observation in Q4, seasonality component shows an increase at Q4 instead of a decline in Q4.
Lastly, the range of the residual explodes. Previously, the residual component range about 2% from the base level of 1. When the outlier is added, the residual range increases to +20% to -20%.
gasXEnd = gas
gasXEnd[20, "Gas"] = gas[20,"Gas"] + 300
classic_dcmp_end <- gasXEnd %>%
model(
classical_decomposition(Gas, type = "multiplicative")
) %>%
components()
classic_dcmp_end %>%
autoplot() +
labs(title =
"Classical multiplicative decomposition: Gas End-Outlier",
caption = "Austrialia Gas Production")classic_dcmp_end %>%
autoplot(season_adjust) +
labs(caption = "Seasonally adjusted gas production with shock at end")The impact of an outlier at the last observation produces different effects than in the middle. First, the trend does not show a hump but rather a spike at the end. Second, the seasonal component shows an increase in Q2 but keeps the decline in Q4. In contrast, the middle of the range outlier causes an increase in Q4 but no impact on Q2. Third, the residuals show a smaller impact at the end. The decline is only 12% compared to a 40% increase in the mid-range outlier.
Recall your retail time series data (from Exercise 8 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?
Our solution first examines the multiplicative versions of the X11 and classical decompositions then compares their residuals in a labeled scatterplot to identify outliers.
Our retail time series data is replicated below with the same seed as in Exercise 8, Section 2.10.
# We replicate the Time Series used in HW 1 Exercise by using the same seed.
set.seed(392381)
myseries <- aus_retail %>%
filter(`Series ID` == sample(aus_retail$`Series ID`, 1 ))The X11 decomposition is computed below.
# Import the seasonal library where X_13ARIMA_SEATS is available
library(seasonal)
x11_dcmp <- myseries %>%
model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) %>%
components()
autoplot(x11_dcmp ) +
labs( title =
"Decomposition of Pharmaeutical, cosmetic, toiletry turnover using X-11",
caption = "Western Australia")The classical decomposition is computed below.
classic_dcmp <- myseries %>%
model(
classical_decomposition(Turnover, type = "multiplicative")
) %>%
components()
classic_dcmp %>%
autoplot() +
labs(title =
"Classical multiplicative decomposition: Pharmaceuticals, cosmetics, toiletry",
caption = "Western Australia")Lastly, we examine a scatter plot of the residuals of both the X11 decomposition (horizontal axis) to the classical decomposition (vertical axis) below. As the residuals are multiplicative, the non-outliers are centered around the point \((1,1)\) rather than the origin.
We define outliers as the datapoints where either X11 or classical residual exceed a 9.5% threshold. We use ggrepel and conditional labeling to identify outliers.
library(ggrepel)
turnover_compare = classic_dcmp %>% select(Month, random)
turnover_compare$x11err = x11_dcmp$irregular
ggplot(turnover_compare, aes(x=x11err, y=random)) +
geom_point(alpha=0.6) +
ggtitle("Residuals Between X11 and Classical Decomposition") +
geom_label_repel(aes(label=ifelse( abs(x11err - 1) > 0.095 |
abs(random - 1) > 0.095, as.character(Month), "") ) ,
color="red", point.padding = 0.3, box.padding = 0.3) +
xlab("X11 residual") + ylab("Classical residual") +
labs(caption="Outliers labeled if absolute residual exceeds 9.5%")The X11 decomposition reveals some interesting new features that I did not expect.
In general, X11 seems to do a better job of identify business cycles and deviations from trend than the classical decomposition. However, digging into Australian business cycles to investigate the above outliers and seasonal variation is of scope.
Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.
Figure 3.19: Decomposition of the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.
Figure 3.20: Seasonal component from the decomposition shown in the previous figure.
The decomposition of the civilian labor force shows a steady positive upward trend. The trend could be explained by population growth which may be due to changes in birth rates or immigration. In 1995, the trend seems to reach 9,000,000 workers. The seasonal component of the labor force increases in March, September and December. For example, seasonal chart shows 70,000 workforce decline relative to trend in August which may be due to holiday seasonal work or summer school.
The recession of 1991/1992 is highly visible in the residuals and shows both a decline in employment followed by a sharp rebound in following months.