library(fpp3)
library(tsibbledata)
library(tidyverse)
library(seasonal)
3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8 and 3.9
Consider the GDP information in global_economy
. Plot the
GDP per capita for each country over time. Which country has the highest
GDP per capita? How has this changed over time?
head(global_economy)
## # A tsibble: 6 x 9 [1Y]
## # Key: Country [1]
## Country Code Year GDP Growth CPI Imports Exports Population
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan AFG 1960 537777811. NA NA 7.02 4.13 8996351
## 2 Afghanistan AFG 1961 548888896. NA NA 8.10 4.45 9166764
## 3 Afghanistan AFG 1962 546666678. NA NA 9.35 4.88 9345868
## 4 Afghanistan AFG 1963 751111191. NA NA 16.9 9.17 9533954
## 5 Afghanistan AFG 1964 800000044. NA NA 18.1 8.89 9731361
## 6 Afghanistan AFG 1965 1006666638. NA NA 21.4 11.3 9938414
# Creating a variable called GDP per capita
gdp_pc <- global_economy %>%
mutate(GDP_per_capita = GDP / Population)
# The legend was in the way so, it was removed
gdp_pc %>%
autoplot(GDP_per_capita) +
labs(title = "GDP per capita for each country over time",
y = "GDP per capita") +
guides(colour = "none")
The plot is visible now but the countries are unidentifable so I will
just filter out the top 10 gdp/capita and plot them
top10_gdp_per_cap <- gdp_pc %>%
as_tibble() %>%
arrange(desc(GDP_per_capita)) %>%
slice_head(n = 10)
top10_gdp_per_cap
## # A tibble: 10 × 10
## Country Code Year GDP Growth CPI Imports Exports Population
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Monaco MCO 2014 7060236168. 7.18 NA NA NA 38132
## 2 Monaco MCO 2008 6476490406. 0.732 NA NA NA 35853
## 3 Liechtenstein LIE 2014 6657170923. NA NA NA NA 37127
## 4 Liechtenstein LIE 2013 6391735894. NA NA NA NA 36834
## 5 Monaco MCO 2013 6553372278. 9.57 NA NA NA 37971
## 6 Monaco MCO 2016 6468252212. 3.21 NA NA NA 38499
## 7 Liechtenstein LIE 2015 6268391521. NA NA NA NA 37403
## 8 Monaco MCO 2007 5867916781. 14.4 NA NA NA 35111
## 9 Liechtenstein LIE 2016 6214633651. NA NA NA NA 37666
## 10 Monaco MCO 2015 6258178995. 4.94 NA NA NA 38307
## # ℹ 1 more variable: GDP_per_capita <dbl>
Monaco and Liechtenstein seem to have the highest gdp per capita. We can filter the original tstibble to just these two countries.
top_countries <- top10_gdp_per_cap %>%
pull(Country) %>% unique()
# Filter gdp_pc back to just those countries
gdp_pc %>%
filter(Country %in% top_countries) %>%
autoplot(GDP_per_capita) +
labs(title = "GDP per capita over time (Top countries)",
y = "GDP per capita (US$)")
## Warning: Removed 22 rows containing missing values or values outside the scale range
## (`geom_line()`).
Based on this plot, Monaco has the highest GDP per capita followed by Liechtenstein. Over time, their GDP per capita has scaled drastically compared to other countries. Based on the plot with all the countires, Monaco and Liechtenstein are the only two countries that have a GDP per capita over $150,000 after 2010. All of the other groups have steadily grown in GDP per capita but none as fast as Monaco and Liechtenstein.
For each of the following series, make a graph of the data. If
transforming seems appropriate, do so and describe the effect. - United
States GDP from global_economy
. - Slaughter of Victorian
“Bulls, bullocks and steers” in aus_livestock
. - Victorian
Electricity Demand from vic_elec
. - Gas production from
aus_production
.
us_gdp <- global_economy %>%
filter(Country == "United States")
autoplot(us_gdp, GDP) +
labs(title = "US GDP", y = "US$")
This time series shows expontenial growth so performing a log transformation would be appropraite here.
# Log transform
autoplot(us_gdp, log(GDP)) +
labs(title = "US GDP (log scale)", y = "log(US$)")
The log transformed plot looks more linear, more smooth compared to the previous plot of the US GDP. In this case, a log transformation was appropriate.
aus_livestock %>%
filter(Animal == "Bulls, bullocks and steers", State == "Victoria") %>%
autoplot(Count) +
labs(title = "Victorian Slaughter of Bulls, bullocks and steers", y = "Count")
This is count data so the y-axis is fairly steady. Unlike previously where there was exponentially growth over time, count data is more stable and to scale so transformation is not necessary.
vic_elec %>%
autoplot(Demand) +
labs(title = "Victorian Electricity Demand", y = "MW")
There seems some seasonality into be a spike in electricity demand at the beginning of every year. There doesn’t seem to be much variation so transformation is not necessary here.
aus_production %>%
autoplot(Gas) +
labs(title = "Australian Gas Production", y = "Terajoules")
There is strong upward growth and changes in variance in gas production. From 1960 Q1 to 1970, there does not seem to be much change in gas production. After 1970, there is a significant increase in gas production, which may warrant a log transformation.
aus_production %>%
mutate(log_Gas = log(Gas)) %>%
autoplot(log_Gas) +
labs(title = "Log of Australian Gas Production", y = "log(Terajoules)")
Performing a log variation of gas production stabilizes and smooths the strong upward growth.
Why is a Box-Cox transformation unhelpful for the canadian_gas data?
canadian_gas %>%
autoplot(Volume) +
labs(title = "Canadian Gas Production", y = "Volume")
# Box-Cox transformation (automatic lambda)
lambda <- canadian_gas %>%
features(Volume, features = guerrero) %>%
pull(lambda_guerrero)
canadian_gas %>%
autoplot(box_cox(Volume, lambda)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed gas production with $\\lambda$ = ",
round(lambda,2))))
The original plot shows a strong upward trend and clear seasonality. A Box–Cox transformation is unhelpful because the variance is stable. Box-Cox transformation is useful when the seasonal variation changes in the series. The plot of the Box-Cox transformation looks almost identical to the orginial plot.
What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?
Exercise 7 in Section 2.10 looks at Monthly Australian retail data in
aus_retail
head(aus_retail)
## # A tsibble: 6 x 5 [1M]
## # Key: State, Industry [1]
## State Industry `Series ID` Month Turnover
## <chr> <chr> <chr> <mth> <dbl>
## 1 Australian Capital Territory Cafes, restaurants… A3349849A 1982 Apr 4.4
## 2 Australian Capital Territory Cafes, restaurants… A3349849A 1982 May 3.4
## 3 Australian Capital Territory Cafes, restaurants… A3349849A 1982 Jun 3.6
## 4 Australian Capital Territory Cafes, restaurants… A3349849A 1982 Jul 4
## 5 Australian Capital Territory Cafes, restaurants… A3349849A 1982 Aug 3.6
## 6 Australian Capital Territory Cafes, restaurants… A3349849A 1982 Sep 4.2
STATE <- "Victoria"
INDUSTRY <- "Cafes, restaurants and takeaway food services"
retail <- aus_retail %>%
filter(State == STATE, Industry == INDUSTRY)
autoplot(retail, Turnover) +
labs(title = paste(STATE, "—", INDUSTRY),
y = "Million")
I plotted one series from Exercise 7. The time plot shows a strong upward trend over time. There is a clear seasonal pattern, with spikes in sales every December. The size of these seasonal peaks increases as the series grows, indicating changing seasonal variance. A Box–Cox transformation with λ ≈ 0 (log transform) is appropriate here as it stabilizes the variance and makes the seasonal fluctuations more proportional.
For the following series, find an appropriate Box-Cox transformation
in order to stabilise the variance. Tobacco from
aus_production
, Economy class passengers between Melbourne
and Sydney from ansett
, and Pedestrian counts at Southern
Cross Station from pedestrian.
aus_production %>%
autoplot(Tobacco) +
ggtitle("Tobacco Production")
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).
This time plot does not show a distinct change in seasonal variance. It is hard to tell if a Box-Cox transformation will have any benefit since the variance seems constant.
lambda <- aus_production %>%
features(Tobacco, features = guerrero) %>%
pull(lambda_guerrero)
aus_production %>%
autoplot(box_cox(Tobacco, lambda)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed gas production with $\\lambda$ = ",
round(lambda,2))))
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).
Here we can see that the box-cox transformation had very little effect on the orginal plot. Lambda is 0.93, indicating a Box-Cox transformation is not necessary.
ansett %>%
filter(Class == "Economy", Airports == "MEL-SYD") %>%
autoplot(Passengers) +
labs(title = "Economy Class Passengers MEL–SYD", y = "Passengers")
This plot shows that variance is not constant so a log transformation would be useful here.
x <- ansett %>%
filter(Class == "Economy", Airports == "MEL-SYD")
lambda <- x %>%
features(Passengers, guerrero) %>%
pull(lambda_guerrero)
x %>%
mutate(Passengers_bc = box_cox(Passengers, lambda)) %>%
autoplot(Passengers_bc) +
labs(y = "Box–Cox(Passengers)",
title = paste0("MEL–SYD Economy (λ = ", round(lambda, 2), ")"))
The Box-Cox transformation did not improve the variation in the plot. For some reason the lambda is equal to two, which may be due to the zeros in the data.
x %>%
mutate(Passengers_log = box_cox(Passengers + 1, 0)) %>%
autoplot(Passengers_log) +
labs(title = "MEL–SYD Economy (Box–Cox λ = 0, log transform)",
y = "log(Passengers + 1)")
I forced the Box–Cox parameter to λ = 0, which corresponds to a log transformation, and added 1 passenger to each value to remove the zeros. If we exclude the couple of weeks before 1990, we can see that the transformation stabilized the variance. Thus, a log box-cox transformation is appropriate in this series.
ped <- pedestrian %>%
filter(Sensor == "Southern Cross Station")
autoplot(ped, Count) +
labs(title = "Pedestrian counts — Southern Cross Station",
y = "Count")
This time series is a bit hard to see as it is hourly data over 2 years, Let’s shorten the time span to the first four weeks in Jan 2015.
ped %>%
filter_index("2015-01-01" ~ "2015-01-28") %>%
autoplot(Count) +
labs(title = "Southern Cross pedestrians — Jan 2015 (four weeks)")
There seems to be five spikes consistently. Since this is pedestrian data from the Southern Cross Station, we can assume that the peaks are during weekdays rush hour. The variance is not constant so we can use a log box-cox transforamtion.
lambda_ped <- ped %>%
features(Count, guerrero) %>%
pull(lambda_guerrero)
ped %>%
filter_index("2015-01-01" ~ "2015-01-28") %>%
mutate(Count_bc = box_cox(Count, lambda_ped)) %>%
autoplot(Count_bc) +
labs(title = paste0("Southern Cross Pedestrian counts (λ = ",
round(lambda_ped, 2), ")"),
y = "Box–Cox(Count)")
For this pedestrian series, the plot improved slightly with a box-cox transformation.The variation in spikes during the weekends increased slightly but still not at the peaks during the weekdays.
Consider the last five years of the Gas data from aus_production.
gas <- tail(aus_production, 5*4) |> select(Gas)
autoplot(gas, Gas) + labs(title="Gas: last 5 years", y="Terajoules")
There is a upward trend and clear seasonality. There seems to be a spike the terajoules during the summer quarters and a drop in the colder quarters.
gas %>%
model(classical_decomposition(Gas, type = "multiplicative")) %>%
components() %>%
autoplot() +
labs(title = "Classical multiplicative decomposition of Gas")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
There is a clear upward trend in gas production. The seasonal plot shows troughs at the beginning of the year and peaks around the middle of the year, indicating clear seasonality.
Part a showed upward movement in gas production and clear quarterly seasonality. This multiplicative classical decomposition confirms the gradual uptrend and seasonal fluctuations. The remainder is small compared to the total gas production, suggesting that most of the composition is explained by the trend and seasonal component.
dcmp <- gas %>%
model(classical_decomposition(Gas, type = "multiplicative")) |>
components()
# seasonally adjusted series = Gas / seasonal
autoplot(dcmp, season_adjust) +
labs(title = "Seasonally adjusted Gas production",
y = "Seasonally adjusted Gas")
gas_outlier <- gas |> mutate(Gas = if_else(row_number()==1, Gas + 300, Gas))
dcmp_outlier <- gas_outlier |>
model(classical_decomposition(Gas, type = "multiplicative")) |>
components()
autoplot(dcmp_outlier, season_adjust) +
labs(title = "Seasonally adjusted Gas (with outlier)")
I added the outlier to the first row, which caused a spike in the beginning of the data series. This outlier distorted the trend component. It no longer shows a steady increase in gas production over time. However, the seasonal component does not show much of a change.
Yes, it would make a difference. An outlier near the end of a series has a large impact on the trend of the data.Classical decomposition uses moving averages to estimate the trend. If the outlier was at the end, it would not be included in the moving average window at the end so it will cause a sudden spike or drop on the trend plot. If the outlier was in the middle, that one outlier would be averaged out with the other numbers.
Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?
retail <- aus_retail %>%
filter(State == "Victoria",
Industry == "Cafes, restaurants and takeaway food services")
x11_dcmp <- retail %>%
model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) |>
components()
autoplot(x11_dcmp) +
labs(title = "X-11 decomposition of Victorian retail turnover")
The decomposed series plot shows a long term trend over time. The seasonal component shows spikes in every December. The remainder shows irregular drops in turn at the beginning for 2000, which is not shown in the trend and seasonal component.
Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.
The STL decomposition shows a upward trend in labor force from 1978 to 1995. The seasonal component also shows the same peak and trough around the same time, indicating strong seasonality and low variations. The remainder shows a sharp, irregular drop in 1991 to 1992. However, the drop is only -400 which is small when compared to the trend which is around 9000. There this decline does not have a significant impact on the upward trend in civilian labor.
The recession of 1991/1992 is extremely noticeable in the remainder component as a large negative deviation. However, the trend continues upward despite the recession because only a small portion of the trend is attributed to the remainder.