Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?
Code
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltglobal_economy = pd.read_csv("global_economy.csv")df = ( global_economy .dropna(subset=["GDP", "Population"]) .assign( gdp_pc=lambda x: x["GDP"] / x["Population"], year=lambda x: pd.to_datetime(x["ds"].astype(int), format="%Y") ))df.head()
unique_id
Code
ds
GDP
Growth
CPI
Imports
Exports
Population
gdp_pc
year
0
Afghanistan
AFG
1960
5.377778e+08
NaN
NaN
7.024793
4.132233
8996351.0
59.777327
1960-01-01
1
Afghanistan
AFG
1961
5.488889e+08
NaN
NaN
8.097166
4.453443
9166764.0
59.878153
1961-01-01
2
Afghanistan
AFG
1962
5.466667e+08
NaN
NaN
9.349593
4.878051
9345868.0
58.492874
1962-01-01
3
Afghanistan
AFG
1963
7.511112e+08
NaN
NaN
16.863910
9.171601
9533954.0
78.782758
1963-01-01
4
Afghanistan
AFG
1964
8.000000e+08
NaN
NaN
18.055555
8.888893
9731361.0
82.208444
1964-01-01
Plot of GDP capita for each country over time
Code
plt.figure(figsize=(10, 5))for country, g in df.sort_values("year").groupby("unique_id"): plt.plot(g["year"], g["gdp_pc"], linewidth=0.8, alpha=0.15)plt.title("GDP per capita over time (all countries)")plt.xlabel("Year")plt.ylabel("GDP per capita (USD)")plt.tight_layout()plt.show()
The plot above contains a lot of countries, so the plot below contains the top 10 countries by latest available year to make it readable.
Code
latest = ( df.sort_values("year") .groupby("unique_id", as_index=False) .tail(1) .sort_values("gdp_pc", ascending=False))top10_countries = latest["unique_id"].head(10).tolist()top10 = df[df["unique_id"].isin(top10_countries)].copy()plt.figure(figsize=(10, 5))for country, g in top10.sort_values("year").groupby("unique_id"): plt.plot(g["year"], g["gdp_pc"], linewidth=2, label=country)plt.title("GDP per capita over time (Top 10 by latest year)")plt.xlabel("Year")plt.ylabel("GDP per capita (USD)")plt.legend(loc="upper left", bbox_to_anchor=(1.02, 1))plt.tight_layout()plt.show()latest.head(10)
top_country
Monaco 43
United States 8
Kuwait 2
United Arab Emirates 2
Liechtenstein 2
Luxembourg 1
Name: count, dtype: int64
Code
plt.figure(figsize=(10, 4))plt.plot(yearly_top["year"], yearly_top["gdp_pc"])plt.title("GDP per capita of the top-ranked country each year")plt.xlabel("Year")plt.ylabel("Top GDP per capita (USD)")plt.tight_layout()plt.show()yearly_top.tail(15)
year
top_country
gdp_pc
9491
2003-01-01
Monaco
108978.488860
9492
2004-01-01
Monaco
123382.015835
9493
2005-01-01
Monaco
124374.268481
9494
2006-01-01
Monaco
133195.429339
9495
2007-01-01
Monaco
167124.740985
9496
2008-01-01
Monaco
180640.125115
9497
2009-01-01
Monaco
149221.361937
9498
2010-01-01
Monaco
144569.175786
9499
2011-01-01
Monaco
162155.498619
9500
2012-01-01
Monaco
152000.362070
8109
2013-01-01
Liechtenstein
173528.150454
9502
2014-01-01
Monaco
185152.527227
8111
2015-01-01
Liechtenstein
167590.608272
9504
2016-01-01
Monaco
168010.914891
8403
2017-01-01
Luxembourg
104103.036747
GDP per capita was computed by dividing GDP by Population for each country and year, thus making it easier to compare countries and years. From the computations, it is evident that there is an overall growth trend in GDP per capita at the global level, with varying rates of growth in different economies. In 2017, Luxembourg records the highest GDP per capita at around $104,103. From all the computed data, Monaco seems to have recorded the highest frequency in terms of GDP per capita, thus making it the leading country for 43 years.
Exercise 3.2
For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.
United States GDP from global_economy.
Code
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltge = pd.read_csv("global_economy.csv")us = ge.query("unique_id == 'United States'")[["ds","GDP"]].dropna()us["ds"] = pd.to_datetime(us["ds"].astype(int), format="%Y")us["log_GDP"] = np.log(us["GDP"])fig, ax = plt.subplots(figsize=(10,4))ax.plot(us["ds"], us["GDP"])ax.set_title("United States GDP")ax.set_xlabel("Year")ax.set_ylabel("GDP")plt.tight_layout()plt.show()fig, ax = plt.subplots(figsize=(10,4))ax.plot(us["ds"], us["log_GDP"])ax.set_title("United States GDP (log scale)")ax.set_xlabel("Year")ax.set_ylabel("log(GDP)")plt.tight_layout()plt.show()
The graphs show that the GDP series climbs steadily, with the rate at which it climbs increasing over time. This is because the curve is bending up, suggesting that multiplicative growth is at work, with the later years pulling up the scale, compressing the small fluctuations that occurred at the beginning. If we take the logs, the series becomes a more stable line, which is characteristic of a constant rate of growth over time. This is why using the log transformation helps stabilize the variance, enabling easier interpretation of patterns.
Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock.
Code
liv = pd.read_csv("aus_livestock.csv")liv["ds"] = pd.to_datetime(liv["ds"])vic_bulls = liv[liv["unique_id"].str.contains("Victoria") & liv["unique_id"].str.contains("Bulls, bullocks and steers")].copy()vic_bulls["log_y"] = np.log(vic_bulls["y"])fig, ax = plt.subplots(figsize=(10,4))ax.plot(vic_bulls["ds"], vic_bulls["y"])ax.set_title("Victoria: Bulls, bullocks and steers slaughter")ax.set_xlabel("Date")ax.set_ylabel("Slaughter")plt.tight_layout()plt.show()fig, ax = plt.subplots(figsize=(10,4))ax.plot(vic_bulls["ds"], vic_bulls["log_y"])ax.set_title("Victoria: Bulls, bullocks and steers slaughter (log scale)")ax.set_xlabel("Date")ax.set_ylabel("log(Slaughter)")plt.tight_layout()plt.show()
It’s clear that the original series has strong seasonal patterns and large fluctuations over time, especially with larger fluctuations when the level is high. There are also signs of structural changes, especially in the late 1970s to early 1980s. Once we apply the log transformation to our data, we see more stable ups and downs with consistent changes over time. The log transformation is appropriate here since it controls large fluctuations in the data.
The average demand for electricity each day has strong seasonality and lots of wiggles, with some spikes during extreme periods. However, overall, the pattern remains fairly consistent over time, with no strong evidence that the variations increase with the level of the series. The overall pattern of the data remains essentially the same after the log transformation, with just a hint of compression of the peaks. In this case, the log transformation is not strictly necessary, since the variance appears fairly stable anyway.
Gas production from aus_production.
Code
ap = pd.read_csv("aus_production.csv")ap["ds"] = pd.to_datetime(ap["ds"])gas = ap[["ds","Gas"]].dropna().rename(columns={"Gas":"y"})gas["log_y"] = np.log(gas["y"])fig, ax = plt.subplots(figsize=(10,4))ax.plot(gas["ds"], gas["y"])ax.set_title("Gas production (aus_production)")ax.set_xlabel("Date")ax.set_ylabel("Gas")plt.tight_layout()plt.show()fig, ax = plt.subplots(figsize=(10,4))ax.plot(gas["ds"], gas["log_y"])ax.set_title("Gas production (log scale)")ax.set_xlabel("Date")ax.set_ylabel("log(Gas)")plt.tight_layout()plt.show()
We see that the gas production data has a sharply increasing pattern with noticeable ups and downs, which become larger as the level increases. This is indicative of multiplicative seasonality, where the variance increases with the level of the series. However, once the logarithm is taken, the ups and downs become stable over time, and the pattern becomes more linear. The use of the logarithm is appropriate for this data because it stabilizes the variance and makes the multiplicative seasonality more additive.
Exercise 3.3
Why is a Box-Cox transformation unhelpful for the canadian_gas data?
Code
gas = pd.read_csv("canadian_gas.csv")gas["ds"] = pd.to_datetime(gas["ds"])plt.figure(figsize=(10,4))plt.plot(gas["ds"], gas["y"])plt.title("Canadian Gas Production")plt.xlabel("Date")plt.ylabel("Gas")plt.tight_layout()plt.show()
The figures that the gas production data has a sharply increasing pattern with noticeable ups and downs, which become larger as the level increases. This is indicative of multiplicative seasonality, where the variance increases with the level of the series. However, once the logarithm is taken, the ups and downs become stable over time, and the pattern becomes more linear. The use of the logarithm is appropriate for this data because it stabilizes the variance and makes the multiplicative seasonality more additive.
Exercise 3.4
What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?
The Box-Cox parameter is estimated at approximately 0.26. This indicates that there is a moderate transformation required. The Box-Cox parameter is closer to 0 than 1. Thus, it makes more sense to use the logarithmic transformation. However, using the transformation with λ close to 0.26 would be more appropriate.
Exercise 3.5
For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.
Code
import pandas as pdimport numpy as npfrom scipy.stats import boxcoxaus_prod = pd.read_csv("aus_production.csv")aus_prod["ds"] = pd.to_datetime(aus_prod["ds"])tobacco = aus_prod["Tobacco"].dropna().values_, lambda_tobacco = boxcox(tobacco)lambda_tobacco
The Box Cox analysis of the time series indicates that Tobacco production (aus_production) has an estimate of λ = 1.41, which essentially means 1. This means that there will be little or no transformation required. The estimate of λ for the Economy class passengers time series (ansett) is approximately 0.34. This means that the time series will require a moderate transformation close to the cube root. The time series representing pedestrian counts at Southern Cross Station has an estimate of λ close to 0 at approximately 0.15. This means that the time series will require a logarithmic transformation.
Exercise 3.7
Consider the last five years of the Gas data from aus_production.
gas = aus_production.loc[lambda x: x["unique_id"] == "Gas"].tail(5 * 4)
Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?
Code
import pandas as pdimport matplotlib.pyplot as pltaus_production = pd.read_csv("aus_production.csv")aus_production["ds"] = pd.to_datetime(aus_production["ds"])gas = aus_production[["ds", "Gas"]].dropna().tail(5*4)plt.figure()plt.plot(gas["ds"], gas["Gas"])plt.title("Gas Production – Last 5 Years")plt.xlabel("Date")plt.ylabel("Gas")plt.show()
Over the last five years, the gas production reveals seasonality with a repeating pattern within the year. There’s an evident and ongoing trend as the production increases over time. Within this short period, no strong cycles other than the seasonality are evident.
Use seasonal_decompose with model='multiplicative' to calculate the trend-cycle and seasonal indices.
From the multiplicative decomposition, it is observed that there exists a smooth increasing trend along with seasonal factors that repeat themselves every quarter. The seasonal factors remain relatively constant over time. Therefore, it can be deduced that a multiplicative model has been used. The residuals are small compared to the level of the series.
Do the results support the graphical interpretation from part a?
Yes, the decomposition confirms the picture provided by the graph in (a). The trend component confirms the rising pattern shown in the graph, while the seasonal component reflects the periodic fluctuations every three months. Overall, the decomposition fits the visual information nicely.
The seasonally adjusted data removes the normal ups and downs of the data on a quarterly basis, but it still maintains the overall direction of increase. Once the data is seasonally adjusted, it appears smoother, with the overall increase in gas production more evident. The normal swings in the data are no longer dominant.
Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
If there is one outlier in the middle of the data, the trend piece will be warped around that point, and the residuals will increase with a large spike. The decomposition will adjust its estimate of the trend by incorporating some of the extreme values in the data. This demonstrates the sensitivity of traditional decomposition to extreme values.
Does it make any difference if the outlier is near the end rather than in the middle of the time series?
When an outlier appears at a position closer to the tail of a series, it impacts the boundary estimates of a trend with greater intensity. Decomposition methods, being dependent on moving averages, make edge points highly sensitive to outliers. Therefore, the trend at the end gets affected more steeply than if an outlier appears at a central position.
Exercise 3.8
Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using the MSTL method. Does it reveal any outliers, or unusual features that you had not noticed previously?
Code
from statsmodels.tsa.seasonal import MSTLimport matplotlib.pyplot as pltmyseries["Month"] = pd.to_datetime(myseries["Month"])retail_ts = myseries.sort_values("Month").set_index("Month")["Turnover"]mstl = MSTL(retail_ts, periods=12)res = mstl.fit()res.plot()plt.show()
The MSTL technique decomposes the given data on the retail industry into three different segments: the smooth and steady trend over time, the strong yearly seasonality with its magnitude slightly increasing over time, and finally the remainder. The trend component indicates steady growth over time, while the seasonality component indicates strong annual variation with its magnitude slightly increasing over time. The remainder component focuses our attention on some unusual features, particularly the large fluctuations from 2008-2009 and 2011-2012 that were not as prominent in the initial visualization of the time series.
Exercise 3.9
Figures 3.13 and 3.14 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.
As shown in Figure 3.13 and Figure 3.14, the STL decomposition of Australia’s monthly civilian labor force from February 1978 to August 1995 separates the original time series into its trend component, its seasonal component, and its remainder component.
Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation. Is the recession of 1991/1992 visible in the estimated components?
The Australian civilian labor force increases steadily throughout the entire period, with an obvious trend of increasing growth from 1978 to 1995. The seasonal part has regular, predictable year-by-year ups and downs, with peaks and troughs remaining small relative to the level, indicating stability in seasonality. The remainder section shows the short-term wiggles, with some significant movements away from zero.
Between 1991 and 1992, the trend part of the series shows a brief slowdown in growth, with the rate of growth leveling off. The remainder part becomes more volatile during these years, indicating unusual short-run disruptions. However, the seasonality remains much the same, implying that the recession affected the trend, not the seasonality.