DATA624 - Homework 2

Author

Anthony Josue Roman

Exercise 3.1

Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

global_economy = pd.read_csv("global_economy.csv")

df = (
    global_economy
    .dropna(subset=["GDP", "Population"])
    .assign(
        gdp_pc=lambda x: x["GDP"] / x["Population"],
        year=lambda x: pd.to_datetime(x["ds"].astype(int), format="%Y")
    )
)

df.head()
unique_id Code ds GDP Growth CPI Imports Exports Population gdp_pc year
0 Afghanistan AFG 1960 5.377778e+08 NaN NaN 7.024793 4.132233 8996351.0 59.777327 1960-01-01
1 Afghanistan AFG 1961 5.488889e+08 NaN NaN 8.097166 4.453443 9166764.0 59.878153 1961-01-01
2 Afghanistan AFG 1962 5.466667e+08 NaN NaN 9.349593 4.878051 9345868.0 58.492874 1962-01-01
3 Afghanistan AFG 1963 7.511112e+08 NaN NaN 16.863910 9.171601 9533954.0 78.782758 1963-01-01
4 Afghanistan AFG 1964 8.000000e+08 NaN NaN 18.055555 8.888893 9731361.0 82.208444 1964-01-01

Plot of GDP capita for each country over time

Code
plt.figure(figsize=(10, 5))

for country, g in df.sort_values("year").groupby("unique_id"):
    plt.plot(g["year"], g["gdp_pc"], linewidth=0.8, alpha=0.15)

plt.title("GDP per capita over time (all countries)")
plt.xlabel("Year")
plt.ylabel("GDP per capita (USD)")
plt.tight_layout()
plt.show()

The plot above contains a lot of countries, so the plot below contains the top 10 countries by latest available year to make it readable.

Code
latest = (
    df.sort_values("year")
      .groupby("unique_id", as_index=False)
      .tail(1)
      .sort_values("gdp_pc", ascending=False)
)

top10_countries = latest["unique_id"].head(10).tolist()

top10 = df[df["unique_id"].isin(top10_countries)].copy()

plt.figure(figsize=(10, 5))
for country, g in top10.sort_values("year").groupby("unique_id"):
    plt.plot(g["year"], g["gdp_pc"], linewidth=2, label=country)

plt.title("GDP per capita over time (Top 10 by latest year)")
plt.xlabel("Year")
plt.ylabel("GDP per capita (USD)")
plt.legend(loc="upper left", bbox_to_anchor=(1.02, 1))
plt.tight_layout()
plt.show()

latest.head(10)

unique_id Code ds GDP Growth CPI Imports Exports Population gdp_pc year
9504 Monaco MCO 2016 6.468252e+09 3.213849 NaN NaN NaN 38499.0 168010.914891 2016-01-01
8112 Liechtenstein LIE 2016 6.214634e+09 NaN NaN NaN NaN 37666.0 164993.194162 2016-01-01
8403 Luxembourg LUX 2017 6.240446e+10 2.297941 111.413231 193.969823 230.016430 599449.0 104103.036747 2017-01-01
1329 Bermuda BMU 2013 5.573710e+09 -2.511154 NaN 29.592516 47.668268 65001.0 85748.065414 2013-01-01
8461 Macao SAR, China MAC 2017 5.036120e+10 9.096104 136.099091 32.009975 79.395298 622567.0 80892.821329 2017-01-01
13439 Switzerland CHE 2017 6.788873e+11 1.086792 98.266862 53.927473 64.976735 8466017.0 80189.696861 2017-01-01
6662 Isle of Man IMN 2016 6.592628e+09 7.400000 NaN NaN NaN 83737.0 78730.162285 2016-01-01
10491 Norway NOR 2017 3.988320e+11 1.918746 114.550719 33.051905 35.472423 5282223.0 75504.566255 2017-01-01
2483 Channel Islands CHI 2007 1.151461e+10 5.898330 NaN NaN NaN 156513.0 73569.644965 2007-01-01
6083 Iceland ISL 2017 2.390929e+10 3.642227 121.956961 42.848779 46.963832 341284.0 70056.873392 2017-01-01
Code
max_year = df["year"].max()
latest_year_df = df[df["year"] == max_year].copy()

top_latest_year = latest_year_df.sort_values("gdp_pc", ascending=False).head(10)
top_latest_year
unique_id Code ds GDP Growth CPI Imports Exports Population gdp_pc year
8403 Luxembourg LUX 2017 6.240446e+10 2.297941 111.413231 193.969823 230.016430 599449.0 104103.036747 2017-01-01
8461 Macao SAR, China MAC 2017 5.036120e+10 9.096104 136.099091 32.009975 79.395298 622567.0 80892.821329 2017-01-01
13439 Switzerland CHE 2017 6.788873e+11 1.086792 98.266862 53.927473 64.976735 8466017.0 80189.696861 2017-01-01
10491 Norway NOR 2017 3.988320e+11 1.918746 114.550719 33.051905 35.472423 5282223.0 75504.566255 2017-01-01
6083 Iceland ISL 2017 2.390929e+10 3.642227 121.956961 42.848779 46.963832 341284.0 70056.873392 2017-01-01
6605 Ireland IRL 2017 3.337308e+11 7.802382 105.079586 87.878319 120.012529 4813608.0 69330.690154 2017-01-01
11477 Qatar QAT 2017 1.669286e+11 1.579850 115.857023 37.257003 51.042428 2639211.0 63249.422433 2017-01-01
14483 United States USA 2017 1.939060e+13 2.273339 112.411557 NaN NaN 325719178.0 59531.661964 2017-01-01
10375 North America NAC 2017 2.104998e+13 2.348806 NaN NaN NaN 362492702.0 58070.066137 2017-01-01
12143 Singapore SGP 2017 3.239072e+11 3.618542 113.266146 149.082919 173.345205 5612253.0 57714.296631 2017-01-01
Code
yearly_top = (
    df.sort_values(["year", "gdp_pc"], ascending=[True, False])
      .groupby("year", as_index=False)
      .head(1)
      .rename(columns={"unique_id": "top_country"})
      .loc[:, ["year", "top_country", "gdp_pc"]]
)

top_counts = yearly_top["top_country"].value_counts().head(15)
top_counts
top_country
Monaco                  43
United States            8
Kuwait                   2
United Arab Emirates     2
Liechtenstein            2
Luxembourg               1
Name: count, dtype: int64
Code
plt.figure(figsize=(10, 4))
plt.plot(yearly_top["year"], yearly_top["gdp_pc"])
plt.title("GDP per capita of the top-ranked country each year")
plt.xlabel("Year")
plt.ylabel("Top GDP per capita (USD)")
plt.tight_layout()
plt.show()

yearly_top.tail(15)

year top_country gdp_pc
9491 2003-01-01 Monaco 108978.488860
9492 2004-01-01 Monaco 123382.015835
9493 2005-01-01 Monaco 124374.268481
9494 2006-01-01 Monaco 133195.429339
9495 2007-01-01 Monaco 167124.740985
9496 2008-01-01 Monaco 180640.125115
9497 2009-01-01 Monaco 149221.361937
9498 2010-01-01 Monaco 144569.175786
9499 2011-01-01 Monaco 162155.498619
9500 2012-01-01 Monaco 152000.362070
8109 2013-01-01 Liechtenstein 173528.150454
9502 2014-01-01 Monaco 185152.527227
8111 2015-01-01 Liechtenstein 167590.608272
9504 2016-01-01 Monaco 168010.914891
8403 2017-01-01 Luxembourg 104103.036747

GDP per capita was computed by dividing GDP by Population for each country and year, thus making it easier to compare countries and years. From the computations, it is evident that there is an overall growth trend in GDP per capita at the global level, with varying rates of growth in different economies. In 2017, Luxembourg records the highest GDP per capita at around $104,103. From all the computed data, Monaco seems to have recorded the highest frequency in terms of GDP per capita, thus making it the leading country for 43 years.

Exercise 3.2

For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

  1. United States GDP from global_economy.
Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

ge = pd.read_csv("global_economy.csv")
us = ge.query("unique_id == 'United States'")[["ds","GDP"]].dropna()
us["ds"] = pd.to_datetime(us["ds"].astype(int), format="%Y")
us["log_GDP"] = np.log(us["GDP"])

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(us["ds"], us["GDP"])
ax.set_title("United States GDP")
ax.set_xlabel("Year")
ax.set_ylabel("GDP")
plt.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(us["ds"], us["log_GDP"])
ax.set_title("United States GDP (log scale)")
ax.set_xlabel("Year")
ax.set_ylabel("log(GDP)")
plt.tight_layout()
plt.show()

The graphs show that the GDP series climbs steadily, with the rate at which it climbs increasing over time. This is because the curve is bending up, suggesting that multiplicative growth is at work, with the later years pulling up the scale, compressing the small fluctuations that occurred at the beginning. If we take the logs, the series becomes a more stable line, which is characteristic of a constant rate of growth over time. This is why using the log transformation helps stabilize the variance, enabling easier interpretation of patterns.

  1. Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock.
Code
liv = pd.read_csv("aus_livestock.csv")
liv["ds"] = pd.to_datetime(liv["ds"])
vic_bulls = liv[liv["unique_id"].str.contains("Victoria") & liv["unique_id"].str.contains("Bulls, bullocks and steers")].copy()
vic_bulls["log_y"] = np.log(vic_bulls["y"])

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(vic_bulls["ds"], vic_bulls["y"])
ax.set_title("Victoria: Bulls, bullocks and steers slaughter")
ax.set_xlabel("Date")
ax.set_ylabel("Slaughter")
plt.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(vic_bulls["ds"], vic_bulls["log_y"])
ax.set_title("Victoria: Bulls, bullocks and steers slaughter (log scale)")
ax.set_xlabel("Date")
ax.set_ylabel("log(Slaughter)")
plt.tight_layout()
plt.show()

It’s clear that the original series has strong seasonal patterns and large fluctuations over time, especially with larger fluctuations when the level is high. There are also signs of structural changes, especially in the late 1970s to early 1980s. Once we apply the log transformation to our data, we see more stable ups and downs with consistent changes over time. The log transformation is appropriate here since it controls large fluctuations in the data.

  1. Victorian Electricity Demand from vic_elec.
Code
ve = pd.read_csv("vic_elec.csv")
ve["ds"] = pd.to_datetime(ve["ds"])

daily = (
    ve.query("unique_id == 'Demand'")
      .set_index("ds")["y"]
      .resample("D")
      .mean()
      .reset_index()
      .rename(columns={"y":"Demand"})
)

daily["log_Demand"] = np.log(daily["Demand"])

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(daily["ds"], daily["Demand"])
ax.set_title("Victoria Electricity Demand (daily average)")
ax.set_xlabel("Date")
ax.set_ylabel("Demand")
plt.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(daily["ds"], daily["log_Demand"])
ax.set_title("Victoria Electricity Demand (daily average, log scale)")
ax.set_xlabel("Date")
ax.set_ylabel("log(Demand)")
plt.tight_layout()
plt.show()

The average demand for electricity each day has strong seasonality and lots of wiggles, with some spikes during extreme periods. However, overall, the pattern remains fairly consistent over time, with no strong evidence that the variations increase with the level of the series. The overall pattern of the data remains essentially the same after the log transformation, with just a hint of compression of the peaks. In this case, the log transformation is not strictly necessary, since the variance appears fairly stable anyway.

  1. Gas production from aus_production.
Code
ap = pd.read_csv("aus_production.csv")
ap["ds"] = pd.to_datetime(ap["ds"])
gas = ap[["ds","Gas"]].dropna().rename(columns={"Gas":"y"})
gas["log_y"] = np.log(gas["y"])

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(gas["ds"], gas["y"])
ax.set_title("Gas production (aus_production)")
ax.set_xlabel("Date")
ax.set_ylabel("Gas")
plt.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(gas["ds"], gas["log_y"])
ax.set_title("Gas production (log scale)")
ax.set_xlabel("Date")
ax.set_ylabel("log(Gas)")
plt.tight_layout()
plt.show()

We see that the gas production data has a sharply increasing pattern with noticeable ups and downs, which become larger as the level increases. This is indicative of multiplicative seasonality, where the variance increases with the level of the series. However, once the logarithm is taken, the ups and downs become stable over time, and the pattern becomes more linear. The use of the logarithm is appropriate for this data because it stabilizes the variance and makes the multiplicative seasonality more additive.

Exercise 3.3

Why is a Box-Cox transformation unhelpful for the canadian_gas data?

Code
gas = pd.read_csv("canadian_gas.csv")
gas["ds"] = pd.to_datetime(gas["ds"])

plt.figure(figsize=(10,4))
plt.plot(gas["ds"], gas["y"])
plt.title("Canadian Gas Production")
plt.xlabel("Date")
plt.ylabel("Gas")
plt.tight_layout()
plt.show()

Code
stl = STL(gas["y"], period=4)
res = stl.fit()

fig = res.plot()
fig.set_size_inches(10,6)
plt.tight_layout()
plt.show()

Code
gas["log_y"] = np.log(gas["y"])

stl_log = STL(gas["log_y"], period=4)
res_log = stl_log.fit()

fig = res_log.plot()
fig.set_size_inches(10,6)
plt.tight_layout()
plt.show()

The figures that the gas production data has a sharply increasing pattern with noticeable ups and downs, which become larger as the level increases. This is indicative of multiplicative seasonality, where the variance increases with the level of the series. However, once the logarithm is taken, the ups and downs become stable over time, and the pattern becomes more linear. The use of the logarithm is appropriate for this data because it stabilizes the variance and makes the multiplicative seasonality more additive.

Exercise 3.4

What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf
from pandas.plotting import lag_plot

aus_retail = pd.read_csv("aus_retail.csv")
aus_retail["Month"] = pd.to_datetime(aus_retail["Month"])

np.random.seed(123)
random_series_id = np.random.choice(aus_retail["Series ID"].unique(), 1)[0]

myseries = aus_retail.query("`Series ID` == @random_series_id").copy()
myseries = myseries.sort_values("Month")

myseries.head()
State Industry Series ID Month Turnover
45569 Tasmania Pharmaceutical, cosmetic and toiletry goods re... A3349671C 1982-04-01 4.0
45570 Tasmania Pharmaceutical, cosmetic and toiletry goods re... A3349671C 1982-05-01 4.0
45571 Tasmania Pharmaceutical, cosmetic and toiletry goods re... A3349671C 1982-06-01 3.9
45572 Tasmania Pharmaceutical, cosmetic and toiletry goods re... A3349671C 1982-07-01 4.4
45573 Tasmania Pharmaceutical, cosmetic and toiletry goods re... A3349671C 1982-08-01 4.2
Code
plt.figure(figsize=(10,4))
plt.plot(myseries["Month"], myseries["Turnover"])
plt.title(f"Retail Series: {random_series_id}")
plt.xlabel("Month")
plt.ylabel("Turnover")
plt.tight_layout()
plt.show()

Code
decomp = seasonal_decompose(myseries["Turnover"], period=12, model="additive")
fig = decomp.plot()
fig.set_size_inches(10,6)
plt.tight_layout()
plt.show()

Code
plt.figure(figsize=(4,4))
lag_plot(myseries["Turnover"])
plt.title("Lag Plot")
plt.tight_layout()
plt.show()

Code
plt.figure(figsize=(8,4))
plot_acf(myseries["Turnover"], lags=48)
plt.tight_layout()
plt.show()
<Figure size 768x384 with 0 Axes>

Code
from scipy.stats import boxcox

y = myseries["Turnover"].values
y_positive = y[y > 0]

_, lambda_est = boxcox(y_positive)
lambda_est
np.float64(0.2638512890190425)

The Box-Cox parameter is estimated at approximately 0.26. This indicates that there is a moderate transformation required. The Box-Cox parameter is closer to 0 than 1. Thus, it makes more sense to use the logarithmic transformation. However, using the transformation with λ close to 0.26 would be more appropriate.

Exercise 3.5

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

Code
import pandas as pd
import numpy as np
from scipy.stats import boxcox

aus_prod = pd.read_csv("aus_production.csv")
aus_prod["ds"] = pd.to_datetime(aus_prod["ds"])

tobacco = aus_prod["Tobacco"].dropna().values

_, lambda_tobacco = boxcox(tobacco)
lambda_tobacco
np.float64(1.4081158422710829)
Code
ansett = pd.read_csv("ansett.csv")
ansett["ds"] = pd.to_datetime(ansett["ds"])

econ_series = ansett.loc[ansett["Class"] == "Economy", "y"].dropna()

if econ_series.min() <= 0:
    econ_shifted = econ_series - econ_series.min() + 1
else:
    econ_shifted = econ_series

_, lambda_econ = boxcox(econ_shifted.values)
lambda_econ
np.float64(0.34248932972035595)
Code
ped = pd.read_csv("pedestrian.csv")
counts = ped["y"].dropna()

if counts.min() <= 0:
    counts = counts - counts.min() + 1

_, lambda_ped = boxcox(counts.values)
lambda_ped
np.float64(0.14795296616733997)

The Box Cox analysis of the time series indicates that Tobacco production (aus_production) has an estimate of λ = 1.41, which essentially means 1. This means that there will be little or no transformation required. The estimate of λ for the Economy class passengers time series (ansett) is approximately 0.34. This means that the time series will require a moderate transformation close to the cube root. The time series representing pedestrian counts at Southern Cross Station has an estimate of λ close to 0 at approximately 0.15. This means that the time series will require a logarithmic transformation.

Exercise 3.7

Consider the last five years of the Gas data from aus_production.

gas = aus_production.loc[lambda x: x["unique_id"] == "Gas"].tail(5 * 4)

  1. Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?
Code
import pandas as pd
import matplotlib.pyplot as plt

aus_production = pd.read_csv("aus_production.csv")
aus_production["ds"] = pd.to_datetime(aus_production["ds"])

gas = aus_production[["ds", "Gas"]].dropna().tail(5 * 4)

plt.figure()
plt.plot(gas["ds"], gas["Gas"])
plt.title("Gas Production – Last 5 Years")
plt.xlabel("Date")
plt.ylabel("Gas")
plt.show()

Over the last five years, the gas production reveals seasonality with a repeating pattern within the year. There’s an evident and ongoing trend as the production increases over time. Within this short period, no strong cycles other than the seasonality are evident.

  1. Use seasonal_decompose with model='multiplicative' to calculate the trend-cycle and seasonal indices.
Code
from statsmodels.tsa.seasonal import seasonal_decompose

gas_ts = gas.set_index("ds")["Gas"]

decomp = seasonal_decompose(gas_ts, model="multiplicative", period=4)

decomp.plot()
plt.show()

trend = decomp.trend
seasonal = decomp.seasonal

From the multiplicative decomposition, it is observed that there exists a smooth increasing trend along with seasonal factors that repeat themselves every quarter. The seasonal factors remain relatively constant over time. Therefore, it can be deduced that a multiplicative model has been used. The residuals are small compared to the level of the series.

  1. Do the results support the graphical interpretation from part a?

Yes, the decomposition confirms the picture provided by the graph in (a). The trend component confirms the rising pattern shown in the graph, while the seasonal component reflects the periodic fluctuations every three months. Overall, the decomposition fits the visual information nicely.

  1. Compute and plot the seasonally adjusted data.
Code
gas_sa = gas_ts / seasonal

plt.figure()
plt.plot(gas_ts, label="Original")
plt.plot(gas_sa, label="Seasonally Adjusted")
plt.legend()
plt.title("Seasonally Adjusted Gas Production")
plt.show()

The seasonally adjusted data removes the normal ups and downs of the data on a quarterly basis, but it still maintains the overall direction of increase. Once the data is seasonally adjusted, it appears smoother, with the overall increase in gas production more evident. The normal swings in the data are no longer dominant.

  1. Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
Code
gas_outlier = gas_ts.copy()
midpoint = len(gas_outlier) // 2
gas_outlier.iloc[midpoint] += 300

decomp_outlier = seasonal_decompose(gas_outlier, model="multiplicative", period=4)
gas_sa_outlier = gas_outlier / decomp_outlier.seasonal

decomp_outlier.plot()
plt.show()

If there is one outlier in the middle of the data, the trend piece will be warped around that point, and the residuals will increase with a large spike. The decomposition will adjust its estimate of the trend by incorporating some of the extreme values in the data. This demonstrates the sensitivity of traditional decomposition to extreme values.

  1. Does it make any difference if the outlier is near the end rather than in the middle of the time series?
Code
gas_outlier_end = gas_ts.copy()
gas_outlier_end.iloc[-1] += 300

decomp_end = seasonal_decompose(gas_outlier_end, model="multiplicative", period=4)
gas_sa_outlier_end = gas_outlier_end / decomp_end.seasonal

decomp_end.plot()
plt.show()

When an outlier appears at a position closer to the tail of a series, it impacts the boundary estimates of a trend with greater intensity. Decomposition methods, being dependent on moving averages, make edge points highly sensitive to outliers. Therefore, the trend at the end gets affected more steeply than if an outlier appears at a central position.

Exercise 3.8

Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using the MSTL method. Does it reveal any outliers, or unusual features that you had not noticed previously?

Code
from statsmodels.tsa.seasonal import MSTL
import matplotlib.pyplot as plt

myseries["Month"] = pd.to_datetime(myseries["Month"])

retail_ts = myseries.sort_values("Month").set_index("Month")["Turnover"]

mstl = MSTL(retail_ts, periods=12)
res = mstl.fit()

res.plot()
plt.show()

The MSTL technique decomposes the given data on the retail industry into three different segments: the smooth and steady trend over time, the strong yearly seasonality with its magnitude slightly increasing over time, and finally the remainder. The trend component indicates steady growth over time, while the seasonality component indicates strong annual variation with its magnitude slightly increasing over time. The remainder component focuses our attention on some unusual features, particularly the large fluctuations from 2008-2009 and 2011-2012 that were not as prominent in the initial visualization of the time series.

Exercise 3.9

  1. Figures 3.13 and 3.14 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

As shown in Figure 3.13 and Figure 3.14, the STL decomposition of Australia’s monthly civilian labor force from February 1978 to August 1995 separates the original time series into its trend component, its seasonal component, and its remainder component.

  1. Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation. Is the recession of 1991/1992 visible in the estimated components?

The Australian civilian labor force increases steadily throughout the entire period, with an obvious trend of increasing growth from 1978 to 1995. The seasonal part has regular, predictable year-by-year ups and downs, with peaks and troughs remaining small relative to the level, indicating stability in seasonality. The remainder section shows the short-term wiggles, with some significant movements away from zero.

Between 1991 and 1992, the trend part of the series shows a brief slowdown in growth, with the rate of growth leveling off. The remainder part becomes more volatile during these years, indicating unusual short-run disruptions. However, the seasonality remains much the same, implying that the recession affected the trend, not the seasonality.