DATA624 - Homework 2

Author

Anthony Josue Roman

Exercise 3.1

Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

global_economy = pd.read_csv("global_economy.csv")

df = (
    global_economy
    .dropna(subset=["GDP", "Population"])
    .assign(
        gdp_pc=lambda x: x["GDP"] / x["Population"],
        year=lambda x: pd.to_datetime(x["ds"].astype(int), format="%Y")
    )
)

df.head()

	unique_id	Code	ds	GDP	Growth	CPI	Imports	Exports	Population	gdp_pc	year
0	Afghanistan	AFG	1960	5.377778e+08	NaN	NaN	7.024793	4.132233	8996351.0	59.777327	1960-01-01
1	Afghanistan	AFG	1961	5.488889e+08	NaN	NaN	8.097166	4.453443	9166764.0	59.878153	1961-01-01
2	Afghanistan	AFG	1962	5.466667e+08	NaN	NaN	9.349593	4.878051	9345868.0	58.492874	1962-01-01
3	Afghanistan	AFG	1963	7.511112e+08	NaN	NaN	16.863910	9.171601	9533954.0	78.782758	1963-01-01
4	Afghanistan	AFG	1964	8.000000e+08	NaN	NaN	18.055555	8.888893	9731361.0	82.208444	1964-01-01

Plot of GDP capita for each country over time

Code

plt.figure(figsize=(10, 5))

for country, g in df.sort_values("year").groupby("unique_id"):
    plt.plot(g["year"], g["gdp_pc"], linewidth=0.8, alpha=0.15)

plt.title("GDP per capita over time (all countries)")
plt.xlabel("Year")
plt.ylabel("GDP per capita (USD)")
plt.tight_layout()
plt.show()

The plot above contains a lot of countries, so the plot below contains the top 10 countries by latest available year to make it readable.

Code

latest = (
    df.sort_values("year")
      .groupby("unique_id", as_index=False)
      .tail(1)
      .sort_values("gdp_pc", ascending=False)
)

top10_countries = latest["unique_id"].head(10).tolist()

top10 = df[df["unique_id"].isin(top10_countries)].copy()

plt.figure(figsize=(10, 5))
for country, g in top10.sort_values("year").groupby("unique_id"):
    plt.plot(g["year"], g["gdp_pc"], linewidth=2, label=country)

plt.title("GDP per capita over time (Top 10 by latest year)")
plt.xlabel("Year")
plt.ylabel("GDP per capita (USD)")
plt.legend(loc="upper left", bbox_to_anchor=(1.02, 1))
plt.tight_layout()
plt.show()

latest.head(10)

	unique_id	Code	ds	GDP	Growth	CPI	Imports	Exports	Population	gdp_pc	year
9504	Monaco	MCO	2016	6.468252e+09	3.213849	NaN	NaN	NaN	38499.0	168010.914891	2016-01-01
8112	Liechtenstein	LIE	2016	6.214634e+09	NaN	NaN	NaN	NaN	37666.0	164993.194162	2016-01-01
8403	Luxembourg	LUX	2017	6.240446e+10	2.297941	111.413231	193.969823	230.016430	599449.0	104103.036747	2017-01-01
1329	Bermuda	BMU	2013	5.573710e+09	-2.511154	NaN	29.592516	47.668268	65001.0	85748.065414	2013-01-01
8461	Macao SAR, China	MAC	2017	5.036120e+10	9.096104	136.099091	32.009975	79.395298	622567.0	80892.821329	2017-01-01
13439	Switzerland	CHE	2017	6.788873e+11	1.086792	98.266862	53.927473	64.976735	8466017.0	80189.696861	2017-01-01
6662	Isle of Man	IMN	2016	6.592628e+09	7.400000	NaN	NaN	NaN	83737.0	78730.162285	2016-01-01
10491	Norway	NOR	2017	3.988320e+11	1.918746	114.550719	33.051905	35.472423	5282223.0	75504.566255	2017-01-01
2483	Channel Islands	CHI	2007	1.151461e+10	5.898330	NaN	NaN	NaN	156513.0	73569.644965	2007-01-01
6083	Iceland	ISL	2017	2.390929e+10	3.642227	121.956961	42.848779	46.963832	341284.0	70056.873392	2017-01-01

Code

max_year = df["year"].max()
latest_year_df = df[df["year"] == max_year].copy()

top_latest_year = latest_year_df.sort_values("gdp_pc", ascending=False).head(10)
top_latest_year

	unique_id	Code	ds	GDP	Growth	CPI	Imports	Exports	Population	gdp_pc	year
8403	Luxembourg	LUX	2017	6.240446e+10	2.297941	111.413231	193.969823	230.016430	599449.0	104103.036747	2017-01-01
8461	Macao SAR, China	MAC	2017	5.036120e+10	9.096104	136.099091	32.009975	79.395298	622567.0	80892.821329	2017-01-01
13439	Switzerland	CHE	2017	6.788873e+11	1.086792	98.266862	53.927473	64.976735	8466017.0	80189.696861	2017-01-01
10491	Norway	NOR	2017	3.988320e+11	1.918746	114.550719	33.051905	35.472423	5282223.0	75504.566255	2017-01-01
6083	Iceland	ISL	2017	2.390929e+10	3.642227	121.956961	42.848779	46.963832	341284.0	70056.873392	2017-01-01
6605	Ireland	IRL	2017	3.337308e+11	7.802382	105.079586	87.878319	120.012529	4813608.0	69330.690154	2017-01-01
11477	Qatar	QAT	2017	1.669286e+11	1.579850	115.857023	37.257003	51.042428	2639211.0	63249.422433	2017-01-01
14483	United States	USA	2017	1.939060e+13	2.273339	112.411557	NaN	NaN	325719178.0	59531.661964	2017-01-01
10375	North America	NAC	2017	2.104998e+13	2.348806	NaN	NaN	NaN	362492702.0	58070.066137	2017-01-01
12143	Singapore	SGP	2017	3.239072e+11	3.618542	113.266146	149.082919	173.345205	5612253.0	57714.296631	2017-01-01

Code

yearly_top = (
    df.sort_values(["year", "gdp_pc"], ascending=[True, False])
      .groupby("year", as_index=False)
      .head(1)
      .rename(columns={"unique_id": "top_country"})
      .loc[:, ["year", "top_country", "gdp_pc"]]
)

top_counts = yearly_top["top_country"].value_counts().head(15)
top_counts

top_country
Monaco                  43
United States            8
Kuwait                   2
United Arab Emirates     2
Liechtenstein            2
Luxembourg               1
Name: count, dtype: int64

Code

plt.figure(figsize=(10, 4))
plt.plot(yearly_top["year"], yearly_top["gdp_pc"])
plt.title("GDP per capita of the top-ranked country each year")
plt.xlabel("Year")
plt.ylabel("Top GDP per capita (USD)")
plt.tight_layout()
plt.show()

yearly_top.tail(15)

	year	top_country	gdp_pc
9491	2003-01-01	Monaco	108978.488860
9492	2004-01-01	Monaco	123382.015835
9493	2005-01-01	Monaco	124374.268481
9494	2006-01-01	Monaco	133195.429339
9495	2007-01-01	Monaco	167124.740985
9496	2008-01-01	Monaco	180640.125115
9497	2009-01-01	Monaco	149221.361937
9498	2010-01-01	Monaco	144569.175786
9499	2011-01-01	Monaco	162155.498619
9500	2012-01-01	Monaco	152000.362070
8109	2013-01-01	Liechtenstein	173528.150454
9502	2014-01-01	Monaco	185152.527227
8111	2015-01-01	Liechtenstein	167590.608272
9504	2016-01-01	Monaco	168010.914891
8403	2017-01-01	Luxembourg	104103.036747

GDP per capita was computed by dividing GDP by Population for each country and year, thus making it easier to compare countries and years. From the computations, it is evident that there is an overall growth trend in GDP per capita at the global level, with varying rates of growth in different economies. In 2017, Luxembourg records the highest GDP per capita at around $104,103. From all the computed data, Monaco seems to have recorded the highest frequency in terms of GDP per capita, thus making it the leading country for 43 years.

Exercise 3.2

For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

United States GDP from global_economy.

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

ge = pd.read_csv("global_economy.csv")
us = ge.query("unique_id == 'United States'")[["ds","GDP"]].dropna()
us["ds"] = pd.to_datetime(us["ds"].astype(int), format="%Y")
us["log_GDP"] = np.log(us["GDP"])

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(us["ds"], us["GDP"])
ax.set_title("United States GDP")
ax.set_xlabel("Year")
ax.set_ylabel("GDP")
plt.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(us["ds"], us["log_GDP"])
ax.set_title("United States GDP (log scale)")
ax.set_xlabel("Year")
ax.set_ylabel("log(GDP)")
plt.tight_layout()
plt.show()

The graphs show that the GDP series climbs steadily, with the rate at which it climbs increasing over time. This is because the curve is bending up, suggesting that multiplicative growth is at work, with the later years pulling up the scale, compressing the small fluctuations that occurred at the beginning. If we take the logs, the series becomes a more stable line, which is characteristic of a constant rate of growth over time. This is why using the log transformation helps stabilize the variance, enabling easier interpretation of patterns.

Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock.

Code

liv = pd.read_csv("aus_livestock.csv")
liv["ds"] = pd.to_datetime(liv["ds"])
vic_bulls = liv[liv["unique_id"].str.contains("Victoria") & liv["unique_id"].str.contains("Bulls, bullocks and steers")].copy()
vic_bulls["log_y"] = np.log(vic_bulls["y"])

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(vic_bulls["ds"], vic_bulls["y"])
ax.set_title("Victoria: Bulls, bullocks and steers slaughter")
ax.set_xlabel("Date")
ax.set_ylabel("Slaughter")
plt.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(vic_bulls["ds"], vic_bulls["log_y"])
ax.set_title("Victoria: Bulls, bullocks and steers slaughter (log scale)")
ax.set_xlabel("Date")
ax.set_ylabel("log(Slaughter)")
plt.tight_layout()
plt.show()

It’s clear that the original series has strong seasonal patterns and large fluctuations over time, especially with larger fluctuations when the level is high. There are also signs of structural changes, especially in the late 1970s to early 1980s. Once we apply the log transformation to our data, we see more stable ups and downs with consistent changes over time. The log transformation is appropriate here since it controls large fluctuations in the data.

Victorian Electricity Demand from vic_elec.

Code

ve = pd.read_csv("vic_elec.csv")
ve["ds"] = pd.to_datetime(ve["ds"])

daily = (
    ve.query("unique_id == 'Demand'")
      .set_index("ds")["y"]
      .resample("D")
      .mean()
      .reset_index()
      .rename(columns={"y":"Demand"})
)

daily["log_Demand"] = np.log(daily["Demand"])

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(daily["ds"], daily["Demand"])
ax.set_title("Victoria Electricity Demand (daily average)")
ax.set_xlabel("Date")
ax.set_ylabel("Demand")
plt.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(daily["ds"], daily["log_Demand"])
ax.set_title("Victoria Electricity Demand (daily average, log scale)")
ax.set_xlabel("Date")
ax.set_ylabel("log(Demand)")
plt.tight_layout()
plt.show()

The average demand for electricity each day has strong seasonality and lots of wiggles, with some spikes during extreme periods. However, overall, the pattern remains fairly consistent over time, with no strong evidence that the variations increase with the level of the series. The overall pattern of the data remains essentially the same after the log transformation, with just a hint of compression of the peaks. In this case, the log transformation is not strictly necessary, since the variance appears fairly stable anyway.

Gas production from aus_production.

Code

ap = pd.read_csv("aus_production.csv")
ap["ds"] = pd.to_datetime(ap["ds"])
gas = ap[["ds","Gas"]].dropna().rename(columns={"Gas":"y"})
gas["log_y"] = np.log(gas["y"])

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(gas["ds"], gas["y"])
ax.set_title("Gas production (aus_production)")
ax.set_xlabel("Date")
ax.set_ylabel("Gas")
plt.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(gas["ds"], gas["log_y"])
ax.set_title("Gas production (log scale)")
ax.set_xlabel("Date")
ax.set_ylabel("log(Gas)")
plt.tight_layout()
plt.show()

We see that the gas production data has a sharply increasing pattern with noticeable ups and downs, which become larger as the level increases. This is indicative of multiplicative seasonality, where the variance increases with the level of the series. However, once the logarithm is taken, the ups and downs become stable over time, and the pattern becomes more linear. The use of the logarithm is appropriate for this data because it stabilizes the variance and makes the multiplicative seasonality more additive.

Exercise 3.3

Why is a Box-Cox transformation unhelpful for the canadian_gas data?

Code

gas = pd.read_csv("canadian_gas.csv")
gas["ds"] = pd.to_datetime(gas["ds"])

plt.figure(figsize=(10,4))
plt.plot(gas["ds"], gas["y"])
plt.title("Canadian Gas Production")
plt.xlabel("Date")
plt.ylabel("Gas")
plt.tight_layout()
plt.show()

Code

stl = STL(gas["y"], period=4)
res = stl.fit()

fig = res.plot()
fig.set_size_inches(10,6)
plt.tight_layout()
plt.show()

Code

gas["log_y"] = np.log(gas["y"])

stl_log = STL(gas["log_y"], period=4)
res_log = stl_log.fit()

fig = res_log.plot()
fig.set_size_inches(10,6)
plt.tight_layout()
plt.show()

The figures that the gas production data has a sharply increasing pattern with noticeable ups and downs, which become larger as the level increases. This is indicative of multiplicative seasonality, where the variance increases with the level of the series. However, once the logarithm is taken, the ups and downs become stable over time, and the pattern becomes more linear. The use of the logarithm is appropriate for this data because it stabilizes the variance and makes the multiplicative seasonality more additive.

Exercise 3.4

What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf
from pandas.plotting import lag_plot

aus_retail = pd.read_csv("aus_retail.csv")
aus_retail["Month"] = pd.to_datetime(aus_retail["Month"])

np.random.seed(123)
random_series_id = np.random.choice(aus_retail["Series ID"].unique(), 1)[0]

myseries = aus_retail.query("`Series ID` == @random_series_id").copy()
myseries = myseries.sort_values("Month")

myseries.head()

	State	Industry	Series ID	Month	Turnover
45569	Tasmania	Pharmaceutical, cosmetic and toiletry goods re...	A3349671C	1982-04-01	4.0
45570	Tasmania	Pharmaceutical, cosmetic and toiletry goods re...	A3349671C	1982-05-01	4.0
45571	Tasmania	Pharmaceutical, cosmetic and toiletry goods re...	A3349671C	1982-06-01	3.9
45572	Tasmania	Pharmaceutical, cosmetic and toiletry goods re...	A3349671C	1982-07-01	4.4
45573	Tasmania	Pharmaceutical, cosmetic and toiletry goods re...	A3349671C	1982-08-01	4.2

Code

plt.figure(figsize=(10,4))
plt.plot(myseries["Month"], myseries["Turnover"])
plt.title(f"Retail Series: {random_series_id}")
plt.xlabel("Month")
plt.ylabel("Turnover")
plt.tight_layout()
plt.show()

Code

decomp = seasonal_decompose(myseries["Turnover"], period=12, model="additive")
fig = decomp.plot()
fig.set_size_inches(10,6)
plt.tight_layout()
plt.show()

Code

plt.figure(figsize=(4,4))
lag_plot(myseries["Turnover"])
plt.title("Lag Plot")
plt.tight_layout()
plt.show()

Code

plt.figure(figsize=(8,4))
plot_acf(myseries["Turnover"], lags=48)
plt.tight_layout()
plt.show()

<Figure size 768x384 with 0 Axes>

Code

from scipy.stats import boxcox

y = myseries["Turnover"].values
y_positive = y[y > 0]

_, lambda_est = boxcox(y_positive)
lambda_est

np.float64(0.2638512890190425)

The Box-Cox parameter is estimated at approximately 0.26. This indicates that there is a moderate transformation required. The Box-Cox parameter is closer to 0 than 1. Thus, it makes more sense to use the logarithmic transformation. However, using the transformation with λ close to 0.26 would be more appropriate.

Exercise 3.5

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

Code

import pandas as pd
import numpy as np
from scipy.stats import boxcox

aus_prod = pd.read_csv("aus_production.csv")
aus_prod["ds"] = pd.to_datetime(aus_prod["ds"])

tobacco = aus_prod["Tobacco"].dropna().values

_, lambda_tobacco = boxcox(tobacco)
lambda_tobacco

np.float64(1.4081158422710829)

Code

ansett = pd.read_csv("ansett.csv")
ansett["ds"] = pd.to_datetime(ansett["ds"])

econ_series = ansett.loc[ansett["Class"] == "Economy", "y"].dropna()

if econ_series.min() <= 0:
    econ_shifted = econ_series - econ_series.min() + 1
else:
    econ_shifted = econ_series

_, lambda_econ = boxcox(econ_shifted.values)
lambda_econ

np.float64(0.34248932972035595)

Code

ped = pd.read_csv("pedestrian.csv")
counts = ped["y"].dropna()

if counts.min() <= 0:
    counts = counts - counts.min() + 1

_, lambda_ped = boxcox(counts.values)
lambda_ped

np.float64(0.14795296616733997)

The Box Cox analysis of the time series indicates that Tobacco production (aus_production) has an estimate of λ = 1.41, which essentially means 1. This means that there will be little or no transformation required. The estimate of λ for the Economy class passengers time series (ansett) is approximately 0.34. This means that the time series will require a moderate transformation close to the cube root. The time series representing pedestrian counts at Southern Cross Station has an estimate of λ close to 0 at approximately 0.15. This means that the time series will require a logarithmic transformation.

Exercise 3.7

Consider the last five years of the Gas data from aus_production.

gas = aus_production.loc[lambda x: x["unique_id"] == "Gas"].tail(5 * 4)

Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?

Code

import pandas as pd
import matplotlib.pyplot as plt

aus_production = pd.read_csv("aus_production.csv")
aus_production["ds"] = pd.to_datetime(aus_production["ds"])

gas = aus_production[["ds", "Gas"]].dropna().tail(5 * 4)

plt.figure()
plt.plot(gas["ds"], gas["Gas"])
plt.title("Gas Production – Last 5 Years")
plt.xlabel("Date")
plt.ylabel("Gas")
plt.show()

Over the last five years, the gas production reveals seasonality with a repeating pattern within the year. There’s an evident and ongoing trend as the production increases over time. Within this short period, no strong cycles other than the seasonality are evident.

Use seasonal_decompose with model='multiplicative' to calculate the trend-cycle and seasonal indices.

Code

from statsmodels.tsa.seasonal import seasonal_decompose

gas_ts = gas.set_index("ds")["Gas"]

decomp = seasonal_decompose(gas_ts, model="multiplicative", period=4)

decomp.plot()
plt.show()

trend = decomp.trend
seasonal = decomp.seasonal

From the multiplicative decomposition, it is observed that there exists a smooth increasing trend along with seasonal factors that repeat themselves every quarter. The seasonal factors remain relatively constant over time. Therefore, it can be deduced that a multiplicative model has been used. The residuals are small compared to the level of the series.

Do the results support the graphical interpretation from part a?

Yes, the decomposition confirms the picture provided by the graph in (a). The trend component confirms the rising pattern shown in the graph, while the seasonal component reflects the periodic fluctuations every three months. Overall, the decomposition fits the visual information nicely.

Compute and plot the seasonally adjusted data.

Code

gas_sa = gas_ts / seasonal

plt.figure()
plt.plot(gas_ts, label="Original")
plt.plot(gas_sa, label="Seasonally Adjusted")
plt.legend()
plt.title("Seasonally Adjusted Gas Production")
plt.show()

The seasonally adjusted data removes the normal ups and downs of the data on a quarterly basis, but it still maintains the overall direction of increase. Once the data is seasonally adjusted, it appears smoother, with the overall increase in gas production more evident. The normal swings in the data are no longer dominant.

Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

Code

gas_outlier = gas_ts.copy()
midpoint = len(gas_outlier) // 2
gas_outlier.iloc[midpoint] += 300

decomp_outlier = seasonal_decompose(gas_outlier, model="multiplicative", period=4)
gas_sa_outlier = gas_outlier / decomp_outlier.seasonal

decomp_outlier.plot()
plt.show()

If there is one outlier in the middle of the data, the trend piece will be warped around that point, and the residuals will increase with a large spike. The decomposition will adjust its estimate of the trend by incorporating some of the extreme values in the data. This demonstrates the sensitivity of traditional decomposition to extreme values.

Does it make any difference if the outlier is near the end rather than in the middle of the time series?

Code

gas_outlier_end = gas_ts.copy()
gas_outlier_end.iloc[-1] += 300

decomp_end = seasonal_decompose(gas_outlier_end, model="multiplicative", period=4)
gas_sa_outlier_end = gas_outlier_end / decomp_end.seasonal

decomp_end.plot()
plt.show()

When an outlier appears at a position closer to the tail of a series, it impacts the boundary estimates of a trend with greater intensity. Decomposition methods, being dependent on moving averages, make edge points highly sensitive to outliers. Therefore, the trend at the end gets affected more steeply than if an outlier appears at a central position.

Exercise 3.8

Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using the MSTL method. Does it reveal any outliers, or unusual features that you had not noticed previously?

Code

from statsmodels.tsa.seasonal import MSTL
import matplotlib.pyplot as plt

myseries["Month"] = pd.to_datetime(myseries["Month"])

retail_ts = myseries.sort_values("Month").set_index("Month")["Turnover"]

mstl = MSTL(retail_ts, periods=12)
res = mstl.fit()

res.plot()
plt.show()

The MSTL technique decomposes the given data on the retail industry into three different segments: the smooth and steady trend over time, the strong yearly seasonality with its magnitude slightly increasing over time, and finally the remainder. The trend component indicates steady growth over time, while the seasonality component indicates strong annual variation with its magnitude slightly increasing over time. The remainder component focuses our attention on some unusual features, particularly the large fluctuations from 2008-2009 and 2011-2012 that were not as prominent in the initial visualization of the time series.

Exercise 3.9

Figures 3.13 and 3.14 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

As shown in Figure 3.13 and Figure 3.14, the STL decomposition of Australia’s monthly civilian labor force from February 1978 to August 1995 separates the original time series into its trend component, its seasonal component, and its remainder component.

Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation. Is the recession of 1991/1992 visible in the estimated components?

The Australian civilian labor force increases steadily throughout the entire period, with an obvious trend of increasing growth from 1978 to 1995. The seasonal part has regular, predictable year-by-year ups and downs, with peaks and troughs remaining small relative to the level, indicating stability in seasonality. The remainder section shows the short-term wiggles, with some significant movements away from zero.

Between 1991 and 1992, the trend part of the series shows a brief slowdown in growth, with the rate of growth leveling off. The remainder part becomes more volatile during these years, indicating unusual short-run disruptions. However, the seasonality remains much the same, implying that the recession affected the trend, not the seasonality.