DATA624 - Homework 1

Author

Anthony Josue Roman

Exercise 2.1

Explore the following four time series: Bricks from aus_production, Lynx from pelt, GOOG_Close from gafa_stock, Demand from vic_elec.

Bricks

Use info() to find out about the data in each series.

import pandas as pd

aus_production = pd.read_csv("aus_production.csv")
aus_production["ds"] = pd.to_datetime(aus_production["ds"])
aus_production.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 218 entries, 0 to 217
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   ds           218 non-null    datetime64[ns]
 1   Beer         218 non-null    int64         
 2   Tobacco      194 non-null    float64       
 3   Bricks       198 non-null    float64       
 4   Cement       218 non-null    int64         
 5   Electricity  218 non-null    int64         
 6   Gas          218 non-null    int64         
dtypes: datetime64[ns](1), float64(2), int64(4)
memory usage: 12.1 KB

Q: What is the time interval of each series?

A: Time interval: Quarterly because the series shows strong seasonal fluctuations and a long-term trend. Brick production increases steadily from the late 1950s through the 1970s, peaks in the late 1970s and early 1980s, and then declines with increased volatility in later years. These patterns are consistent with construction cycles and broader economic conditions.

Use plot_series() to produce a time plot of each series.
For the last plot, modify the axis labels and title.

import matplotlib.pyplot as plt

bricks = aus_production[["ds", "Bricks"]].rename(columns={"Bricks": "y"}).copy()
bricks["y"] = pd.to_numeric(bricks["y"], errors="coerce")
bricks = bricks.dropna(subset=["y"]).sort_values("ds")

plt.figure(figsize=(10,4))
plt.plot(bricks["ds"], bricks["y"])
plt.title("Australian Brick Production")
plt.xlabel("Time")
plt.ylabel("Bricks")
plt.tight_layout()
plt.show()

Lynx

Use info() to find out about the data in each series.

import pandas as pd

pelt = pd.read_csv("pelt.csv")
pelt["ds"] = pd.to_datetime(pelt["ds"])
pelt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182 entries, 0 to 181
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   unique_id  182 non-null    object        
 1   ds         182 non-null    datetime64[ns]
 2   y          182 non-null    int64         
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 4.4+ KB

Q: What is the time interval of each series?

A: Time interval: Annual because the lynx series exhibits strong cyclical behavior with repeated peaks and troughs over multi-year periods. The magnitude of these cycles varies through time, but there is no seasonal pattern since the data are annual.

Use plot_series() to produce a time plot of each series.
For the last plot, modify the axis labels and title.

import matplotlib.pyplot as plt
import matplotlib.dates as mdates

pelt["ds"] = pd.to_datetime(pelt["ds"], errors="coerce")

lynx = pelt[pelt["unique_id"] == "lynx"].dropna(subset=["ds", "y"]).copy()

plt.figure(figsize=(10,4))
plt.plot(lynx["ds"], lynx["y"])
plt.title("Lynx Trappings")
plt.xlabel("Year")
plt.ylabel("Number of Lynx")

ax = plt.gca()
ax.xaxis.set_major_locator(mdates.YearLocator(10))
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y"))

plt.tight_layout()
plt.show()

GOOG_Close

Use info() to find out about the data in each series.

import pandas as pd

gafa = pd.read_csv("gafa_stock.csv")
gafa["ds"] = pd.to_datetime(gafa["ds"], errors="coerce")
gafa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30192 entries, 0 to 30191
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   unique_id  30192 non-null  object        
 1   ds         30192 non-null  datetime64[ns]
 2   y          30192 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 707.8+ KB

Q: What is the time interval of each series?

A: Time Interval: Daily because the Google closing price shows a clear upward trend across the sample period, with noticeable periods of higher volatility. Price movements cluster in certain intervals, which is typical for financial series. There is no obvious seasonal pattern at the daily scale.

Use plot_series() to produce a time plot of each series.
For the last plot, modify the axis labels and title.

import matplotlib.pyplot as plt
import matplotlib.dates as mdates

goog_close = gafa[gafa["unique_id"] == "GOOG_Close"].dropna(subset=["ds", "y"]).copy()

plt.figure(figsize=(10,4))
plt.plot(goog_close["ds"], goog_close["y"])
plt.title("Google Daily Closing Price (GOOG_Close)")
plt.xlabel("Date")
plt.ylabel("Closing Price")

ax = plt.gca()
ax.xaxis.set_major_locator(mdates.YearLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y"))

plt.tight_layout()
plt.show()

Demand

Use info() to find out about the data in each series.

import pandas as pd

vic = pd.read_csv("vic_elec.csv")
vic["ds"] = pd.to_datetime(vic["ds"], errors="coerce")
vic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105204 entries, 0 to 105203
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   ds         105204 non-null  datetime64[ns]
 1   unique_id  105204 non-null  object        
 2   y          105204 non-null  float64       
 3   Holiday    105204 non-null  bool          
dtypes: bool(1), datetime64[ns](1), float64(1), object(1)
memory usage: 2.5+ MB

Q: What is the time interval of each series?

A: Time Interval: Half-Hourly because electricity demand exhibits strong and regular short-term patterns, reflecting daily and weekly seasonality. Demand levels fluctuate consistently within each day, with broader changes over time. Occasional spikes and dips suggest the influence of extreme weather, holidays, or other unusual events.

Use plot_series() to produce a time plot of each series.
For the last plot, modify the axis labels and title.

import matplotlib.pyplot as plt
import matplotlib.dates as mdates

demand = vic[vic["unique_id"] == "Demand"].dropna(subset=["ds", "y"]).copy()

plt.figure(figsize=(10,4))
plt.plot(demand["ds"], demand["y"])
plt.title("Half-hourly Electricity Demand in Victoria")
plt.xlabel("Time")
plt.ylabel("Electricity Demand (MWh)")

ax = plt.gca()
ax.xaxis.set_major_locator(mdates.YearLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y"))

plt.tight_layout()
plt.show()

Exercise 2.2

Use query() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.

import pandas as pd

gafa = pd.read_csv("gafa_stock.csv")
gafa["ds"] = pd.to_datetime(gafa["ds"], errors="coerce")

close_ids = ["AAPL_Close", "AMZN_Close", "FB_Close", "GOOG_Close"]

peaks = []
for uid in close_ids:
    s = gafa[gafa["unique_id"] == uid].dropna(subset=["ds", "y"]).copy()
    idx = s["y"].idxmax()
    peaks.append({
        "Stock": uid.replace("_Close", ""),
        "Peak_Date": s.loc[idx, "ds"].date(),
        "Peak_Close": float(s.loc[idx, "y"])
    })

peaks_df = pd.DataFrame(peaks).sort_values("Stock")
peaks_df

	Stock	Peak_Date	Peak_Close
0	AAPL	2018-10-03	232.070007
1	AMZN	2018-09-04	2039.510010
2	FB	2018-07-25	217.500000
3	GOOG	2018-07-26	1268.329956

The peak closing prices for AAPL, AMZN, FB, and GOOG occur at different points in time, reflecting stock-specific performance rather than a common market peak. While all four series show overall growth, the timing of their maximum values varies due to differences in business conditions and investor expectations. This highlights how individual stock behavior can diverge even within the same market sector.

Exercise 2.3

Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

You can read the data into Python with the following script:

import pandas as pd

tute1 = pd.read_csv("tute1.csv")
tute1.head()

	Quarter	Sales	AdBudget	GDP
0	1981-03-01	1020.2	659.2	251.8
1	1981-06-01	889.2	589.0	290.9
2	1981-09-01	795.0	512.5	290.8
3	1981-12-01	1003.9	614.1	292.4
4	1982-03-01	1057.7	647.2	279.1

Convert the data to time series

import pandas as pd

tute1["ds"] = pd.PeriodIndex(tute1["Quarter"], freq="Q").to_timestamp()

Construct time series plots of each of the three series

import matplotlib.pyplot as plt

plt.figure(figsize=(10,4))
plt.plot(tute1["ds"], tute1["Sales"], label="Sales")
plt.plot(tute1["ds"], tute1["AdBudget"], label="AdBudget")
plt.plot(tute1["ds"], tute1["GDP"], label="GDP")

plt.title("Quarterly Sales, Advertising Budget, and GDP")
plt.xlabel("Time")
plt.ylabel("Value (inflation-adjusted)")
plt.legend()
plt.tight_layout()
plt.show()

The file tute1.csv contains quarterly data from 1981 to 2005, including sales for a small company, its advertising budget, and gross domestic product, all adjusted for inflation. After converting the time variable to a datetime format, time series plots were constructed for each variable to examine their overall behavior. These plots allow for visual comparison of long-term trends and variability across the three series.

Exercise 2.4

The us_total.csv contains data on the demand for natural gas in the US.

Download us_total.csv from the book website read in the csv file using pd.read_csv().

import pandas as pd
import matplotlib.pyplot as plt

us_total = pd.read_csv("us_total.csv")
us_total.head()

	ds	unique_id	y
0	1997	Alabama	324158
1	1998	Alabama	329134
2	1999	Alabama	337270
3	2000	Alabama	353614
4	2001	Alabama	332693

Create a dataframe from us_total with year as the index.

us_total = us_total.set_index("ds")
us_total.head()

	unique_id	y
ds
1997	Alabama	324158
1998	Alabama	329134
1999	Alabama	337270
2000	Alabama	353614
2001	Alabama	332693

Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).

import matplotlib.dates as mdates

us_total = pd.read_csv("us_total.csv")
us_total.columns = us_total.columns.str.strip()

new_england_states = [
    "Maine",
    "Vermont",
    "New Hampshire",
    "Massachusetts",
    "Connecticut",
    "Rhode Island"
]

ne = us_total[us_total["unique_id"].isin(new_england_states)].copy()

plt.figure(figsize=(10,4))
for state in new_england_states:
    s = ne[ne["unique_id"] == state]
    plt.plot(s["ds"], s["y"], label=state)

plt.title("Annual Natural Gas Consumption in New England")
plt.xlabel("Year")
plt.ylabel("Natural Gas Consumption")
plt.legend()
plt.tight_layout()
plt.show()

The plot shows annual natural gas consumption for New England states over time. Massachusetts and Connecticut consistently account for the largest share of regional consumption, with Massachusetts remaining the dominant consumer throughout the period. Connecticut shows a clear upward trend, particularly after the mid-2000s.

In contrast, Maine, New Hampshire, and Rhode Island exhibit more moderate usage levels with relatively smaller fluctuations, while Vermont remains the lowest consumer across all years. Overall, the region shows steady growth in natural gas demand, driven primarily by increases in the larger states.

Exercise 2.5

Download tourism.xlsx from the book website and read in it using pd.read_excel().

import pandas as pd
import matplotlib.pyplot as plt

tourism = pd.read_excel("tourism.xlsx", engine="calamine")
tourism.head()

	Quarter	Region	State	Purpose	Trips
0	1998-01-01	Adelaide	South Australia	Business	135.077690
1	1998-04-01	Adelaide	South Australia	Business	109.987316
2	1998-07-01	Adelaide	South Australia	Business	166.034687
3	1998-10-01	Adelaide	South Australia	Business	127.160464
4	1999-01-01	Adelaide	South Australia	Business	137.448533

Create a dataframe using tourism.xlsx.

tourism.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24320 entries, 0 to 24319
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Quarter  24320 non-null  object 
 1   Region   24320 non-null  object 
 2   State    24320 non-null  object 
 3   Purpose  24320 non-null  object 
 4   Trips    24320 non-null  float64
dtypes: float64(1), object(4)
memory usage: 950.1+ KB

Find what combination of Region and Purpose had the maximum number of overnight trips on average.

avg_trips = (
    tourism
    .groupby(["Region", "Purpose"])["Trips"]
    .mean()
    .reset_index()
)

avg_trips.sort_values("Trips", ascending=False).head(1)

	Region	Purpose	Trips
271	Sydney	Visiting	747.269968

Create a new dataframe which combines the Purposes and Regions, and just has total trips by State.

state_totals = (
    tourism
    .groupby("State")["Trips"]
    .sum()
    .reset_index()
)

state_totals.head()

	State	Trips
0	ACT	41006.585207
1	New South Wales	557367.430368
2	Northern Territory	28613.683011
3	Queensland	386642.911752
4	South Australia	118151.348155

The tourism data were grouped by Region and Purpose to compute the average number of overnight trips. The combination with the highest average trips corresponds to large population and travel-heavy regions. The data were then aggregated by State to obtain total trips, which provides a simplified view of overall tourism demand at the state level.

Exercise 2.8

Use the following graphics functions: plot_series(), seasonal_decompose(), lag_plot(), plot_acf() and explore features from the following time series: “Total Private” Employed from us_employment, Bricks from aus_production, Hare from pelt, “H02” Cost from PBS, and Barrels from us_gasoline.

Bricks from aus_production.csv

import pandas as pd
import matplotlib.pyplot as plt

from utilsforecast.plotting import plot_series
from pandas.plotting import lag_plot
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.seasonal import seasonal_decompose

aus_production = pd.read_csv("aus_production.csv")
aus_production["ds"] = pd.to_datetime(aus_production["ds"], errors="coerce")

bricks = aus_production[["ds", "Bricks"]].rename(columns={"Bricks": "y"}).dropna()
bricks["unique_id"] = "Bricks"
bricks = bricks[["unique_id", "ds", "y"]].sort_values("ds")

plot_series(bricks, id_col="unique_id", time_col="ds", target_col="y")
plt.title("aus_production: Bricks")
plt.tight_layout()
plt.show()

s = bricks.set_index("ds")["y"]
seasonal_decompose(s, model="additive", period=4).plot()
plt.suptitle("aus_production: Bricks (Seasonal decomposition)", y=1.02)
plt.tight_layout()
plt.show()

plt.figure(figsize=(5,5))
lag_plot(s, lag=1)
plt.title("aus_production: Bricks (Lag plot, lag=1)")
plt.tight_layout()
plt.show()

plt.figure(figsize=(10,4))
plot_acf(s, lags=40, zero=False)
plt.title("aus_production: Bricks (ACF)")
plt.tight_layout()
plt.show()

<Figure size 960x384 with 0 Axes>

Image of the Aus Production in case it breaks on rpubs

“Total Private” from us_employment.csv

import pandas as pd
import matplotlib.pyplot as plt

from utilsforecast.plotting import plot_series
from pandas.plotting import lag_plot
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.seasonal import seasonal_decompose

us_employment = pd.read_csv("us_employment.csv")
us_employment["ds"] = pd.to_datetime(us_employment["ds"], errors="coerce")

tp = us_employment[us_employment["unique_id"] == "Total Private"][["unique_id","ds","y"]].dropna().sort_values("ds")

plot_series(tp, id_col="unique_id", time_col="ds", target_col="y")
plt.title('us_employment: "Total Private" Employed')
plt.tight_layout()
plt.show()

s = tp.set_index("ds")["y"]
seasonal_decompose(s, model="additive", period=12).plot()
plt.suptitle('us_employment: "Total Private" (Seasonal decomposition)', y=1.02)
plt.tight_layout()
plt.show()

plt.figure(figsize=(5,5))
lag_plot(s, lag=1)
plt.title('us_employment: "Total Private" (Lag plot, lag=1)')
plt.tight_layout()
plt.show()

plt.figure(figsize=(10,4))
plot_acf(s, lags=60, zero=False)
plt.title('us_employment: "Total Private" (ACF)')
plt.tight_layout()
plt.show()

<Figure size 960x384 with 0 Axes>

Image of the Total Private in case it breaks on rpubs

Hare from pelt.csv

import pandas as pd
import matplotlib.pyplot as plt

from utilsforecast.plotting import plot_series
from pandas.plotting import lag_plot
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.seasonal import seasonal_decompose

pelt = pd.read_csv("pelt.csv")
pelt["ds"] = pd.to_datetime(pelt["ds"], errors="coerce")

hare = pelt[pelt["unique_id"].str.lower() == "hare"][["unique_id","ds","y"]].dropna().sort_values("ds")

plot_series(hare, id_col="unique_id", time_col="ds", target_col="y")
plt.title("pelt: Hare")
plt.tight_layout()
plt.show()

s = hare.set_index("ds")["y"]
seasonal_decompose(s, model="additive", period=10).plot()
plt.suptitle("pelt: Hare (Decomposition, period=10)", y=1.02)
plt.tight_layout()
plt.show()

plt.figure(figsize=(5,5))
lag_plot(s, lag=1)
plt.title("pelt: Hare (Lag plot, lag=1)")
plt.tight_layout()
plt.show()

plt.figure(figsize=(10,4))
plot_acf(s, lags=30, zero=False)
plt.title("pelt: Hare (ACF)")
plt.tight_layout()
plt.show()

<Figure size 960x384 with 0 Axes>

Image of the Pelt in case it breaks on rpubs

“H02” Cost from PBS_unparsed.csv

import pandas as pd
import matplotlib.pyplot as plt

from utilsforecast.plotting import plot_series
from pandas.plotting import lag_plot
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.seasonal import seasonal_decompose

PBS = pd.read_csv("PBS_unparsed.csv")
PBS.columns = PBS.columns.str.strip()

PBS["ds"] = pd.to_datetime(PBS["ds"], errors="coerce") if "ds" in PBS.columns else pd.to_datetime(PBS["Month"], errors="coerce")

if {"unique_id", "y"}.issubset(PBS.columns):
    h02 = PBS[PBS["unique_id"] == "H02"][["unique_id","ds","y"]].dropna().sort_values("ds")
else:
    code_col = next(c for c in PBS.columns if c != "ds" and PBS[c].astype(str).str.contains(r"\bH02\b", na=False).any())
    value_col = next(c for c in PBS.columns if c not in ["ds", code_col] and pd.to_numeric(PBS[c], errors="coerce").notna().any())

    h02 = PBS.loc[PBS[code_col].astype(str).str.strip() == "H02", ["ds", value_col]].copy()
    h02["y"] = pd.to_numeric(h02[value_col], errors="coerce")
    h02 = h02.dropna(subset=["ds", "y"]).sort_values("ds")
    h02["unique_id"] = "H02"
    h02 = h02[["unique_id", "ds", "y"]]

plot_series(h02, id_col="unique_id", time_col="ds", target_col="y")
plt.title('PBS: "H02" Cost')
plt.tight_layout()
plt.show()

s = h02.set_index("ds")["y"].dropna()
seasonal_decompose(s, model="additive", period=12).plot()
plt.suptitle('PBS: "H02" (Seasonal decomposition)', y=1.02)
plt.tight_layout()
plt.show()

plt.figure(figsize=(5,5))
lag_plot(s, lag=1)
plt.title('PBS: "H02" (Lag plot, lag=1)')
plt.tight_layout()
plt.show()

plt.figure(figsize=(10,4))
plot_acf(s, lags=60, zero=False)
plt.title('PBS: "H02" (ACF)')
plt.tight_layout()
plt.show()

<Figure size 960x384 with 0 Axes>

Image of the H02 in case it breaks on rpubs

Barrels from us_gasoline.csv

import pandas as pd
import matplotlib.pyplot as plt

from utilsforecast.plotting import plot_series
from pandas.plotting import lag_plot
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.seasonal import seasonal_decompose

us_gasoline = pd.read_csv("us_gasoline.csv")
us_gasoline["ds"] = pd.to_datetime(us_gasoline["ds"], errors="coerce")

barrels_id = us_gasoline["unique_id"].dropna().unique()[0]
barrels = us_gasoline[us_gasoline["unique_id"] == barrels_id][["unique_id","ds","y"]].dropna().sort_values("ds")


plot_series(barrels, id_col="unique_id", time_col="ds", target_col="y")
plt.title("us_gasoline: Barrels")
plt.tight_layout()
plt.show()

s = barrels.set_index("ds")["y"].dropna()
seasonal_decompose(s, model="additive", period=52).plot()
plt.suptitle("us_gasoline: Barrels (Seasonal decomposition)", y=1.02)
plt.tight_layout()
plt.show()

plt.figure(figsize=(5,5))
lag_plot(s, lag=1)
plt.title("us_gasoline: Barrels (Lag plot, lag=1)")
plt.tight_layout()
plt.show()

plt.figure(figsize=(10,4))
plot_acf(s, lags=80, zero=False)
plt.title("us_gasoline: Barrels (ACF)")
plt.tight_layout()
plt.show()

<Figure size 960x384 with 0 Axes>

Image of the Barrels in case it breaks on rpubs

Can you spot any seasonality, cyclicity and trend?

Yes. Total Private Employment shows a strong long-term upward trend with noticeable cyclical downturns during recession periods, but little seasonality. Bricks, H02 cost, and US gasoline barrels all display clear seasonality, with repeating within-year patterns. Hare does not exhibit seasonality, but instead shows pronounced multi-year cycles.

What do you learn about the series?

Each series exhibits different underlying structures. Employment is dominated by trend and business cycles, indicating strong persistence. Bricks and gasoline demand are strongly seasonal and autocorrelated, reflecting production and consumption patterns tied to the calendar. H02 cost shows both trend and seasonality, suggesting steadily rising costs with recurring annual behavior. The hare series is highly volatile and cyclical, with sharp rises and falls rather than steady growth.

What can you say about the seasonal patterns?

Seasonality is strongest and most regular in Bricks, H02 cost, and US gasoline barrels, as confirmed by the seasonal decomposition and ACF plots showing peaks at seasonal lags. These seasonal effects appear stable over time. Total Private Employment shows only weak seasonality relative to its trend, while Hare shows no seasonal pattern at all.

Can you identify any unusual years?

Yes. Total Private Employment shows sharp declines during major economic downturns, which stand out as unusual periods. US gasoline barrels displays abrupt drops during periods of reduced demand or disruption. Bricks shows irregular spikes and declines that deviate from its typical seasonal pattern. The Hare series contains extreme peak and trough years associated with population booms and crashes, making those years clearly unusual.