DATA624 HW2

library(fpp3)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.6
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.1
## ✔ lubridate   1.9.2     ✔ fable       0.4.1
## ✔ ggplot2     3.5.1

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## Warning: package 'lubridate' was built under R version 4.2.3

## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tsibbledata' was built under R version 4.2.3

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(ggplot2)

Exercise 3.7.1

Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

Answer

We will fist take a look at our data.

?global_economy

## starting httpd help server ... done

Global Economic indicators featured by the World Bank from 1960 to 2017. ‘global_economy’ is an annual tsibble with six values:

GDP: Gross domestic product (in $USD February 2019).
Growth: Annual percentage growth in GDP.
CPI: Consumer price index (base year 2010).
Imports: Imports of goods and services (% of GDP).
Exports: Exports of goods and services (% of GDP).
Population: Total population.

glimpse(global_economy)

## Rows: 15,150
## Columns: 9
## Key: Country [263]
## $ Country    <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",…
## $ Code       <fct> AFG, AFG, AFG, AFG, AFG, AFG, AFG, AFG, AFG, AFG, AFG, AFG,…
## $ Year       <dbl> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969,…
## $ GDP        <dbl> 537777811, 548888896, 546666678, 751111191, 800000044, 1006…
## $ Growth     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CPI        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Imports    <dbl> 7.024793, 8.097166, 9.349593, 16.863910, 18.055555, 21.4128…
## $ Exports    <dbl> 4.132233, 4.453443, 4.878051, 9.171601, 8.888893, 11.258279…
## $ Population <dbl> 8996351, 9166764, 9345868, 9533954, 9731361, 9938414, 10152…

We will need to calculate the GDP per Capita by dividing GDP with Population.

GDP per capita is computed as:

\[ GDP \ per \ capita = \frac{\text{GDP}}{\text{Population}} \]

global_economy<- global_economy |>
  filter(Country %in% unique(global_economy$Country)) |>
  mutate(GDP_per_capita = GDP / Population)

global_economy |>
  ggplot(aes(x = Year, y = GDP_per_capita, group = Country, color = Country)) +
  geom_line(show.legend = FALSE, alpha = 0.5) +  
  labs(title = "GDP per Capita Over Time for Each Country",
       x = "Year",
       y = "GDP per Capita ($USD)") +
  theme_minimal()

Finding the Country with the Highest GDP Per Capita Over Time

highest_gdp_overall <- global_economy |>
  filter(GDP_per_capita == max(GDP_per_capita, na.rm = TRUE)) |>
  select(Country, Year, GDP_per_capita)
highest_gdp_overall

## # A tsibble: 1 x 3 [1Y]
## # Key:       Country [1]
##   Country  Year GDP_per_capita
##   <fct>   <dbl>          <dbl>
## 1 Monaco   2014        185153.

highest_gdp_per_year <- global_economy |>
  as_tibble() |> 
  group_by(Year) |>  
  slice_max(order_by = GDP_per_capita, n = 1) |>  
  select(Year, Country, GDP_per_capita) |>  
  arrange(Year)
highest_gdp_per_year

## # A tibble: 58 × 3
## # Groups:   Year [58]
##     Year Country       GDP_per_capita
##    <dbl> <fct>                  <dbl>
##  1  1960 United States          3007.
##  2  1961 United States          3067.
##  3  1962 United States          3244.
##  4  1963 United States          3375.
##  5  1964 United States          3574.
##  6  1965 Kuwait                 4429.
##  7  1966 Kuwait                 4556.
##  8  1967 United States          4336.
##  9  1968 United States          4696.
## 10  1969 United States          5032.
## # ℹ 48 more rows

highest_gdp_per_year |>
  ggplot(aes(x = Year, y = GDP_per_capita, color = Country, group = Country)) +
  geom_line(linewidth  = 1) +
  geom_point(size = 2) +
  labs(title = "Country with Highest GDP Per Capita Over Time",
       x = "Year",
       y = "GDP per Capita ($USD)") +
  theme_minimal()

Analysis: The first plot provides an overview of GDP per capita across all countries from 1960 to 2017. Most countries show a general upward trend, indicating economic growth over time. However, there is significant variation, with some nations experiencing rapid increases while others grow more gradually.

Monaco shows a strong increasing trend in GDP per capita, reaching its highest value in 2014. The trend also experiences cycles, with noticeable dips during economic downturns, such as the 2008 financial crisis.

There is no clear seasonality, as GDP per capita is measured annually and is not influenced by seasonal patterns.

The country with the highest GDP per capita has changed over time are the United States led initially, followed by Kuwait, Luxembourg, and Monaco. This may reflect economic shifts driven by finance, trade, and resource wealth.

Exercise 3.7.2

For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

United States GDP from ‘global_economy’.
Slaughter of Victorian “Bulls, bullocks and steers” in ‘aus_livestock’.
Victorian Electricity Demand from ‘vic_elec’.
Gas production from aus_production.

Answer:

United States GDP:

us_gdp <- global_economy |> 
  filter(Country == "United States")
glimpse(us_gdp)

## Rows: 58
## Columns: 10
## Key: Country [1]
## $ Country        <fct> "United States", "United States", "United States", "Uni…
## $ Code           <fct> USA, USA, USA, USA, USA, USA, USA, USA, USA, USA, USA, …
## $ Year           <dbl> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1…
## $ GDP            <dbl> 5.433000e+11, 5.633000e+11, 6.051000e+11, 6.386000e+11,…
## $ Growth         <dbl> NA, 2.3000000, 6.1000000, 4.4000000, 5.8000000, 6.40000…
## $ CPI            <dbl> 13.56306, 13.70828, 13.87261, 14.04459, 14.22421, 14.44…
## $ Imports        <dbl> 4.196576, 4.029824, 4.131549, 4.087065, 4.097404, 4.235…
## $ Exports        <dbl> 4.969630, 4.899698, 4.809122, 4.870028, 5.103529, 4.988…
## $ Population     <dbl> 180671000, 183691000, 186538000, 189242000, 191889000, …
## $ GDP_per_capita <dbl> 3007.123, 3066.563, 3243.843, 3374.515, 3573.941, 3827.…

global_economy|>
  filter(Country == 'United States')|>
  autoplot(GDP/10^12)+labs(title= "GDP, United States", y = "$US (in trillions)")

global_economy |>
  filter(Country == 'United States') |>
  ggplot(aes(x = Year, y = log(GDP))) +
  geom_line() +
  labs(title = "Log-Transformed GDP, United States", 
       y = "log($US GDP)", 
       x = "Year") +
  theme_minimal()

Analysis: The original GDP plot shows a steep upward trend, suggesting exponential growth over time. Applying a log transformation helps normalize this growth, making long-term trends easier to interpret. This transformation also reduces scale imbalances and allows fluctuations to be more comparable across different time periods, which improves trend analysis.

Victorian Livestock:

glimpse(aus_livestock)

## Rows: 29,364
## Columns: 4
## Key: Animal, State [54]
## $ Month  <mth> 1976 Jul, 1976 Aug, 1976 Sep, 1976 Oct, 1976 Nov, 1976 Dec, 197…
## $ Animal <fct> "Bulls, bullocks and steers", "Bulls, bullocks and steers", "Bu…
## $ State  <fct> Australian Capital Territory, Australian Capital Territory, Aus…
## $ Count  <dbl> 2300, 2100, 2100, 1900, 2100, 1800, 1800, 1900, 2700, 2300, 250…

vic_cattle <- aus_livestock |>
  filter(Animal == "Bulls, bullocks and steers", State == "Victoria")

ggplot(vic_cattle, aes(x = Month, y = Count)) +
  geom_line() +
  labs(title = "Slaughter of Bulls, Bullocks, and Steers in Victoria",
       x = "Year",
       y = "Slaughter Count") +
  theme_minimal()

vic_cattle_ts <- aus_livestock |> # Convert to tsibble (time series format)
  filter(Animal == "Bulls, bullocks and steers", State == "Victoria") |>
  as_tsibble(index = Month)
glimpse(vic_cattle_ts)

## Rows: 510
## Columns: 4
## Key: Animal, State [1]
## $ Month  <mth> 1976 Jul, 1976 Aug, 1976 Sep, 1976 Oct, 1976 Nov, 1976 Dec, 197…
## $ Animal <fct> "Bulls, bullocks and steers", "Bulls, bullocks and steers", "Bu…
## $ State  <fct> Victoria, Victoria, Victoria, Victoria, Victoria, Victoria, Vic…
## $ Count  <dbl> 109200, 94700, 95500, 94800, 94100, 98300, 93500, 102000, 10260…

vic_cattle_ts |> # Decompose the time series
  model(STL(Count ~ season(window = "periodic"))) |> 
  components() |> 
  autoplot() +
  labs(title = "Decomposition of Victorian Cattle Slaughter Time Series")

Analysis: The data shows strong seasonality (3rd panel) with clear annual patterns. However, no transformation is needed, as the variance remains stable over time.

Victorian Electricity Demand:

glimpse(vic_elec)

## Rows: 52,608
## Columns: 5
## $ Time        <dttm> 2012-01-01 00:00:00, 2012-01-01 00:30:00, 2012-01-01 01:0…
## $ Demand      <dbl> 4382.825, 4263.366, 4048.966, 3877.563, 4036.230, 3865.597…
## $ Temperature <dbl> 21.40, 21.05, 20.70, 20.55, 20.40, 20.25, 20.10, 19.60, 19…
## $ Date        <date> 2012-01-01, 2012-01-01, 2012-01-01, 2012-01-01, 2012-01-0…
## $ Holiday     <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…

The Half-hourly electricity demand for Victoria

vic_elec|>
  autoplot(Demand)+
  labs(title = "Half-hourly Electricity Demand in Victoria")+
  theme_minimal()

Let’s take a look at the daily electricity demand in Victoria.

vic_elec_daily <- vic_elec |>
  as_tibble() |>  # convert to a regular tibble
  mutate(Date = as_date(Time)) |> 
  group_by(Date) |>
  summarize(Daily_Demand = sum(Demand, na.rm = TRUE)) |>
  ungroup() |> 
  as_tsibble(index = Date)  # convert back to tsibble
#vic_elec_daily

ggplot(vic_elec_daily, aes(x = Date, y = Daily_Demand)) +
  geom_line() +
  labs(title = "Daily Electricity Demand in Victoria")+
  theme_minimal()

Let’s take a look at the monthly electricity demand in Victoria.

vic_elec_monthly <- vic_elec_daily |>
  as_tibble() |>  # convert to a regular tibble 
  mutate(Month = yearmonth(Date)) |>  # year-month format
  group_by(Month) |>
  summarize(Monthly_Demand = sum(Daily_Demand, na.rm = TRUE)) |>
  ungroup() |> 
  as_tsibble(index = Month)  # convert back to tsibble
#vic_elec_monthly

# Plot Monthly Electricity Demand
ggplot(vic_elec_monthly, aes(x = Month, y = Monthly_Demand)) +
  geom_line() +
  labs(title = "Monthly Electricity Demand in Victoria") +
  theme_minimal()

Analysis: We analyzed Victorian electricity demand at different time scales to better understand its trends and seasonality. The initial plot of half-hourly demand showed high variability, making it difficult to interpret. Aggregating to daily demand helped smooth out short-term fluctuations, while the monthly demand plot revealed a strong seasonal pattern, with peaks likely corresponding to extreme weather months. No transformation is needed, as the variance remains stable over time.

Gas Production:

glimpse(aus_production)

## Rows: 218
## Columns: 7
## $ Quarter     <qtr> 1956 Q1, 1956 Q2, 1956 Q3, 1956 Q4, 1957 Q1, 1957 Q2, 1957…
## $ Beer        <dbl> 284, 213, 227, 308, 262, 228, 236, 320, 272, 233, 237, 313…
## $ Tobacco     <dbl> 5225, 5178, 5297, 5681, 5577, 5651, 5317, 6152, 5758, 5641…
## $ Bricks      <dbl> 189, 204, 208, 197, 187, 214, 227, 222, 199, 229, 249, 234…
## $ Cement      <dbl> 465, 532, 561, 570, 529, 604, 603, 582, 554, 620, 646, 637…
## $ Electricity <dbl> 3923, 4436, 4806, 4418, 4339, 4811, 5259, 4735, 4608, 5196…
## $ Gas         <dbl> 5, 6, 7, 6, 5, 7, 7, 6, 5, 7, 8, 6, 5, 7, 8, 6, 6, 8, 8, 7…

ggplot(aus_production, aes(x = Quarter, y = Gas)) +
  geom_line() +
  labs(title = "Quarterly Gas Production in Australia") +
  theme_minimal()

A strong upward trend. data shows clear seasonal patterns. The fluctuations (seasonal variations) get larger over time. Let’s try log transformation:

ggplot(aus_production, aes(x = Quarter, y = log(Gas))) +
  geom_line() +
  labs(title = "Log-Transformed Quarterly Gas Production in Australia",
       x = "Year",
       y = "Log(Gas Production)") +
  theme_minimal()

Analysis: The quarterly gas production data in Australia shows a strong upward trend, indicating a steady increase in production over time. The data also shows clear seasonal patterns, with fluctuations that become larger as production increases. Applying a log transformation helps stabilize both the variance and trend, making the data easier to interpret. After transformation, the seasonal variations appear more consistent and uniform over time.

Exercise 3.7.3

Why is a Box-Cox transformation unhelpful for the ‘canadian_gas’ data?

Answer:

glimpse(canadian_gas)

## Rows: 542
## Columns: 2
## $ Month  <mth> 1960 Jan, 1960 Feb, 1960 Mar, 1960 Apr, 1960 May, 1960 Jun, 196…
## $ Volume <dbl> 1.4306, 1.3059, 1.4022, 1.1699, 1.1161, 1.0113, 0.9660, 0.9773,…

autoplot(canadian_gas, Volume) +
  labs(title = "Monthly Canadian Gas Production",
       y = "Gas Production (Billions of Cubic Metres)",
       x = "Year") +
  theme_minimal()

The plot shows a strong upward trend and a clear seasonal patterns, with peaks and dips occurring at regular intervals. The fluctuations increase slightly over time, but the seasonal variation is fairly stable relative to the trend. The variance does not increase dramatically.

BOx-Cox Test:

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

gas_volume <- canadian_gas$Volume
lambda <- BoxCox.lambda(gas_volume)
canadian_gas_transformed <- BoxCox(gas_volume, lambda)
canadian_gas_tsibble <- canadian_gas |> 
  mutate(Transformed_Volume = canadian_gas_transformed)

# Plot transformed data
ggplot(canadian_gas_tsibble, aes(x = Month, y = Transformed_Volume)) +
  geom_line() +
  labs(title = "Box-Cox Transformed Canadian Gas Production")+
  theme_minimal()

Analysis: The Box-Cox transformation is unhelpful for the Canadian gas data because it follows an additive seasonal pattern rather than a multiplicative one. Since the seasonal fluctuations remain relatively stable over time, and the variance does not increase significantly, the transformation does not offer substantial benefits. Therefore, applying the Box-Cox transformation is unnecessary.

Exercise 3.7.4

What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?

Answer:

glimpse(aus_retail)

## Rows: 64,532
## Columns: 5
## Key: State, Industry [152]
## $ State       <chr> "Australian Capital Territory", "Australian Capital Territ…
## $ Industry    <chr> "Cafes, restaurants and catering services", "Cafes, restau…
## $ `Series ID` <chr> "A3349849A", "A3349849A", "A3349849A", "A3349849A", "A3349…
## $ Month       <mth> 1982 Apr, 1982 May, 1982 Jun, 1982 Jul, 1982 Aug, 1982 Sep…
## $ Turnover    <dbl> 4.4, 3.4, 3.6, 4.0, 3.6, 4.2, 4.8, 5.4, 6.9, 3.8, 4.2, 4.0…

set.seed(4321) 
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`, 1)) #from Exercise 7 in Section 2.10
glimpse(myseries)

## Rows: 441
## Columns: 5
## Key: State, Industry [1]
## $ State       <chr> "Western Australia", "Western Australia", "Western Austral…
## $ Industry    <chr> "Furniture, floor coverings, houseware and textile goods r…
## $ `Series ID` <chr> "A3349661X", "A3349661X", "A3349661X", "A3349661X", "A3349…
## $ Month       <mth> 1982 Apr, 1982 May, 1982 Jun, 1982 Jul, 1982 Aug, 1982 Sep…
## $ Turnover    <dbl> 19.2, 21.9, 19.9, 19.3, 19.6, 19.9, 18.0, 19.0, 23.0, 16.6…

autoplot(myseries, Turnover)

turnover_values <- myseries$Turnover
lambda <- BoxCox.lambda(turnover_values)
lambda

## [1] 0.05851592

myseries_transformed <- myseries |> 
  mutate(Transformed_Turnover = BoxCox(Turnover, lambda))

ggplot(myseries_transformed, aes(x = Month, y = Transformed_Turnover)) +
  geom_line() +
  labs(y = "Transformed Turnover",
       title = paste("Box-Cox Transformed Retail Turnover (λ =", round(lambda, 2), ")")) +
  theme_minimal()

Analysis: For our selected retail data (from Exercise 7 in Section 2.10), the Box-Cox transformation with $\lambda = 0.06$ was applied to stabilize variance and stabilized the trend. Before the transformation, the data showed increasing variance, where seasonal fluctuations became larger over time. After the transformation, the trend appears more linear and seasonal fluctuations are more stable. this makes the data better suited for modeling. Also since $\lambda$ is close to 0, a log transformation would have provided a similar effect. Overall, the transformation successfully stabilized variance, improving the interpretability of the trend.

Exercise 3.7.5

For the following series, find an appropriate Box-Cox transformation in order to stabilize the variance. Tobacco from ‘aus_production’, Economy class passengers between Melbourne and Sydney from ‘ansett’, and Pedestrian counts at Southern Cross Station from ‘pedestrian’.

Answer:

Let’s check dataset structures

glimpse(aus_production)   # Tobacco data

## Rows: 218
## Columns: 7
## $ Quarter     <qtr> 1956 Q1, 1956 Q2, 1956 Q3, 1956 Q4, 1957 Q1, 1957 Q2, 1957…
## $ Beer        <dbl> 284, 213, 227, 308, 262, 228, 236, 320, 272, 233, 237, 313…
## $ Tobacco     <dbl> 5225, 5178, 5297, 5681, 5577, 5651, 5317, 6152, 5758, 5641…
## $ Bricks      <dbl> 189, 204, 208, 197, 187, 214, 227, 222, 199, 229, 249, 234…
## $ Cement      <dbl> 465, 532, 561, 570, 529, 604, 603, 582, 554, 620, 646, 637…
## $ Electricity <dbl> 3923, 4436, 4806, 4418, 4339, 4811, 5259, 4735, 4608, 5196…
## $ Gas         <dbl> 5, 6, 7, 6, 5, 7, 7, 6, 5, 7, 8, 6, 5, 7, 8, 6, 6, 8, 8, 7…

glimpse(ansett)           # Economy class passengers

## Rows: 7,407
## Columns: 4
## Key: Airports, Class [30]
## $ Week       <week> 1989 W28, 1989 W29, 1989 W30, 1989 W31, 1989 W32, 1989 W33…
## $ Airports   <chr> "ADL-PER", "ADL-PER", "ADL-PER", "ADL-PER", "ADL-PER", "ADL…
## $ Class      <chr> "Business", "Business", "Business", "Business", "Business",…
## $ Passengers <dbl> 193, 254, 185, 254, 191, 136, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23…

glimpse(pedestrian)       # Pedestrian counts

## Rows: 66,037
## Columns: 5
## Key: Sensor [4]
## $ Sensor    <chr> "Birrarung Marr", "Birrarung Marr", "Birrarung Marr", "Birra…
## $ Date_Time <dttm> 2015-01-01 00:00:00, 2015-01-01 01:00:00, 2015-01-01 02:00:…
## $ Date      <date> 2015-01-01, 2015-01-01, 2015-01-01, 2015-01-01, 2015-01-01,…
## $ Time      <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ Count     <int> 1630, 826, 567, 264, 139, 77, 44, 56, 113, 166, 250, 466, 40…

Before applying any transformation, let’s plot the original time series to check for trends, seasonality, and variance instability

Tobacco Production:

autoplot(aus_production, Tobacco) + 
  labs(title = "Tobacco Production in Australia", y = "Production (Tonnes)") +
  theme_minimal()

Observation: The series has a clear long-term trend, increasing until the late 1970s and then decreasing from the 1980s onward. The variance appears stable over time. There are fluctuations, but they do not seem to expand as the values increase. Box-Cox transformation is not required because the variance is already stable.

Economy Class Passengers (Melbourne-Sydney):

ansett_filtered <- ansett |> filter(Airports == "MEL-SYD", Class == "Economy")
autoplot(ansett_filtered, Passengers) + 
  labs(title = "Economy Class Passengers (Melbourne-Sydney)", y = "Passengers") +
  theme_minimal()

Observations: the plot shows upward trend in passengers over time. There is a significant dip around 1990. The variance does not appear to be increasing which, means fluctuations remain roughly consistent. Box-Cox transformation is not required, as variance instability is not evident.

Pedestrian counts at Southern Cross Station

Dataset containing the hourly pedestrian counts from 2015-01-01 to 2016-12-31 at 4 sensors in the city of Melbourne.

pedestrian_filtered <- pedestrian |> 
  filter(Sensor == "Southern Cross Station")
autoplot(pedestrian_filtered, Count) + 
  labs(title = "Pedestrian Counts at Southern Cross Station") +
  theme_minimal()

Observations: The plot shows a highly seasonal pattern (daily or weekly cycles). The Variance is increasing over time – peaks and troughs become more extreme. There is a clear need for stabilization since fluctuations grow larger as counts increase. A Box-Cox transformation is required to stabilize variance.

Let’s find optimal lambda using Guerrero’s method

lambda_pedestrian <- pedestrian_filtered  |> 
  features(Count, features = guerrero) |> 
  pull(lambda_guerrero)
lambda_pedestrian

## [1] -0.2501616

Now that we have $\lambda = -0.25$, we will apply the Box-Cox transformation to Pedestrian Counts to stabilize variance.

pedestrian_transformed <- pedestrian_filtered |> 
  mutate(Transformed_Count = box_cox(Count, lambda_pedestrian))

ggplot(pedestrian_transformed, aes(x = Date_Time, y = Transformed_Count)) +
  geom_line() +
  labs(y = "Transformed Count",
       title = paste("Box-Cox Transformed Pedestrian Counts (λ =", round(lambda_pedestrian, 2), ")")) +
  theme_minimal()

Observation: The Box-Cox transformation $(\lambda = 0.25)$ reduced the impact of extreme values, making fluctuations more comparable over time. The overall trend appears slightly more stable, but significant high-frequency variations remain. The transformation helped, but additional smoothing techniques, such as aggregating the data to daily or weekly counts, might further improve variance stabilization..

Let’s aggregate to weekly data to help reduce the noise and make patterns easier to identify.

pedestrian_weekly <- pedestrian_filtered |> 
  mutate(Week = yearweek(Date_Time)) |> 
  index_by(Week) |> 
  summarise(Weekly_Count = sum(Count, na.rm = TRUE))

autoplot(pedestrian_weekly, Weekly_Count) +
  labs(title = "Weekly Pedestrian Counts at Southern Cross Station",
       y = "Weekly Count") +
  theme_minimal()

lambda_weekly <- pedestrian_weekly |> 
  features(Weekly_Count, features = guerrero) |> 
  pull(lambda_guerrero)
lambda_weekly

## [1] -0.1108714

pedestrian_weekly_transformed <- pedestrian_weekly |> 
  mutate(Transformed_Weekly_Count = box_cox(Weekly_Count, lambda_weekly))

ggplot(pedestrian_weekly_transformed, aes(x = Week, y = Transformed_Weekly_Count)) +
  geom_line() +
  labs(title = paste("Box-Cox Transformed Weekly Counts (λ =", round(lambda_weekly, 2), ")"),
       y = "Transformed Weekly Count") +
  theme_minimal()

Observation: The transformation stabilized the variance compared to the original weekly plot, improving interpretability. The fluctuations are now more consistent over time, reducing the impact of extreme values. Since $\lambda$ is close to 0, a log transformation (log(Weekly_Count)) would have produced a similar effect. However, some dips—likely due to holidays or unusual events—remain prominent in the data.

Exercise 3.7.7

Consider the last five years of the Gas data from ‘aus_production’.

gas <- tail(aus_production, 5*4) |> select(Gas)

Answer:

glimpse(aus_production)

## Rows: 218
## Columns: 7
## $ Quarter     <qtr> 1956 Q1, 1956 Q2, 1956 Q3, 1956 Q4, 1957 Q1, 1957 Q2, 1957…
## $ Beer        <dbl> 284, 213, 227, 308, 262, 228, 236, 320, 272, 233, 237, 313…
## $ Tobacco     <dbl> 5225, 5178, 5297, 5681, 5577, 5651, 5317, 6152, 5758, 5641…
## $ Bricks      <dbl> 189, 204, 208, 197, 187, 214, 227, 222, 199, 229, 249, 234…
## $ Cement      <dbl> 465, 532, 561, 570, 529, 604, 603, 582, 554, 620, 646, 637…
## $ Electricity <dbl> 3923, 4436, 4806, 4418, 4339, 4811, 5259, 4735, 4608, 5196…
## $ Gas         <dbl> 5, 6, 7, 6, 5, 7, 7, 6, 5, 7, 8, 6, 5, 7, 8, 6, 6, 8, 8, 7…

- a. Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?

autoplot(gas) +
  ggtitle("Quarterly Gas Production in Australia (Last 5 Years)") +
  xlab("Quarter") + 
  ylab("Gas Production")

## Plot variable not specified, automatically selected `.vars = Gas`

The gas production seems to be in an increasing trend over time and there are clear seasonal fluctuations within each year. The values rise and fall in a repeated pattern across quarters.

- b. Use classical_decomposition with type = multiplicative to calculate the trend-cycle and seasonal indices.

decomp <- gas |> 
  model(classical_decomposition(Gas, type = "multiplicative"))
components <- decomp |> components()
#components

autoplot(components) +
  ggtitle("Classical Multiplicative Decomposition of Gas Production")

- c. Do the results support the graphical interpretation from part a?

Yes, the decomposition results support our observations from part (a). The trend component confirms the increasing trend in gas production that we initially noticed in the raw time series plot. Additionally, the seasonal component exhibits a consistent pattern across quarters, reinforcing the presence of seasonality. Lastly, the random (residuals) component is relatively small, indicating that most of the variation in the data is effectively captured by the trend and seasonal components.

- d. Compute and plot the seasonally adjusted data.

# Check column names in decomposition output
colnames(components(decomp))

## [1] ".model"        "Quarter"       "Gas"           "trend"        
## [5] "seasonal"      "random"        "season_adjust"

# Extract decomposition components
gas_components <- components(decomp)

# Explicitly select required columns
gas_plot_data <- gas_components |> 
  select(Quarter, Gas, season_adjust)

# Plot original vs seasonally adjusted data
autoplot(gas_plot_data, Gas, colour = "gray") +
  geom_line(aes(x = Quarter, y = season_adjust), colour = "blue") +
  labs(title = "Seasonally Adjusted Gas Production",
       y = "Gas Production",
       x = "Quarter")

The gray line represents the original gas production data (with seasonal fluctuations) and the blue line represents the seasonally adjusted gas production, where the seasonal fluctuations have been removed, leaving only the trend and random variations. We can see that the seasonally adjusted line is smoother than the original, confirming that the seasonal component has been successfully removed.

- e. Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

gas_outlier <- gas |> mutate(Gas = replace(Gas, 10, Gas[10] + 300))

decomp_outlier <- gas_outlier |> # decomposition on the outlier dataset  
  model(classical_decomposition(Gas, type = "multiplicative"))
gas_components_outlier <- components(decomp_outlier)
colnames(gas_components_outlier)

## [1] ".model"        "Quarter"       "Gas"           "trend"        
## [5] "seasonal"      "random"        "season_adjust"

head(gas_components_outlier)

## # A dable: 6 x 7 [1Q]
## # Key:     .model [1]
## # :        Gas = trend * seasonal * random
##   .model                       Quarter   Gas trend seasonal random season_adjust
##   <chr>                          <qtr> <dbl> <dbl>    <dbl>  <dbl>         <dbl>
## 1 "classical_decomposition(Ga… 2005 Q3   221   NA     1.06  NA              209.
## 2 "classical_decomposition(Ga… 2005 Q4   180   NA     1.12  NA              160.
## 3 "classical_decomposition(Ga… 2006 Q1   171  200.    0.821  1.04           208.
## 4 "classical_decomposition(Ga… 2006 Q2   224  204.    0.998  1.10           224.
## 5 "classical_decomposition(Ga… 2006 Q3   233  207     1.06   1.06           220.
## 6 "classical_decomposition(Ga… 2006 Q4   192  210.    1.12   0.813          171.

ggplot(gas_components_outlier, aes(x = Quarter)) +
  geom_line(aes(y = Gas), colour = "gray") +  # Original data
  geom_line(aes(y = season_adjust), colour = "red") +  # Seasonally adjusted data
  labs(title = "Effect of Outlier on Seasonally Adjusted Gas Production",
       y = "Gas Production",
       x = "Quarter")

Observation: To examine the impact of an outlier, I modified one observation by adding 300 to the 10th quarter. The resulting plot showed that the seasonally adjusted gas production was significantly affected, with a noticeable spike in the middle of the time series. This demonstrates that an extreme value in the data can distort the seasonal adjustment, leading to an artificial shift in the trend.

- f. Does it make any difference if the outlier is near the end rather than in the middle of the time series?

Now, let’s test whether placing the outlier near the end instead of the middle changes the results.

gas_outlier_last <- gas |> mutate(Gas = replace(Gas, n(), Gas[n()] + 300)) #outlier at the last quarter

decomp_outlier_last <- gas_outlier_last |> 
  model(classical_decomposition(Gas, type = "multiplicative"))
gas_components_outlier_last <- components(decomp_outlier_last)


ggplot(gas_components_outlier_last, aes(x = Quarter)) + # Plot original vs. seasonally adjusted data with outlier at the end
  geom_line(aes(y = Gas), colour = "gray") +  # Original data
  geom_line(aes(y = season_adjust), colour = "red") +  # Seasonally adjusted data
  labs(title = "Effect of Outlier at the End on Seasonally Adjusted Gas Production",
       y = "Gas Production",
       x = "Quarter")

Yes, the position of the outlier affects the seasonally adjusted data differently. When the outlier is in the middle of the series, its impact spreads across multiple time points, distorting both past and future trend estimates. In contrast, when the outlier is placed at the end, the earlier trend remains stable, and only the last data point is affected. This shows that outliers in the middle influence the overall trend more, while outliers at the end primarily affect recent observations and future projections.

Exercise 3.7.8

Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

Answer: Let’s check and look for any visible trends, seasonality, or outliers.

set.seed(7865) # Set a seed and randomly select one series
myseries <- aus_retail |> 
  filter(`Series ID` == sample(aus_retail$`Series ID`, 1))


autoplot(myseries, Turnover) + # Plot the selected time series
  ggtitle("Selected Retail Time Series") +
  xlab("Year") + 
  ylab("Turnover")

# Apply X-11 decomposition
x11_decomp <- myseries |> 
  model(X_11 = X_13ARIMA_SEATS(Turnover))

# Extract decomposition components
x11_components <- components(x11_decomp)

# Plot decomposition results
autoplot(x11_components) +
  ggtitle("X-11 Decomposition of Retail Time Series")

Yes, the X-11 decomposition reveals patterns and unusual features in the retail time series that were not immediately obvious in the original plot. The trend shows a steady upward increase in turnover, while the seasonal component confirms strong, recurring fluctuations. The irregular component highlights potential outliers, with noticeable deviations, especially in the early 1980s and around the late 1990s to early 2000s, which may be due to economic shocks or data recording anomalies. Overall, the decomposition helps distinguish long-term trends and seasonal effects from unexpected variations, making it easier to interpret retail turnover behavior.

Exercise 3.7.9

Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Figure 3.19 - STL Decomposition

Figure 3.20 - Seasonal Patterns

- a. Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

Answer:

The decomposition of the Australian civilian labor force from February 1978 to August 1995 shows distinct trends and seasonal patterns. The trend component shows a steady increase in the labor force over time, indicating long-term growth. The seasonal component shows regular fluctuations, suggesting that labor force participation varies systematically throughout the year. The remainder (irregular) component captures short-term variations, including sharp deviations, which could indicate economic events or anomalies.The trend component shows large-scale changes in the labor force, with values in the thousands, while the seasonal and remainder components fluctuate on a much smaller scale, typically within ±100 or ±200. This difference in scale helps distinguish between long-term growth and short-term variations.

- b. Is the recession of 1991/1992 visible in the estimated components?

Answer:

Yes, the 1991/1992 recession is visible in the decomposition. During this time, the trend slows down, showing that labor force growth weakened. The remainder component also has bigger ups and downs, which suggests more economic uncertainty. While the seasonal pattern stays the same, the brief pause in the trend matches the economic downturn of the early 1990s.

DATA624 HW2

Farhana Akther

2025-02-16

Exercise 3.7.1

Exercise 3.7.2

Exercise 3.7.3

Exercise 3.7.4

Exercise 3.7.5

Exercise 3.7.7

Exercise 3.7.8

Exercise 3.7.9