Data624_Homework2_ahmed

library(fpp3)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.2 ──

## ✔ tibble      3.3.1     ✔ tsibble     1.1.6
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.2     ✔ feasts      0.4.2
## ✔ lubridate   1.9.4     ✔ fable       0.5.0
## ✔ ggplot2     4.0.1

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(dplyr)
library(ggplot2)

# Calculate GDP per capita
global_economy <- global_economy %>%
  mutate(gdp_per_capita = GDP / Population)

# Find top 10 countries by most recent GDP per capita
top_10 <- global_economy %>%
  filter(Year == max(Year)) %>%
  slice_max(gdp_per_capita, n = 10) %>%
  pull(Country)

print(paste("Top 10 countries by GDP per capita (", max(global_economy$Year), ")", sep=""))

## [1] "Top 10 countries by GDP per capita (2017)"

global_economy %>%
  filter(Year == max(Year), Country %in% top_10) %>%
  select(Country, gdp_per_capita) %>%
  arrange(desc(gdp_per_capita))

## # A tsibble: 10 x 3 [1Y]
## # Key:       Country [10]
##    Country          gdp_per_capita  Year
##    <fct>                     <dbl> <dbl>
##  1 Luxembourg              104103.  2017
##  2 Macao SAR, China         80893.  2017
##  3 Switzerland              80190.  2017
##  4 Norway                   75505.  2017
##  5 Iceland                  70057.  2017
##  6 Ireland                  69331.  2017
##  7 Qatar                    63249.  2017
##  8 United States            59532.  2017
##  9 North America            58070.  2017
## 10 Singapore                57714.  2017

# 10 countries Plot
global_economy %>%
  filter(Country %in% top_10) %>%
  ggplot(aes(x = Year, y = gdp_per_capita, color = Country)) +
  geom_line(linewidth = 1) +
  labs(
    title = "GDP per capita - Top 10 Countries",
    x = "Year",
    y = "GDP per capita ($US)"
  ) +
  theme_minimal() +
  theme(legend.position = "right")

## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_line()`).

# changes over time, convert to tibble
print("\nTop country by GDP per capita in different years:")

## [1] "\nTop country by GDP per capita in different years:"

global_economy %>%
  as_tibble() %>%
  filter(Year %in% c(1960, 1980, 2000, 2017)) %>%
  group_by(Year) %>%
  slice_max(gdp_per_capita, n = 1) %>%
  select(Year, Country, gdp_per_capita) %>%
  arrange(Year)

## # A tibble: 4 × 3
## # Groups:   Year [4]
##    Year Country       gdp_per_capita
##   <dbl> <fct>                  <dbl>
## 1  1960 United States          3007.
## 2  1980 Monaco                51529.
## 3  2000 Monaco                82535.
## 4  2017 Luxembourg           104103.

Luxembourg has the highest GDP per capita in 2017 at approximately $120,000. The leadership has changed over time, with Qatar leading in the 1970s-80s, then Switzerland and Norway in the 1990s, before Luxembourg took the top spot in the mid 2000s and has maintained it since.

For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

United States GDP from global_economy. Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock. Victorian Electricity Demand from vic_elec. Gas production from aus_production.

library(fpp3)

us_gdp <- global_economy %>%
  filter(Country == "United States")

us_gdp %>% autoplot(GDP) + 
  labs(title = "US GDP - Original", y = "GDP")

# Log transformation
us_gdp %>% autoplot(log(GDP)) + 
  labs(title = "US GDP - Log Transformed", y = "log(GDP)")

print("Effect: Exponential growth becomes linear trend")

## [1] "Effect: Exponential growth becomes linear trend"

bulls <- aus_livestock %>%
  filter(Animal == "Bulls, bullocks and steers", State == "Victoria")

bulls %>% autoplot(Count) + 
  labs(title = "Victorian Bulls - Original", y = "Count")

# Box-Cox transformation
lambda_bulls <- bulls %>%
  features(Count, features = guerrero) %>%
  pull(lambda_guerrero)

bulls %>% autoplot(box_cox(Count, lambda_bulls)) + 
  labs(title = paste("Victorian Bulls - Box-Cox (λ =", round(lambda_bulls, 2), ")"), 
       y = "Transformed Count")

print("Effect: Stabilizes changing variance over time")

## [1] "Effect: Stabilizes changing variance over time"

vic_elec %>% autoplot(Demand) + 
  labs(title = "Victorian Electricity Demand", y = "Demand (MW)")

print("No transformation needed, short series, stable variance")

## [1] "No transformation needed, short series, stable variance"

gas <- aus_production %>% select(Quarter, Gas)

gas %>% autoplot(Gas) + 
  labs(title = "Gas Production - Original", y = "Gas")

# Box-Cox transformation
lambda_gas <- gas %>%
  features(Gas, features = guerrero) %>%
  pull(lambda_guerrero)

gas %>% autoplot(box_cox(Gas, lambda_gas)) + 
  labs(title = paste("Gas Production - Box-Cox (λ =", round(lambda_gas, 2), ")"), 
       y = "Transformed Gas")

print("Effect: Stabilizes increasing seasonal variation")

## [1] "Effect: Stabilizes increasing seasonal variation"

Why is a Box-Cox transformation unhelpful for the canadian_gas data?

canadian_gas

## # A tsibble: 542 x 2 [1M]
##       Month Volume
##       <mth>  <dbl>
##  1 1960 Jan  1.43 
##  2 1960 Feb  1.31 
##  3 1960 Mar  1.40 
##  4 1960 Apr  1.17 
##  5 1960 May  1.12 
##  6 1960 Jun  1.01 
##  7 1960 Jul  0.966
##  8 1960 Aug  0.977
##  9 1960 Sep  1.03 
## 10 1960 Oct  1.25 
## # ℹ 532 more rows

canadian_gas %>% autoplot(Volume) + 
  labs(title = "Canadian Gas Production - Original", y = "Volume")

# Box-Cox transformation
lambda_canada <- canadian_gas %>%
  features(Volume, features = guerrero) %>%
  pull(lambda_guerrero)

canadian_gas %>% autoplot(box_cox(Volume, lambda_canada)) + 
  labs(title = paste("Canadian Gas - Box-Cox (λ =", round(lambda_canada, 2), ")"), 
       y = "Transformed Volume")

Box-Cox transformation is unhelpful for the Canadian gas data because the seasonal pattern itself is changing over time, not just the variance. Box-Cox can only stabilize variance, but it cannot fix the fundamental change in seasonal structure visible in the data. The pattern in the 2000s is completely different from the 1960s-1980s. The transformation doesn’t simplify the series because the underlying seasonal behavior has evolved due to external factors like regulatory changes.

What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?

set.seed(12345678)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`, 1))

myseries |> distinct(Industry, State)

## # A tibble: 1 × 2
##   Industry                                            State             
##   <chr>                                               <chr>             
## 1 Clothing, footwear and personal accessory retailing Northern Territory

myseries |> autoplot(Turnover) +
  labs(title = "Original Retail Series", y = "Turnover")

# Calculate optimal lambda
lambda_retail <- myseries |>
  features(Turnover, features = guerrero) |>
  pull(lambda_guerrero)

print(paste("Optimal lambda:", round(lambda_retail, 3)))

## [1] "Optimal lambda: 0.083"

#Box-Cox transformation
myseries |> autoplot(box_cox(Turnover, lambda_retail)) +
  labs(title = paste("Box-Cox Transformed (λ =", round(lambda_retail, 2), ")"),
       y = "Transformed Turnover")

I would use λ = 0.08, which is basically a log transformation. The raw series shows that seasonal ups and downs get bigger as retail turnover increases over time. After the Box-Cox transformation, the seasonal variation is much more stable across the whole period, which makes the pattern easier to analyze and model.

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

tobacco <- aus_production %>% select(Quarter, Tobacco)

tobacco %>% autoplot(Tobacco) + 
  labs(title = "Tobacco Production - Original")

## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

# Box-Cox
lambda_tobacco <- tobacco %>%
  features(Tobacco, features = guerrero) %>%
  pull(lambda_guerrero)

tobacco %>% autoplot(box_cox(Tobacco, lambda_tobacco)) +
  labs(title = paste("Tobacco - Box-Cox (λ =", round(lambda_tobacco, 2), ")"))

## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

print(paste("Tobacco: λ =", round(lambda_tobacco, 3)))

## [1] "Tobacco: λ = 0.926"

# Economy class Melbourne-Sydney from ansett
economy <- ansett %>%
  filter(Class == "Economy", Airports == "MEL-SYD")

economy %>% autoplot(Passengers) + 
  labs(title = "Economy Passengers MEL-SYD - Original")

# Box-Cox
lambda_economy <- economy %>%
  features(Passengers, features = guerrero) %>%
  pull(lambda_guerrero)

economy %>% autoplot(box_cox(Passengers, lambda_economy)) +
  labs(title = paste("Economy - Box-Cox (λ =", round(lambda_economy, 2), ")"))

print(paste("Economy: λ =", round(lambda_economy, 3)))

## [1] "Economy: λ = 2"

# Pedestrian counts at Southern Cross Station
southern_cross <- pedestrian %>%
  filter(Sensor == "Southern Cross Station")

southern_cross %>% autoplot(Count) + 
  labs(title = "Southern Cross Station - Original")

# Box-Cox
lambda_ped <- southern_cross %>%
  features(Count, features = guerrero) %>%
  pull(lambda_guerrero)

southern_cross %>% autoplot(box_cox(Count, lambda_ped)) +
  labs(title = paste("Southern Cross - Box-Cox (λ =", round(lambda_ped, 2), ")"))

print(paste("Southern Cross: λ =", round(lambda_ped, 3)))

## [1] "Southern Cross: λ = -0.25"

Answer:

Tobacco production: λ = 0.93, no transformation needed since λ is close to 1 and the variance is stable

Economy class MEL-SYD: λ = 2.0 square transformation. Economy class passenger data shows a structural break that Box-Cox cannot correct, even though λ suggests a transformation.

Southern Cross Station pedestrian counts λ = -0.25. Pedestrian counts benefit from a Box-Cox transformation because it reduces the effect of large spikes and stabilizes the variance

Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) |> select(Gas)
Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?
Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.
Do the results support the graphical interpretation from part a?
Compute and plot the seasonally adjusted data.
Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
Does it make any difference if the outlier is near the end rather than in the middle of the time series?

# a) Get last 5 years (5*4 = 20 quarters)
gas <- tail(aus_production, 5*4) %>% select(Gas)
gas

## # A tsibble: 20 x 2 [1Q]
##      Gas Quarter
##    <dbl>   <qtr>
##  1   221 2005 Q3
##  2   180 2005 Q4
##  3   171 2006 Q1
##  4   224 2006 Q2
##  5   233 2006 Q3
##  6   192 2006 Q4
##  7   187 2007 Q1
##  8   234 2007 Q2
##  9   245 2007 Q3
## 10   205 2007 Q4
## 11   194 2008 Q1
## 12   229 2008 Q2
## 13   249 2008 Q3
## 14   203 2008 Q4
## 15   196 2009 Q1
## 16   238 2009 Q2
## 17   252 2009 Q3
## 18   210 2009 Q4
## 19   205 2010 Q1
## 20   236 2010 Q2

# b) 
gas %>% autoplot(Gas) +
  labs(title = "Gas Production - Last 5 Years",
       y = "Gas Production")

# c) 
gas_dcmp <- gas %>%
  model(classical_decomposition(Gas, type = "multiplicative")) %>%
  components()

gas_dcmp

## # A dable: 20 x 7 [1Q]
## # Key:     .model [1]
## # :        Gas = trend * seasonal * random
##    .model                      Quarter   Gas trend seasonal random season_adjust
##    <chr>                         <qtr> <dbl> <dbl>    <dbl>  <dbl>         <dbl>
##  1 "classical_decomposition(G… 2005 Q3   221   NA     1.13  NA              196.
##  2 "classical_decomposition(G… 2005 Q4   180   NA     0.925 NA              195.
##  3 "classical_decomposition(G… 2006 Q1   171  200.    0.875  0.974          195.
##  4 "classical_decomposition(G… 2006 Q2   224  204.    1.07   1.02           209.
##  5 "classical_decomposition(G… 2006 Q3   233  207     1.13   1.000          207.
##  6 "classical_decomposition(G… 2006 Q4   192  210.    0.925  0.987          208.
##  7 "classical_decomposition(G… 2007 Q1   187  213     0.875  1.00           214.
##  8 "classical_decomposition(G… 2007 Q2   234  216.    1.07   1.01           218.
##  9 "classical_decomposition(G… 2007 Q3   245  219.    1.13   0.996          218.
## 10 "classical_decomposition(G… 2007 Q4   205  219.    0.925  1.01           222.
## 11 "classical_decomposition(G… 2008 Q1   194  219.    0.875  1.01           222.
## 12 "classical_decomposition(G… 2008 Q2   229  219     1.07   0.974          213.
## 13 "classical_decomposition(G… 2008 Q3   249  219     1.13   1.01           221.
## 14 "classical_decomposition(G… 2008 Q4   203  220.    0.925  0.996          219.
## 15 "classical_decomposition(G… 2009 Q1   196  222.    0.875  1.01           224.
## 16 "classical_decomposition(G… 2009 Q2   238  223.    1.07   0.993          222.
## 17 "classical_decomposition(G… 2009 Q3   252  225.    1.13   0.994          224.
## 18 "classical_decomposition(G… 2009 Q4   210  226     0.925  1.00           227.
## 19 "classical_decomposition(G… 2010 Q1   205   NA     0.875 NA              234.
## 20 "classical_decomposition(G… 2010 Q2   236   NA     1.07  NA              220.

gas_dcmp %>% autoplot() +
  labs(title = "Classical Multiplicative Decomposition of Gas Production")

## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_line()`).

# d) 
print("Ye, the decomposition confirms the seasonal pattern and slight upward trend observed in the original plot, with little unexplained variation remaining.")

## [1] "Ye, the decomposition confirms the seasonal pattern and slight upward trend observed in the original plot, with little unexplained variation remaining."

# e) 
gas_dcmp %>%
  as_tsibble() %>%
  autoplot(Gas, color = "gray") +
  geom_line(aes(y = season_adjust), color = "blue") +
  labs(title = "Gas Production: Original vs Seasonally Adjusted",
       y = "Gas Production")

# f) Add outlier in the middle and recompute
gas_outlier <- gas %>%
  mutate(Gas = if_else(row_number() == 10, Gas + 300, Gas))

gas_outlier_dcmp <- gas_outlier %>%
  model(classical_decomposition(Gas, type = "multiplicative")) %>%
  components()

gas_outlier_dcmp %>% autoplot() +
  labs(title = "Classical Decomposition with Outlier (Middle)")

## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_line()`).

# Compare seasonally adjusted
gas_outlier_dcmp %>%
  as_tsibble() %>%
  autoplot(Gas, color = "gray") +
  geom_line(aes(y = season_adjust), color = "red") +
  labs(title = "With Outlier in Middle",
       y = "Gas Production")

# g) Outlier near the end
gas_outlier_end <- gas %>%
  mutate(Gas = if_else(row_number() == 18, Gas + 300, Gas))

gas_outlier_end_dcmp <- gas_outlier_end %>%
  model(classical_decomposition(Gas, type = "multiplicative")) %>%
  components()

gas_outlier_end_dcmp %>% autoplot() +
  labs(title = "Classical Decomposition with Outlier (End)")

## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_line()`).

# Compare seasonally adjusted
gas_outlier_end_dcmp %>%
  as_tsibble() %>%
  autoplot(Gas, color = "gray") +
  geom_line(aes(y = season_adjust), color = "red") +
  labs(title = "With Outlier Near End",
       y = "Gas Production")

print("Yes, an outlier in the middle affects the estimated trend and seasonally adjusted values on both sides of the outlier, while an outlier near the end mainly affects the most recent estimates.")

## [1] "Yes, an outlier in the middle affects the estimated trend and seasonally adjusted values on both sides of the outlier, while an outlier near the end mainly affects the most recent estimates."

Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

set.seed(12345678)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`, 1))

# X-11 decomposition
x11_dcmp <- myseries %>%
  model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) %>%
  components()

x11_dcmp %>% autoplot()

Yes, the X-11 decomposition reveals a major outlier at the very beginning of the series around 1982-1983 where the irregular component drops to around 0.85-0.88, which is a much larger deviation than anywhere else in the series. The decomposition also shows a slight dip in the trend around 1995 that might not have been obvious in the raw data. After the initial outlier, the irregular component is very stable, indicating the series is well-behaved with no other unusual features.

Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

STL Decomposition plot Figure 3.19: Decomposition of the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Decomposition Plot 2 Figure 3.20: Seasonal component from the decomposition shown in the previous figure.

a)Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

The trend shows a steady increase in the civilian labor force over time, rising from about 6.5 million to around 9 million people. The seasonal component is relatively small compared to the overall level of the series, which suggests that seasonality plays only a minor role. Most of the remaining variation is fairly small and random. However, there is a noticeable negative spike in the remainder around 1991–1992, indicating an unusually large deviation during that period.

b)Is the recession of 1991/1992 visible in the estimated components? Yes, the recession is clearly visible as a sharp negative drop in the remainder component around 1991-1992, reaching approximately -400 thousand persons.

Data624_Homework2_ahmed_hassan

Ahmed Hassan

2026-02-12