Homework 2

3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8 and 3.9

3.1 Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'ggplot2' was built under R version 4.5.2

## Warning: package 'tibble' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.5.2

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.2 ──
## ✔ tsibble     1.1.6     ✔ feasts      0.4.2
## ✔ tsibbledata 0.4.1     ✔ fable       0.5.0

## Warning: package 'tsibble' was built under R version 4.5.2

## Warning: package 'tsibbledata' was built under R version 4.5.2

## Warning: package 'feasts' was built under R version 4.5.2

## Warning: package 'fabletools' was built under R version 4.5.2

## Warning: package 'fable' was built under R version 4.5.2

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

global <- global_economy |> mutate(Gdp_per_cap = (GDP/Population)) |> arrange(desc(Gdp_per_cap))

## Warning: Current temporal ordering may yield unexpected results.
## ℹ Suggest to sort by `Country`, `Year` first.

print(n=1, global)

## # A tsibble: 15,150 x 10 [1Y]
## # Key:       Country [263]
##   Country Code   Year    GDP Growth   CPI Imports Exports Population Gdp_per_cap
##   <fct>   <fct> <dbl>  <dbl>  <dbl> <dbl>   <dbl>   <dbl>      <dbl>       <dbl>
## 1 Monaco  MCO    2014 7.06e9   7.18    NA      NA      NA      38132     185153.
## # ℹ 15,149 more rows

#If we're just sorting for the top value, Monaco had the highest GDP per capita in 2014.

#in the 60s/70s, the US and, for a couple of years, the UAE and Kuwait were at the top. Monaco has had the highest GDP per capita most years. In 2017 it was Luxembourg and in 2015 it was Lichtenstein. These are small countries full of millionaires. 

global_max_per_year <- global_economy |> as_tibble() |> mutate(Gdp_per_cap = GDP / Population) |> group_by(Year) |> slice_max(Gdp_per_cap, n = 1) |>  ungroup()

#graph with legend removed, so we can see every country

ggplot(global, aes(x = Year, y = Gdp_per_cap, color = Country)) + geom_line() +
  theme(legend.position = "none")

## Warning: Removed 3242 rows containing missing values or values outside the scale range
## (`geom_line()`).

#Graph of just the US, UAE, Monaco, Kuwait, Luxembourg, and Lichtenstein (we're missing a lot of data from Kuwait)

global_max_6 <- global_economy |> as_tibble() |> mutate(Gdp_per_cap = GDP / Population) |> group_by(Year) |> filter(Code %in% c('MCO', 'KWT', 'USA', 'ARE', 'LIE', 'LUX') )

ggplot(global_max_6, aes(x = Year, y = Gdp_per_cap, color = Country)) + geom_line()

## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_line()`).

3.2 For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

Victorian Electricity Demand from vic_elec.

#I took this question as transform includes  mathematically adjusting
#United States GDP from global_economy.

#transformed by GDP per capita

ggplot((global |> filter(Country == "United States")), aes(x = Year, y = Gdp_per_cap, color = Country)) + geom_line()

#this number is pretty low, so it's worth trying

global_economy |> features(GDP, features = guerrero)

## Warning: 129 errors (1 unique) encountered for feature 1
## [129] missing value where TRUE/FALSE needed

## # A tibble: 263 × 2
##    Country             lambda_guerrero
##    <fct>                         <dbl>
##  1 Afghanistan                 0.0728 
##  2 Albania                    NA      
##  3 Algeria                    -0.00502
##  4 American Samoa             NA      
##  5 Andorra                    NA      
##  6 Angola                     NA      
##  7 Antigua and Barbuda        NA      
##  8 Arab World                 NA      
##  9 Argentina                  NA      
## 10 Armenia                    NA      
## # ℹ 253 more rows

global_economy |> filter(Code == 'USA') |> autoplot(box_cox(GDP, .282))

#Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock.

ggplot((aus_livestock |> filter(Animal == "Bulls, bullocks and steers")), aes(x = Month, y = Count, color = State)) + geom_line()

#this data should be averaged monthly to get a better idea independent of seasonality
#creating a new table with the 5-month average

aus_bull <- aus_livestock |>
    mutate(
        fivemonth = slider::slide_dbl(Count, mean,
                                   .before = 2, .after = 2, .complete = TRUE)
    )

ggplot((aus_bull |> filter(Animal == "Bulls, bullocks and steers")), aes(x = Month, y = fivemonth, color = State)) + geom_line()

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

#Victorian Electricity Demand from vic_elec

#let's make each measure a 24-hour average using 23/24 time stamps on either side
#This way, we can see daily need without the noise of nighttime dips

daily_elec <- vic_elec |>
    mutate(
        daily = slider::slide_dbl(Demand, mean,
                                      .before = 23, .after = 24, .complete = TRUE)
    )

ggplot(daily_elec, aes(x = Time, y = daily, color = Temperature)) + geom_line()

## Warning: Removed 47 rows containing missing values or values outside the scale range
## (`geom_line()`).

#I would also explore weekly averages

weekly_elec <- vic_elec |>
    mutate(
        weekly = slider::slide_dbl(Demand, mean,
                                  .before = 167, .after = 168, .complete = TRUE)
    )

ggplot(weekly_elec, aes(x = Time, y = weekly, color = Temperature)) + geom_line()

## Warning: Removed 335 rows containing missing values or values outside the scale range
## (`geom_line()`).

#And here it is adjusted by temperature (I would like to know if this is a legitimate way of adjusting -- it seems like the need goes up/down exponentially with temperature, particularly when it's cold)

vic_elec_temp <- vic_elec |>
    mutate(
        temp_adj = Demand / Temperature )

ggplot(vic_elec_temp, aes(x = Time, y = temp_adj, color = Temperature)) + geom_line()

#adjusted with a box-cox
#this took a strangely long time to calcuate and doesn't make the graph much more reatable

vic_elec |> features(Demand, features = guerrero)

## # A tibble: 1 × 1
##   lambda_guerrero
##             <dbl>
## 1          0.0999

vic_elec |> autoplot(box_cox(Demand, .099))

#Gas production from aus_production.
ggplot(aus_production, aes(x = Quarter, y = Gas)) + geom_line() + scale_y_log10()

ggplot(aus_production, aes(x = Quarter, y = Gas)) + geom_line()

#I used a log transformation to make the pattern (particularly at the beginning) more visible

3.3 Why is a Box-Cox transformation unhelpful for the canadian_gas data?

library(latex2exp)

## Warning: package 'latex2exp' was built under R version 4.5.2

#regular line plot

ggplot(data = canadian_gas, aes(x = Month, y = Volume)) + geom_line()

#box-cox plot
lambda <- canadian_gas |> features(Volume, features = guerrero) |>
    pull(lambda_guerrero)

canadian_gas |>
    autoplot(box_cox(Volume, lambda)) +
    labs(y = "",
         title = latex2exp::TeX(paste0(
             "Transformed gas production with $\\lambda$ = ",
             round(lambda,2))))

#a box-cox compresses the scale and makes seasonal variations more homogeneous -- the variations are already homogeneous (already has a constant variance) and the two plots look very similar

#I took this code from the book's text, but re-watched the video and realized there was an easier way to do it 
canadian_gas |> features(Volume, features = guerrero)

## # A tibble: 1 × 1
##   lambda_guerrero
##             <dbl>
## 1           0.577

canadian_gas |> autoplot(box_cox(Volume, .577))

3.4 What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)? aus_retail

#group all the retail sectors together 

aus_retail_2 <- aus_retail |> summarize(sumt = sum(Turnover))

#the autoplot

autoplot(aus_retail_2)

## Plot variable not specified, automatically selected `.vars = sumt`

#the seasonal variation looks about the same at about .1 (to me)

aus_retail_2 |> autoplot(box_cox(sumt, .1))

#this gives me .196, so the correct answer may be closer to .2

aus_retail_2 |> features(sumt, features = guerrero)

## # A tibble: 1 × 1
##   lambda_guerrero
##             <dbl>
## 1           0.196

3.5

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

aus_production |> features(Tobacco, features = guerrero)

## # A tibble: 1 × 1
##   lambda_guerrero
##             <dbl>
## 1           0.926

#this gives .926

#checking the plot - they look even to me

aus_production  |> autoplot(box_cox(Tobacco, .926))

## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

Economy class passengers between Melbourne and Sydney from ansett,

class_group <- ansett |> group_by(Class) |> summarize(sum = sum(Passengers)) 

#economy passgengers = 2
class_group |> features(sum, features = guerrero)

## # A tibble: 3 × 2
##   Class    lambda_guerrero
##   <chr>              <dbl>
## 1 Business          -0.900
## 2 Economy            2.00 
## 3 First             -0.900

class_group |> filter(Class == "Economy") |> autoplot(box_cox(sum, 2))

Tobacco from aus_production,

aus_production |> features(Tobacco, features = guerrero)

## # A tibble: 1 × 1
##   lambda_guerrero
##             <dbl>
## 1           0.926

aus_production |> autoplot(box_cox(Tobacco, .926))

## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

#and Pedestrian counts at Southern Cross Station from pedestrian.
pedestrian |> filter(Sensor == 'Southern Cross Station') |> features(Count, features = guerrero)

## # A tibble: 1 × 2
##   Sensor                 lambda_guerrero
##   <chr>                            <dbl>
## 1 Southern Cross Station          -0.250

#this...does not look great

pedestrian |> filter(Sensor == 'Southern Cross Station') |> autoplot(box_cox(Count, -0.250))

#but better than it was
pedestrian |> filter(Sensor == 'Southern Cross Station') |> autoplot(box_cox(Count, 1))

3.7

Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) |> select(Gas) Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle? Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices. Do the results support the graphical interpretation from part a? Compute and plot the seasonally adjusted data. Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier? Does it make any difference if the outlier is near the end rather than in the middle of the time series?

gas <- tail(aus_production, 5*4) |> select(Gas)

#autoplot time series
#there are very clear seasonal fluctuations, with a low in Q1, high in Q3, dip in Q4. 
autoplot(gas)

## Plot variable not specified, automatically selected `.vars = Gas`

#classical decomposition reflects the seasonality out and trend appropriately 
gas |>
    model(
        classical_decomposition(Gas, type = "multiplicative")
    ) |>
    components() |>
    autoplot()

## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_line()`).

#Compute and plot the seasonally adjusted data.
dcmp <- gas |>
    model(classical = classical_decomposition(Gas, type = "multiplicative")) |>
    components()

#just the trend

autoplot(dcmp, trend)

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

#seasonal indices (these repeat)

dcmp |> 
    as_tibble() |> 
    select(Quarter, seasonal) |> 
    head(4)

## # A tibble: 4 × 2
##   Quarter seasonal
##     <qtr>    <dbl>
## 1 2005 Q3    1.13 
## 2 2005 Q4    0.925
## 3 2006 Q1    0.875
## 4 2006 Q2    1.07

#Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

#in the middle
gas2 <- gas
gas2[10, "Gas"] <- 1500
gas2 |>
    model(
        classical_decomposition(Gas, type = "multiplicative")
    ) |>
    components() |>
    autoplot()

## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_line()`).

#at the beginning

gas[4, "Gas"] <- 1000


gas |>
    model(
        classical_decomposition(Gas, type = "multiplicative")
    ) |>
    components() |>
    autoplot()

## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_line()`).

#adding an outlier to the end of the series

gas[18, "Gas"] <- 1500

gas |>
    model(
        classical_decomposition(Gas, type = "multiplicative")
    ) |>
    components() |>
    autoplot()

## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_line()`).

#in the middle
gas2 <- gas
gas2[10, "Gas"] <- 1500
gas2 |>
    model(
        classical_decomposition(Gas, type = "multiplicative")
    ) |>
    components() |>
    autoplot()

## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_line()`).

#adding outliers adds spikes to the trends and skews the seasonality to very sharp peaks

#Yes, it makes a difference if the outlier is at the end or beginning of the cycle or in the middle. When the outlier is in the middle, the model seems better equipped to pull it out as random noise, though it does seem to skew the seasonal component.

3.8 Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

#using the aggregate table aus_retail_2 from earlier, which combines all sectors

#this called for a library called seasonal

library(seasonal)

## Warning: package 'seasonal' was built under R version 4.5.2

## 
## Attaching package: 'seasonal'

## The following object is masked from 'package:tibble':
## 
##     view

x11_dcmp_retail <- aus_retail_2 |>
    model(x11 = X_13ARIMA_SEATS(sumt ~ x11())) |>
    components()
autoplot(x11_dcmp_retail) +
    labs(title =
             "Decomposition of total Australian retail employment using X-11.")

#there's a high outlier in 2000/2001, and then a quick drop, the data overall is really seasonal

#here, I'll try with one sector (clothing)

aus_retail |> filter (Industry == "Clothing retailing")  |>
    model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) |>
    components()

## # A dable: 3,456 x 9 [1M]
## # Key:     State, Industry, .model [8]
## # :        Turnover = trend * seasonal * irregular
##    State              Industry .model    Month Turnover trend seasonal irregular
##    <chr>              <chr>    <chr>     <mth>    <dbl> <dbl>    <dbl>     <dbl>
##  1 Australian Capita… Clothin… x11    1982 Apr      3.7  3.22    1.11      1.04 
##  2 Australian Capita… Clothin… x11    1982 May      3.8  3.22    1.17      1.01 
##  3 Australian Capita… Clothin… x11    1982 Jun      3.2  3.25    1.01      0.975
##  4 Australian Capita… Clothin… x11    1982 Jul      3.4  3.29    1.05      0.979
##  5 Australian Capita… Clothin… x11    1982 Aug      3.1  3.37    0.942     0.977
##  6 Australian Capita… Clothin… x11    1982 Sep      3.4  3.46    0.957     1.03 
##  7 Australian Capita… Clothin… x11    1982 Oct      3.4  3.53    0.957     1.01 
##  8 Australian Capita… Clothin… x11    1982 Nov      3.6  3.58    0.976     1.03 
##  9 Australian Capita… Clothin… x11    1982 Dec      4.7  3.61    1.29      1.01 
## 10 Australian Capita… Clothin… x11    1983 Jan      2.7  3.64    0.774     0.960
## # ℹ 3,446 more rows
## # ℹ 1 more variable: season_adjust <dbl>

autoplot(x11_dcmp_retail) +
    labs(title =
             "Decomposition of total Australian clothing retail employment using X-11.")

#It looks similar. There's in outlier in the 80s that I didn't spot before, as well as the same spike/drop in 2000/2001

3.9

Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation. Is the recession of 1991/1992 visible in the estimated components?

There’s a consistent upward trend, which accounts for most of the data. The seasonal component is consistent, but seasonality isn’t very strong, as per the scale on the lefthand side. The remainder represents random noise and includes a couple of significant dips.

The recession of 1992/1992 is visible as a remainder (outliers), and it’s significant – the scale of the bottom (remainder) graph is larger than the seasonal graph. The trend cycle also flattens out a little starting around then, indicating potentially more stagnant employment.

Homework 2

Samantha Barbaro

2026-02-08