DATA624

Import Libraries

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(tsibble)

## Warning: package 'tsibble' was built under R version 4.3.3

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## 
## Attaching package: 'tsibble'

## The following object is masked from 'package:lubridate':
## 
##     interval

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0     ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1     ✔ tibble  3.2.1
## ✔ purrr   1.0.2     ✔ tidyr   1.3.1
## ✔ readr   2.1.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.3.3

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.0 ──
## ✔ tsibbledata 0.4.1     ✔ fable       0.3.4
## ✔ feasts      0.3.2     ✔ fabletools  0.4.2

## Warning: package 'tsibbledata' was built under R version 4.3.3

## Warning: package 'feasts' was built under R version 4.3.3

## Warning: package 'fabletools' was built under R version 4.3.3

## Warning: package 'fable' was built under R version 4.3.3

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(forecast)

## Warning: package 'forecast' was built under R version 4.3.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(seasonal)

## Warning: package 'seasonal' was built under R version 4.3.3

## 
## Attaching package: 'seasonal'
## 
## The following object is masked from 'package:tibble':
## 
##     view

3.1

Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

data("global_economy")
# use global_economy data
global_economy$GDPperCapita <- global_economy$GDP/global_economy$Population
global_economy %>% autoplot(GDPperCapita, show.legend =  FALSE) +
  labs(title= "GDP Per Capita", y = "USD")

## Warning: Removed 3242 rows containing missing values or values outside the scale range
## (`geom_line()`).

# Creating a dataframe by GDPCAP
gdpCap<- global_economy %>%
  index_by(Year)%>%
  filter(GDPperCapita == max(GDPperCapita, na.rm = TRUE)) %>%
  select(Country, GDPperCapita) %>%
  arrange(Year)

## Adding missing grouping variables: `Year`

# Top 5 Countries GDP per Capita
head(gdpCap  %>%
  arrange(desc(GDPperCapita)), n = 5)

# Changing data type
gdpCap$Country <- factor(gdpCap$Country)

# Graph
ggplot(gdpCap, aes(x = as.factor(Year), y = GDPperCapita, fill = Country)) +
  geom_bar(stat = "identity") +
  labs(title = "GDP Per Capita over the Period of 1960 to 2017", x = "Year from 1960 to 2017", y = "GDP per Capita", fill = "Country") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_x_discrete(limits = gdpCap$Year, breaks = 5)

## Warning in scale_x_discrete(limits = gdpCap$Year, breaks = 5): Continuous limits supplied to discrete scale.
## ℹ Did you mean `limits = factor(...)` or `scale_*_continuous()`?

Monaco appears to have had the highest GDP per capita in 2014, reaching 185,153. If we examine the leading countries by GDP per capita over the years, Kuwait and the United States were at the top during the 1960s and 1970s. However, Monaco dominated for most of the subsequent years, with Luxembourg and Liechtenstein emerging as leaders for a few years toward the end of the period.

3.2

For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

United States GDP from global_economy. Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock. Victorian Electricity Demand from vic_elec. Gas production from aus_production.

data("aus_livestock")
data("vic_elec")
data("aus_production")

United States GDP from global_economy.

global_economy %>% filter(Country =='United States') %>%
  autoplot(GDP) +
  labs(title = "USA GDP", y = "USD")

A transformation is not needed.

Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock.

aus_livestock %>% filter(Animal == 'Bulls, bullocks and steers') %>%
  filter(State == 'Victoria') %>%
  autoplot(Count) +
  labs(title = "Victorian Bulls, bullocks and Steers Slaughtered", y = 'Count')

A transformation is not needed.

Victorian Electricity Demand from vic_elec.

vic_elec %>% autoplot(Demand) + 
  labs(title = 'Victory Electricity Demands', y = 'MWh')

electric <- vic_elec %>%
  mutate(Year = year(Date)) %>%
  index_by(Year) %>%
  summarise(Demand = sum(Demand))

electric %>% autoplot(Demand) + 
  labs(title = 'Yearly Victory Electricity Demands', y = 'MWh')

Given the significant noise in the dataset, I decided to transform the electrical data into yearly intervals. This approach helped reduce the noise while also highlighting the consistent downward trend in electricity demand.

Gas production from aus_production.

aus_production %>% autoplot(Gas) +
  labs(title = 'Gas Production', y = 'petajoules')

lambda <- aus_production %>%
  features(Gas, features = guerrero) %>%
  pull(lambda_guerrero)

aus_production %>%
  autoplot(box_cox(Gas, lambda)) +
  labs(title = paste("Transformed Gas Production with \u03BB =", round(lambda, 2)), y = 'petajoules')

I opted to apply a Box-Cox transformation to better highlight the seasonal variations over the entire time period. This also reveals a consistent seasonal pattern throughout the data.

3.3

Why is a Box-Cox transformation unhelpful for the canadian_gas data?

canadian_gas %>% autoplot(Volume) +
  labs(title = "Canadian Gas Data", y ="Volumne in Billions of Cubic Metres")

print(range(canadian_gas$Volume))

## [1]  0.9660 19.5284

The Box-Cox transformation is not particularly useful in this case, as the range of data for gas volume is quite narrow, spanning only from 0.966 to 19.5284. This limited range makes the transformation less effective.

3.4

What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?

set.seed(17)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1)) 
myseries %>% autoplot(Turnover) +
  labs(title = "Retail Data Turnover",
       y = "$AUD (Millions)")

lambda <- myseries %>%
  features(Turnover, features = guerrero) %>%
  pull(lambda_guerrero)

myseries %>% autoplot(box_cox(Turnover, lambda))+
  labs(title = paste("Transformed Retail Turnover with \u03BB =", round(lambda, 2)))

By applying the Box-Cox transformation with λ = 0.14, the seasonal variation becomes more consistent. The Box-Cox transformation was chosen because it uses a natural logarithm for handling exponential growth, and Guerrero identified an optimal λ value to simplify the forecasting process.

3.5

For the following series, find an appropriate Box-Cox transformation in order to stabilize the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

data("pedestrian")
data("ansett")

Tobacco from aus_production

aus_production %>% autoplot(Tobacco) +
  labs(title = "Original Tobacco Production", y = "Production in tonnes")

## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

lambda <- aus_production %>%
  features(Tobacco, features = guerrero)%>%
  pull(lambda_guerrero)

aus_production %>% autoplot(box_cox(Tobacco,lambda)) +
  labs(title = paste("Transformed Tobacco Production with \u03BB =", round(lambda, 2)))

## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

In the Tobacco data, the Box-Cox transformation produced a λ value of 0.93, indicating that the data underwent very minimal transformation.

Economy class passengers between Melbourne and Sydney from ansett

econ <- ansett %>%
  filter(Class == "Economy",
         Airports == "MEL-SYD")

autoplot(econ, Passengers)+
  labs(title = "Economy class Passengers Between Melbourne and Sydney")

lambda <- econ %>%
  features(Passengers, features = guerrero) %>%
  pull(lambda_guerrero)

econ %>%
  autoplot(box_cox(Passengers, lambda)) +
  labs(title = paste("Transformed Economy Class Passengers Count with \u03BB =", round(lambda, 2)))

In the Ansett dataset, the Box-Cox transformation resulted in a λ value of 2, meaning the data was squared to better highlight variations.

Pedestrian counts at Southern Cross Station from pedestrian

pedestrian %>% filter(Sensor =='Southern Cross Station') %>% autoplot(Count)+
  labs(title = "Original Pedestrian Count")

weekly <- pedestrian %>%
  mutate(Week = yearweek(Date)) %>%
  index_by(Week) %>%
  summarise(Count = sum(Count))

weekly %>% autoplot(Count)+
  labs(title = "Weekly Pedestrian Count")

lambda <- weekly %>%
  features(Count, features = guerrero) %>%
  pull(lambda_guerrero)

weekly %>% autoplot(box_cox(Count,lambda)) +
  labs(title = paste("Transformed Weekly Pedestrian Count with \u03BB =", round(lambda, 2)))

While the Box-Cox transformation with a λ value of 2 applies a power transformation, the more significant adjustment was grouping the data from hourly pedestrian counts into weekly counts. Given the timeframe from January 2015 to December 2016, the original dataset contained too many data points, which introduced considerable variance and noise. Grouping the data helped address this issue.

3.7

Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) |> select(Gas)

A. Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?

gas %>% autoplot(Gas) +
  labs(title = "Australia Gas Production", y = "Petajoules")

There appears to be a seasonal fluctuation in manufacturing production, with a notable decrease reaching its lowest point towards the end of Q4 and a peak around the middle of the year, near the end of Q2.

B. Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.

gas %>% model(classical_decomposition(Gas, type = "multiplicative")) %>%
  components() %>%
  autoplot() +
  labs(title = "Classical additive decomposition of total US retail employment")

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

C. Do the results support the graphical interpretation from part a?

The results confirm Part A. The seasonal line shows a trough at the start of the year, a peak around the middle of the year, and a gradual decline as the year progresses toward its end.

D. Compute and plot the seasonally adjusted data.

gas_season <- gas %>% model(classical_decomposition(Gas, type = "multiplicative"))  
components(gas_season) %>%
  as_tsibble() %>%
  autoplot(Gas, colour = "darkgray") +
  geom_line(aes(y=season_adjust), colour = "#3230B2") +
  labs(title = "Seasonally Adjusted Gas Production")

E. Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

gas$Gas[gas$Gas == 171] <- gas$Gas[gas$Gas == 171] + 300

gas %>%
  model(classical_decomposition(Gas, type = "multiplicative")) %>%
  components() %>%
  as_tsibble() %>%
  autoplot(Gas, colour = "darkgray") +
  geom_line(aes(y=season_adjust), colour = "#3230B2") +
  labs(title = "Seasonally Adjusted Data with an Outlier")

The outlier caused a notable increase in both the raw data and the seasonally adjusted data, significantly affecting the overall trend.

F. Does it make any difference if the outlier is near the end rather than in the middle of the time series?

gas$Gas[gas$Gas == 471] <- gas$Gas[gas$Gas == 471] - 300
gas$Gas[gas$Gas == 245] <- gas$Gas[gas$Gas == 245] + 300

gas %>%
  model(classical_decomposition(Gas, type = "multiplicative")) %>%
  components() %>%
  as_tsibble() %>%
  autoplot(Gas, colour = "darkgray") +
  geom_line(aes(y=season_adjust), colour = "#3230B2") +
  labs(title = "Seasonally Adjusted Data with a Middle Outlier")

There appears to be no difference in the impact of the outlier on the trend and data; its effect seems significant regardless of its placement.

3.8

Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

x11_dcmp <- myseries %>%
  model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) %>%
  components()
autoplot(x11_dcmp) +
  labs(title = "Decomposition of total US retail employment using X-11.")

Examining the irregular component of the decomposition, we observe high levels of noise at the beginning of the 1990s and again toward the end of the 2000s. This increased noise likely corresponds to the recessions occurring during these periods. Outside of these fluctuations, the data shows a generally consistent trend and seasonal pattern throughout the turnover.

3.9

A. Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

Figure 3.19 illustrates a positive and rising trend over time. The seasonal pattern shows three peaks in civilian labor force hiring each year, typically occurring around March, September, and December. The data remains relatively stable with minimal noise until the late 1980s, extending through to about 1993. A noticeable trough is observed towards the end of 1990 and into 1991.

B. Is the recession of 1991/1992 visible in the estimated components?

In Figure 3.20, the seasonal component of the decomposition reveals a sharp decline from March to August in the early 1990s. This trend aligns with Figure 3.19, which shows a significant drop in the remainder or noise component between 1990 and 1991, supporting the evidence of a recession during that period.

DATA624_HW2

2024-09-15

Import Libraries

3.1

3.2

3.3

3.4

3.5

3.7

3.8

3.9