Data 624 Homework

Do exercises 3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8 and 3.9 from the online Hyndman book. .

Exercise 3.1

Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

The global_economy dataset contains GDP and Population.

We will calculate GDP per capita by dividing GDP by Population.

Identify the country with the highest GDP per capita overall

## # A tsibble: 7 x 3 [1Y]
## # Key:       Country [2]
##   Country        Year max_gdp_per_capita
##   <fct>         <dbl>              <dbl>
## 1 Monaco         2014            185153.
## 2 Monaco         2008            180640.
## 3 Liechtenstein  2014            179308.
## 4 Liechtenstein  2013            173528.
## 5 Monaco         2013            172589.
## 6 Monaco         2016            168011.
## 7 Liechtenstein  2015            167591.

To investigate this inquiry in greater depth and ascertain which country possesses the highest GDP per capita, we will focus on nations within the upper echelon, applying a filter for those with a GDP per capita exceeding 80,000 USD.

global_economy %>%
  filter(GDP_per_capita > 80000) %>%
  autoplot(GDP_per_capita) +
  labs(title= "GDP per capita", y = "$US")

The chart above shows that Monaco had significant GDP growth from 2000 to 2014, culminating in a record GDP per capita of 185152.5 in 2014.

monaco_gdp <- global_economy %>%
  filter(Country == "Monaco") %>%
  select(Year, GDP_per_capita)  # Select only the Year and GDP columns
monaco_gdp_ts <- monaco_gdp %>%
  as_tsibble(index = Year)
monaco_gdp_ts %>%
  gg_lag(GDP_per_capita) +
  labs(title = "Lag Plot of Monaco's GDP",
       x = "Lagged GDP",
       y = "GDP",
       caption = "Source: global_economy dataset from fpp3")

Exercise 3.2

For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

- United States GDP from global_economy.

- Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock.

- Victorian Electricity Demand from vic_elec.

- Gas production from aus_production.

data("global_economy")

us_gdp <- global_economy %>% filter(Country == "United States")
vic_bulls <- aus_livestock %>% filter(Animal == "Bulls, bullocks and steers", State == "Victoria")
vic_elec_demand <- vic_elec
gas_production <- aus_production

1. Analyze United States GDP from global_economy

us_gdp %>%
ggplot(aes(x = Year, y = GDP)) +
geom_line() +
labs(title = "U.S. GDP Over Time",
 x = "Year",
y = "GDP (in $USD)")

The GDP series shows exponential growth, suggesting a log transformation might be useful.

us_gdp_transformed <- us_gdp %>%
  autoplot(log(GDP)) + ggtitle("Log Transformed USA GDP") 
us_gdp_transformed

The log transformation makes the growth appear more linear and stabilizes the variance.

2. Analyze Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock.

vic_bulls %>%
  autoplot(Count) +
  ggtitle("Slaughter of Victorian Bulls, bullocks and steers")

The series is showing more and more variation as time goes on, which hints that a log transformation could be helpful.

# Apply log transformation
vic_bulls_transformed <- vic_bulls %>%
  mutate(log_Count = log(Count))


vic_bulls_transformed %>%
  autoplot(log_Count) +
  ggtitle("Log Transformed Slaughter of Victorian Bulls, bullocks and steers")

The log transformation stabilizes the variance in the slaughter series.

3. Victorian Electricity Demand from vic_elec.

vic_elec_demand %>%
  autoplot(Demand) +
  ggtitle("Victorian Electricity Demand")

The series seems to have a pretty consistent variance, but we could try a log transformation to check if it makes the series better.

# Apply log transformation
vic_elec_demand_transformed <- vic_elec_demand %>%
  mutate(log_Demand = log(Demand))

vic_elec_demand_transformed %>%
  autoplot(log_Demand) +
  ggtitle("Log Transformed Victorian Electricity Demand")

4. Gas production from aus_production.

The log transformation doesn’t drastically change the series, but it might help in stabilizing variance slightly.

gas_production %>%
  autoplot(Gas) +
  ggtitle("Gas Production")

The gas production series shows exponential growth, suggesting a log transformation might be useful.

# log transf
gas_production_transformed <- gas_production %>%
  mutate(log_Gas = log(Gas))

# Plot
gas_production_transformed %>%
  autoplot(log_Gas) +
  ggtitle("Log Transformed Gas Production")

Using a log transformation helps to make the growth look more like a straight line and keeps the variance steady.

Exercise 3.3

Why is a Box-Cox transformation unhelpful for the canadian_gas data?

gas_data <- canadian_gas
head(gas_data)

## # A tsibble: 6 x 2 [1M]
##      Month Volume
##      <mth>  <dbl>
## 1 1960 Jan   1.43
## 2 1960 Feb   1.31
## 3 1960 Mar   1.40
## 4 1960 Apr   1.17
## 5 1960 May   1.12
## 6 1960 Jun   1.01

# Calculate optimal lambda

lambda <- gas_data %>%
features(Volume, features = guerrero) %>%
pull(lambda_guerrero)

#Box-Cox transformation

gas_transformed <- gas_data %>%
mutate(box_cox = box_cox(Volume, lambda))

gas_transformed %>%
  autoplot() + labs(title = " Monthly Canadian gas production", x = "Months", y = "volume")

## Plot variable not specified, automatically selected `.vars = Volume`

gas_transformed %>%
  autoplot() + labs(title = "Monthly Canadian gas production Transformed", x = "Months", y = "box_cox")

## Plot variable not specified, automatically selected `.vars = Volume`

The Box-Cox transformation isn’t really useful for the canadian_gas data since it doesn’t tackle the main seasonal trends and could make things harder to interpret, all without offering any real benefits for forecasting.

Exercise 3.4

What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?

retail_series <- aus_retail %>%
filter(Industry == "Food retailing")
head(retail_series,7)

## # A tsibble: 7 x 5 [1M]
## # Key:       State, Industry [1]
##   State                        Industry       `Series ID`    Month Turnover
##   <chr>                        <chr>          <chr>          <mth>    <dbl>
## 1 Australian Capital Territory Food retailing A3349688X   1982 Apr     15.5
## 2 Australian Capital Territory Food retailing A3349688X   1982 May     15.1
## 3 Australian Capital Territory Food retailing A3349688X   1982 Jun     15.5
## 4 Australian Capital Territory Food retailing A3349688X   1982 Jul     16.1
## 5 Australian Capital Territory Food retailing A3349688X   1982 Aug     15.8
## 6 Australian Capital Territory Food retailing A3349688X   1982 Sep     16  
## 7 Australian Capital Territory Food retailing A3349688X   1982 Oct     16.6

retail_series%>%autoplot() + ggtitle("Australian Retail Trade Turnover")

## Plot variable not specified, automatically selected `.vars = Turnover`

We apply the guerrero method to find the best lambda value for the Box-Cox transformation. This approach picks the lambda that reduces the coefficient of variation for the transformed data.

Determine the optimal Box-Cox transformation parameter (lambda)

# Determine the optimal Box-Cox transformation parameter (lambda)

lambda <- retail_series %>%
features(Turnover, features = guerrero) %>%
pull(lambda_guerrero)

lambda

## [1]  0.30276595  0.08786620 -0.06280256  0.07636928  0.08473322  0.09023149
## [7]  0.05068016  0.11791154

For box-cox transformation I will use lambda from the Guerrero feature with value of it’s 0.30276595.

lambda<-0.08786620

# Apply the Box-Cox transformation

retail_series_transformed <- retail_series %>%
mutate(Turnover_Transformed = box_cox(Turnover, lambda))
retail_series_transformed%>%autoplot(box_cox(Turnover,lambda)) + ggtitle("Transformed Retail Series")

By displaying the original series next to the transformed one, we can get a good look at the effects of the Box-Cox transformation. The transformed series is expected to show a more uniform variance over time than the original series does.

Exercise 3.5

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

Tobacco data

tobacco_data <- aus_production %>%
select(Quarter, Tobacco)
head(tobacco_data,3)

## # A tsibble: 3 x 2 [1Q]
##   Quarter Tobacco
##     <qtr>   <dbl>
## 1 1956 Q1    5225
## 2 1956 Q2    5178
## 3 1956 Q3    5297

Economy class passengers data

economy_data <- ansett %>%
filter(Class == "Economy", Airports == "MEL-SYD") %>%
select(Week, Passengers)
head(economy_data,3)

## # A tsibble: 3 x 2 [1W]
##       Week Passengers
##     <week>      <dbl>
## 1 1987 W26      20167
## 2 1987 W27      20161
## 3 1987 W28      19993

Pedestrian data

pedestrian_data <- pedestrian %>%
filter(Sensor == "Southern Cross Station") %>%
select(Date, Count)
head(pedestrian_data,3)

## # A tsibble: 3 x 3 [1h] <Australia/Melbourne>
##   Date       Count Date_Time          
##   <date>     <int> <dttm>             
## 1 2015-01-01   746 2015-01-01 00:00:00
## 2 2015-01-01   312 2015-01-01 01:00:00
## 3 2015-01-01   180 2015-01-01 02:00:00

Calculate optimal lambda values

lambda_tobacco <- aus_production %>%
select(Tobacco) %>%
features(Tobacco, features = guerrero) %>%
pull(lambda_guerrero)

lambda_economy <- ansett %>%
filter(Class == "Economy", Airports == "MEL-SYD") %>%
select(Passengers) %>%
features(Passengers, features = guerrero) %>%
pull(lambda_guerrero)

lambda_pedestrian <- pedestrian %>%
filter(Sensor == "Southern Cross Station") %>%
select(Count) %>%
features(Count, features = guerrero) %>%
pull(lambda_guerrero)

#lambda_tobacco

lambda_tobacco

## [1] 0.9264636

create_comparison_chart <- function(data, original_col, time_col, lambda, title) {

  # Original series

original_chart <- ggplot(data, aes_string(x = time_col, y = original_col)) +
geom_line(color = 'blue') +
labs(title = paste("Original", title),
x = time_col,
 y = original_col) +
theme_minimal()
# Transformed series
 transformed_chart <- ggplot(data, aes_string(x = time_col, y = paste("box_cox(", original_col, ", ", lambda, ")", sep = ""))) +
 geom_line(color = 'red') +
labs(title = paste("Box-Cox Transformed", title),
x = time_col,
 y = paste("Transformed", original_col)) +
theme_minimal()
return(list(original = original_chart, transformed = transformed_chart))

}

# Tobacco charts
tobacco_charts <- create_comparison_chart(tobacco_data, "Tobacco", "Quarter", 
                                        lambda_tobacco, "Tobacco Production")
tobacco_charts

## $original

## 
## $transformed

A lambda value λ = 0.9264636 which is close to 1 suggests that the original data is already nearly normally distributed, and the Box-Cox transformation may not significantly alter the data’s shape. When lambda equals one, the data shifts down, but the shape does not change. If the optimal value for lambda is one, the data is already normally distributed, making the Box-Cox transformation unnecessary.

#lambda_economy

lambda_economy

## [1] 1.999927

# Economy passenger charts
economy_charts <- create_comparison_chart(economy_data, "Passengers", "Week",
                                        lambda_economy, "Economy Passengers")
economy_charts

## $original

## 
## $transformed

A lambda value λ = 1.999927 which is greater than 1 suggests that the original data is left-skewed, meaning more observations are around higher values. In this case, the Box-Cox transformation applies a power transformation with lambda > 1 to achieve a more normal distribution.

#lambda_pedestrian

lambda_pedestrian

## [1] -0.2501616

# Pedestrian charts
pedestrian_charts <- create_comparison_chart(pedestrian_data, "Count", "Date",
                                           lambda_pedestrian, "Pedestrian Count")
pedestrian_charts

## $original

## 
## $transformed

A negative lambda value λ = -0.2501616 indicates that the data is right-skewed and that the Box-Cox transformation involves an inverse or reciprocal transformation.

Exercise 3.7

Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) |> select(Gas)

gas_ts <- ts(gas, frequency = 4, start = c(2017, 1)) |> as_tsibble()
head(gas_ts)

## # A tsibble: 6 x 3 [1Q]
## # Key:       key [1]
##     index key   value
##     <qtr> <chr> <dbl>
## 1 2017 Q1 Gas     221
## 2 2017 Q2 Gas     180
## 3 2017 Q3 Gas     171
## 4 2017 Q4 Gas     224
## 5 2018 Q1 Gas     233
## 6 2018 Q2 Gas     192

gas_ts %>%
  autoplot() + ggtitle("Quarterly Gas production")+

  labs(title = "Gas Production Time Series", x = "Quarter", y = "Gas Production")

## Plot variable not specified, automatically selected `.vars = value`

The graph presented above illustrates distinct seasonal fluctuations alongside a production trend in the data pertaining to Australian gas production over the past five years.

Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.

decomp <- gas_ts  %>%  
  model(classical_decomposition(value, type = "multiplicative"))  %>%  
  components()
decomp %>%  
  autoplot() + labs(title = "Classical multiplicative decomposition of Gas Production in Australia")

Do the results support the graphical interpretation from part a?

The components validate the conclusions drawn in Part A concerning the upward trajectory and the seasonal variations observed. Furthermore, the data does not exhibit significant randomness.

Compute and plot the seasonally adjusted data.

adjusted_gas <- gas_ts |>
  mutate(
    adj_gas = value / decomp$seasonal
  )



ggplot(adjusted_gas, aes(x = index, y = adj_gas)) +
  geom_line() +
  labs(title = "Seasonally Adjusted Gas Production",
       x = "Quarter",
       y = "Gas Production (Seasonally Adjusted)") +
  theme(plot.title = element_text(hjust = 0.5)) +  
  theme(figure.figsize = c(5, 5))

e.Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

gas_ts_outlier <- gas_ts
gas_ts_outlier$value[10] <- gas_ts_outlier$value[10] + 300

decomp_outlier <- gas_ts_outlier |>
  model(classical_decomposition(value, type = "multiplicative")) |>
  components()

adjusted_gas_outlier <- gas_ts_outlier |>
  mutate(
    adj_gas = value / decomp_outlier$seasonal
  )

ggplot(adjusted_gas_outlier, aes(x = index, y = adj_gas)) +
  geom_line() +
  labs(title = "Seasonally Adjusted Gas Production with Outlier",
       x = "Quarter",
       y = "Gas Production (Seasonally Adjusted)") +
  theme(plot.title = element_text(hjust = 0.5),  # Center the title
        figure.figsize = c(7, 5)) # Adjust figure size (width, height in inches)

The presence of the outlier significantly alters the seasonally adjusted data. The advantages of utilizing seasonally adjusted figures are negated, as the underlying trend becomes indiscernible. This outlier, positioned centrally within the dataset, manifests prominently at the onset of the seasonally adjusted data visualization.

Does it make any difference if the outlier is near the end rather than in the middle of the time series?

gas_ts_outlier_end <- gas_ts
gas_ts_outlier_end$value[20] <- gas_ts_outlier_end$value[20] + 300

decomp_outlier_end <- gas_ts_outlier_end |>
  model(classical_decomposition(value, type = "multiplicative")) |>
  components()

adjusted_gas_outlier_end <- gas_ts_outlier_end |>
  mutate(
    adj_gas = value / decomp_outlier_end$seasonal
  )

ggplot(adjusted_gas_outlier_end, aes(x = index, y = adj_gas)) +
  geom_line() +
  labs(title = "Seasonally Adjusted Gas Production with Outlier at End",
       x = "Quarter",
       y = "Gas Production (Seasonally Adjusted)") +
  theme(plot.title = element_text(hjust = 0.5)) + # Center the title
  theme(aspect.ratio = 1) # Ensure the plot is square (height = width)

Comparing the results with the outlier in the middle, we can determine if the outlier’s position affects the seasonally adjusted data differently. Classical decomposition methods are sensitive to outliers, and their impact can vary depending on the outlier’s location within the time series.

The effect of the outlier is that it distorts the seasonal indices, which in turn affects the seasonally adjusted data. The impact of the outlier depends on its position within the time series.

Exercise 3.8

Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

set.seed(12345678)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

x11 <- myseries %>%
        model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) %>%
  components()

autoplot(x11)

Exercise 3.9

Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

The figures presented illustrate the decomposition of the Australian civilian labor force from February 1978 to August 1995, employing the STL method. This decomposition categorizes the data into three distinct components: trend, seasonal, and remainder.

In Figure 3.19, the original data (value) is depicted alongside the long-term trend, seasonal variations (season_year), and the residual fluctuations (remainder). The trend component indicates a consistent upward trajectory in the labor force over the observed period. The seasonal component highlights predictable annual patterns, which likely correspond to variations in seasonal employment. The remainder component captures short-term, irregular fluctuations.

Figure 3.20 emphasizes the seasonal component, providing a clearer representation of the cyclical patterns. The interpretation of these graphs is significantly influenced by the differing scales on the y-axis between the original data and the decomposed components.

The recession of 1991/1992 is not distinctly observable within the estimated components. Although the remainder component exhibits some negative fluctuations during this timeframe, these variations are not markedly larger than other fluctuations throughout the series. The overarching trend continues to rise, thereby obscuring the recession’s impact on the decomposed components.

Data 624 Homework_2

Uliana Plotnikova

2025-02-15