DATA 624 - Homework 2

Problem 3.1
Problem 3.2
Problem 3.3
Problem 3.4
Problem 3.5
Problem 3.7
Problem 3.8
Problem 3.9

Problem 3.1

Consider the GDP information in global_economy. Plot the GDP per capita for each country over time.

# Identify the top 10 countries based on mean per cap GDP over time
top_10_countries <- as_tibble(global_economy) %>%
  mutate(gdp_per_cap = GDP/Population) %>%
  group_by(Country) %>%
  summarize(mean_per_cap_gdp = mean(gdp_per_cap)) %>%
  arrange(desc(mean_per_cap_gdp)) %>%
  head(10) %>% 
  select(Country)

# Plot the top 10 countries per cap GDP over time
as_tibble(global_economy) %>%
  filter(Country %in% top_10_countries[[1]]) %>%
  mutate(gdp_per_cap = GDP/Population) %>%
  group_by(Country) %>%
  mutate(mean_per_cap_gdp = mean(gdp_per_cap)) %>%
  ungroup() %>%
  ggplot(aes(x=Year, y=gdp_per_cap, color=Country)) +
  geom_line() +
  labs(title = "GDP Per Capita for Top-10 Countries",
      x= "Year",
      y = "GDP Per Capita")

Which country has the highest GDP per capita? How has this changed over time?

The Netherlands has the highest GDP per capita over this period. While it wasn’t the highest from the outset, it saw an increase in the 1980s and ultimately overtook the other countries around 1990 and has maintained itself as the highest ever since.

Problem 3.2

For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

United States GDP from global_economy

For this data, I applied an adjusted to the GDP based on inflation. Doing so doesn’t appear to have made a big difference or provide any significant information gain. However, the rate of the growth and the trend in the data is much more stable over the time series compared to a more steep increase exhibited by the pre-transformed data.

global_economy %>%
  filter(Country == 'United States') %>%
  mutate(adjusted_gdp = GDP/CPI * 100) %>%
  ggplot() +
  geom_line(aes(x=Year, y=GDP), color="red") +
  geom_line(aes(x=Year, y=adjusted_gdp), color="blue")

b. Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock.

For this data series, I did both a calendar transformation, byt changing the data from monthly to quarterly. Additionally, in doing this, I adjusted the data to show the mean count per quarter. Finally, I viewed the graph after applying a power transformation - using an nth root of 3. Ultimately, I didn’t really see any benefit to applying these various transormations.

aus_livestock %>%
  filter(Animal == "Bulls, bullocks and steers" & State=="Victoria") %>% 
  ggplot(aes(x=Month, y=Count)) +
  geom_line()

aus_livestock %>%
  filter(Animal == "Bulls, bullocks and steers" & State=="Victoria") %>% 
  index_by(Quarter = yearquarter(Month)) %>%
  group_by(Quarter) %>%
  summarize(mean_count = mean(Count)) %>%
  ggplot(aes(x=Quarter, y=mean_count)) +
  geom_line()

aus_livestock %>%
  filter(Animal == "Bulls, bullocks and steers" & State=="Victoria") %>% 
  index_by(Quarter = yearquarter(Month)) %>%
  group_by(Quarter) %>%
  summarize(mean_count = mean(Count)) %>%
  mutate(adj_count = nthroot(mean_count,3)) %>%
  autoplot(adj_count)

aus_livestock %>%
  filter(Animal == "Bulls, bullocks and steers" & State=="Victoria") %>% 
  index_by(Quarter = yearquarter(Month)) %>%
  group_by(Quarter) %>%
  summarize(mean_count = mean(Count)) %>%
  mutate(adj_count = log(mean_count)) %>%
  autoplot(adj_count)

Victorian Electricity Demand from vic_elec

For this series, I tried several transformations. The main goal was to minimize the variance that exists in the series. Across each of the transformations, I decided that a log transformation was the most effective transformation for reducing the variance. Using this transformation the range for the data was between 8.2 and 8.7

weekly_vic_elec <- as_tibble(vic_elec) %>%
  mutate(Week = yearweek(Date)) %>%
  group_by(Week) %>%
  summarize(avg_demand = mean(Demand)) %>%
  ungroup() %>%
  as_tsibble()

## Using `Week` as index variable.

lambda <- weekly_vic_elec %>%
  features(avg_demand, feature = guerrero) %>%
  pull(lambda_guerrero)
  

weekly_vic_elec %>%
  mutate(adj_demand = log(avg_demand)) %>%
  autoplot(adj_demand)

weekly_vic_elec %>%
  autoplot(avg_demand)

weekly_vic_elec %>%
  autoplot(box_cox(avg_demand,lambda))

vic_elec %>%
  group_by(Date) %>%
  summarize(avg_demand = mean(Demand))

## # A tsibble: 52,608 x 3 [30m] <Australia/Melbourne>
## # Key:       Date [1,096]
##    Date       Time                avg_demand
##    <date>     <dttm>                   <dbl>
##  1 2012-01-01 2012-01-01 00:00:00      4383.
##  2 2012-01-01 2012-01-01 00:30:00      4263.
##  3 2012-01-01 2012-01-01 01:00:00      4049.
##  4 2012-01-01 2012-01-01 01:30:00      3878.
##  5 2012-01-01 2012-01-01 02:00:00      4036.
##  6 2012-01-01 2012-01-01 02:30:00      3866.
##  7 2012-01-01 2012-01-01 03:00:00      3694.
##  8 2012-01-01 2012-01-01 03:30:00      3562.
##  9 2012-01-01 2012-01-01 04:00:00      3433.
## 10 2012-01-01 2012-01-01 04:30:00      3359.
## # ℹ 52,598 more rows

vic_elec %>% 
  autoplot(Demand)

Gas production from aus_production

Finally, for this series, the box_cox transformation proved to be highly effective in reducing the variance exhibited in the data.

lambda <- aus_production |>
  features(Gas, features = guerrero) |>
  pull(lambda_guerrero)

aus_production %>%
  autoplot(Gas)

aus_production %>%
  autoplot(box_cox(Gas,lambda))

Problem 3.3

Why is a Box-Cox transformation unhelpful for the canadian_gas data?

The Box-Cox transformation is unhelpful with this data because the variance in the data is not changing in a consistent way. The variance that is exhibited in the middle of the series is much ore extreme then the variance that is exhibited at the beginning and end of the series.

canadian_gas %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Volume`

lambda <- canadian_gas |>
  features(Volume, features = guerrero) |>
  pull(lambda_guerrero)

canadian_gas %>%
  autoplot(box_cox(Volume, lambda))

Problem 3.4

What Box-Cox transformation would you select for your retail data

I chose a box-cox transformation with a lambda of.23

set.seed(9716)

myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Turnover`

lambda <- myseries %>%
  features(Turnover, features = guerrero) %>%
  pull(lambda_guerrero)

myseries %>%
  autoplot(box_cox(Turnover, lambda))

Problem 3.5

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance.

Tobacco from aus_production,

aus_production %>%
  autoplot(Tobacco)

## Warning: Removed 24 rows containing missing values (`geom_line()`).

lambda <- aus_production %>%
  features(Tobacco, features = guerrero) %>%
  pull(lambda_guerrero)


aus_production %>%
  autoplot(box_cox(Tobacco, lambda))

## Warning: Removed 24 rows containing missing values (`geom_line()`).

Economy class passengers between Melbourne and Sydney from ansett,

as_tibble(ansett)

## # A tibble: 7,407 × 4
##        Week Airports Class    Passengers
##      <week> <chr>    <chr>         <dbl>
##  1 1989 W28 ADL-PER  Business        193
##  2 1989 W29 ADL-PER  Business        254
##  3 1989 W30 ADL-PER  Business        185
##  4 1989 W31 ADL-PER  Business        254
##  5 1989 W32 ADL-PER  Business        191
##  6 1989 W33 ADL-PER  Business        136
##  7 1989 W34 ADL-PER  Business          0
##  8 1989 W35 ADL-PER  Business          0
##  9 1989 W36 ADL-PER  Business          0
## 10 1989 W37 ADL-PER  Business          0
## # ℹ 7,397 more rows

ansett_filtered <- ansett %>%
  filter(Airports == "MEL-SYD" & Class == 'Economy')

ansett_filtered %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Passengers`

lambda <- ansett_filtered %>%
  features(Passengers, features = guerrero) %>%
  pull(lambda_guerrero)


ansett_filtered %>%
  autoplot(box_cox(Passengers, lambda))

Pedestrian counts at Southern Cross Station from pedestrian

as_tibble(pedestrian) %>% select(Sensor) %>% distinct()

## # A tibble: 4 × 1
##   Sensor                       
##   <chr>                        
## 1 Birrarung Marr               
## 2 Bourke Street Mall (North)   
## 3 QV Market-Elizabeth St (West)
## 4 Southern Cross Station

pedestrian

## # A tsibble: 66,037 x 5 [1h] <Australia/Melbourne>
## # Key:       Sensor [4]
##    Sensor         Date_Time           Date        Time Count
##    <chr>          <dttm>              <date>     <int> <int>
##  1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01     0  1630
##  2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01     1   826
##  3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01     2   567
##  4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01     3   264
##  5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01     4   139
##  6 Birrarung Marr 2015-01-01 05:00:00 2015-01-01     5    77
##  7 Birrarung Marr 2015-01-01 06:00:00 2015-01-01     6    44
##  8 Birrarung Marr 2015-01-01 07:00:00 2015-01-01     7    56
##  9 Birrarung Marr 2015-01-01 08:00:00 2015-01-01     8   113
## 10 Birrarung Marr 2015-01-01 09:00:00 2015-01-01     9   166
## # ℹ 66,027 more rows

pedestrian_filtered <- pedestrian %>% 
  filter(Sensor == 'Southern Cross Station') 

pedestrian_filtered %>%
  autoplot(Count)

pedestrian_adj <- as_tibble(pedestrian_filtered) %>%
  group_by(Date) %>%
  summarize(total_count = sum(Count)) %>%
  tsibble()

## Using `Date` as index variable.

pedestrian_weekly <- as_tibble(pedestrian_filtered) %>%
  mutate(Week = yearweek(Date)) %>%
  group_by(Date) %>%
  summarize(daily_count = sum(Count)) %>%
  mutate(Week = yearweek(Date)) %>%
  group_by(Week) %>%
  summarize(weekly_avg = mean(daily_count)) %>%
  as_tsibble()

## Using `Week` as index variable.

lambda <- pedestrian_weekly %>%
  features(weekly_avg, features = guerrero) %>%
  pull(lambda_guerrero)

lambda2 <- pedestrian_adj %>%
  features(total_count, features = guerrero) %>%
  pull(lambda_guerrero)

pedestrian_weekly %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = weekly_avg`

pedestrian_weekly %>%
  autoplot(box_cox(weekly_avg, lambda))

Problem 3.7

gas <- tail(aus_production, 5*4) |> select(Gas)

Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?

The overall trend appears to be a positive trend, while there is a seasonal fluctuation where the data peaks in Q3 and is at its lowest point in Q1. This seasonality appears to persist on an annual basis.

gas %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Gas`

Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.

gas %>%
  model(classical_decomposition(Gas, type="multiplicative")) %>%
  components() %>%
  autoplot()

## Warning: Removed 2 rows containing missing values (`geom_line()`).

Do the results support the graphical interpretation from part a?

Yes, the results from the decomposition support the interpretation from part a. In the base graphic, we find that the seasonality is such that Q1 is the trough of the seasonal data, while the apex occurs in Q4. Additionally, we can see the existence of an overall positive trend that we see exhibited in the trend portion of the graph.

gas %>%
  model(classical_decomposition(Gas, type="multiplicative")) %>%
  components() %>%
  as_tsibble() %>%
  autoplot(Gas)

Compute and plot the seasonally adjusted data.

gas %>%
  model(classical_decomposition(Gas, type="multiplicative")) %>%
  components() %>%
  as_tsibble() %>%
  autoplot(Gas, color="gray") +
  geom_line(aes(y=season_adjust), color="black")

Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

The outlier seems to throw the entire series off kilter. It makes it so the seasonal adjustments are no longer valid.

gas_outlier <- gas



set.seed(9162023)
sample_index = sample(1:length(gas[[1]]),1)

gas_outlier[[sample_index,1]] = gas_outlier[[sample_index,1]] + 300
gas_outlier

## # A tsibble: 20 x 2 [1Q]
##      Gas Quarter
##    <dbl>   <qtr>
##  1   221 2005 Q3
##  2   180 2005 Q4
##  3   171 2006 Q1
##  4   224 2006 Q2
##  5   233 2006 Q3
##  6   192 2006 Q4
##  7   187 2007 Q1
##  8   534 2007 Q2
##  9   245 2007 Q3
## 10   205 2007 Q4
## 11   194 2008 Q1
## 12   229 2008 Q2
## 13   249 2008 Q3
## 14   203 2008 Q4
## 15   196 2009 Q1
## 16   238 2009 Q2
## 17   252 2009 Q3
## 18   210 2009 Q4
## 19   205 2010 Q1
## 20   236 2010 Q2

gas_outlier %>%
  model(classical_decomposition(Gas, type="multiplicative")) %>%
  components() %>%
  as_tsibble() %>%
  autoplot(Gas, color="gray") +
  geom_line(aes(y=season_adjust), color="black")

Does it make any difference if the outlier is near the end rather than in the middle of the time series?

No, the placement of the outlier does not seem to have a change in the decomposition of the data series.

gas_outlier_begin <- gas
gas_outlier_end <- gas

gas_outlier_begin[[5,1]] = gas[[5,1]] + 300
gas_outlier_begin

## # A tsibble: 20 x 2 [1Q]
##      Gas Quarter
##    <dbl>   <qtr>
##  1   221 2005 Q3
##  2   180 2005 Q4
##  3   171 2006 Q1
##  4   224 2006 Q2
##  5   533 2006 Q3
##  6   192 2006 Q4
##  7   187 2007 Q1
##  8   234 2007 Q2
##  9   245 2007 Q3
## 10   205 2007 Q4
## 11   194 2008 Q1
## 12   229 2008 Q2
## 13   249 2008 Q3
## 14   203 2008 Q4
## 15   196 2009 Q1
## 16   238 2009 Q2
## 17   252 2009 Q3
## 18   210 2009 Q4
## 19   205 2010 Q1
## 20   236 2010 Q2

gas_outlier_end[[15,1]] = gas[[15,1]] + 300
gas_outlier_end

## # A tsibble: 20 x 2 [1Q]
##      Gas Quarter
##    <dbl>   <qtr>
##  1   221 2005 Q3
##  2   180 2005 Q4
##  3   171 2006 Q1
##  4   224 2006 Q2
##  5   233 2006 Q3
##  6   192 2006 Q4
##  7   187 2007 Q1
##  8   234 2007 Q2
##  9   245 2007 Q3
## 10   205 2007 Q4
## 11   194 2008 Q1
## 12   229 2008 Q2
## 13   249 2008 Q3
## 14   203 2008 Q4
## 15   496 2009 Q1
## 16   238 2009 Q2
## 17   252 2009 Q3
## 18   210 2009 Q4
## 19   205 2010 Q1
## 20   236 2010 Q2

gas_outlier_begin %>%
  model(classical_decomposition(Gas, type="multiplicative")) %>%
  components() %>%
  as_tsibble() %>%
  autoplot(Gas, color="gray") +
  geom_line(aes(y=season_adjust), color="#0072B2")

gas_outlier_end %>%
  model(classical_decomposition(Gas, type="multiplicative")) %>%
  components() %>%
  as_tsibble() %>%
  autoplot(Gas, color="gray") +
  geom_line(aes(y=season_adjust), color="#0072B2")

Problem 3.8

Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

The one new observation that I notice, is that the seasonality in the data appears to have a downward trend that I did not notice previously. Additionally, there is a significant outlier in the data that occurs in the early part of 2000.

myseries

## # A tsibble: 441 x 5 [1M]
## # Key:       State, Industry [1]
##    State           Industry                        `Series ID`    Month Turnover
##    <chr>           <chr>                           <chr>          <mth>    <dbl>
##  1 New South Wales Furniture, floor coverings, ho… A3349468W   1982 Apr     65.8
##  2 New South Wales Furniture, floor coverings, ho… A3349468W   1982 May     65.8
##  3 New South Wales Furniture, floor coverings, ho… A3349468W   1982 Jun     62.3
##  4 New South Wales Furniture, floor coverings, ho… A3349468W   1982 Jul     68.2
##  5 New South Wales Furniture, floor coverings, ho… A3349468W   1982 Aug     66  
##  6 New South Wales Furniture, floor coverings, ho… A3349468W   1982 Sep     62.3
##  7 New South Wales Furniture, floor coverings, ho… A3349468W   1982 Oct     66.2
##  8 New South Wales Furniture, floor coverings, ho… A3349468W   1982 Nov     68.9
##  9 New South Wales Furniture, floor coverings, ho… A3349468W   1982 Dec     90.8
## 10 New South Wales Furniture, floor coverings, ho… A3349468W   1983 Jan     58  
## # ℹ 431 more rows

myseries %>%
  model(x11=X_13ARIMA_SEATS(Turnover ~ x11())) %>%
  components() %>%
  autoplot()

Problem 3.9

Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

There is a positive trend in the data that has been pretty much consistent through the entirety of the time series. Additionally, the seasonal fluctuations in the data appear to be consistent and stable throughout the series, until about 1990, when the amplitude of the variations in the seasonal data appears to be increasing. There is an outlier in the data that occurs around 1991, in which there was a significant dip in the data, followed by a significant spike. These fluctuations seemed to be attributed to black-swan events - as they were not present at any other time in the series.

Is the recession of 1991/1992 visible in the estimated components?

Yes, the recessions of 1991/1992 is visible in the components. As mentioned above, it appears to be captured in the remainder portion of the data.