DATA624

3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8 and 3.9

Exercise 1

Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

This data set requires some pre-processing before we can do anything meaningful with it. First we are going to cast the Country names as characters instead of factors. We will also remove any aggregated countries/regions.

#country_names <- (as.character(unique(global_economy$Country)))
global_economy_hw <- global_economy %>% mutate(Country = as.character(Country))

global_economy_hw <- global_economy_hw |> filter(!Country %in% c("Europe & Central Asia (excluding high income)","Europe & Central Asia (IDA & IBRD countries)", "Europe & Central Asia", "Euro area", "East Asia & Pacific (excluding high income)", "East Asia & Pacific (IDA & IBRD countries)", "Early-demographic dividend", "Central Europe and the Baltics","Caribbean small states", "Arab World", "Upper middle income", "World", "Fragile and conflict affected situations", "Heavily indebted poor countries (HIPC)", "Late-demographic dividend", "Latin America & Caribbean", "Latin America & Caribbean (excluding high income)", "Latin America & the Caribbean (IDA & IBRD countries)", "Least developed countries: UN classification", "Low & middle income", "Low income", "Lower middle income", "Middle East & North Africa", "Middle East & North Africa (excluding high income)", "Middle East & North Africa (IDA & IBRD countries)", "Middle income", "North America", "OECD members", "Other small states", "Pacific island small states", "Post-demographic dividend", "Pre-demographic dividend", "Small states", "South Asia", "South Asia (IDA & IBRD)", "Sub-Saharan Africa", "Sub-Saharan Africa (excluding high income)", "Sub-Saharan Africa (IDA & IBRD countries)", "IDA & IBRD total", "IDA total", "High income", "IBRD only","IDA blend","IDA only")) |> select(Year, Country, GDP, Population)

Next we are going to shorten some country names to give us more printable versions. And then we can cast our data frame to a tsibble. Latly we are going to create an array of country names so that we can plot subsets of the countries.

global_economy_hw$Country[global_economy_hw$Country == 'St. Vincent and the Grenadines'] <- "St. Vincent"
global_economy_hw$Country[global_economy_hw$Country == 'Central African Republic'] <- "CAR"
global_economy_hw$Country[global_economy_hw$Country == "Bosnia and Herzegovina"] <- "Bosnia"
global_economy_hw$Country[global_economy_hw$Country == "British Virgin Islands"] <- "British VI"
global_economy_hw$Country[global_economy_hw$Country == "Hong Kong SAR, China"] <- "Hong Kong"
global_economy_hw$Country[global_economy_hw$Country == "Sint Maarten (Dutch part)"] <- "Sint Maarten"
global_economy_hw$Country[global_economy_hw$Country == "St. Martin (French part)" ] <- "St. Martind"
global_economy_hw$Country[global_economy_hw$Country == "Turks and Caicos Islands"] <- "Turks and Caicos"
global_economy_hw <- global_economy_hw |> as_tsibble(index = Year, key = Country)

#global_economy_hw$Country[Country=='St. Vincent and the Grenadines'] <- "St. Vincent"

country_names <- (as.character(unique(global_economy_hw$Country)))

#global_economy_hw |>
#  autoplot(GDP/Population) +
#  labs(title= "GDP per capita", y = "$US")

I had to split the countries up into 50 country blocks to get them to plot. Otherwise my key was too large to generate a plot. Also the plot would be too busy to interpret. I’m color blind and find these plots hard to read. I’m going extract to th 5 top performing countries and plot their GDP per capita at the end.

#country_names <- as.list(unique(global_economy_hw$Country))
global_economy_hw |> filter(Country %in% c(country_names[1:50])) |>
 autoplot(GDP/Population) +
 labs(title= "GDP per capita", y = "$US")

## Warning: Removed 595 rows containing missing values or values outside the scale range
## (`geom_line()`).

temp <- global_economy_hw |> filter(!is.na(GDP) & !is.na(Population)) |> filter(Country %in% c(country_names[1:50]))

temp |> filter(GDP/Population == max(GDP/Population, na.rm = TRUE)) |> select(Country)

## # A tsibble: 1 x 2 [1Y]
## # Key:       Country [1]
##   Country  Year
##   <chr>   <dbl>
## 1 Bermuda  2008

global_economy_hw |> filter(Country %in% c(country_names[51:100])) |>
 autoplot(GDP/Population) +
 labs(title= "GDP per capita", y = "$US")

## Warning: Removed 591 rows containing missing values or values outside the scale range
## (`geom_line()`).

temp <- global_economy_hw |> filter(!is.na(GDP) & !is.na(Population)) |> filter(Country %in% c(country_names[51:100]))

temp |> filter(GDP/Population == max(GDP/Population, na.rm = TRUE)) |> select(Country)

## # A tsibble: 1 x 2 [1Y]
## # Key:       Country [1]
##   Country      Year
##   <chr>       <dbl>
## 1 Isle of Man  2014

global_economy_hw |> filter(Country %in% c(country_names[101:150])) |> 
 autoplot(GDP/Population) +
 labs(title= "GDP per capita", y = "$US")

## Warning: Removed 828 rows containing missing values or values outside the scale range
## (`geom_line()`).

temp <- global_economy_hw |> filter(!is.na(GDP) & !is.na(Population)) |> filter(Country %in% c(country_names[101:150]))

temp |> filter(GDP/Population == max(GDP/Population, na.rm = TRUE)) |> select(Country)

## # A tsibble: 1 x 2 [1Y]
## # Key:       Country [1]
##   Country  Year
##   <chr>   <dbl>
## 1 Monaco   2014

global_economy_hw |> filter(Country %in% c(country_names[151:200])) |> 
 autoplot(GDP/Population) +
 labs(title= "GDP per capita", y = "$US")

## Warning: Removed 662 rows containing missing values or values outside the scale range
## (`geom_line()`).

temp <- global_economy_hw |> filter(!is.na(GDP) & !is.na(Population)) |> filter(Country %in% c(country_names[150:200]))

temp |> filter(GDP/Population == max(GDP/Population, na.rm = TRUE)) |> select(Country)

## # A tsibble: 1 x 2 [1Y]
## # Key:       Country [1]
##   Country  Year
##   <chr>   <dbl>
## 1 Norway   2013

global_economy_hw |> filter(Country %in% c(country_names[201:219])) |> 
 autoplot(GDP/Population) +
 labs(title= "GDP per capita", y = "$US")

## Warning: Removed 270 rows containing missing values or values outside the scale range
## (`geom_line()`).

temp <- global_economy_hw |> filter(!is.na(GDP) & !is.na(Population)) |> filter(Country %in% c(country_names[201:219]))

temp |> filter(GDP/Population == max(GDP/Population, na.rm = TRUE)) |> select(Country)

## # A tsibble: 1 x 2 [1Y]
## # Key:       Country [1]
##   Country        Year
##   <chr>         <dbl>
## 1 United States  2017

Now we have the top 5 countries. Let’s plot them and then do our analysis. The top GDP per captia country is Monoco. It looks like a lot of countries in the above plots experienced a drop in GDP around 1985 and 2000 and the experienced a recovery. Monaco shows the same trend just at a overall higher level than the other countries of the world. Monaco is a bit of an outlier in some respects, it has a very small population with most wealth concentrated within the royal family.

global_economy_hw |> filter(Country %in% c('Bermuda', 'Isle of Man', 'Norway', 'Monaco', 'United States')) |> 
 autoplot(GDP/Population) +
 labs(title= "GDP per capita", y = "$US")

## Warning: Removed 51 rows containing missing values or values outside the scale range
## (`geom_line()`).

Exercise 2

For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

United States GDP from global_economy.
Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock.
Victorian Electricity Demand from vic_elec.
Gas production from aus_production.

United States GDP in global_economy

I think that a log transformation might help this plot

global_economy_hw |> filter(Country %in% c('United States')) |>
 autoplot(GDP)

The log transformation transforms the exponential function into a linear function. It preserves the relationship in the original data that shows that the GDP doubled between ~1985 and 2000.

global_economy_hw |> filter(Country %in% c('United States')) |> mutate(log_GDP = log(GDP)) |>
 autoplot(log_GDP)

Count in aus_livestock

aus_livestock |> filter(Animal == 'Bulls, bullocks and steers') |> autoplot()

## Plot variable not specified, automatically selected `.vars = Count`

There is a lot of differences in the variance in this data, I wanted to use a Box-Cox transformation but really struggled to get the ‘box_cox’ function to work on grouped data (there are 8 states and each has a separate lambda). I ended up applying a log transformation which is a Box-Cox of lambda = 0.

This did reduce the heteroskedasticity in the individual time series plots.

#lambda <- aus_livestock |> filter(Animal == 'Bulls, bullocks and steers') |> features(Count, features = guerrero) |>
#  pull(lambda_guerrero)

aus_livestock |> filter(Animal == 'Bulls, bullocks and steers') |>
  autoplot(log(Count))

Demand in vic_elec

vic_elec |> autoplot()

## Plot variable not specified, automatically selected `.vars = Demand`

I used the Box-Cox transformation and ended up with a lambda of 0.1, there still exists a lot of variation in this data.

lambda <- vic_elec |> features(Demand, features = guerrero) |>
  pull(lambda_guerrero)

vic_elec |> autoplot(box_cox(Demand, lambda))

Gas production from aus_production

I’m going to apply a Box-Cox to see if I can reduce the variance across the series

aus_production |> autoplot(Gas)

The variance is not perfect but better

lambda <- aus_production |> features(Gas, features = guerrero) |>
  pull(lambda_guerrero)

aus_production |> autoplot(box_cox(Gas, lambda))

Exercise 3

Why is a Box-Cox transformation unhelpful for the canadian_gas data?

canadian_gas |> autoplot()

## Plot variable not specified, automatically selected `.vars = Volume`

Interestingly, the Box-Cox does nothing to improve the variation in the time series.

lambda <- canadian_gas |> features(Volume, features = guerrero) |>
  pull(lambda_guerrero)

canadian_gas |> autoplot(box_cox(Volume, lambda))

Exercise 4

What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?

set.seed(555)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries |> autoplot()

## Plot variable not specified, automatically selected `.vars = Turnover`

I chose a Box-Cox with a lambda of 0.08. I’m starting to think that I should choose 0.10 as my default lambda…

lambda <- myseries |> features(Turnover, features = guerrero) |>
  pull(lambda_guerrero)

myseries |> autoplot(box_cox(Turnover, lambda))

Exercise 5

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

Tobacco from aus_production

aus_production |> autoplot(Tobacco)

## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

The Box-Cox does very little for this series, the lambda is 0.92 and the series looks unchanged.

lambda <- aus_production |> features(Tobacco, features = guerrero) |>
  pull(lambda_guerrero)

aus_production |> autoplot(box_cox(Tobacco, lambda))

## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

Economy class passengers between Melbourne and Sydney from ansett

ansett |> filter(Class == 'Economy' & Airports == 'MEL-SYD') |> autoplot()

## Plot variable not specified, automatically selected `.vars = Passengers`

This is a Box-Cox transformation with a lambda of 2

lambda <- ansett |> filter(Class == 'Economy' & Airports == 'MEL-SYD') |> features(Passengers, features = guerrero) |>
  pull(lambda_guerrero)

ansett |> filter(Class == 'Economy' & Airports == 'MEL-SYD') |> autoplot(box_cox(Passengers, lambda))

Pedestrian counts at Southern Cross Station from pedestrian

pedestrian |> filter(Sensor == 'Southern Cross Station') |> autoplot(Count)

lambda <- pedestrian |> filter(Sensor == 'Southern Cross Station') |> features(Count, features = guerrero) |>
  pull(lambda_guerrero)

pedestrian |> filter(Sensor == 'Southern Cross Station') |> autoplot(box_cox(Count, lambda))

Exercise 7

Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) |> select(Gas)

Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?
Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.
Do the results support the graphical interpretation from part a?
Compute and plot the seasonally adjusted data.
Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
Does it make any difference if the outlier is near the end rather than in the middle of the time series?

First we plot the series data. You can see that there is both a quarterly seasonality with a peak in q3 and positive trend.

gas <- tail(aus_production, 5*4) |> select(Gas)
gas |> autoplot()

## Plot variable not specified, automatically selected `.vars = Gas`

Second we use classical decomposition and results support the analysis from part a.

gas |>
  model(
    classical_decomposition(Gas, type = "multiplicative")
  ) |>
  components() |>
  autoplot() +
  labs(title = "Classical additive decomposition of gas production from aus_production")

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

Plot of seasonaly adjusted data

dcmp <- gas |>
  model(
    classical_decomposition(Gas, type = "multiplicative")
  ) 

components(dcmp) |>
  as_tsibble() |>
  autoplot(Gas, colour="gray") +
  geom_line(aes(y=trend), colour = "#D55E00") +
  labs(
    y = "Gas in petajoules",
    title = "AUS gas production"
  )

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

Make one value and outlier and then recalculate the seasonality adjusted data and plot. We see that classical decomposition is very sensitive to outliers.

bad_gas <- gas
bad_gas$Gas[10] <- bad_gas$Gas[10] + 200

dcmp <- bad_gas |>
  model(
    classical_decomposition(Gas, type = "multiplicative")
  ) 

components(dcmp) |>
  as_tsibble() |>
  autoplot(Gas, colour="gray") +
  geom_line(aes(y=trend), colour = "#D55E00") +
  labs(
    y = "Gas in petajoules",
    title = "AUS gas production"
  )

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

If the outlier had been near the end of the series it has less impact on the seasonal adjustment because it is used in fewer adjustments/

bad_gas <- gas
bad_gas$Gas[20] <- bad_gas$Gas[20] + 200

dcmp <- bad_gas |>
  model(
    classical_decomposition(Gas, type = "multiplicative")
  ) 

components(dcmp) |>
  as_tsibble() |>
  autoplot(Gas, colour="gray") +
  geom_line(aes(y=trend), colour = "#D55E00") +
  labs(
    y = "Gas in petajoules",
    title = "AUS gas production"
  )

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

Exercise 8

Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

set.seed(555)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries |> autoplot()

## Plot variable not specified, automatically selected `.vars = Turnover`

I’m getting an error trying to use the ‘x13binary’ package. I’ve some debugging can’t currently resolve if for this homework problem.

#x11_dcmp <- myseries |>
#  model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) |>
#  components()

#autoplot(x11_dcmp) +
# labs(title =
#   "Decomposition of retail turnover using X-11.")

Exercise 9

Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

There is a positive trend cycle across the entire series. There is a seasonal component that is consistent across the series. The seasonal component is small compared to the trend and even the remainder. March, September and December are the highest months for the civilian labour force across the series. December makes sense for holiday seasonal labour and September makes sense (at least in the US) that people are able to return to the workforce when their kids start school. The size of the remainder is surprising.

Is the recession of 1991/1992 visible in the estimated components?

The recession of 1991/1992 is all in the remainder. Which as an anomoly in the trend is what you would want.

DATA624_HW2

William Aiken

9/13/2024