Question 3.1

Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

global_economy <- global_economy
ggplot(global_economy)+
  geom_line(aes(x = Year,y = GDP,color = Country),show.legend = FALSE)+
  theme(legend.position = NULL)

## Warning: Removed 3242 row(s) containing missing values (geom_path).

countries_only <- global_economy[complete.cases(global_economy),]
countries_only$Country[which(countries_only$GDP == max(countries_only$GDP))]

## [1] United States
## 263 Levels: Afghanistan Albania Algeria American Samoa Andorra ... Zimbabwe

countries_only$GDP[which(countries_only$GDP == max(countries_only$GDP))]

## [1] 1.862448e+13

ggplot(countries_only)+
  geom_line(aes(x = Year,y = GDP,color = Country),show.legend = FALSE)+
  theme(legend.position = NULL)

The USA has the highest GDP among all countries. The GDP of the USA was higher than any other country for every year since 1960.

Per Capita, it is a bit different.

countries_only$per_capita <- countries_only$GDP / countries_only$Population
ggplot(countries_only)+
  geom_line(aes(x = Year,y = per_capita,color = Country),show.legend = FALSE)+
  theme(legend.position = NULL)

countries_only$Country[which(countries_only$per_capita == max(countries_only$per_capita))]

## [1] Luxembourg
## 263 Levels: Afghanistan Albania Algeria American Samoa Andorra ... Zimbabwe

countries_only$per_capita[which(countries_only$per_capita == max(countries_only$per_capita))]

## [1] 119225.4

Luxembourg has the highest GDP per capita, and it has taken off in the last 25 years. In 1980, the GDP per capita in Luxembourg was indistinguishable among the other countries.

Question 3.2

For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

United States GDP from global_economy. Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock. Victorian Electricity Demand from vic_elec. Gas production from aus_production.

us <- subset(global_economy,Code == "USA")

autoplot(us)

## Plot variable not specified, automatically selected `.vars = GDP`

To see the lower State counts, I performed a log transform. This made the data appear much less variable and it makes the counts appear nearer to one another than they are. This was done in order to better visualize the lower counts.

livestock <- aus_livestock

bulls_bullocks <- subset(livestock,Animal == "Bulls, bullocks and steers")

autoplot(bulls_bullocks)

## Plot variable not specified, automatically selected `.vars = Count`

bulls_bullocks$Count <- log(bulls_bullocks$Count)

autoplot(bulls_bullocks)

## Plot variable not specified, automatically selected `.vars = Count`

autoplot(vic_elec)

## Plot variable not specified, automatically selected `.vars = Demand`

autoplot(aus_production,Gas)

Question 3.3

Why is a Box-Cox transformation unhelpful for the canadian_gas data?

can_gas <- canadian_gas

lambda <- can_gas %>%
  features(Volume, features = guerrero) %>%
  pull(lambda_guerrero)

transform <- box_cox(can_gas$Volume,lambda)

autoplot(can_gas)

## Plot variable not specified, automatically selected `.vars = Volume`

autoplot(can_gas,transform)

The box-cox transformation hardly does anything here because the variance in the data does not increase on a logarithmic scale. Box-Cox transformations are better used in cases where the variance in the data increases or decreases dramatically.

Question 3.4

What Box-Cox transformation would you select for your retail data (from Exercise 8 in Section 2.10)?

From the retail data, I would use the Guerrero method to obtain the best lambda value. The transformed data makes the lower turnover values much easier to visualize and it makes the point of greatest increase appear more clearly. The 1990s saw a much steeper increase in turnover than the early 2000s did.

set.seed(34)

aus_retail_series <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

#retail <- aus_retail

lambda <- aus_retail_series %>%
  features(Turnover, features = guerrero) %>%
  pull(lambda_guerrero)

transform <- box_cox(aus_retail_series$Turnover,lambda)

autoplot(aus_retail_series)

## Plot variable not specified, automatically selected `.vars = Turnover`

autoplot(aus_retail_series,transform)

lambda

## [1] 0.1704623

Question 3.5

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

autoplot(aus_production,Tobacco)

## Warning: Removed 24 row(s) containing missing values (geom_path).

set.seed(34)

lambda <- aus_production %>%
  features(Tobacco, features = guerrero) %>%
  pull(lambda_guerrero)

transform <- box_cox(aus_production$Tobacco,lambda)

autoplot(aus_production,transform)

## Warning: Removed 24 row(s) containing missing values (geom_path).

lambda

## [1] 0.9264636

transform <- box_cox(aus_production$Tobacco,0.01)

autoplot(aus_production,transform)

## Warning: Removed 24 row(s) containing missing values (geom_path).

With a lambda of 0.93, the Box-Cox Transformation for Tobacco does not reveal much about the data.

Outside of scaling the data, there is a negligible effect when lambda is high or low.

mel_syd <- subset(ansett,Airports == "MEL-SYD")

mel_syd_eco <- subset(mel_syd, Class == "Economy")

autoplot(mel_syd_eco,Passengers)

set.seed(34)

lambda <- mel_syd_eco %>%
  features(Passengers, features = guerrero) %>%
  pull(lambda_guerrero)

transform <- box_cox(mel_syd_eco$Passengers,lambda)

autoplot(mel_syd_eco,transform)

lambda

## [1] 1.999927

transform <- box_cox(mel_syd_eco$Passengers,0.25)

autoplot(mel_syd_eco,transform)

For the Melbourne-Sydney data, the transformation is heavily skewed because of the weeks in the late 1980s that saw a precipitous drop in passengers.

We can take it a step further and replace those values with the median value of the dataset.

mel_syd_eco$Passengers[mel_syd_eco$Passengers == 0] <- median(mel_syd_eco$Passengers)

lambda <- mel_syd_eco %>%
  features(Passengers, features = guerrero) %>%
  pull(lambda_guerrero)

transform <- box_cox(mel_syd_eco$Passengers,lambda)

autoplot(mel_syd_eco,transform)

lambda

## [1] 0.1540033

It is not an accurate representation of what actually happened in 1989, but replacing the zero values does make it easier to visualize the rest of the data. The auto-chosen lambda decreased from 1.99 to 0.15 after replacing the zero values.

southern_cross <- subset(pedestrian,Sensor == "Southern Cross Station")

autoplot(southern_cross,Count)

lambda <- southern_cross %>%
  features(Count, features = guerrero) %>%
  pull(lambda_guerrero)

transform <- box_cox(southern_cross$Count,lambda)

autoplot(southern_cross,transform)

southern_cross_2016 <- subset(southern_cross, Date >= as.Date("2016-01-01"))

lambda <- southern_cross_2016 %>%
  features(Count, features = guerrero) %>%
  pull(lambda_guerrero)

transform <- box_cox(southern_cross_2016$Count,lambda)

autoplot(southern_cross_2016,transform)

lambda

## [1] -0.2564844

The data is too granular for an autoplot visualization. The “optimal” lambda value from the Guerrero method is actually a negative value.

Question 3.7

Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) |> select(Gas) Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle? Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices. Do the results support the graphical interpretation from part a? Compute and plot the seasonally adjusted data. Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier? Does it make any difference if the outlier is near the end rather than in the middle of the time series?

gas <- tail(aus_production, 5*4) |> select(Gas)

autoplot(gas)

## Plot variable not specified, automatically selected `.vars = Gas`

Q1 tends to be the quarter with the lowest gas production, while Q3 is the quarter with the highest production. This is consistent in every quarter on the plot. Year over year, the gas production increased in each quarter from the prior year’s same quarter.

B

gas %>%
  model(classical_decomposition(Gas,type="multiplicative")) %>%
  components() %>%
  autoplot()

## Warning: Removed 2 row(s) containing missing values (geom_path).

It is clear to see that there is in fact a seasonal trend at play.

C

The results support the graphical interpretation from part A. The random plot is helpful to show just how substantial the seasonal pattern actually is.

D

gas %>%
  model(classical_decomposition(Gas,type="multiplicative")) %>%
  components() %>%
  select(season_adjust) %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = season_adjust`

When adjusted for seasonality, we see a relatively steady increase in gas production.

E

gas$Gas[10] <- gas$Gas[10] + 300

gas %>%
  model(classical_decomposition(Gas,type="multiplicative")) %>%
  components() %>%
  select(season_adjust) %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = season_adjust`

When an outlier is introduced into the time series, the seasonally adjusted data becomes intensely skewed. From this exercise, it is clear to see that outliers make seasonal adjustment difficult.

F

gas$Gas[10] <- gas$Gas[10] - 300
gas$Gas[19] <- gas$Gas[19] + 300

gas %>%
  model(classical_decomposition(Gas,type="multiplicative")) %>%
  components() %>%
  select(season_adjust) %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = season_adjust`

When shifted towards the end of the series, the outlier does allow for the slow growth to show in the data points from before the outlier, but it is difficult to see from the plot, as the outlier changes the scale of the plot completely.

Question 3.8

Recall your retail time series data (from Exercise 8 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

From Chapter 2 Question 8:

set.seed(34)

aus_retail_series <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

autoplot(aus_retail_series)

## Plot variable not specified, automatically selected `.vars = Turnover`

gg_season(aus_retail_series)

## Plot variable not specified, automatically selected `y = Turnover`

gg_subseries(aus_retail_series)

## Plot variable not specified, automatically selected `y = Turnover`

gg_lag(aus_retail_series)

## Plot variable not specified, automatically selected `y = Turnover`

aus_retail_series %>%
  ACF(Turnover) %>%
  autoplot()

library(seasonal)

## 
## Attaching package: 'seasonal'

## The following object is masked from 'package:tibble':
## 
##     view

x11_retail <- aus_retail_series %>%
  model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) %>%
  components()
autoplot(x11_retail)

The X-11 decomposed plot shows a much more prominent seasonality than the original plots. There is more variability in the middle of the “irregular” plot, and there is another spike just after 2010. The turnover shows a seasonal, yet consistent increase.

Question 3.9

Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation. Is the recession of 1991/1992 visible in the estimated components?

Response

The decomposition reveals that there is an immense increase and subsequent decrease in the labor force in December and January, respectively. This is consistent throughout all of the years in the plot. It is also revealed that there is a large decline in the labor force in August. The labor force drops from July to August, and it rebounds consistently in September. The monthly decomposition is especially useful for visualizing the data because it highlights the labor force population in each month for each year, which is valuable for being able to identify patterns in each month.

The recession of 1991/1992 is perhaps evidenced in the uncharacteristic decline in the labor force in the March plot specifically. A steep increase was followed by a steep decline in the labor force. The remainder plot in the STL decomposition makes the recession painfully obvious to see. The recession is not obvious in the value, trend, or season_year plots.

DATA 624 Assignment 1

Shane Hylton

2023-02-14