library(tidyverse)
library(fpp3)
library(seasonal)
library(plotly)
head(global_economy,5)
glimpse(global_economy)
## Rows: 15,150
## Columns: 9
## Key: Country [263]
## $ Country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",…
## $ Code <fct> AFG, AFG, AFG, AFG, AFG, AFG, AFG, AFG, AFG, AFG, AFG, AFG,…
## $ Year <dbl> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969,…
## $ GDP <dbl> 537777811, 548888896, 546666678, 751111191, 800000044, 1006…
## $ Growth <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CPI <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Imports <dbl> 7.024793, 8.097166, 9.349593, 16.863910, 18.055555, 21.4128…
## $ Exports <dbl> 4.132233, 4.453443, 4.878051, 9.171601, 8.888893, 11.258279…
## $ Population <dbl> 8996351, 9166764, 9345868, 9533954, 9731361, 9938414, 10152…
# Find the top 20 countries with the highest GDP
country_codes_df <- read_csv("https://raw.githubusercontent.com/D-hartog/DATA624/refs/heads/main/HW2/wikipedia-iso-country-codes.csv")
## Rows: 246 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): English short name lower case, Alpha-2 code, Alpha-3 code, Numeric ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
codes <- unique(country_codes_df$`Alpha-3 code`)
global_economy_filtered <- global_economy %>% filter(Code %in% codes)
global_economy_filtered <- global_economy_filtered %>% mutate(GDP_per_capita = GDP/Population)
top20 <- global_economy_filtered %>% filter(Year == 2017) %>% arrange(desc(GDP_per_capita)) %>% head(20)
unique(top20$Country)[1:20]
## [1] Luxembourg Macao SAR, China Switzerland
## [4] Norway Iceland Ireland
## [7] Qatar United States Singapore
## [10] Denmark Australia Sweden
## [13] San Marino Netherlands Austria
## [16] Hong Kong SAR, China Finland Canada
## [19] Germany Belgium
## 263 Levels: Afghanistan Albania Algeria American Samoa Andorra ... Zimbabwe
# Plot time series for the top 20 countries
global_economy_filtered %>% filter(Country %in% unique(top20$Country)[1:20]) %>% autoplot(GDP_per_capita)
Since there are over 200 countries, I thought that this would be
difficult to visualize. My thought was to plot the top 20 countries with
the highest GDP for the last year data was recorded (2017). The country
with the highest recorded GDP per capita in 2017 was Luxembourg.
Comparing this to the beginning of the recorded data in 1960, the US had
the highest GDP per capita. Over time the GDP per capita of non European
countries like Qatar, Special Administrative Regions of China (Hong Kong
and Macao), and Singapore have also surpassed many countries.
top20
global_economy_filtered %>% filter(Year == 1960) %>% arrange(desc(GDP_per_capita)) %>% head(20)
# Filter global economy data for the the United States
us_economy <- global_economy %>% filter(Country == "United States")
# Add a column with GDP per capita calculation
us_economy <- us_economy %>% mutate(GDP_per_capita = GDP/Population)
#Plot of the time series of US GDP
us_economy %>% autoplot(GDP) + labs(title = "US GDP", x = "Year")
The transformation that would be appropriate here is to visualize the
time series of the GDP per capita to adjust for the population increase
over time. This does not have a large effect on the shape of the trend
line but we can see that the trend does become slight more linear than
just plotting GDP.
# Plot of the time series of US GDP per capita
us_economy %>% autoplot(GDP_per_capita) + labs(title = "US GDP Per Capita", x = "Year")
aus_livestock %>% filter((Animal == "Bulls, bullocks and steers") & (State == "Victoria"))
# plot time series of Slaughter of Victorian “Bulls, bullocks and steers”
aus_livestock %>% filter((Animal == "Bulls, bullocks and steers") & (State == "Victoria")) %>% autoplot(Count) + labs(title = "Number of Slaughtered Bulls, Bullocks, Steers in Victoria, AU", x = "Month")
head(vic_elec, 5)
vic_elec %>% autoplot(Demand) + labs(title = "Vicotria Electricity Demand", y = "Demand (MWh)", x = "Time (30 minute intervals)")
When plotting the original data we can see that there is a consistent increase in variation as time goe on. In order to make the variation in the series more homogenous, an appropriate strategy might involve a log transformation.
aus_production %>% autoplot(Gas) + labs(title = "Australia Gas Production", y = "Gas (petajoules)")
aus_production %>% autoplot(log(Gas)) + labs(title = "Australia Gas Production: Log Transformed", y = "Gas (petajoules)")
Applying the log transformation, we can now see more consistent
variation between the first 15 years of the time series and the last 30
years.
head(canadian_gas)
# Canadian Gas Data
canadian_gas %>% autoplot() + labs(title = "Monthly Canadian Gas Production", y = "Billions (cubic meters)")
## Plot variable not specified, automatically selected `.vars = Volume`
A Box-Cox transformation would not be helpful for the Canadian Gas data as the variation in the data does not consistently only increase or only decrease throughout the series. We see that the variation increases until about 1993 and then begins to decrease.
lambda <- canadian_gas |>
features(Volume, features = guerrero) |>
pull(lambda_guerrero)
canadian_gas |>
autoplot(box_cox(Volume, lambda)) +
labs(y = "",
title = paste0(
"Transformed Monthly Canadian Gas Production with lambda = ", round(lambda, 4)))
Implementing the code to find the optimal lambda value and applying a
Box-Cox transformation to the Canadian Gas data we don’t see much change
in the heterogeneity of the variation across the series.
Monthly Australian retail data is provided in aus_retail. Select one of the time series as follows (but choose your own seed value):
set.seed(123)
myseries <- aus_retail %>%
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries %>% autoplot(Turnover) + labs(title = "Victoria Household and Goods Retail Turnover")
The Box-Cox transformaiton that I would choose for this series would be
a log transformation as the variation in the data consistently increases
as time goes on. In the graph below we can see how applying a log
transformation evens out the seasonal variation that me might see from
year to year.
myseries %>% autoplot(log(Turnover)) + labs(title = "Log Transformed:Victoria Household and Goods Retail Turnover")
Since Box-Cox transformations rely on the value of \(\lambda\) we can use the features function, setting features to guerrero to find the best lambda. A good value of \(\lambda\) is one which makes the size of the seasonal variation about the same across the whole series, in turn making forcasting easier as well.
tobacco <- aus_production %>% select(Tobacco)
tobacco %>% autoplot(Tobacco) + labs(title = "Australian Tobacco and Cigarette Production", x = "Tonnes")
lambda <- tobacco %>%
features(Tobacco, features = guerrero) |>
pull(lambda_guerrero)
tobacco |>
autoplot(box_cox(Tobacco, lambda)) +
labs(y = "",
title = paste0(
"Transformed Tobacco Production with lambda = ", round(lambda, 4)))
In this scenario the labda is close to 1 so we don’t see much change in
the shape of the time series but the values are shifted down.
economy_mel_syd <- ansett %>% filter((Class == "Economy") & (Airports == "MEL-SYD"))
economy_mel_syd %>% autoplot(Passengers) + labs(title = "Number of Passengers in Economy Class Between Melbourne and Sydney")
lambda <- economy_mel_syd %>%
features(Passengers, features = guerrero) |>
pull(lambda_guerrero)
economy_mel_syd %>%
autoplot(box_cox(Passengers, lambda)) +
labs(y = "",
title = paste0(
"Transformed Passengers with lambda = ", round(lambda, 4)))
The value of lambda beng close to 2 shifts the values up and we can see an attempt at creating more consistent variation across the series.
southern_cross <- pedestrian %>% filter(Sensor == "Southern Cross Station")
southern_cross %>% autoplot(Count) + labs(title = "Pedestrian Counts in Melbourne")
lambda <- southern_cross %>%
features(Count, features = guerrero) |>
pull(lambda_guerrero)
southern_cross %>%
autoplot(box_cox(Count, lambda)) +
labs(y = "",
title = paste0(
"Transformed Southern Cross Pedestrian Counts with lambda = ", round(lambda, 4)))
gas <- tail(aus_production, 5*4) %>% select(Gas)
gas
head(gas)
gas %>% autoplot(Gas) + labs(title = "Australian Gas Production",
y = "Gas (petrajoules)",
x = "Quarter")
Visualizing the time series, there is an upward trend in the amount that
is being produced as well as seasonal fluctuations. It appears that more
gas is produced during Q2 increasing to Q3, dropping back down in Q4.
This seasonal fluctuations are consistent across the time series.
classical_decomp <- gas %>%
model(classical_decomposition(Gas, type = "multiplicative")) %>%
components()
classical_decomp %>% autoplot()
### c. Do the results support the graphical interpretation from part
a?
Yes, the results of the multiplicative decomposition supports my interpretation above. We can see an increasing linear trend, and clear and consistent seasonality.
classical_decomp %>% autoplot(season_adjust)
set.seed(543)
row <- sample(x = nrow(gas), size = 1)
# Random row before adjustment
gas[row, ]
gas[row, "Gas"] <- gas[row, "Gas"] + 300
# Random row after adjustment
gas[row, ]
classical_decomp2 <- gas %>%
model(classical_decomposition(Gas, type = "multiplicative")) %>%
components()
classical_decomp2 %>% autoplot(season_adjust)
The first noticable change in the seasonal adjustment is that we see a
big spike at the specific year and quarter where 300 was added to that
value. The upward trend in the seasonal adjustment plot is also
eliminated from the graph. The outlier also effects the seasonal
pattern. Before the outlier Q1 saw relatively low production of gas
which was reversed when the outlier was added.
Add outlier to the end of the series
gas <- tail(aus_production, 5*4) %>% select(Gas)
# Add an outlier to the beginning of the series
gas[20, ]
gas[20, "Gas"] <- gas[20, "Gas"] + 300
# Random row after adjustment
gas[20, ]
# Plot the seasonsal adjusted data.
classical_decomp3 <- gas %>%
model(classical_decomposition(Gas, type = "multiplicative")) %>%
components()
classical_decomp3 %>% autoplot(season_adjust)
Adding the outlier to the end of the series does make a difference. We
do see a slight upward trendin the seasonally adjusted values as was the
case without the outlier. It is hard to see since the limits of the y -
axis have been increased to fit the value of the outlier. Another
difference between adding the outlier to the end of the series is that
the seasonal pattern is more or less preserved.
# View the first 6 rows of the tsibble that was created from exercsie 3.4 that also referenced exercise 7 from 2.10
head(myseries)
myseries %>% autoplot(Turnover) + labs(title = "Victoria Household Goods Turnover", x = "Millions (AUD)")
myseries%>% gg_season(Turnover) + labs(title = "Victoria Household Goods Turnover by Year", x = "Millions (AUD)")
myseries %>% gg_subseries(Turnover) + labs(title = "Victoria Household Goods Turnover: Monthly Subseries")
myseries %>% gg_lag(Turnover, lags = 1:12, geom = "point") + labs(title = "Lag plot")
myseries %>% ACF(Turnover, lag_max = 24) %>% autoplot() + labs(title = "Victoria Household Goods Turnover: Autocorrelation Plot")
head(myseries)
# Decompose series using x11
x11_dcmp <- myseries %>%
model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) %>%
components()
autoplot(x11_dcmp) +
labs(title =
"Decomposition of Household Goods Retail Turnover using X-11.")
Decomposing this series using x11 does reveal some interesting findings that are not easily seen using autoplot, gg_season, gg_subseries, lag plots or the autocorrelation plot. There seems to be an unusually low peak in the early 2000’s seen in the time series plot. This point also correlates with a spike in the remaining data. What is interesting is that mush of a diffence in the seasonal pattern around this time.
In the decomposition that is shown in figure 3.19 we see that the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995 is decomposed to visualize the trend, seasonal patterns and the remainder. There is a clear upward trend where the scale of the trend and time series values are similar as the trend line is a reflection of the global pattern over time. In the seasonal component we see regular peaks and valleys across years indicating strong seasonality. The scale of the seasonal component goes from approx -100 to 100 and using the sub series plot we can draw some conclusions. The months of February to May, September, December, and July (only in the second half of the series) we see that there are people being added to the civilian labor force. Since all other months are below 0, we can conclude that in these months the number of people is decreasing. Unlike the previous these values would need to be subtracted from the trend line to get the values in the actual data. Lastly, in the remainder component we see values that are between 100 and -100, indicating that a fair amount of the variability in the labor data can be explained by random noise.
The recession of 1991/1992 is visible in the estimated remainder components. Since there are large negative values in the remainder during these years, versus any irregularities in the trend or seasonal component we can interpret this noise as an event other than seasonal fluctuations or the over all trend. In this case the event would be a recession.