3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8 and 3.9
Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?
This data set requires some pre-processing before we can do anything meaningful with it. First we are going to cast the Country names as characters instead of factors. We will also remove any aggregated countries/regions.
#country_names <- (as.character(unique(global_economy$Country)))
global_economy_hw <- global_economy %>% mutate(Country = as.character(Country))
global_economy_hw <- global_economy_hw |> filter(!Country %in% c("Europe & Central Asia (excluding high income)","Europe & Central Asia (IDA & IBRD countries)", "Europe & Central Asia", "Euro area", "East Asia & Pacific (excluding high income)", "East Asia & Pacific (IDA & IBRD countries)", "Early-demographic dividend", "Central Europe and the Baltics","Caribbean small states", "Arab World", "Upper middle income", "World", "Fragile and conflict affected situations", "Heavily indebted poor countries (HIPC)", "Late-demographic dividend", "Latin America & Caribbean", "Latin America & Caribbean (excluding high income)", "Latin America & the Caribbean (IDA & IBRD countries)", "Least developed countries: UN classification", "Low & middle income", "Low income", "Lower middle income", "Middle East & North Africa", "Middle East & North Africa (excluding high income)", "Middle East & North Africa (IDA & IBRD countries)", "Middle income", "North America", "OECD members", "Other small states", "Pacific island small states", "Post-demographic dividend", "Pre-demographic dividend", "Small states", "South Asia", "South Asia (IDA & IBRD)", "Sub-Saharan Africa", "Sub-Saharan Africa (excluding high income)", "Sub-Saharan Africa (IDA & IBRD countries)", "IDA & IBRD total", "IDA total", "High income", "IBRD only","IDA blend","IDA only")) |> select(Year, Country, GDP, Population)
Next we are going to shorten some country names to give us more printable versions. And then we can cast our data frame to a tsibble. Latly we are going to create an array of country names so that we can plot subsets of the countries.
global_economy_hw$Country[global_economy_hw$Country == 'St. Vincent and the Grenadines'] <- "St. Vincent"
global_economy_hw$Country[global_economy_hw$Country == 'Central African Republic'] <- "CAR"
global_economy_hw$Country[global_economy_hw$Country == "Bosnia and Herzegovina"] <- "Bosnia"
global_economy_hw$Country[global_economy_hw$Country == "British Virgin Islands"] <- "British VI"
global_economy_hw$Country[global_economy_hw$Country == "Hong Kong SAR, China"] <- "Hong Kong"
global_economy_hw$Country[global_economy_hw$Country == "Sint Maarten (Dutch part)"] <- "Sint Maarten"
global_economy_hw$Country[global_economy_hw$Country == "St. Martin (French part)" ] <- "St. Martind"
global_economy_hw$Country[global_economy_hw$Country == "Turks and Caicos Islands"] <- "Turks and Caicos"
global_economy_hw <- global_economy_hw |> as_tsibble(index = Year, key = Country)
#global_economy_hw$Country[Country=='St. Vincent and the Grenadines'] <- "St. Vincent"
country_names <- (as.character(unique(global_economy_hw$Country)))
#global_economy_hw |>
# autoplot(GDP/Population) +
# labs(title= "GDP per capita", y = "$US")
I had to split the countries up into 50 country blocks to get them to plot. Otherwise my key was too large to generate a plot. Also the plot would be too busy to interpret. I’m color blind and find these plots hard to read. I’m going extract to th 5 top performing countries and plot their GDP per capita at the end.
#country_names <- as.list(unique(global_economy_hw$Country))
global_economy_hw |> filter(Country %in% c(country_names[1:50])) |>
autoplot(GDP/Population) +
labs(title= "GDP per capita", y = "$US")
## Warning: Removed 595 rows containing missing values or values outside the scale range
## (`geom_line()`).
temp <- global_economy_hw |> filter(!is.na(GDP) & !is.na(Population)) |> filter(Country %in% c(country_names[1:50]))
temp |> filter(GDP/Population == max(GDP/Population, na.rm = TRUE)) |> select(Country)
## # A tsibble: 1 x 2 [1Y]
## # Key: Country [1]
## Country Year
## <chr> <dbl>
## 1 Bermuda 2008
global_economy_hw |> filter(Country %in% c(country_names[51:100])) |>
autoplot(GDP/Population) +
labs(title= "GDP per capita", y = "$US")
## Warning: Removed 591 rows containing missing values or values outside the scale range
## (`geom_line()`).
temp <- global_economy_hw |> filter(!is.na(GDP) & !is.na(Population)) |> filter(Country %in% c(country_names[51:100]))
temp |> filter(GDP/Population == max(GDP/Population, na.rm = TRUE)) |> select(Country)
## # A tsibble: 1 x 2 [1Y]
## # Key: Country [1]
## Country Year
## <chr> <dbl>
## 1 Isle of Man 2014
global_economy_hw |> filter(Country %in% c(country_names[101:150])) |>
autoplot(GDP/Population) +
labs(title= "GDP per capita", y = "$US")
## Warning: Removed 828 rows containing missing values or values outside the scale range
## (`geom_line()`).
temp <- global_economy_hw |> filter(!is.na(GDP) & !is.na(Population)) |> filter(Country %in% c(country_names[101:150]))
temp |> filter(GDP/Population == max(GDP/Population, na.rm = TRUE)) |> select(Country)
## # A tsibble: 1 x 2 [1Y]
## # Key: Country [1]
## Country Year
## <chr> <dbl>
## 1 Monaco 2014
global_economy_hw |> filter(Country %in% c(country_names[151:200])) |>
autoplot(GDP/Population) +
labs(title= "GDP per capita", y = "$US")
## Warning: Removed 662 rows containing missing values or values outside the scale range
## (`geom_line()`).
temp <- global_economy_hw |> filter(!is.na(GDP) & !is.na(Population)) |> filter(Country %in% c(country_names[150:200]))
temp |> filter(GDP/Population == max(GDP/Population, na.rm = TRUE)) |> select(Country)
## # A tsibble: 1 x 2 [1Y]
## # Key: Country [1]
## Country Year
## <chr> <dbl>
## 1 Norway 2013
global_economy_hw |> filter(Country %in% c(country_names[201:219])) |>
autoplot(GDP/Population) +
labs(title= "GDP per capita", y = "$US")
## Warning: Removed 270 rows containing missing values or values outside the scale range
## (`geom_line()`).
temp <- global_economy_hw |> filter(!is.na(GDP) & !is.na(Population)) |> filter(Country %in% c(country_names[201:219]))
temp |> filter(GDP/Population == max(GDP/Population, na.rm = TRUE)) |> select(Country)
## # A tsibble: 1 x 2 [1Y]
## # Key: Country [1]
## Country Year
## <chr> <dbl>
## 1 United States 2017
Now we have the top 5 countries. Let’s plot them and then do our analysis. The top GDP per captia country is Monoco. It looks like a lot of countries in the above plots experienced a drop in GDP around 1985 and 2000 and the experienced a recovery. Monaco shows the same trend just at a overall higher level than the other countries of the world. Monaco is a bit of an outlier in some respects, it has a very small population with most wealth concentrated within the royal family.
global_economy_hw |> filter(Country %in% c('Bermuda', 'Isle of Man', 'Norway', 'Monaco', 'United States')) |>
autoplot(GDP/Population) +
labs(title= "GDP per capita", y = "$US")
## Warning: Removed 51 rows containing missing values or values outside the scale range
## (`geom_line()`).
For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.
I think that a log transformation might help this plot
global_economy_hw |> filter(Country %in% c('United States')) |>
autoplot(GDP)
The log transformation transforms the exponential function into a linear function. It preserves the relationship in the original data that shows that the GDP doubled between ~1985 and 2000.
global_economy_hw |> filter(Country %in% c('United States')) |> mutate(log_GDP = log(GDP)) |>
autoplot(log_GDP)
aus_livestock |> filter(Animal == 'Bulls, bullocks and steers') |> autoplot()
## Plot variable not specified, automatically selected `.vars = Count`
There is a lot of differences in the variance in this data, I wanted to use a Box-Cox transformation but really struggled to get the ‘box_cox’ function to work on grouped data (there are 8 states and each has a separate lambda). I ended up applying a log transformation which is a Box-Cox of lambda = 0.
This did reduce the heteroskedasticity in the individual time series plots.
#lambda <- aus_livestock |> filter(Animal == 'Bulls, bullocks and steers') |> features(Count, features = guerrero) |>
# pull(lambda_guerrero)
aus_livestock |> filter(Animal == 'Bulls, bullocks and steers') |>
autoplot(log(Count))
vic_elec |> autoplot()
## Plot variable not specified, automatically selected `.vars = Demand`
I used the Box-Cox transformation and ended up with a lambda of 0.1, there still exists a lot of variation in this data.
lambda <- vic_elec |> features(Demand, features = guerrero) |>
pull(lambda_guerrero)
vic_elec |> autoplot(box_cox(Demand, lambda))
I’m going to apply a Box-Cox to see if I can reduce the variance across the series
aus_production |> autoplot(Gas)
The variance is not perfect but better
lambda <- aus_production |> features(Gas, features = guerrero) |>
pull(lambda_guerrero)
aus_production |> autoplot(box_cox(Gas, lambda))
Why is a Box-Cox transformation unhelpful for the canadian_gas data?
canadian_gas |> autoplot()
## Plot variable not specified, automatically selected `.vars = Volume`
Interestingly, the Box-Cox does nothing to improve the variation in the time series.
lambda <- canadian_gas |> features(Volume, features = guerrero) |>
pull(lambda_guerrero)
canadian_gas |> autoplot(box_cox(Volume, lambda))
What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?
set.seed(555)
myseries <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries |> autoplot()
## Plot variable not specified, automatically selected `.vars = Turnover`
I chose a Box-Cox with a lambda of 0.08. I’m starting to think that I should choose 0.10 as my default lambda…
lambda <- myseries |> features(Turnover, features = guerrero) |>
pull(lambda_guerrero)
myseries |> autoplot(box_cox(Turnover, lambda))
For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.
aus_production |> autoplot(Tobacco)
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).
The Box-Cox does very little for this series, the lambda is 0.92 and the series looks unchanged.
lambda <- aus_production |> features(Tobacco, features = guerrero) |>
pull(lambda_guerrero)
aus_production |> autoplot(box_cox(Tobacco, lambda))
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).
ansett |> filter(Class == 'Economy' & Airports == 'MEL-SYD') |> autoplot()
## Plot variable not specified, automatically selected `.vars = Passengers`
This is a Box-Cox transformation with a lambda of 2
lambda <- ansett |> filter(Class == 'Economy' & Airports == 'MEL-SYD') |> features(Passengers, features = guerrero) |>
pull(lambda_guerrero)
ansett |> filter(Class == 'Economy' & Airports == 'MEL-SYD') |> autoplot(box_cox(Passengers, lambda))
pedestrian |> filter(Sensor == 'Southern Cross Station') |> autoplot(Count)
lambda <- pedestrian |> filter(Sensor == 'Southern Cross Station') |> features(Count, features = guerrero) |>
pull(lambda_guerrero)
pedestrian |> filter(Sensor == 'Southern Cross Station') |> autoplot(box_cox(Count, lambda))
Consider the last five years of the Gas data from aus_production.
gas <- tail(aus_production, 5*4) |> select(Gas)
First we plot the series data. You can see that there is both a quarterly seasonality with a peak in q3 and positive trend.
gas <- tail(aus_production, 5*4) |> select(Gas)
gas |> autoplot()
## Plot variable not specified, automatically selected `.vars = Gas`
Second we use classical decomposition and results support the analysis from part a.
gas |>
model(
classical_decomposition(Gas, type = "multiplicative")
) |>
components() |>
autoplot() +
labs(title = "Classical additive decomposition of gas production from aus_production")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
Plot of seasonaly adjusted data
dcmp <- gas |>
model(
classical_decomposition(Gas, type = "multiplicative")
)
components(dcmp) |>
as_tsibble() |>
autoplot(Gas, colour="gray") +
geom_line(aes(y=trend), colour = "#D55E00") +
labs(
y = "Gas in petajoules",
title = "AUS gas production"
)
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).
Make one value and outlier and then recalculate the seasonality adjusted data and plot. We see that classical decomposition is very sensitive to outliers.
bad_gas <- gas
bad_gas$Gas[10] <- bad_gas$Gas[10] + 200
dcmp <- bad_gas |>
model(
classical_decomposition(Gas, type = "multiplicative")
)
components(dcmp) |>
as_tsibble() |>
autoplot(Gas, colour="gray") +
geom_line(aes(y=trend), colour = "#D55E00") +
labs(
y = "Gas in petajoules",
title = "AUS gas production"
)
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).
If the outlier had been near the end of the series it has less impact on the seasonal adjustment because it is used in fewer adjustments/
bad_gas <- gas
bad_gas$Gas[20] <- bad_gas$Gas[20] + 200
dcmp <- bad_gas |>
model(
classical_decomposition(Gas, type = "multiplicative")
)
components(dcmp) |>
as_tsibble() |>
autoplot(Gas, colour="gray") +
geom_line(aes(y=trend), colour = "#D55E00") +
labs(
y = "Gas in petajoules",
title = "AUS gas production"
)
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).
Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?
set.seed(555)
myseries <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries |> autoplot()
## Plot variable not specified, automatically selected `.vars = Turnover`
I’m getting an error trying to use the ‘x13binary’ package. I’ve some debugging can’t currently resolve if for this homework problem.
#x11_dcmp <- myseries |>
# model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) |>
# components()
#autoplot(x11_dcmp) +
# labs(title =
# "Decomposition of retail turnover using X-11.")
Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.
There is a positive trend cycle across the entire series. There is a seasonal component that is consistent across the series. The seasonal component is small compared to the trend and even the remainder. March, September and December are the highest months for the civilian labour force across the series. December makes sense for holiday seasonal labour and September makes sense (at least in the US) that people are able to return to the workforce when their kids start school. The size of the remainder is surprising.
The recession of 1991/1992 is all in the remainder. Which as an anomoly in the trend is what you would want.