We are required to complete questions 3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8 and 3.9 from chapter 3 of “Forecasting: Principles and Practice” Third Edition by Rob Hyndman and George Athanasopoulos.
3.1
We are tasked with plotting the GDP per capita for each country over time in the global_economy dataset
global_economy |>ggplot(aes(x= Year, y = GDP/Population)) +geom_line()
unique(global_economy$Country)
[1] Afghanistan
[2] Albania
[3] Algeria
[4] American Samoa
[5] Andorra
[6] Angola
[7] Antigua and Barbuda
[8] Arab World
[9] Argentina
[10] Armenia
[11] Aruba
[12] Australia
[13] Austria
[14] Azerbaijan
[15] Bahamas, The
[16] Bahrain
[17] Bangladesh
[18] Barbados
[19] Belarus
[20] Belgium
[21] Belize
[22] Benin
[23] Bermuda
[24] Bhutan
[25] Bolivia
[26] Bosnia and Herzegovina
[27] Botswana
[28] Brazil
[29] British Virgin Islands
[30] Brunei Darussalam
[31] Bulgaria
[32] Burkina Faso
[33] Burundi
[34] Cabo Verde
[35] Cambodia
[36] Cameroon
[37] Canada
[38] Caribbean small states
[39] Cayman Islands
[40] Central African Republic
[41] Central Europe and the Baltics
[42] Chad
[43] Channel Islands
[44] Chile
[45] China
[46] Colombia
[47] Comoros
[48] Congo, Dem. Rep.
[49] Congo, Rep.
[50] Costa Rica
[51] Cote d'Ivoire
[52] Croatia
[53] Cuba
[54] Curacao
[55] Cyprus
[56] Czech Republic
[57] Denmark
[58] Djibouti
[59] Dominica
[60] Dominican Republic
[61] Early-demographic dividend
[62] East Asia & Pacific
[63] East Asia & Pacific (excluding high income)
[64] East Asia & Pacific (IDA & IBRD countries)
[65] Ecuador
[66] Egypt, Arab Rep.
[67] El Salvador
[68] Equatorial Guinea
[69] Eritrea
[70] Estonia
[71] Eswatini
[72] Ethiopia
[73] Euro area
[74] Europe & Central Asia
[75] Europe & Central Asia (excluding high income)
[76] Europe & Central Asia (IDA & IBRD countries)
[77] European Union
[78] Faroe Islands
[79] Fiji
[80] Finland
[81] Fragile and conflict affected situations
[82] France
[83] French Polynesia
[84] Gabon
[85] Gambia, The
[86] Georgia
[87] Germany
[88] Ghana
[89] Gibraltar
[90] Greece
[91] Greenland
[92] Grenada
[93] Guam
[94] Guatemala
[95] Guinea
[96] Guinea-Bissau
[97] Guyana
[98] Haiti
[99] Heavily indebted poor countries (HIPC)
[100] High income
[101] Honduras
[102] Hong Kong SAR, China
[103] Hungary
[104] IBRD only
[105] Iceland
[106] IDA & IBRD total
[107] IDA blend
[108] IDA only
[109] IDA total
[110] India
[111] Indonesia
[112] Iran, Islamic Rep.
[113] Iraq
[114] Ireland
[115] Isle of Man
[116] Israel
[117] Italy
[118] Jamaica
[119] Japan
[120] Jordan
[121] Kazakhstan
[122] Kenya
[123] Kiribati
[124] Korea, Dem. People's Rep.
[125] Korea, Rep.
[126] Kosovo
[127] Kuwait
[128] Kyrgyz Republic
[129] Lao PDR
[130] Late-demographic dividend
[131] Latin America & Caribbean
[132] Latin America & Caribbean (excluding high income)
[133] Latin America & the Caribbean (IDA & IBRD countries)
[134] Latvia
[135] Least developed countries: UN classification
[136] Lebanon
[137] Lesotho
[138] Liberia
[139] Libya
[140] Liechtenstein
[141] Lithuania
[142] Low & middle income
[143] Low income
[144] Lower middle income
[145] Luxembourg
[146] Macao SAR, China
[147] Macedonia, FYR
[148] Madagascar
[149] Malawi
[150] Malaysia
[151] Maldives
[152] Mali
[153] Malta
[154] Marshall Islands
[155] Mauritania
[156] Mauritius
[157] Mexico
[158] Micronesia, Fed. Sts.
[159] Middle East & North Africa
[160] Middle East & North Africa (excluding high income)
[161] Middle East & North Africa (IDA & IBRD countries)
[162] Middle income
[163] Moldova
[164] Monaco
[165] Mongolia
[166] Montenegro
[167] Morocco
[168] Mozambique
[169] Myanmar
[170] Namibia
[171] Nauru
[172] Nepal
[173] Netherlands
[174] New Caledonia
[175] New Zealand
[176] Nicaragua
[177] Niger
[178] Nigeria
[179] North America
[180] Northern Mariana Islands
[181] Norway
[182] OECD members
[183] Oman
[184] Other small states
[185] Pacific island small states
[186] Pakistan
[187] Palau
[188] Panama
[189] Papua New Guinea
[190] Paraguay
[191] Peru
[192] Philippines
[193] Poland
[194] Portugal
[195] Post-demographic dividend
[196] Pre-demographic dividend
[197] Puerto Rico
[198] Qatar
[199] Romania
[200] Russian Federation
[201] Rwanda
[202] Samoa
[203] San Marino
[204] Sao Tome and Principe
[205] Saudi Arabia
[206] Senegal
[207] Serbia
[208] Seychelles
[209] Sierra Leone
[210] Singapore
[211] Sint Maarten (Dutch part)
[212] Slovak Republic
[213] Slovenia
[214] Small states
[215] Solomon Islands
[216] Somalia
[217] South Africa
[218] South Asia
[219] South Asia (IDA & IBRD)
[220] South Sudan
[221] Spain
[222] Sri Lanka
[223] St. Kitts and Nevis
[224] St. Lucia
[225] St. Martin (French part)
[226] St. Vincent and the Grenadines
[227] Sub-Saharan Africa
[228] Sub-Saharan Africa (excluding high income)
[229] Sub-Saharan Africa (IDA & IBRD countries)
[230] Sudan
[231] Suriname
[232] Sweden
[233] Switzerland
[234] Syrian Arab Republic
[235] Tajikistan
[236] Tanzania
[237] Thailand
[238] Timor-Leste
[239] Togo
[240] Tonga
[241] Trinidad and Tobago
[242] Tunisia
[243] Turkey
[244] Turkmenistan
[245] Turks and Caicos Islands
[246] Tuvalu
[247] Uganda
[248] Ukraine
[249] United Arab Emirates
[250] United Kingdom
[251] United States
[252] Upper middle income
[253] Uruguay
[254] Uzbekistan
[255] Vanuatu
[256] Venezuela, RB
[257] Vietnam
[258] Virgin Islands (U.S.)
[259] West Bank and Gaza
[260] World
[261] Yemen, Rep.
[262] Zambia
[263] Zimbabwe
263 Levels: Afghanistan Albania Algeria American Samoa Andorra ... Zimbabwe
This graph would be too noisy as there are 263 different time series (a time series for each country in the data set). Above, I plotted the data without color coding by country - color coding by country would not fit on the chart as the legend itself would be too large (263 different lines).
I found an R package called ‘countrycode’ which can provide continent information based on the country name. Below, I split the graphs by continent.
Warning: Some values were not matched unambiguously: Arab World, Caribbean small states, Central Europe and the Baltics, Channel Islands, Early-demographic dividend, East Asia & Pacific, East Asia & Pacific (excluding high income), East Asia & Pacific (IDA & IBRD countries), Euro area, Europe & Central Asia, Europe & Central Asia (excluding high income), Europe & Central Asia (IDA & IBRD countries), European Union, Fragile and conflict affected situations, Heavily indebted poor countries (HIPC), High income, IBRD only, IDA & IBRD total, IDA blend, IDA only, IDA total, Kosovo, Late-demographic dividend, Latin America & Caribbean, Latin America & Caribbean (excluding high income), Latin America & the Caribbean (IDA & IBRD countries), Least developed countries: UN classification, Low & middle income, Low income, Lower middle income, Middle East & North Africa, Middle East & North Africa (excluding high income), Middle East & North Africa (IDA & IBRD countries), Middle income, North America, OECD members, Other small states, Pacific island small states, Post-demographic dividend, Pre-demographic dividend, Small states, South Asia, South Asia (IDA & IBRD), Sub-Saharan Africa, Sub-Saharan Africa (excluding high income), Sub-Saharan Africa (IDA & IBRD countries), Upper middle income, World
It is still a bit noisy so to narrow it down (again I excluded color coding by Country to get the data to fit), I will plot the countries with the top 2 averages for each continent. This isn’t a fool proof method - for example, a country can have a low GDP per capita for most of the time series and then have a massive spike that skews the average. Additionally, the averages could be taken on different time frames (for example, one country can have a much shorter time series of GDP per capita than another). However, I think this is a decent starting point for data exploration.
Next, I filter the gdp_per_capita tsibble for the top countries for the above and plot
filtered_gdp_per_capita <- gdp_per_capita |>filter(Country %in% topCountries)filtered_gdp_per_capita |>ggplot(aes(x = Year, y = GDPPerCapita, color = Country)) +geom_line()
Warning: Removed 179 rows containing missing values or values outside the scale range
(`geom_line()`).
We can see the top 2 countries in each continent and their GDP per capita over time. Liechtenstein and Monaco shave the highest GDP per capita, which has continually increased at a higher rate than other countries.
3.2
We are tasked with plotting various time series and transforming if necessary
While the graph doesn’t change much (besides the change in the Y axis values), this does provide more insight into whether GDP increases because the population increases or whether GDP increases because of increased output - because the per-capita graph is increasing, this suggests the latter.
Victorian livestock
aus_livestock |>filter(State =='Victoria', Animal =='Bulls, bullocks and steers') |>autoplot(Count)
The data is a bit noisy, but it seems like variation might increase/decrease with the level of the series.
Below I try a Box-Cox transformation and use the guerrero feature to choose a lambda value. The lambda value is set to -0.04 (very close to zero, which is close to a log transformation), however the graph does not look much different so it seems like the transformation did not change much. It is likely fine to work with either versions of the data (original or Box-Cox transformed).
lambda <- aus_livestock |>filter(State =='Victoria', Animal =='Bulls, bullocks and steers') |>features(Count, features = guerrero) |>pull(lambda_guerrero)aus_livestock |>filter(State =='Victoria', Animal =='Bulls, bullocks and steers') |>autoplot(box_cox(Count, lambda)) +ggtitle(paste('Box-Cox transformation with lambda',lambda))
Victorian Electricity Demand
vic_elec |>autoplot(Demand)
This data set has daily half-hour data, so the graph is noisy. I will aggregate it into daily data to reduce some of the noise
Interestingly, the guerrero lambda is near -1, which is close to a reciprocal transformation. There are some outlier spikes that still remain, even after a log transformation or a Box-Cox transformation. The transformed versions of the data look similar to the original, so I would likely stick with the original data for ease of interpretability.
This does improve the consistency of the variance, though there are still a few spikes that are larger than others (around the 1970s). Next, I try the Box-Cox transformation
lambdaAusGas<- aus_production|>select(Quarter, Gas) |>features(Gas, features = guerrero) |>pull(lambda_guerrero)aus_production|>select(Quarter, Gas) |>autoplot(box_cox(Gas, lambdaAusGas)) +ggtitle(paste('Box-Cox transformation with lambda',lambdaAusGas))
The Box-Cox transformation seems a bit better than the log transformation for improving the consistency of the variance of the time series, so the Box-Cox transformation is the appropriate transformation to use (and is what was used in the text book).
3.3
We are asked why a Box-Cox transformation is unhelpful for the canadian_gas data
canadian_gas |>autoplot(Volume)
lambdaCadGas<- canadian_gas|>features(Volume, features = guerrero) |>pull(lambda_guerrero)canadian_gas|>autoplot(box_cox(Volume, lambdaCadGas)) +ggtitle(paste('Box-Cox transformation with lambda',lambdaCadGas))
The Box-Cox transformation has a lambda of near 0.5, which would be equivalent to the square root of the values and is a weaker transformation than a cube root or logarithm transformation in terms of the change to the data. Furthermore, the square root is less interpretable. Lastly, the spikes between the 1980s and 1990s are still larger than the spikes towards the start and end of the time series (though the starts and ends do seem to be a bit more consistent with the transformation).
3.4
We are asked which Box-Cox transformation we would use for one of the time series of retail data from aus_retail. I will select ‘Liquor retailing’ for ‘New South Wales’
aus_retail |>filter(Industry=='Liquor retailing', State=='New South Wales') |>autoplot(Turnover)
First, I will note that since the Turnover data is in $Million AUD, it makes sense to do an inflation adjustment. The R Package for the textbook has annual inflation data readily available, however monthly inflation data would need to be imported from another source. As such, and in order to see the variation and seasonality that comes with keeping the data in monthly form as opposed to aggregating to an annual value, I will bypass an inflation adjustment for now.
Next, I try a Box-Cox transformation and use the guerrero feature to choose a lambda value.
lambdaWalesLiquor <- aus_retail |>filter(Industry=='Liquor retailing', State=='New South Wales') |>features(Turnover, features = guerrero) |>pull(lambda_guerrero)aus_retail |>filter(Industry=='Liquor retailing', State=='New South Wales') |>autoplot(box_cox(Turnover, lambdaWalesLiquor)) +ggtitle(paste('Box-Cox transformation with lambda',lambdaWalesLiquor))
The guerrero feature uses a lambda of -0.03, which is near a log transformation.
aus_retail |>filter(Industry=='Liquor retailing', State=='New South Wales') |>autoplot(log(Turnover))
The difference between the Box-Cox and the log transformation seems minuscule. I would go with the log transformation (Lambda of 0) for ease of interpretability.
3.5
We are tasked with performing Box-Cox transformations for a variety of time series
There is a large outlier in the late 1980s - the book previously discussed some of this data-set and mentioned there was a pilot strike. There is an additional outlier in the 1990s when, according to the book, airlines decreased the number of Economy seats in order to increase the number of Business seats.
The log transformation doesn’t make much of a difference and the outlier is still present.
Next, I try a Box-Cox transformation and use the guerrero feature to choose a lambda value.
flightsLambda <- ansett |>filter(Airports=='MEL-SYD', Class=='Economy') |>features(Passengers, features = guerrero) |>pull(lambda_guerrero)ansett |>filter(Airports=='MEL-SYD', Class=='Economy') |>autoplot(box_cox(Passengers, flightsLambda)) +ggtitle(paste('Box-Cox transformation with lambda',flightsLambda))
This does improve some of the variation with the outliers, however the variation is still unstable. This data-set seems difficult to transform due to exogenous shocks that have created outliers. I would go with the Box-Cox transformation with the guerrero lambda value, however the data set likely needs to be adjusted to handle outliers.
southernCrossLambda <- southernCrossWeekly |>features(totalWeek, features = guerrero) |>pull(lambda_guerrero)southernCrossWeekly |>autoplot(box_cox(totalWeek, southernCrossLambda)) +ggtitle(paste('Box-Cox transformation with lambda',southernCrossLambda))
The Box-Cox transformation does a good job of standardizing the variance in the weeks of the year that are not near the end of the year/beginning of the year. My hypothesis is that station counts drop significantly due to the end of the year/beginning of the year holidays, which creates outliers. I would go with the Box-Cox transformation for this case.
3.7
We are tasked with analyzing the last five years of Gas data from aus_production
gas <-tail(aus_production, 5*4) |>select(Gas)
A
I plot the time series. Based on the below, there is an upward trend-cycle with seasonality (with gas production dipping in Q1 and peaking around Q3.
gas |>autoplot(Gas)
B
Below I use classical decomposition with type multiplicative on the series:
gas |>model(classical_decomposition(Gas, type='multiplicative')) |>components() |>autoplot() +labs(title ='classical multiplicative decomposition of the last five years of gas production')
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).
C
The component plots show that there is a positive trend with seasonality, confirming the observation from part A.
D
Below I compute and plot the seasonally adjusted data (the blue line)
gasDecomp <- gas |>model(classical_decomposition(Gas, type='multiplicative')) |>components() |>as_tsibble()gasDecomp |>autoplot(Gas) +geom_line(aes(y= season_adjust), color="#0072B2") +labs(y ="Gas Production)",title ="Quarterly Gas Production")
E
Next, we are tasked with changing one data point to an outlier
The outlier causes the seasonal adjusted data to have a large spike - in the graph above the black line is the seasonally adjusted data from the regular data set while the blue line is the seasonally adjusted data from the data set with the outlier. Furthermore, it causes more fluctuation in the seasonally adjusted data.
F
I will create a third outlier that is towards the end and compare it to the data above.
The outlier at the end causes a spike towards the end of the data. Both outlier data sets have more fluctuation in the seasonally adjusted data than the seasonally adjusted data based on the original data set.
3.8
We are tasked with performing a time series decomposition on the data from question 3.4 using the X-11 method.
The data looked like this:
aus_retail |>filter(Industry=='Liquor retailing', State=='New South Wales') |>autoplot(Turnover)
I will transform the data with a log transformation, as in 3.4 I saw that the Box-Cox seemed similar to a log transformation
logLiquorSales <- aus_retail |>filter(Industry=='Liquor retailing', State=='New South Wales') logLiquorSales$logTurnover <-log(logLiquorSales$Turnover)
Next, I perform the X-11 decomposition on the log data
x11_dcmp <- logLiquorSales |>model(x11 =X_13ARIMA_SEATS(logTurnover ~x11())) |>components()autoplot(x11_dcmp) +labs(title ="Decomposition of total US retail employment using X-11.")
For comparison, I perform a classical multiplicative decomposition (the variation in the seasonal pattern seems proportional to the level of the series).
logLiquorSales |>model(classical_decomposition(logTurnover, type ="multiplicative") ) |>components() |>autoplot()
Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_line()`).
The two methods seem similar for the trend and seasonal components; however, the remainder component for the X-11 method is more consistent than for the classical method. Furthermore, the X-11 Method has estimates for the trend cycle for the full data set whereas the classical method does not have estimates for the last few data points.
3.9
We are asked to analyze an STL decomposition of monthly Australian labor force data as well as the seasonal sub-series plot of the seasonal component
A
The time series has a positive trend with some cyclicity (where the upward trend either pauses or dips). This is the main contributor to the time series, evident by the smallest rectangle on the left hand side of the chart. The remainder has a larger contribution to the series than the seasonality, as the seasonality has the largest rectangle of the three components. The seasonal sub-series plot shots that labor force participation is highest in the middle to end of Q1 and Q2, while Q3 and Q4 have some more variance.
B
The recession of 1991/1992 is slightly visible in the trend component, as the upward trend flat-lines for a bit; however, the recession is most visible in the remainder component which has a ma large decrease, spike, and then another large decrease before returning to a somewhat more normal pattern.