library(fpp3)
library(RColorBrewer)
library(seasonal)
Do exercises 3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8 and 3.9 from the online Hyndman book. Please include your Rpubs link along with.pdf file of your run code
Monaco has the highest GDP per capita. This has been the case for almost the entirety of the dataset, for as long as Monaco’s data was collected (1970), except for a couple of short periods in the 2010s.
global_economy_per_capita <- global_economy %>% mutate(GDP_perCapita = GDP/Population)
global_economy_per_capita %>% autoplot(GDP_perCapita) + theme(legend.position = "none")
## Warning: Removed 3242 rows containing missing values or values outside the scale range
## (`geom_line()`).
gepc_tibble <- global_economy_per_capita %>% as_tibble() %>% group_by(Country) %>% summarize(mean_GDP_pc = mean(GDP_perCapita, na.rm = TRUE)) %>% arrange(desc(mean_GDP_pc)) %>% head(10)
gepc_top <- gepc_tibble$Country %>% as.character()
global_economy_per_capita %>% filter(Country %in% gepc_top) %>% autoplot(GDP_perCapita) + geom_line(size = 2) + scale_color_brewer(palette = "Set1")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
## Warning: Removed 299 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 299 rows containing missing values or values outside the scale range
## (`geom_line()`).
The following is a plot of United States GDP from global_economy. The grey line represents the original data, while the blue line represents the inflation adjusted GDP.
us_global_economy <- global_economy %>% filter(Country == "United States") %>% mutate(Adjusted_GDP = GDP / CPI * 100)
us_global_economy %>% ggplot(aes(Year,)) + geom_line(aes(y = GDP, x = Year),size = 2, color = "grey") + geom_line(aes(y = Adjusted_GDP, x = Year), size = 2, color = "blue") + labs(x = "Year", y = "GDP in $USD", title = "United States GDP From 1960 to 2017") + scale_y_continuous(labels = scales::comma)
In this dataset, I used a box_cox transformation for all of States except for Queensland. I used the Guerrero Features function to estimate the most appropriate lambda for each territory, accounting for 0 data in certain States by cutting down the analysis window once the data tapers and stays at 0. For all of the box_cox transformations, the seasonal variance is evened out.
aus_livestock_bbs <- aus_livestock %>% filter(Animal == "Bulls, bullocks and steers") %>% na.omit
aus_livestock_bbs %>% autoplot()
aus_livestock_bbs %>% features(Count,features = guerrero)
## # A tibble: 8 × 3
## Animal State lambda_guerrero
## <fct> <fct> <dbl>
## 1 Bulls, bullocks and steers Australian Capital Territory 1.00
## 2 Bulls, bullocks and steers New South Wales -0.599
## 3 Bulls, bullocks and steers Northern Territory 0.126
## 4 Bulls, bullocks and steers Queensland 0.789
## 5 Bulls, bullocks and steers South Australia 0.615
## 6 Bulls, bullocks and steers Tasmania 0.0427
## 7 Bulls, bullocks and steers Victoria -0.0446
## 8 Bulls, bullocks and steers Western Australia 0.0447
For Australian Capital Territory, I limited the data to less than 1993, as afterwards it appears to taper to 0.
aus_livestock_bbs <- aus_livestock %>% filter(Animal == "Bulls, bullocks and steers") %>% na.omit
aus_livestock_bbs %>% autoplot()
aus_livestock_bbs %>% filter(year(Month) < 1993) %>% features(Count,features = guerrero)
## # A tibble: 8 × 3
## Animal State lambda_guerrero
## <fct> <fct> <dbl>
## 1 Bulls, bullocks and steers Australian Capital Territory 1.70
## 2 Bulls, bullocks and steers New South Wales -0.590
## 3 Bulls, bullocks and steers Northern Territory 0.308
## 4 Bulls, bullocks and steers Queensland 1.53
## 5 Bulls, bullocks and steers South Australia 0.352
## 6 Bulls, bullocks and steers Tasmania 0.770
## 7 Bulls, bullocks and steers Victoria 0.297
## 8 Bulls, bullocks and steers Western Australia 0.638
aus_livestock_bbs %>% filter(State == "Australian Capital Territory", year(Month) < 1993) %>% autoplot(Count) + labs (y = "Slaughter Count", x = "Time", title = "Bulls, Bullocks, and Steer Murder Count in Australian Capital Territory, Australia")
aus_livestock_bbs %>% filter(State == "Australian Capital Territory", year(Month) < 1993) %>% autoplot(box_cox(Count,1.7)) + labs (y = "Slaughter Count", x = "Time", title = "Bulls, Bullocks, and Steer Murder Count in Australian Capital Territory, Australia")
For New South Wales, transforming the data evens out the shape of the dataset as well as the seasonal variance.
aus_livestock_bbs %>% features(Count,features = guerrero)
## # A tibble: 8 × 3
## Animal State lambda_guerrero
## <fct> <fct> <dbl>
## 1 Bulls, bullocks and steers Australian Capital Territory 1.00
## 2 Bulls, bullocks and steers New South Wales -0.599
## 3 Bulls, bullocks and steers Northern Territory 0.126
## 4 Bulls, bullocks and steers Queensland 0.789
## 5 Bulls, bullocks and steers South Australia 0.615
## 6 Bulls, bullocks and steers Tasmania 0.0427
## 7 Bulls, bullocks and steers Victoria -0.0446
## 8 Bulls, bullocks and steers Western Australia 0.0447
aus_livestock_bbs %>% filter(State == "New South Wales") %>% autoplot(Count) + labs (y = "Slaughter Count", x = "Time", title = "Bulls, Bullocks, and Steer Murder Count in New South Wales, Australia")
aus_livestock_bbs %>% filter(State == "New South Wales") %>% autoplot(box_cox(Count,-.599)) + labs (y = "Box_Cox Transformed Slaughter Count", x = "Time", title = "Bulls, Bullocks, and Steer Transformed Murder Count in New South Wales, Australia")
In the Northern Territory dataset, the transformation was very pronounce. The original dataset demonstrated large seasonal variation in 1980 to approximately 1993, and then far less seasonal variance afterwards before tapering to 0. To prevent the 0 data from influencing the transformation too strongly, I only transformed and graphed the data up to 2001.
aus_livestock_bbs %>% filter(year(Month) < 2001) %>% features(Count,features = guerrero)
## # A tibble: 8 × 3
## Animal State lambda_guerrero
## <fct> <fct> <dbl>
## 1 Bulls, bullocks and steers Australian Capital Territory 1.00
## 2 Bulls, bullocks and steers New South Wales -0.573
## 3 Bulls, bullocks and steers Northern Territory 0.165
## 4 Bulls, bullocks and steers Queensland 1.08
## 5 Bulls, bullocks and steers South Australia 0.684
## 6 Bulls, bullocks and steers Tasmania 0.637
## 7 Bulls, bullocks and steers Victoria 0.302
## 8 Bulls, bullocks and steers Western Australia 0.173
aus_livestock_bbs %>% filter(State == "Northern Territory",year(Month) < 2001) %>% autoplot(Count) + labs (y = "Slaughter Count", x = "Time", title = "Bulls, Bullocks, and Steer Murder Count in Northern Territory, Australia")
aus_livestock_bbs %>% filter(State == "Northern Territory",year(Month) < 2001) %>% autoplot(box_cox(Count,.165)) + labs (y = "Box_Cox Transformed Slaughter Count", x = "Time", title = "Bulls, Bullocks, and Steer Transformed Murder Count in Northern Territory, Australia")
The lambda for Queensland is close to 1, which means the transformed data is shifted downwards without any changes to the shape of the time series. Performing the box_cox transformation with lambda = 0.789 results in little discernible change in the plot. I would not transform this data s a result, to make the results easier to interpret as the benefits of the transformation does not appear to be pronounced.
aus_livestock_bbs %>% filter(State == "Queensland") %>% autoplot(Count) + labs (y = "Slaughter Count", x = "Time", title = "Bulls, Bullocks, and Steer Murder Count in Queensland, Australia")
aus_livestock_bbs %>% filter(State == "Queensland") %>% autoplot(box_cox(Count,0.789)) + labs (y = "Box_Cox Transformed Slaughter Count", x = "Time", title = "Bulls, Bullocks, and Steer Transformed Murder Count in Queensland, Australia")
Much like the New South Wales dataset, transforming the data evens out the shape as well as the seasonal variance.
aus_livestock_bbs %>% filter(State == "South Australia") %>% autoplot(Count) + labs (y = "Slaughter Count", x = "Time", title = "Bulls, Bullocks, and Steer Murder Count in South Australia, Australia")
aus_livestock_bbs %>% filter(State == "South Australia") %>% autoplot(box_cox(Count,0.615)) + labs (y = "Box_Cox Transformed Slaughter Count", x = "Time", title = "Bulls, Bullocks, and Steer Transformed Murder Count in South Australia, Australia")
For Tasmania, Victoria, and Western Australia, a lambda of 0 was used so easier interpretation of the resulting transformation.
aus_livestock_bbs %>% filter(State == "Tasmania") %>% autoplot(Count) + labs (y = "Slaughter Count", x = "Time", title = "Bulls, Bullocks, and Steer Murder Count in Tasmania, Australia")
aus_livestock_bbs %>% filter(State == "Tasmania") %>% autoplot(box_cox(Count,0)) + labs (y = "Box_Cox Transformed Slaughter Count", x = "Time", title = "Bulls, Bullocks, and Steer Transformed Murder Count in Tasmania, Australia")
aus_livestock_bbs %>% filter(State == "Victoria") %>% autoplot(Count) + labs (y = "Slaughter Count", x = "Time", title = "Bulls, Bullocks, and Steer Murder Count in Victoria, Australia")
aus_livestock_bbs %>% filter(State == "Victoria") %>% autoplot(box_cox(Count,0)) + labs (y = "Box_Cox Transformed Slaughter Count", x = "Time", title = "Bulls, Bullocks, and Steer Transformed Murder Count in Victoria, Australia")
aus_livestock_bbs %>% filter(State == "Western Australia") %>% autoplot(Count) + labs (y = "Slaughter Count", x = "Time", title = "Bulls, Bullocks, and Steer Murder Count in Western Australia, Australia")
aus_livestock_bbs %>% filter(State == "Western Australia") %>% autoplot(box_cox(Count,0)) + labs (y = "Box_Cox Transformed Slaughter Count", x = "Time", title = "Bulls, Bullocks, and Steer Transformed Murder Count in Western Australia, Australia")
Transforming the dataset appears to have kept the same general cycle, while evening out seasonal variance. Given that data is taken at 30 minute intervals, I summarized the data to group by date and added the demand over a single day. Proportionately to the original graph, the seasonal variance appears to have been stretched/elongated to even out the seasonal variance.
vic_elec_daily <- vic_elec %>% as_tibble() %>% mutate(dateOnly = date(Date)) %>% group_by(dateOnly) %>% summarize(sumDemand = sum(Demand))
vic_elec_daily <- vic_elec_daily %>% as_tsibble(index = dateOnly)
vic_elec_daily %>% features(sumDemand,features = guerrero)
## # A tibble: 1 × 1
## lambda_guerrero
## <dbl>
## 1 -0.900
vic_elec %>% features(Demand,features = guerrero)
## # A tibble: 1 × 1
## lambda_guerrero
## <dbl>
## 1 0.0999
vic_elec %>% autoplot()
vic_elec %>% autoplot(box_cox(Demand,0)) + labs (y = "ln(Demand of Electricity)", x = "Year", title = "Electricity Demand in Victoria, Australia")
vic_elec_daily %>% autoplot()
vic_elec_daily %>% autoplot(box_cox(sumDemand,0.9)) + labs (y = "Box Cox Transfomration of Daily Electricity Demand", x = "Year", title = "Electricity Demand in Victoria, Australia")
The original plot displayed small seasonal variance before 1970, with gradually increasing variance as the data approached 2010. Transforming with Box Cox caused the seasonal variance to even, but with the shape of the plot becoming more logarithmic.
aus_production %>% autoplot(Gas)
aus_production %>% features(Gas,features = guerrero)
## # A tibble: 1 × 1
## lambda_guerrero
## <dbl>
## 1 0.110
aus_production %>% autoplot(box_cox(Gas,0.11))+ labs (y = "Box Cox Transformation of Gas Production", x = "Year", title = "Gas Production in Australia")
With further research, it appears Box Cox transformations are not appropriate for datasets with many zero values. In the above scenarios with Australian Capital Territory and Northern Territory, I am unsure if it is acceptable to cut down the timeframe of analysis to use the Box-Cox transformation. However, I believe it could be justified as the zero values all occur after a certain year, which is not the same as having zero values interspersed throughout the data.
The Box-Cox transformation is unhelpful for the canadian_gas as it looks like the transformation did not adequately unify seasonal variance even with the optimal lambda determined via the Guerrero Feature. Practically, it may also be unhelpful because the lambda is not approximately 0, which makes the results difficult to interpret after the transformation.
canadian_gas %>% autoplot()
canadian_gas %>% features(Volume,features = guerrero)
## # A tibble: 1 × 1
## lambda_guerrero
## <dbl>
## 1 0.577
canadian_gas %>% autoplot(box_cox(Volume,0.577))
In this case, I would use a Box-Cox transformation with lambda 0. Using the Guerrero Feature, I found the optimal lambda to be 0.15. When plotting with a Box-Cox transformation with lambda 0.15 and comparing to a lambda of 0, both plots visually look to have the approximate same amount of seasonal variance. Therefore, a 0 lambda is chosen to more easily interpret the results as percent change.
set.seed(32)
myseries <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries %>% autoplot()
myseries %>% features(Turnover, features = guerrero)
## # A tibble: 1 × 3
## State Industry lambda_guerrero
## <chr> <chr> <dbl>
## 1 South Australia Pharmaceutical, cosmetic and toiletry goods r… 0.150
myseries %>% autoplot(box_cox(Turnover, 0.15))
myseries %>% autoplot(box_cox(Turnover, 0))
myseries_Transformed <- myseries %>% mutate(logTurnover = log(Turnover))
myseries_Transformed %>% autoplot(logTurnover)
dcmp <- myseries_Transformed %>% model(stl= STL(logTurnover))
components(dcmp)
## # A dable: 441 x 9 [1M]
## # Key: State, Industry, .model [1]
## # : logTurnover = trend + season_year + remainder
## State Industry .model Month logTurnover trend season_year remainder
## <chr> <chr> <chr> <mth> <dbl> <dbl> <dbl> <dbl>
## 1 South Austr… Pharmac… stl 1982 Apr 1.97 2.06 -0.0915 0.00226
## 2 South Austr… Pharmac… stl 1982 May 2.01 2.07 -0.0272 -0.0256
## 3 South Austr… Pharmac… stl 1982 Jun 2.01 2.07 -0.0461 -0.0110
## 4 South Austr… Pharmac… stl 1982 Jul 2.07 2.08 0.000460 -0.00989
## 5 South Austr… Pharmac… stl 1982 Aug 2.12 2.08 0.0246 0.0116
## 6 South Austr… Pharmac… stl 1982 Sep 2.05 2.08 -0.0152 -0.0145
## 7 South Austr… Pharmac… stl 1982 Oct 2.13 2.09 0.0410 -0.000329
## 8 South Austr… Pharmac… stl 1982 Nov 2.17 2.09 0.0562 0.0272
## 9 South Austr… Pharmac… stl 1982 Dec 2.41 2.10 0.202 0.110
## 10 South Austr… Pharmac… stl 1983 Jan 2.09 2.10 -0.0117 0.00470
## # ℹ 431 more rows
## # ℹ 1 more variable: season_adjust <dbl>
myseries_Transformed %>% autoplot(logTurnover,color = "grey") + autolayer(components(dcmp), trend, color = "blue", size = 2)
Given the optimal Box-Cox lambda is close to 1, I would say that it is appropriate to not transform the data as variance is already mostly stabilized.
aus_production %>% autoplot(Tobacco)
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).
aus_production %>% features(Tobacco,features = guerrero)
## # A tibble: 1 × 1
## lambda_guerrero
## <dbl>
## 1 0.926
aus_production %>% autoplot(box_cox(Tobacco,0.926))
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).
Using a lambda of 2 for a Box-Cox transformation of the data, I can see that seasonal variance is relatively evened out in comparison to the original plot. However, there appears to be some strong limitations with using a Box-Cox transformation to stabilize variance in this case, as the larger dips (e.g. the steep decline in Passengers to 0, due to pilot industrial disputes).
MelSyd <- ansett %>% filter(Airports == "MEL-SYD", Class == "Economy")
MelSyd %>% autoplot(Passengers)
MelSyd %>% features(Passengers,features = guerrero)
## # A tibble: 1 × 3
## Airports Class lambda_guerrero
## <chr> <chr> <dbl>
## 1 MEL-SYD Economy 2.00
MelSyd %>% autoplot(box_cox(Passengers,2)) + labs (y = "Box Cox Transformation ofPassenger Counts", x = "Time", title = "Economy Passenger Counts for Ansett Airlines from Melbourne to Syndey")
As data was taken at very close timepoints, I have chosen for this exercise to remove variance within days by taking the sum of daily pedestrian counts.
SCS <- pedestrian %>% filter(Sensor == "Southern Cross Station")
SCS %>% autoplot(Count)
SCS %>% features(Count,features = guerrero)
## # A tibble: 1 × 2
## Sensor lambda_guerrero
## <chr> <dbl>
## 1 Southern Cross Station -0.250
SCS %>% autoplot(box_cox(Count,-.25))
SCS_Daily <- SCS %>% as_tibble() %>% group_by(Date) %>% summarize(sumCount = sum(Count)) %>% as_tsibble(index = Date)
SCS_Daily %>% autoplot()
SCS_Daily %>% features(sumCount,features = guerrero)
## # A tibble: 1 × 1
## lambda_guerrero
## <dbl>
## 1 0.273
SCS_Daily %>% autoplot(box_cox(sumCount, 0.273)) + labs (y = "Box Cox Transformation of Daily Pedestrian Counts", x = "Date", title = "Pedestrian Counts at Southern Cross Station")
gas <- tail(aus_production, 5*4) |> select(Gas)
gas %>% autoplot()
Seasonal fluctuation in this dataset is shown as a yearly pattern, where gas production increases in the summer and then decreases towards the Winter. The trend-cycle is upwards as time increases towards 2011.
gas %>% autoplot()
gas %>% autoplot()
gas %>% model(classical_decomposition(Gas,type = "multiplicative")) %>% components() %>% autoplot()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
The results from classical multiplicative decomposition results the graphical interpretation from part 1. I can see the seasonal component has a yearly predictable pattern.
I computed and plotted the seasonally adjusted data by finding the components using components(). This breaks the time series into its components, including a seasonally adjusted component that removes variation from the season.
dcmp <- gas %>% model(classical_decomposition(Gas,type = "multiplicative"))
gas %>% autoplot(Gas, color = "gray") + autolayer(components(dcmp), season_adjust, color = "blue", size = 2) +
labs (y = "Gas production in petrajoules", x = "Time", title = "Seasonally adjusted gas production in Australia")
Changing one observation in the outlier, the outlier strongly affects the seasonally adjusted data. The data appears to follow the seasonal component very closely, indicating that it was not properly adjusted for season due to the outlier.
gas_updated <- gas %>% mutate(Gas = if_else(Quarter == yearquarter("2008 Q1"),Gas + 300, Gas))
gas_updated
## # A tsibble: 20 x 2 [1Q]
## Gas Quarter
## <dbl> <qtr>
## 1 221 2005 Q3
## 2 180 2005 Q4
## 3 171 2006 Q1
## 4 224 2006 Q2
## 5 233 2006 Q3
## 6 192 2006 Q4
## 7 187 2007 Q1
## 8 234 2007 Q2
## 9 245 2007 Q3
## 10 205 2007 Q4
## 11 494 2008 Q1
## 12 229 2008 Q2
## 13 249 2008 Q3
## 14 203 2008 Q4
## 15 196 2009 Q1
## 16 238 2009 Q2
## 17 252 2009 Q3
## 18 210 2009 Q4
## 19 205 2010 Q1
## 20 236 2010 Q2
dcmp_updated <- gas_updated %>% model(classical_decomposition(Gas,type = "multiplicative"))
gas_updated %>% autoplot(Gas, color = "gray") + autolayer(components(dcmp_updated), season_adjust, color = "blue", size = 2) +
labs (y = "Gas production in petrajoules", x = "Time", title = "Seasonally adjusted gas production in Australia")
If the outlier is near the end instead of the middle of the timeseries, the data is still affected (i.e. we can still see the outlier represented in the dataset) but far less dramatically as the overall shape of the seasonally adjusted data no longer follows the seasonal component of the decomposition.
gas_updated <- gas %>% mutate(Gas = if_else(Quarter == yearquarter("2010 Q2"),Gas + 300, Gas))
gas_updated
## # A tsibble: 20 x 2 [1Q]
## Gas Quarter
## <dbl> <qtr>
## 1 221 2005 Q3
## 2 180 2005 Q4
## 3 171 2006 Q1
## 4 224 2006 Q2
## 5 233 2006 Q3
## 6 192 2006 Q4
## 7 187 2007 Q1
## 8 234 2007 Q2
## 9 245 2007 Q3
## 10 205 2007 Q4
## 11 194 2008 Q1
## 12 229 2008 Q2
## 13 249 2008 Q3
## 14 203 2008 Q4
## 15 196 2009 Q1
## 16 238 2009 Q2
## 17 252 2009 Q3
## 18 210 2009 Q4
## 19 205 2010 Q1
## 20 536 2010 Q2
dcmp_updated <- gas_updated %>% model(classical_decomposition(Gas,type = "multiplicative"))
gas_updated %>% autoplot(Gas, color = "gray") + autolayer(components(dcmp_updated), season_adjust, color = "blue", size = 2) +
labs (y = "Gas production in petrajoules", x = "Time", title = "Seasonally adjusted gas production in Australia")
I decomposed the series and the log transformed series with X-11. In both, we can see a couple of spikes in the trend-cycle component, as well as additional spikes shown in the remainder component. These features were not previously viewable in the timeseries before decomposition.
x11_dcmp_transformed <- myseries_Transformed %>% model(x11 = X_13ARIMA_SEATS(logTurnover ~ x11())) %>% components()
autoplot(x11_dcmp_transformed)
x11_dcmp <- myseries %>% model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) %>% components()
autoplot(x11_dcmp)
myseries_Transformed %>% autoplot(logTurnover)
myseries %>% autoplot(Turnover)
The STL decomposition shown in figure 3.19 shows an abnormal dip in civilian labor force in Australia around 1991. This dip is viewable in the overall timeseries, although it is especially prominent in the decomposition when the remainder component is shown at a larger scale. The decomposition also shows that the civilian labor force in increasing in general, with a clear seasonality over the years. Figure 3.20 shows that, in general, the civilian labor force increases from January to March, and then generally decreases again until a spike in September. This is followed again by a dip in labor force until another spike in December.
The recession of 1991/1992 does appear viewable in two ways. In the remainder component of Figure 3.19, we can see a comparatively sizeable dip at around 1991/1992. Furthermore, looking at the seasonal component in Figure 3.20, we can see a relative dip in the data at the approximate 1990 mark. The recession began at around September 1990 and lasted until September 1991. Figure 3.20 shows that this is reflected in the data, which, at that timeframe, was generally below the median.