Consider the GDP information in global_economy. Plot the GDP per capita for each country over time.
# Identify the top 10 countries based on mean per cap GDP over time
top_10_countries <- as_tibble(global_economy) %>%
mutate(gdp_per_cap = GDP/Population) %>%
group_by(Country) %>%
summarize(mean_per_cap_gdp = mean(gdp_per_cap)) %>%
arrange(desc(mean_per_cap_gdp)) %>%
head(10) %>%
select(Country)
# Plot the top 10 countries per cap GDP over time
as_tibble(global_economy) %>%
filter(Country %in% top_10_countries[[1]]) %>%
mutate(gdp_per_cap = GDP/Population) %>%
group_by(Country) %>%
mutate(mean_per_cap_gdp = mean(gdp_per_cap)) %>%
ungroup() %>%
ggplot(aes(x=Year, y=gdp_per_cap, color=Country)) +
geom_line() +
labs(title = "GDP Per Capita for Top-10 Countries",
x= "Year",
y = "GDP Per Capita")
Which country has the highest GDP per capita? How has this changed over
time?
The Netherlands has the highest GDP per capita over this period. While it wasn’t the highest from the outset, it saw an increase in the 1980s and ultimately overtook the other countries around 1990 and has maintained itself as the highest ever since.
For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.
For this data, I applied an adjusted to the GDP based on inflation. Doing so doesn’t appear to have made a big difference or provide any significant information gain. However, the rate of the growth and the trend in the data is much more stable over the time series compared to a more steep increase exhibited by the pre-transformed data.
global_economy %>%
filter(Country == 'United States') %>%
mutate(adjusted_gdp = GDP/CPI * 100) %>%
ggplot() +
geom_line(aes(x=Year, y=GDP), color="red") +
geom_line(aes(x=Year, y=adjusted_gdp), color="blue")
b. Slaughter of Victorian “Bulls, bullocks and steers” in
aus_livestock.
For this data series, I did both a calendar transformation, byt changing the data from monthly to quarterly. Additionally, in doing this, I adjusted the data to show the mean count per quarter. Finally, I viewed the graph after applying a power transformation - using an nth root of 3. Ultimately, I didn’t really see any benefit to applying these various transormations.
aus_livestock %>%
filter(Animal == "Bulls, bullocks and steers" & State=="Victoria") %>%
ggplot(aes(x=Month, y=Count)) +
geom_line()
aus_livestock %>%
filter(Animal == "Bulls, bullocks and steers" & State=="Victoria") %>%
index_by(Quarter = yearquarter(Month)) %>%
group_by(Quarter) %>%
summarize(mean_count = mean(Count)) %>%
ggplot(aes(x=Quarter, y=mean_count)) +
geom_line()
aus_livestock %>%
filter(Animal == "Bulls, bullocks and steers" & State=="Victoria") %>%
index_by(Quarter = yearquarter(Month)) %>%
group_by(Quarter) %>%
summarize(mean_count = mean(Count)) %>%
mutate(adj_count = nthroot(mean_count,3)) %>%
autoplot(adj_count)
aus_livestock %>%
filter(Animal == "Bulls, bullocks and steers" & State=="Victoria") %>%
index_by(Quarter = yearquarter(Month)) %>%
group_by(Quarter) %>%
summarize(mean_count = mean(Count)) %>%
mutate(adj_count = log(mean_count)) %>%
autoplot(adj_count)
For this series, I tried several transformations. The main goal was to minimize the variance that exists in the series. Across each of the transformations, I decided that a log transformation was the most effective transformation for reducing the variance. Using this transformation the range for the data was between 8.2 and 8.7
weekly_vic_elec <- as_tibble(vic_elec) %>%
mutate(Week = yearweek(Date)) %>%
group_by(Week) %>%
summarize(avg_demand = mean(Demand)) %>%
ungroup() %>%
as_tsibble()
## Using `Week` as index variable.
lambda <- weekly_vic_elec %>%
features(avg_demand, feature = guerrero) %>%
pull(lambda_guerrero)
weekly_vic_elec %>%
mutate(adj_demand = log(avg_demand)) %>%
autoplot(adj_demand)
weekly_vic_elec %>%
autoplot(avg_demand)
weekly_vic_elec %>%
autoplot(box_cox(avg_demand,lambda))
vic_elec %>%
group_by(Date) %>%
summarize(avg_demand = mean(Demand))
## # A tsibble: 52,608 x 3 [30m] <Australia/Melbourne>
## # Key: Date [1,096]
## Date Time avg_demand
## <date> <dttm> <dbl>
## 1 2012-01-01 2012-01-01 00:00:00 4383.
## 2 2012-01-01 2012-01-01 00:30:00 4263.
## 3 2012-01-01 2012-01-01 01:00:00 4049.
## 4 2012-01-01 2012-01-01 01:30:00 3878.
## 5 2012-01-01 2012-01-01 02:00:00 4036.
## 6 2012-01-01 2012-01-01 02:30:00 3866.
## 7 2012-01-01 2012-01-01 03:00:00 3694.
## 8 2012-01-01 2012-01-01 03:30:00 3562.
## 9 2012-01-01 2012-01-01 04:00:00 3433.
## 10 2012-01-01 2012-01-01 04:30:00 3359.
## # ℹ 52,598 more rows
vic_elec %>%
autoplot(Demand)
Finally, for this series, the box_cox transformation proved to be highly effective in reducing the variance exhibited in the data.
lambda <- aus_production |>
features(Gas, features = guerrero) |>
pull(lambda_guerrero)
aus_production %>%
autoplot(Gas)
aus_production %>%
autoplot(box_cox(Gas,lambda))
Why is a Box-Cox transformation unhelpful for the canadian_gas data?
The Box-Cox transformation is unhelpful with this data because the variance in the data is not changing in a consistent way. The variance that is exhibited in the middle of the series is much ore extreme then the variance that is exhibited at the beginning and end of the series.
canadian_gas %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Volume`
lambda <- canadian_gas |>
features(Volume, features = guerrero) |>
pull(lambda_guerrero)
canadian_gas %>%
autoplot(box_cox(Volume, lambda))
What Box-Cox transformation would you select for your retail data
I chose a box-cox transformation with a lambda of.23
set.seed(9716)
myseries <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Turnover`
lambda <- myseries %>%
features(Turnover, features = guerrero) %>%
pull(lambda_guerrero)
myseries %>%
autoplot(box_cox(Turnover, lambda))
For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance.
Tobacco from aus_production,
aus_production %>%
autoplot(Tobacco)
## Warning: Removed 24 rows containing missing values (`geom_line()`).
lambda <- aus_production %>%
features(Tobacco, features = guerrero) %>%
pull(lambda_guerrero)
aus_production %>%
autoplot(box_cox(Tobacco, lambda))
## Warning: Removed 24 rows containing missing values (`geom_line()`).
Economy class passengers between Melbourne and Sydney from ansett,
as_tibble(ansett)
## # A tibble: 7,407 × 4
## Week Airports Class Passengers
## <week> <chr> <chr> <dbl>
## 1 1989 W28 ADL-PER Business 193
## 2 1989 W29 ADL-PER Business 254
## 3 1989 W30 ADL-PER Business 185
## 4 1989 W31 ADL-PER Business 254
## 5 1989 W32 ADL-PER Business 191
## 6 1989 W33 ADL-PER Business 136
## 7 1989 W34 ADL-PER Business 0
## 8 1989 W35 ADL-PER Business 0
## 9 1989 W36 ADL-PER Business 0
## 10 1989 W37 ADL-PER Business 0
## # ℹ 7,397 more rows
ansett_filtered <- ansett %>%
filter(Airports == "MEL-SYD" & Class == 'Economy')
ansett_filtered %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Passengers`
lambda <- ansett_filtered %>%
features(Passengers, features = guerrero) %>%
pull(lambda_guerrero)
ansett_filtered %>%
autoplot(box_cox(Passengers, lambda))
Pedestrian counts at Southern Cross Station from pedestrian
as_tibble(pedestrian) %>% select(Sensor) %>% distinct()
## # A tibble: 4 × 1
## Sensor
## <chr>
## 1 Birrarung Marr
## 2 Bourke Street Mall (North)
## 3 QV Market-Elizabeth St (West)
## 4 Southern Cross Station
pedestrian
## # A tsibble: 66,037 x 5 [1h] <Australia/Melbourne>
## # Key: Sensor [4]
## Sensor Date_Time Date Time Count
## <chr> <dttm> <date> <int> <int>
## 1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01 0 1630
## 2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01 1 826
## 3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01 2 567
## 4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01 3 264
## 5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01 4 139
## 6 Birrarung Marr 2015-01-01 05:00:00 2015-01-01 5 77
## 7 Birrarung Marr 2015-01-01 06:00:00 2015-01-01 6 44
## 8 Birrarung Marr 2015-01-01 07:00:00 2015-01-01 7 56
## 9 Birrarung Marr 2015-01-01 08:00:00 2015-01-01 8 113
## 10 Birrarung Marr 2015-01-01 09:00:00 2015-01-01 9 166
## # ℹ 66,027 more rows
pedestrian_filtered <- pedestrian %>%
filter(Sensor == 'Southern Cross Station')
pedestrian_filtered %>%
autoplot(Count)
pedestrian_adj <- as_tibble(pedestrian_filtered) %>%
group_by(Date) %>%
summarize(total_count = sum(Count)) %>%
tsibble()
## Using `Date` as index variable.
pedestrian_weekly <- as_tibble(pedestrian_filtered) %>%
mutate(Week = yearweek(Date)) %>%
group_by(Date) %>%
summarize(daily_count = sum(Count)) %>%
mutate(Week = yearweek(Date)) %>%
group_by(Week) %>%
summarize(weekly_avg = mean(daily_count)) %>%
as_tsibble()
## Using `Week` as index variable.
lambda <- pedestrian_weekly %>%
features(weekly_avg, features = guerrero) %>%
pull(lambda_guerrero)
lambda2 <- pedestrian_adj %>%
features(total_count, features = guerrero) %>%
pull(lambda_guerrero)
pedestrian_weekly %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = weekly_avg`
pedestrian_weekly %>%
autoplot(box_cox(weekly_avg, lambda))
gas <- tail(aus_production, 5*4) |> select(Gas)
The overall trend appears to be a positive trend, while there is a seasonal fluctuation where the data peaks in Q3 and is at its lowest point in Q1. This seasonality appears to persist on an annual basis.
gas %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Gas`
gas %>%
model(classical_decomposition(Gas, type="multiplicative")) %>%
components() %>%
autoplot()
## Warning: Removed 2 rows containing missing values (`geom_line()`).
Yes, the results from the decomposition support the interpretation from part a. In the base graphic, we find that the seasonality is such that Q1 is the trough of the seasonal data, while the apex occurs in Q4. Additionally, we can see the existence of an overall positive trend that we see exhibited in the trend portion of the graph.
gas %>%
model(classical_decomposition(Gas, type="multiplicative")) %>%
components() %>%
as_tsibble() %>%
autoplot(Gas)
gas %>%
model(classical_decomposition(Gas, type="multiplicative")) %>%
components() %>%
as_tsibble() %>%
autoplot(Gas, color="gray") +
geom_line(aes(y=season_adjust), color="black")
The outlier seems to throw the entire series off kilter. It makes it so the seasonal adjustments are no longer valid.
gas_outlier <- gas
set.seed(9162023)
sample_index = sample(1:length(gas[[1]]),1)
gas_outlier[[sample_index,1]] = gas_outlier[[sample_index,1]] + 300
gas_outlier
## # A tsibble: 20 x 2 [1Q]
## Gas Quarter
## <dbl> <qtr>
## 1 221 2005 Q3
## 2 180 2005 Q4
## 3 171 2006 Q1
## 4 224 2006 Q2
## 5 233 2006 Q3
## 6 192 2006 Q4
## 7 187 2007 Q1
## 8 534 2007 Q2
## 9 245 2007 Q3
## 10 205 2007 Q4
## 11 194 2008 Q1
## 12 229 2008 Q2
## 13 249 2008 Q3
## 14 203 2008 Q4
## 15 196 2009 Q1
## 16 238 2009 Q2
## 17 252 2009 Q3
## 18 210 2009 Q4
## 19 205 2010 Q1
## 20 236 2010 Q2
gas_outlier %>%
model(classical_decomposition(Gas, type="multiplicative")) %>%
components() %>%
as_tsibble() %>%
autoplot(Gas, color="gray") +
geom_line(aes(y=season_adjust), color="black")
No, the placement of the outlier does not seem to have a change in the decomposition of the data series.
gas_outlier_begin <- gas
gas_outlier_end <- gas
gas_outlier_begin[[5,1]] = gas[[5,1]] + 300
gas_outlier_begin
## # A tsibble: 20 x 2 [1Q]
## Gas Quarter
## <dbl> <qtr>
## 1 221 2005 Q3
## 2 180 2005 Q4
## 3 171 2006 Q1
## 4 224 2006 Q2
## 5 533 2006 Q3
## 6 192 2006 Q4
## 7 187 2007 Q1
## 8 234 2007 Q2
## 9 245 2007 Q3
## 10 205 2007 Q4
## 11 194 2008 Q1
## 12 229 2008 Q2
## 13 249 2008 Q3
## 14 203 2008 Q4
## 15 196 2009 Q1
## 16 238 2009 Q2
## 17 252 2009 Q3
## 18 210 2009 Q4
## 19 205 2010 Q1
## 20 236 2010 Q2
gas_outlier_end[[15,1]] = gas[[15,1]] + 300
gas_outlier_end
## # A tsibble: 20 x 2 [1Q]
## Gas Quarter
## <dbl> <qtr>
## 1 221 2005 Q3
## 2 180 2005 Q4
## 3 171 2006 Q1
## 4 224 2006 Q2
## 5 233 2006 Q3
## 6 192 2006 Q4
## 7 187 2007 Q1
## 8 234 2007 Q2
## 9 245 2007 Q3
## 10 205 2007 Q4
## 11 194 2008 Q1
## 12 229 2008 Q2
## 13 249 2008 Q3
## 14 203 2008 Q4
## 15 496 2009 Q1
## 16 238 2009 Q2
## 17 252 2009 Q3
## 18 210 2009 Q4
## 19 205 2010 Q1
## 20 236 2010 Q2
gas_outlier_begin %>%
model(classical_decomposition(Gas, type="multiplicative")) %>%
components() %>%
as_tsibble() %>%
autoplot(Gas, color="gray") +
geom_line(aes(y=season_adjust), color="#0072B2")
gas_outlier_end %>%
model(classical_decomposition(Gas, type="multiplicative")) %>%
components() %>%
as_tsibble() %>%
autoplot(Gas, color="gray") +
geom_line(aes(y=season_adjust), color="#0072B2")
Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?
The one new observation that I notice, is that the seasonality in the data appears to have a downward trend that I did not notice previously. Additionally, there is a significant outlier in the data that occurs in the early part of 2000.
myseries
## # A tsibble: 441 x 5 [1M]
## # Key: State, Industry [1]
## State Industry `Series ID` Month Turnover
## <chr> <chr> <chr> <mth> <dbl>
## 1 New South Wales Furniture, floor coverings, ho… A3349468W 1982 Apr 65.8
## 2 New South Wales Furniture, floor coverings, ho… A3349468W 1982 May 65.8
## 3 New South Wales Furniture, floor coverings, ho… A3349468W 1982 Jun 62.3
## 4 New South Wales Furniture, floor coverings, ho… A3349468W 1982 Jul 68.2
## 5 New South Wales Furniture, floor coverings, ho… A3349468W 1982 Aug 66
## 6 New South Wales Furniture, floor coverings, ho… A3349468W 1982 Sep 62.3
## 7 New South Wales Furniture, floor coverings, ho… A3349468W 1982 Oct 66.2
## 8 New South Wales Furniture, floor coverings, ho… A3349468W 1982 Nov 68.9
## 9 New South Wales Furniture, floor coverings, ho… A3349468W 1982 Dec 90.8
## 10 New South Wales Furniture, floor coverings, ho… A3349468W 1983 Jan 58
## # ℹ 431 more rows
myseries %>%
model(x11=X_13ARIMA_SEATS(Turnover ~ x11())) %>%
components() %>%
autoplot()
There is a positive trend in the data that has been pretty much consistent through the entirety of the time series. Additionally, the seasonal fluctuations in the data appear to be consistent and stable throughout the series, until about 1990, when the amplitude of the variations in the seasonal data appears to be increasing. There is an outlier in the data that occurs around 1991, in which there was a significant dip in the data, followed by a significant spike. These fluctuations seemed to be attributed to black-swan events - as they were not present at any other time in the series.
Yes, the recessions of 1991/1992 is visible in the components. As mentioned above, it appears to be captured in the remainder portion of the data.