library(fpp3)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.6
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.4.1
## ✔ lubridate 1.9.3 ✔ fable 0.4.1
## ✔ ggplot2 3.5.1
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
Exercises 3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8 and 3.9
global_economy
. Plot
the GDP per capita for each country over time. Which country has the
highest GDP per capita? How has this changed over time?global_economy_per_capita <- global_economy |>
mutate(gdp_per_capita = GDP/Population)
global_economy_per_capita |>
autoplot(gdp_per_capita, show.legend=FALSE) + # legend is way too large to include
geom_point(show.legend=FALSE) +
labs(title= "GDP per capita", y = "$US")
## Warning: Removed 3242 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 3325 rows containing missing values or values outside the scale range
## (`geom_point()`).
The legend is too large to fit on the graph, but in 2000, we can easily see there’s a country with a strong lead, so let’s filter by that year and sort in descending order:
# In 2000:
global_economy_us_recent <- global_economy_per_capita |>
filter(Year == "2000") |>
arrange(desc(gdp_per_capita))
global_economy_us_recent
So the country with the highest GDP per capita is Monaco. Monaco’s GDP per capita has overall increased over time, with some variance. Monaco has 2 larger troughs at 1984 and 2000, with the highest peak in 2014. There has been a few years where another country has taken the lead, but for the most part Monaco has always had the highest. In 1976, United Arab Emirates had the highest but 1978 lost the lead.
# In 1976:
global_economy_us_recent <- global_economy_per_capita |>
filter(Year == "1976") |>
arrange(desc(gdp_per_capita))
global_economy_us_recent
In more recent years, Liechtenstein has gotten closer to Monaco, actually surpassing in 2013 and 2015.
global_economy
.Original data:
global_economy_us <- global_economy |>
filter(Country == "United States")
global_economy_us |>
autoplot(GDP) +
geom_point() +
labs(title= "GDP", y = "$US")
Let’s make a population adjustment to make it per-capita:
global_economy_us_per_capita <- global_economy |>
filter(Country == "United States") |>
mutate(gdp_per_capita = GDP/Population) |>
select(gdp_per_capita)
global_economy_us_per_capita |>
autoplot(gdp_per_capita) +
geom_point() +
labs(title= "GDP per capita", y = "$US")
aus_livestock
.Original data:
# original data
aus_livestock_victorian_bulls_bullocks_steers <- aus_livestock |>
filter(State == "Victoria", Animal == "Bulls, bullocks and steers") |>
select(Count)
aus_livestock_victorian_bulls_bullocks_steers |>
autoplot(Count)
There isn’t a lot of variance as time increases, but let’s see what the box cox transformation does:
# guerrero box cox
lambda_aus_livestock <- aus_livestock_victorian_bulls_bullocks_steers |>
features(Count, features = guerrero) |>
pull(lambda_guerrero)
lambda_aus_livestock
## [1] -0.04461887
aus_livestock_victorian_bulls_bullocks_steers |>
autoplot(box_cox(Count, lambda_aus_livestock))
This did not do much. The only other possible adjustment I would try
doing is converting monthly data to daily average data which would
probably smooth the graph a bit more. Some months have more days than
others, so those counts will be higher. I tried to use
monthdays
to compute the daily average, but wasn’t able to
get it to work for the tsibble.
vic_elec
.Original data:
vic_elec |>
autoplot(Demand)
It is very hard to read this graph due to all the data, but for the most part, the variance stays pretty consistent with a few larger peaks during the end/start of each year.
Here is a box-cox transformation in case it makes the graph any more even:
lambda_elec <- vic_elec |>
features(Demand, features = guerrero) |>
pull(lambda_guerrero)
lambda_elec
## [1] 0.09993089
vic_elec |>
autoplot(box_cox(Demand, lambda_elec))
Box-cox transformation expectedly didn’t do much.
Let’s try updating the time interval of the data to make the graph a bit easier to read. Below the time interval has been changed to daily:
vic_elec_daily <- vic_elec |>
index_by(Day = as_date(Time)) |>
summarise(
Demand = sum(Demand)
)
vic_elec_daily |>
autoplot(Demand)
For the most part the variance stays pretty consistent in this graph as well.
Let’s see if the box-cox transformation will help at all:
lambda_elec_daily <- vic_elec_daily |>
features(Demand, features = guerrero) |>
pull(lambda_guerrero)
lambda_elec_daily
## [1] -0.8999268
vic_elec_daily |>
autoplot(box_cox(Demand, lambda_elec_daily))
The box-cox transformation didn’t do much for this either, so no transformation is needed.
aus_production
.Original data:
aus_production |>
autoplot(Gas)
There is a lot of variation in this graph - after 1970, the size of the peaks and troughs grow increasingly larger.
Let’s try a box-cox transformation to even out the graph:
lambda_gas <- aus_production |>
features(Gas, features = guerrero) |>
pull(lambda_guerrero)
lambda_gas
## [1] 0.1095171
aus_production |>
autoplot(box_cox(Gas, lambda_gas))
The variation is a lot more even now, making the graph easier to read. Using this lambda makes the size of the seasonal variation the same across the whole series.
canadian_gas
data?# original data:
canadian_gas |>
autoplot(Volume)
# box cox transformation
canadian_gas_lambda_guerrero <- canadian_gas |>
features(Volume, features = guerrero) |>
pull(lambda_guerrero)
canadian_gas_lambda_guerrero
## [1] 0.5767648
canadian_gas |>
autoplot(box_cox(Volume, canadian_gas_lambda_guerrero))
The box-cox produces pretty much an identical graph to the original making it not very useful. This probably means that the data is already normally distributed.
Original data:
set.seed(100)
myseries <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries |>
autoplot(Turnover)
To make it more even, I would do the following:
lambda <- myseries |>
features(Turnover, features = guerrero) |>
pull(lambda_guerrero)
lambda
## [1] 0.1571524
myseries |>
autoplot(box_cox(Turnover, lambda))
aus_production
, Economy class passengers between Melbourne
and Sydney from ansett
, and Pedestrian counts at Southern
Cross Station from pedestrian
.Tobacco from aus_production
:
Original data:
aus_production |>
autoplot(Tobacco)
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).
To improve:
lambda_tobacco <- aus_production |>
features(Tobacco, features = guerrero) |>
pull(lambda_guerrero)
lambda_tobacco
## [1] 0.9264636
aus_production |>
autoplot(box_cox(Tobacco, lambda_tobacco))
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).
Lambda is close to one so no transformation.
Economy class passengers between Melbourne and Sydney from
ansett
:
Original data:
ansett_econ_btw_melb_syd <- ansett |>
select(Passengers) |>
filter(Class == "Economy", Airports == "MEL-SYD")
ansett_econ_btw_melb_syd |>
autoplot(Passengers)
To improve:
lambda_passengers <- ansett_econ_btw_melb_syd |>
features(Passengers, features = guerrero) |>
pull(lambda_guerrero)
lambda_passengers
## [1] 1.999927
ansett_econ_btw_melb_syd |>
autoplot(box_cox(Passengers, lambda_passengers))
This made the troughs around 1987 and 1992 look larger, but the overall the variance looks slightly more stable now.
Pedestrian counts at Southern Cross Station from
pedestrian
:
Original data:
pedestrian_southern_cross_station <- pedestrian |>
filter(Sensor == "Southern Cross Station")
pedestrian_southern_cross_station |>
autoplot(Count)
To improve:
lambda_pedestrians <- pedestrian_southern_cross_station |>
features(Count, features = guerrero) |>
pull(lambda_guerrero)
lambda_pedestrians
## [1] -0.2501616
pedestrian_southern_cross_station |>
autoplot(box_cox(Count, lambda_pedestrians))
This graph does look more uniform. Out of curiosity, let’s try daily time interval:
pedestrian_southern_cross_station_daily <- pedestrian_southern_cross_station |>
group_by_key() |>
index_by(Date) |> # group by Date and use it as new index
summarise(Count = sum(Count))
pedestrian_southern_cross_station_daily
pedestrian_southern_cross_station_daily |>
autoplot(Count)
lambda_pedestrians_daily <- pedestrian_southern_cross_station_daily |>
features(Count, features = guerrero) |>
pull(lambda_guerrero)
lambda_pedestrians_daily
## [1] 0.2726316
pedestrian_southern_cross_station_daily |>
autoplot(box_cox(Count, lambda_pedestrians_daily))
This didn’t do all that much, but the variance does look more stable than the original graph.
aus_production
.gas <- tail(aus_production, 5*4) |> select(Gas)
gas |>
autoplot(Gas) +
geom_point()
There is a seasonal pattern here where there is a trough at Q1, it then increases for Q2, reaches a peak at Q3 and drops at Q4 with a value always higher than Q1 but lower than Q2 and Q3. The overall trend is increasing.
This makes sense since Q3 is winter time, and more people will need to travel by car due to weather.
gas |>
model(
classical_decomposition(Gas, type = "multiplicative")
) |>
components() |>
autoplot()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
Yes, the seasonal indices show that there is trough for Q1 and a peak during the winter for Q3.
dcmp <- gas |>
model(stl = STL(Gas))
components(dcmp)
gas |>
autoplot(Gas, color = "gray") +
autolayer(components(dcmp), season_adjust, color = "#0072B2")
gas_tibble <- gas |>
as_tibble()
gas_tibble[1,1] = gas_tibble[1,1] + 300
gas_updated <- gas_tibble |>
as_tsibble(index = Quarter)
dcmp_updated <- gas_updated |>
model(stl = STL(Gas))
components(dcmp_updated)
gas_updated |>
autoplot(Gas, color = "gray") +
autolayer(components(dcmp_updated), season_adjust, color = "#0072B2")
This graph is way less smooth than the original one. It shows the major drop from the first observation to the second, and shows more variation. It doesn’t seem like this graph gives us a lot of insight compared to the original one.
Middle outlier:
# Choose middle outlier this time
gas_tibble_middle_outlier <- gas |>
as_tibble()
gas_tibble_middle_outlier[10,1] = gas_tibble_middle_outlier[10,1] + 300
gas_updated_middle_outlier <- gas_tibble_middle_outlier |>
as_tsibble(index = Quarter)
dcmp_updated_middle_outlier <- gas_updated_middle_outlier |>
model(stl = STL(Gas))
components(dcmp_updated_middle_outlier)
gas_updated_middle_outlier |>
autoplot(Gas, color = "gray") +
autolayer(components(dcmp_updated_middle_outlier), season_adjust, color = "#0072B2")
End outlier:
# Choose end outlier this time
gas_tibble_end_outlier <- gas |>
as_tibble()
gas_tibble_end_outlier[20,1] = gas_tibble_end_outlier[10,1] + 300
gas_updated_end_outlier <- gas_tibble_end_outlier |>
as_tsibble(index = Quarter)
dcmp_updated_end_outlier <- gas_updated_end_outlier |>
model(stl = STL(Gas))
components(dcmp_updated_end_outlier)
gas_updated_end_outlier |>
autoplot(Gas, color = "gray") +
autolayer(components(dcmp_updated_end_outlier), season_adjust, color = "#0072B2")
The point in time where the outlier lives does not seem to make a big difference here. These graphs show a lot of variation so it would make sense to deal with the outliers before doing seasonal adjustment to get more accurate data.
Original data:
set.seed(100)
myseries <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries |>
autoplot(Turnover)
X-11 decomposition:
library(seasonal)
##
## Attaching package: 'seasonal'
## The following object is masked from 'package:tibble':
##
## view
x11_dcmp <- myseries |>
model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) |>
components()
autoplot(x11_dcmp)
The irregular graph shows a peak happening around 1986, a lesser one around 1990 and then again in 2009.
The general trend is increasing. The seasonal data shows strong seasonality. The seasonal graph has the largest grey bar, showing that the variation in the seasonal data is the smallest compared to the data. The remainder has a lot more variance, which makes sense. It also shows the big variation in 1991. This graph has the second largest grey bar, meaning that the remainder variance is larger than the seasonal data variation.
Yes you can see in the remainder graph that something happened in 1991 causing a major drop in the labor force. In figure 3.20, you can see this occurred specifically in March as the trend drops in 1991.