Load libraries
library(fpp3)
## Warning: package 'fpp3' was built under R version 4.3.3
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.5
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.4.1
## ✔ lubridate 1.9.4 ✔ fable 0.4.1
## ✔ ggplot2 3.5.1
## Warning: package 'lubridate' was built under R version 4.3.3
## Warning: package 'tsibble' was built under R version 4.3.3
## Warning: package 'feasts' was built under R version 4.3.3
## Warning: package 'fabletools' was built under R version 4.3.3
## Warning: package 'fable' was built under R version 4.3.3
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?
Inspect the global_economy dataset
global_economy
## # A tsibble: 15,150 x 9 [1Y]
## # Key: Country [263]
## Country Code Year GDP Growth CPI Imports Exports Population
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan AFG 1960 537777811. NA NA 7.02 4.13 8996351
## 2 Afghanistan AFG 1961 548888896. NA NA 8.10 4.45 9166764
## 3 Afghanistan AFG 1962 546666678. NA NA 9.35 4.88 9345868
## 4 Afghanistan AFG 1963 751111191. NA NA 16.9 9.17 9533954
## 5 Afghanistan AFG 1964 800000044. NA NA 18.1 8.89 9731361
## 6 Afghanistan AFG 1965 1006666638. NA NA 21.4 11.3 9938414
## 7 Afghanistan AFG 1966 1399999967. NA NA 18.6 8.57 10152331
## 8 Afghanistan AFG 1967 1673333418. NA NA 14.2 6.77 10372630
## 9 Afghanistan AFG 1968 1373333367. NA NA 15.2 8.90 10604346
## 10 Afghanistan AFG 1969 1408888922. NA NA 15.0 10.1 10854428
## # ℹ 15,140 more rows
Create the GDP Per capita col
global_economy <- global_economy %>%
mutate(gdp_per_capita = GDP / Population)
plot the gdp per capita
global_economy %>%
autoplot(gdp_per_capita, show.legend= FALSE) +
labs(title = "GDP Per Capita by Country") +
xlab("Year")+
ylab("GDP Per Capita")
## Warning: Removed 3242 rows containing missing values or values outside the scale range
## (`geom_line()`).
Find the country with the highest GDP Per Capita
highest_avg_gdp_country <- global_economy %>%
group_by(Country) %>%
summarise(avg_gdp_per_capita = mean(gdp_per_capita)) %>%
slice_max(avg_gdp_per_capita, n = 1)
highest_avg_gdp_country
## # A tsibble: 1 x 3 [1Y]
## # Key: Country [1]
## Country Year avg_gdp_per_capita
## <fct> <dbl> <dbl>
## 1 Monaco 2014 185153.
The country with the higest average GDP per capita is Monaco.
So now lets plot Monacos GDP over time.
global_economy %>%
filter(Country == "Monaco") %>%
autoplot(gdp_per_capita) +
labs(title = "Monaco's GDP Per Capita") +
xlab("Year") +
ylab("GDP Per Capita")
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_line()`).
From the plot above we can see that Monaco’s GDP has risen consistently
over the years since the data began in 1970. We can see a steady rise
until the 1980s where there was a slight dip but then in 1985 there was
a strong recovery. Then we see another steady rise until about the mid
90s where there was a small dip until the early 2000s where we then see
a huge recovery and spike until the 2008 recession. Then finally we see
one last recovery. So overall we see Monacos GDP rising over time.
For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.
United States GDP from global_economy
US_gdp <- global_economy %>%
filter(Country == "United States")
US_gdp %>%
autoplot(gdp_per_capita) +
labs(title = "US GDP Per Capita Over Time") +
xlab("Year")+
ylab("GDP Per Capita")
The GDP is rising fast over the years with what looks like an
exponential curve so I will try a log and a box-cox transformation on it
to try and flatten the curve a bit. US GDP per capita with log
transformation
US_gdp %>%
autoplot(log(gdp_per_capita)) +
labs(title = "US GDP Per Capita Over Time (Log transformation)") +
xlab("Year")+
ylab("GDP Per Capita")
Try a box cox
US_gdp %>%
features(gdp_per_capita, features=guerrero)
## # A tibble: 1 × 2
## Country lambda_guerrero
## <fct> <dbl>
## 1 United States 0.393
US_gdp %>%
autoplot(box_cox(gdp_per_capita, 0.3929072)) +
labs(title = "US GDP Per Capita Over Time (Box-Cox transformation)") +
xlab("Year")+
ylab("GDP Per Capita")
We can see that both the log and box-cox transformations help to flatten
out the plot to change the plot from the original exponential growth to
now a much more linear growth. This shows that the US GDP per capita has
been growing pretty steadily over the years and this should help to
stabilize any decomposition or forecast model produced from this
data.
Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock
vic_data <- aus_livestock %>%
filter(Animal == "Bulls, bullocks and steers", State == "Victoria")
vic_data %>%
autoplot(Count) +
labs("Slaughter of Victorian Bulls, bullocks and steers") +
xlab("Number of Slaughters") +
ylab("Year Month")
Since the variance seems to be all over the place with a lot of
flucuations lets try a box-cox transformation
vic_data %>%
features(Count, features = guerrero)
## # A tibble: 1 × 3
## Animal State lambda_guerrero
## <fct> <fct> <dbl>
## 1 Bulls, bullocks and steers Victoria -0.0446
vic_data %>%
autoplot(box_cox(Count, -0.04461887 )) +
labs("Box-Cox Transformed Slaughter of Victorian Bulls, bullocks and steers") +
xlab("Number of Slaughters") +
ylab("Year Month")
After the box-cox transformation we can see while there are still a lot
of flucuations the variance is much more stable than in the original
plot.
Victorian Electricity Demand from vic_elec
vic_elec %>%
autoplot(Demand) +
labs(title ="Half-hourly electricity demand for Victoria, Australia" , x="Time (30 Minute Intervals)", y="Total Electricity Demand in MWh.")
Apply a box-cox transformation
vic_elec %>%
features(Demand, features = guerrero)
## # A tibble: 1 × 1
## lambda_guerrero
## <dbl>
## 1 0.0999
vic_elec %>%
autoplot(box_cox(Demand, 0.09993089)) +
labs(title ="Box-Cox Transformed Half-hourly electricity demand for Victoria, Australia" , x="Time (30 Minute Intervals)", y="Total Electricity Demand in MWh.")
The box cox transformation does not to seemed to have much of an effect
so a transformation does not seem appropriate.
Gas production from aus_production
aus_production %>%
autoplot(Gas) +
labs(title = "Quartly Austrialian Gas Production", x = "Time Quarter", y="Gas Production in Petajoules")
Because we can see that the variance is not stable and we have more
variance at the end than at the beginning I will apply a box-cox
transformation to try and stablize the variance.
aus_production %>%
features(Gas, features = guerrero)
## # A tibble: 1 × 1
## lambda_guerrero
## <dbl>
## 1 0.110
aus_production %>%
autoplot(box_cox(Gas, 0.1095171)) +
labs(title = "Box-Cox Transformed Quartly Austrialian Gas Production", x = "Time Quarter", y="Gas Production in Petajoules")
Why is a Box-Cox transformation unhelpful for the canadian_gas data?
canadian_gas %>%
autoplot(Volume) +
labs(title="Monthly Canadian Gas Production", x = "Month Year", y="Gas Production (Billions of cubic metres)")
Box-cox transformation
canadian_gas %>%
features(Volume, features = guerrero)
## # A tibble: 1 × 1
## lambda_guerrero
## <dbl>
## 1 0.577
canadian_gas %>%
autoplot(box_cox(Volume, 0.5767648 )) +
labs(title="Box-Cox Transformed Monthly Canadian Gas Production", x = "Month Year", y="Gas Production (Billions of cubic metres)")
A Box-Cox transformation is unhelpful for this data because as we can
see from the first plot the variance is already stable across the plot.
When I did the Box-Cox transformation just to see the results we can see
that the plot is very similar to the original as it does not remove the
trend so it is really just unhelpful in this case.
What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?
set.seed(777)
myseries <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries %>%
features(Turnover, features = guerrero)
## # A tibble: 1 × 3
## State Industry lambda_guerrero
## <chr> <chr> <dbl>
## 1 South Australia Cafes, restaurants and takeaway food services 0.371
Optimal lambda value is 0.3713155
myseries %>%
autoplot(box_cox(Turnover, 0.3713155)) +
labs(title = "Box-Cox Transformed Australian Retail Turnover" , x="Month Year", y="Turnover")
I used the features function with parameters guerrero to find the optimal lambda value for the box-cox transformation and then I used that lambda in the box-cox transformation to produce the plot above. We can see it stabilizes the variance across the time series.
###Question 3.5 For the following series, find an appropriate Box-Cox transformation in order to stabilize the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian
Tobacco from aus_production
aus_production %>%
autoplot(Tobacco) +
labs(title = "Quarterly Australian Tobacco Production", x="Year Quarter", y="Tobacco and Cigarette Production in Tonnes")
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).
Find the optimal lambda for box-cox transformation
aus_production %>%
features(Tobacco, features = guerrero)
## # A tibble: 1 × 1
## lambda_guerrero
## <dbl>
## 1 0.926
Optimal lambda value is 0.9264636 so now perform the box-cox to see if it stabilizes the variance.
aus_production %>%
autoplot(box_cox(Tobacco, 0.9264636)) +
labs(title = "Box-Cox Transformed Quarterly Australian Tobacco Production" , x="Quarter Year", y="Tobacco and Cigarette Production in Tonnes")
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).
Since the lambda value is close to 1 the transformation does not do too much because the variance is already pretty stable.
Economy class passengers between Melbourne and Sydney from ansett
econ_data <- ansett %>%
filter(Class == "Economy",Airports == "MEL-SYD")
econ_data
## # A tsibble: 282 x 4 [1W]
## # Key: Airports, Class [1]
## Week Airports Class Passengers
## <week> <chr> <chr> <dbl>
## 1 1987 W26 MEL-SYD Economy 20167
## 2 1987 W27 MEL-SYD Economy 20161
## 3 1987 W28 MEL-SYD Economy 19993
## 4 1987 W29 MEL-SYD Economy 20986
## 5 1987 W30 MEL-SYD Economy 20497
## 6 1987 W31 MEL-SYD Economy 20770
## 7 1987 W32 MEL-SYD Economy 21111
## 8 1987 W33 MEL-SYD Economy 20675
## 9 1987 W34 MEL-SYD Economy 22092
## 10 1987 W35 MEL-SYD Economy 20772
## # ℹ 272 more rows
econ_data %>%
autoplot(Passengers) +
labs(title="Economy Passenger numbers on Ansett airline flights", x="Year Week", y="Number of Economy Passengers")
econ_data %>%
features(Passengers, features = guerrero)
## # A tibble: 1 × 3
## Airports Class lambda_guerrero
## <chr> <chr> <dbl>
## 1 MEL-SYD Economy 2.00
econ_data %>%
autoplot(box_cox(Passengers, 1.999927)) +
labs(title="Box-Cox Transformed Economy Passenger numbers on Ansett airline flights", x="Year Week", y="Number of Economy Passengers")
With a lambda of 1.97 it is performing a transformation raising it to
1.97 and as you can see above it is able to somewhat stabilize the
variance.
Pedestrian counts at Southern Cross Station from pedestrian
scs_data <- pedestrian %>%
filter(Sensor == "Southern Cross Station")
scs_data
## # A tsibble: 17,539 x 5 [1h] <Australia/Melbourne>
## # Key: Sensor [1]
## Sensor Date_Time Date Time Count
## <chr> <dttm> <date> <int> <int>
## 1 Southern Cross Station 2015-01-01 00:00:00 2015-01-01 0 746
## 2 Southern Cross Station 2015-01-01 01:00:00 2015-01-01 1 312
## 3 Southern Cross Station 2015-01-01 02:00:00 2015-01-01 2 180
## 4 Southern Cross Station 2015-01-01 03:00:00 2015-01-01 3 133
## 5 Southern Cross Station 2015-01-01 04:00:00 2015-01-01 4 44
## 6 Southern Cross Station 2015-01-01 05:00:00 2015-01-01 5 16
## 7 Southern Cross Station 2015-01-01 06:00:00 2015-01-01 6 13
## 8 Southern Cross Station 2015-01-01 07:00:00 2015-01-01 7 21
## 9 Southern Cross Station 2015-01-01 08:00:00 2015-01-01 8 39
## 10 Southern Cross Station 2015-01-01 09:00:00 2015-01-01 9 36
## # ℹ 17,529 more rows
scs_data %>%
autoplot(Count) +
labs(title="Hourly Pedestrian Counts in Southern Cross Station", x="Date Time", y="Pedestrian Counts")
The variance seems to be pretty stable already with just slightly higher
towards the end of the time series.
scs_data %>%
features(Count, features = guerrero)
## # A tibble: 1 × 2
## Sensor lambda_guerrero
## <chr> <dbl>
## 1 Southern Cross Station -0.250
scs_data %>%
autoplot(box_cox(Count, -0.2501616 )) +
labs(title="Box-Cox Transformed Hourly Pedestrian Counts in Southern Cross Station", x="Date Time", y="Pedestrian Counts")
With a lambda value of -0.25 it does seem to have had an effect on the
data and it does seem to have stabilized the variance across the time
series.
Consider the last five years of the Gas data from aus_production
####Part A Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?
gas <- tail(aus_production, 5*4) |> select(Gas)
gas %>%
autoplot(Gas) +
labs(title = "Quarterly Australian Gas Production", x="Year Quarter", y="Gas production in Petajoules")
The plot above shows that Australian gas production does follow a
seasonal pattern with consistent dips in Q1 then rises until the end of
Q2 where it then falls until the start of the next Q1 for the next year.
There also seems to be a upward trend in production over the last 5
years.
Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.
gas %>%
model(classical_decomposition(Gas, type = "multiplicative")) %>%
components() %>%
autoplot() +
labs(title = "Classical multiplicative decomposition of Australian Gas Production")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
#### Part C Do the results support the graphical interpretation from
part a? Yes the results do support the graphical interpretation from
part a. In the first plot we saw a consistent seasonal pattern that the
seasonal component in the decomposition above confirms. We also saw a
slight upward trend that is also supported by the trend component
above.
Compute and plot the seasonally adjusted data.
gas %>%
model(classical_decomposition(Gas, type = "multiplicative")) %>%
components() %>%
as_tsibble() |>
autoplot(Gas, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(title = "Seasonally Adjusted Australian Gas Production", x="Year Quarter", y="Gas production in Petajoules")
Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
Add in the outlier at 2006 Q2 by adding 300 to it
gas_out <- gas %>%
mutate(Gas = replace(Gas, row_number() == 4, Gas + 300))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Gas = replace(Gas, row_number() == 4, Gas + 300)`.
## Caused by warning in `x[list] <- values`:
## ! number of items to replace is not a multiple of replacement length
gas_out %>%
model(classical_decomposition(Gas, type = "multiplicative")) %>%
components() %>%
as_tsibble() |>
autoplot(Gas, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(title = "Seasonally Adjusted Australian Gas Production with outlier added", x="Year Quarter", y="Gas production in Petajoules")
The outlier has a huge effect on the seasonal decomposition. From the plot above we can see it completely alters the decomposition adding in a huge spike where the outlier is and also changing the rest of the graph as well with much more peaks and valleys that were not present in the outlier free plot.
Does it make any difference if the outlier is near the end rather than in the middle of the time series?
gas_out_end <- gas %>%
mutate(Gas = replace(Gas, row_number() == 16, Gas + 300))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Gas = replace(Gas, row_number() == 16, Gas + 300)`.
## Caused by warning in `x[list] <- values`:
## ! number of items to replace is not a multiple of replacement length
gas_out_middle <- gas %>%
mutate(Gas = replace(Gas, row_number() == 8, Gas + 300))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Gas = replace(Gas, row_number() == 8, Gas + 300)`.
## Caused by warning in `x[list] <- values`:
## ! number of items to replace is not a multiple of replacement length
gas_out_end %>%
model(classical_decomposition(Gas, type = "multiplicative")) %>%
components() %>%
as_tsibble() |>
autoplot(Gas, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(title = "Seasonally Adjusted Australian Gas Production with outlier added", x="Year Quarter", y="Gas production in Petajoules")
gas_out_middle %>%
model(classical_decomposition(Gas, type = "multiplicative")) %>%
components() %>%
as_tsibble() |>
autoplot(Gas, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(title = "Seasonally Adjusted Australian Gas Production with outlier added", x="Year Quarter", y="Gas production in Petajoules")
From comparing the two graphs the first has the outlier at the end and the second has the outlier in the middle of the time series. We can see that it does matter where the outlier is because that is where the spike in the trend will be however it will not alter the other parts of the plot much but it will have a impact on where the spike is.
###Question 3.8 Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?
myseries %>%
autoplot(Turnover)
x11_dcmp <- myseries %>%
model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) %>%
components()
x11_dcmp %>%
autoplot() +
labs(title ="Decomposition of Australian Retail Trade Turnover")
Looking at the original and then the decomposed plots we can see very
clearly that there is a strong seasonal pattern that the time series
follows. While a seasonal trend was noticed originally I did not realize
how strong it was until seeing the decomposition. We can also easily see
cyclic effects like the .com bubble in the early 2000s causing a spike
in the trend and the 2008 recession and 2019 covid pandemic causing dips
in the trend.
###Question 3.9 Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labor force in Australia each month from February 1978 to August 1995.
####Part A Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.
Looking at the trend from the STL decomposition we can see that there is a clear upward trend in the Australian labor workforce, there are some drop offs but the drop offs are following by rallies keeping the overall upward trend. Then the seasonal component shows a strong annual seasonal pattern. Then looking at the monthly seasonal subplot we can see that it shows dips in the workforce in January and August and spikes in December and March this could be due to seasonal hires beginning and ending.