624assignment2.knit

Assignment 2

Daniel DeBonis

Question 1

We first need to transform GDP to GDP per capita and group the data by country.

library(fpp3)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.6
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.1
## ✔ lubridate   1.9.4     ✔ fable       0.4.1
## ✔ ggplot2     3.5.1

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

global <- global_economy
global_percap <- global |>
                   mutate(GDP_per_cap = GDP/Population)|>
                select(GDP_per_cap, Country, Year) |>
                 group_by(Country)

Now we can try to plot the data. With the number of countries plotted, the legend will cover up the graph, so we need to remove it.

autoplot(global_percap, GDP_per_cap) +
  theme(legend.position="none")

## `mutate_if()` ignored the following grouping variables:
## • Column `Country`

## Warning: Removed 3242 rows containing missing values or values outside the scale range
## (`geom_line()`).

Without the legend, it is very difficult to determine the country with the largest GDP per capita. Instead we can rearrange the data in the table to see the maximum.

global_percap |>
  arrange(desc(GDP_per_cap))

## Warning: Current temporal ordering may yield unexpected results.
## ℹ Suggest to sort by `Country`, `Year` first.

## # A tsibble: 15,150 x 3 [1Y]
## # Key:       Country [263]
## # Groups:    Country [263]
##    GDP_per_cap Country        Year
##          <dbl> <fct>         <dbl>
##  1     185153. Monaco         2014
##  2     180640. Monaco         2008
##  3     179308. Liechtenstein  2014
##  4     173528. Liechtenstein  2013
##  5     172589. Monaco         2013
##  6     168011. Monaco         2016
##  7     167591. Liechtenstein  2015
##  8     167125. Monaco         2007
##  9     164993. Liechtenstein  2016
## 10     163369. Monaco         2015
## # ℹ 15,140 more rows

Depending on the year, the country with the highest GDP was often either Monaco or Liechtenstein. These countries are very small, with small populations. Looking at the line graph, and identifying the top two lines as Monaco and Liechtenstein, this appears to have been the case since at least the 1990s.

Question 2

The first data we are looking at is the GDP of the United States. This has already been imported as part of the previous question.

usgdp = global |>
  filter(Country == "United States")|>
  select(Country, Year, GDP, Population)
autoplot(usgdp)

## Plot variable not specified, automatically selected `.vars = GDP`

It would be appropriate to transform GDP into GDP per capita to control for population size.

usgdppercap <- usgdp |>
  mutate(gdp_per_cap = GDP/Population)
ggplot(data=usgdppercap, aes(x=Year, y=gdp_per_cap))+
  geom_line()

Our second set of data comes from Australian livestock data. We’re asked to plot the slaughter of “bulls, bullocks, and steers” in Victoria.

vicbulls <- aus_livestock |>
  filter(Animal=="Bulls, bullocks and steers" & State=="Victoria")
autoplot(vicbulls)

## Plot variable not specified, automatically selected `.vars = Count`

There is a lot of variation that could be addressed by a Box-Cox transformation. We can first find the best lambda with the guerrero function in R.

lambda <- vicbulls |>
  features(Count, features = guerrero) |>
  pull(lambda_guerrero)
vicbulls |>
  autoplot(box_cox(Count, lambda))

The value for lambda given by the function is -.0446. We can see a reduction in noise in the transformed plot.

The third set of data comes from the electricity demand data. We are asked to plot the demand in the state of Victoria. There is no need to filter the data yet since it only concerns Victoria.

autoplot(vic_elec, Demand)

An appropriate transformation for this data would be to aggregate the data collected every 30 minutes to a higher order, such as month.

vic_elec |>
  index_by(month = yearmonth(Time)) |>
  summarize(Demand = sum(Demand)) |>
  autoplot(Demand)

Now greater trends can come through the large amount of noise in the original graph.

The fourth data set is gas production in Australia.

ggplot(data = aus_production, aes(x=Quarter, y=Gas))+
     geom_line()

With the amount of variance that we see, a Box-Cox transformation would be worth applying. Again, we can use the Guerrero function to find the optimal lambda.

lambda2 <- aus_production |>
  features(Gas, features = guerrero) |>
  pull(lambda_guerrero)
aus_production |>
  autoplot(box_cox(Gas, lambda2))

With standard variance, we can get a better sense of the trend and seasonality. A lambda of .1095 was calculated for the Box-Cox transformation.

Question 3

A Box-Cox transformation would not be useful for the Canadian Gas data because the variance does not appear to grow with the trend in this data set. If we glance at the autoplot of the data,

autoplot(canadian_gas)

## Plot variable not specified, automatically selected `.vars = Volume`

it is clear that the variance is greatest from the late 70s through the 80s, but then the variance decreases as time continues, despite an overall increasing trend.

Question 4

First, let me randomly select a seriesID to specify what subset of the data I will be exploring.

set.seed(464319)
 myseries <- aus_retail |>
     filter(`Series ID` == sample(aus_retail$`Series ID`,1))
View(myseries)

I was randomly assigned footwear data from South Australia. Let’s now look at this data with autoplot

autoplot(myseries)

## Plot variable not specified, automatically selected `.vars = Turnover`

A Box-Cox transformation would be helpful to implement with this data since the variance is increasing as the turnover increases.

lambda3 <- myseries |>
  features(Turnover, features = guerrero) |>
  pull(lambda_guerrero)
myseries |>
  autoplot(box_cox(Turnover, lambda3))

Now the variance is much more regular, allowing the seasonality of the data to be more visible. The lambda calculated this time was -.1878.

Question 5

We can use the Guerrero function to find the optimal lambda for each Box-Cox transformation.

ggplot(data = aus_production, aes(x=Quarter, y=Tobacco))+
  geom_line()

## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

The changes in variance make a Box-Cox transformation appropriate for this data set.

lambda4 <- aus_production |>
    features(Tobacco, features = guerrero) |>
    pull(lambda_guerrero)
aus_production |>
  autoplot(box_cox(Tobacco, lambda4))

## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

For this transformation, lambda was calculated to be .9265.

The second data set is all economy passengers from Melbourne to Sydney, so we first need to filter the full airplane data to the relevant rows.

melsyd <- ansett |>
     filter(Airports == "MEL-SYD" & Class == "Economy")
autoplot(melsyd)

## Plot variable not specified, automatically selected `.vars = Passengers`

The same process of applying a Box-Cox transformation applies to this data set.

lambda5 <- melsyd |>
  features(Passengers, features = guerrero) |>
    pull(lambda_guerrero)
melsyd |>
    autoplot(box_cox(Passengers, lambda5))

The lambda used in this transformation is 1.9999.

Our third set of data involves pedestrians at Southern Cross station. Once again, our data should be filtered before analyzing since we are only interested in one crossing.

scs <- pedestrian |>
     filter(Sensor == "Southern Cross Station")
ggplot(data=scs, aes(x=Date_Time, y=Count))+
     geom_line()

Now to apply the Box-Cox transformation and find the optimal lambda,

lambda6 <- scs |>
  features(Count, features=guerrero) |>
  pull(lambda_guerrero)
scs |>
  autoplot(box_cox(Count, lambda6))

For this transformation, the lambda was -.2502.

Question 7

Let’s import the data with the code given in the exercise itself

gas <- tail(aus_production, 5*4) |> 
  select(Gas)

Plot the time series

autoplot(gas)

## Plot variable not specified, automatically selected `.vars = Gas`

We see clear seasonal fluctuations, with gas use peaking in Q3 and hitting lows in Q1 every year.

Multiplicative Classical Decomposition

gas |>
  model(
    classical_decomposition(Gas, type = "multiplicative")
  ) |>
components() |>
  autoplot()

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

Support Though the method is imperfect, the multiplicative decomposition allows us to see the regularity of the seasonal cycle, the general rising trend, as well as some interesting points of particularly high or low gas usage (Q2 2006 and Q2 2008, respectively).This supports the findings of seasonal regularity found in the first graph.
Compute and plot seasonally adjusted data

gas |>
  model(
    classical_decomposition(Gas, type = "multiplicative")
  ) |>
components() |>
  ggplot(aes(x=Quarter, y=season_adjust)) +
    geom_line()

With the data adjusted for seasonality, the greater upwards trend is more apparent, though moments that go against the trend are still visible.

Create an outlier With 20 observations, I used a Random Number Generator to choose an observation (I drew 5).

gas$Gas[5] <- gas$Gas[5] +300
print(gas$Gas[5])

## [1] 533

Now the recomputed seasonally adjusted data

 gas |>
  model(
    classical_decomposition(Gas, type = "multiplicative")
   ) |>
 components() |>
   ggplot(aes(x=Quarter, y=season_adjust)) +
     geom_line()

We have a huge spike in our outlier, but also less of the seasonality has been accounted for in this model since the outlier happened earlier in the cycle, throwing off the predictions generated.

The fact that the outlier occurred earlier in the cycle does not have a greater effect on the rest of the model. If it occurred later in the cycle, the same presence of unaccounted for seasonality would be present earlier in the cycle.

Question 8

myseries is already loaded into R from a previous question. It involved footwear from South Australia. Now let’s see the X11 decomposition

library(seasonal)

## 
## Attaching package: 'seasonal'

## The following object is masked from 'package:tibble':
## 
##     view

myseries |>
    model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) |>
  components() |>
  autoplot()

We do see the seasonality that was illuminated after the Box-Cox transformation. The rising trend is much clearer here. There is a surprising outlier in the early 2000s, it was not visible at all in the earlier plots.

Question 9

The STL Decomposition makes three things clear. The first thing is that overall there is a positive trend in the size of the labor force. The rate was not always consistent; it is particularly lower in the early 1990s. This is further illustrated by the second thing clear, which is the relative dip in that time period as evidenced by the irregular portion of the decomposition. The third thing is that even with that anomaly being considered, the seasonality remained persistent throughout the time period observed. The monthwise decomposition is helpful in reinforcing the seasonality effect, with most months showing small amounts of variance across years, though March and July have seen greater variance. As far as the early 90’s recession goes, it is best reflected in the remainder section of the decomposition, since the recession would reasonably result in values lower than those explained by seasonality or the greater trend, resulting in the crater we see in that time period. What is a notable, but small change in the overall value graph has that clear explanation. Perhaps this would also be more visible on the monthwise graphs if we were able to expand their x-axes.