DATA 624: Predictive Analytics

Evan McLaughlin

Homework Chapter 3

3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8 and 3.9

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tsibble)

## Warning: package 'tsibble' was built under R version 4.3.3

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## 
## Attaching package: 'tsibble'

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

library(fable)

## Warning: package 'fable' was built under R version 4.3.3

## Loading required package: fabletools

## Warning: package 'fabletools' was built under R version 4.3.3

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(lattice)
library(scales)

## Warning: package 'scales' was built under R version 4.3.3

library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.3.2

library(USgas)

## Warning: package 'USgas' was built under R version 4.3.3

library(seasonal)

## Warning: package 'seasonal' was built under R version 4.3.3

library(zoo)

## Warning: package 'zoo' was built under R version 4.3.3

## 
## Attaching package: 'zoo'

## The following object is masked from 'package:tsibble':
## 
##     index

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

if (!require('fpp3')) (install.packages('fpp3'))

## Loading required package: fpp3

## Warning: package 'fpp3' was built under R version 4.3.3

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.0 ──

## ✔ tibble      3.2.1     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.0     ✔ feasts      0.3.2
## ✔ lubridate   1.9.3

## Warning: package 'tsibbledata' was built under R version 4.3.3

## Warning: package 'feasts' was built under R version 4.3.3

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()     masks base::date()
## ✖ dplyr::filter()       masks stats::filter()
## ✖ zoo::index()          masks tsibble::index()
## ✖ tsibble::intersect()  masks base::intersect()
## ✖ lubridate::interval() masks tsibble::interval()
## ✖ dplyr::lag()          masks stats::lag()
## ✖ tsibble::setdiff()    masks base::setdiff()
## ✖ tsibble::union()      masks base::union()
## ✖ tibble::view()        masks seasonal::view()

if (!require('latex2exp')) (install.packages('latex2exp'))

## Loading required package: latex2exp

## Warning: package 'latex2exp' was built under R version 4.3.3

if (!require('dplyr')) (install.packages('dplyr'))
if (!require('forecast')) (install.packages('forecast'))

## Loading required package: forecast

## Warning: package 'forecast' was built under R version 4.3.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

if (!require('seasonal')) (install.packages('seasonal'))

#3.1 Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

The plot below is pretty messy, even after we filter out those countries that fall below the mean. I’ve added labels. The highest per capita GDP values sit with Monaco and Liechtenstein, and that holds true over the timeline contained in the data.

mean_gdp_per_capita <- mean(global_economy$GDP_per_capita, na.rm = TRUE)
filtered_economy <- global_economy %>%
  filter(GDP_per_capita >= mean_gdp_per_capita)
filtered_economy <- filtered_economy %>%
  as_tsibble(index = Year, key = Country)
gdp_plot <- ggplot(filtered_economy, aes(x = Year, y = GDP_per_capita, color = Country)) +
  geom_line() +
  labs(title = "Filtered GDP Per Capita", 
       x = "Year", 
       y = "GDP Per Capita ($US)") +
  theme_minimal() +
  theme(legend.position = "none")

# the above plot doesn't work well because it lacks necessary labels. I've added below.

gdp_plot <- ggplot(filtered_economy, aes(x = Year, y = GDP_per_capita, color = Country)) +
  geom_line() +
  geom_text(aes(label = Country), check_overlap = TRUE, hjust = -0.1, vjust = -0.5, size = 3) +
  labs(title = "Filtered GDP Per Capita", 
       x = "Year", 
       y = "GDP Per Capita ($US)") +
  theme_minimal() +
  theme(legend.position = "none") 

print(gdp_plot)

highest_gdp_country <- global_economy %>%
  group_by(Country) %>%
  summarise(max_GDP_per_capita = max(GDP_per_capita, na.rm = TRUE)) %>%
  arrange(desc(max_GDP_per_capita)) %>%
  slice(1)

## Warning: There were 3325 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `max_GDP_per_capita = max(GDP_per_capita, na.rm = TRUE)`.
## ℹ In group 23: `Country = "Afghanistan"`, `Year = 1982`.
## Caused by warning in `max()`:
## ! no non-missing arguments to max; returning -Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 3324 remaining warnings.

## Warning: Current temporal ordering may yield unexpected results.
## ℹ Suggest to sort by `Country`, `Year` first.

highest_country_gdp_over_time <- global_economy %>%
  filter(Country == highest_gdp_country$Country) %>%
  ggplot(aes(x = Year, y = GDP_per_capita)) +
  geom_line(color = "orange") +
  labs(title = paste("GDP Per Capita Over Time for", highest_gdp_country$Country), 
       x = "Year", 
       y = "GDP Per Capita ($US)") +
  theme_minimal() +
  theme(legend.position = "none")

print(highest_country_gdp_over_time)

## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_line()`).

#3.2 For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

-United States GDP from global_economy. ## -Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock. -Victorian Electricity Demand from vic_elec. -Gas production from aus_production.

Removing facet_grid() from the code places all the plots on the same axes, which is somewhat problematic given the varying ranges of each of the variables.

#United States GDP from global_economy.

global_economy %>%
  filter(Country == "United States") %>%
  autoplot(GDP) +
  labs(title= "United States GDP over time", y = "$US")

The data shows some variation, so we should consider transformation here. We have transformed the quadratic trend to linear by employing a square root transformation.

#transformation of square root
global_economy %>%
  filter(Country == "United States") %>%
  autoplot(sqrt(GDP)) +
  labs(title= "United States GDP per Year", y = "$US")

# Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock

library(fpp3)
library(tsibble)
aus_livestock %>%
    filter(Animal == "Bulls, bullocks and steers") %>%
    group_by(Animal) %>%
    index_by(Month) %>%
    summarise(Count = sum(Count)) -> bbs

bbs %>% autoplot(Count) +
    labs(title= "Slaughter of Australian Bulls, bullocks and steers per Month", y = "Count")

By filtering for “Bulls, bullocks, and steers,” grouping by animal, and calculating the total number of animals slaughtered each month, the plot above was generated. While no clear upward or downward trend is evident, there does appear to be a cyclical pattern in the timing of when these animals are taken for slaughter. Since the data did not exhibit significant variation in the series levels, I determined that applying a transformation wouldn’t add value.

Victorian Electricity Demand:

vic_elec %>%
    index_by(Date) %>%
    summarise(Demand = sum(Demand)) -> ed

ed %>% autoplot(Demand) +
    labs(title= "Daily Electricity Demand -Victoria", y = "MW")

The data doesn’t show a noticeable trend, but there is evident seasonality. I aggregated the demand for each date and created the plot based on those sums. Since the series levels remained consistent, I decided that applying a transformation wouldn’t add much value. However, using decomposition could offer helpful insights into the seasonal patterns in this dataset.

Gas production from aus_production.

# I'm employing a box-cox transformation here
library(lattice)
aus_production %>%
    autoplot(Gas) +
    labs(title= "qtrly gas production - Australia", y = "petajoules")

lambda <- aus_production %>%
  features(Gas, features = guerrero) %>%
  pull(lambda_guerrero)

aus_production %>%
  autoplot(box_cox(Gas, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "qtrly gas production (transformed) with $\\lambda$ = ",
         round(lambda,2))))

Since the data showed variation in the series levels, I determined that applying a transformation would be beneficial. The text included an example of a Box-Cox transformation, which I chose to apply here as well.

Using the Guerrero method, we selected the optimal lambda, which helps ensure that the seasonal variation remains consistent across the entire series. After applying the Box-Cox transformation to our Gas data using the optimal lambda, we can see that the seasonal variation is now stabilized.

#3.3 Why is a Box-Cox transformation unhelpful for the canadian_gas data?

canadian_gas %>%
    autoplot()

## Plot variable not specified, automatically selected `.vars = Volume`

summary(canadian_gas$Volume)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.966   6.453   8.831   9.777  14.429  19.528

Applying a Box-Cox transformation to the canadian_gas dataset produces results that are nearly identical to those generated by a simple autoplot().

This suggests that the Box-Cox transformation isn’t necessary for this data, as the ratio between the maximum and minimum values is relatively small, and the variation doesn’t increase with the series level. Not all datasets require transformation, and in some cases, it’s perfectly fine to retain the original scale.

#3.4 What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?

set.seed(432)

myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries %>%
    autoplot(Turnover) +
    labs(title= "Australian Retail Turnover", y = "%")

Given the small max(y) / min(y) ratio and the fact that the variation seems more random rather than following the levels, I don’t think a Box-Cox transformation would be effective for this dataset.

#3.5 For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

#tobacco from aus_production

lambda_ap <- aus_production %>%
  features(Tobacco, features = guerrero) %>%
  pull(lambda_guerrero)

aus_production %>%
  autoplot(box_cox(Tobacco, lambda_ap)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Aus tobacco production: $\\lambda$ = ",
         round(lambda_ap,2))))

## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

A lambda of .93 is semistable.

#Economy class passengers between Melbourne and Sydney from ansett:

ansett %>%
    filter(Class == 'Economy') %>%
    filter(Airports == 'MEL-SYD') -> economy_melsyd

lam_econ <- economy_melsyd %>%
  features(Passengers, features = guerrero) %>%
  pull(lambda_guerrero)

economy_melsyd %>%
  autoplot(box_cox(Passengers, lam_econ)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Mel-Syd Economy passengers: $\\lambda$ = ",
         round(lam_econ,2))))

Another semistable lambda

#Pedestrian counts at Southern Cross Station from pedestrian:
# let's start by filtering for station
pedestrian %>%
    filter(Sensor == 'Southern Cross Station') -> southcross

lambda_sc <- southcross %>%
  features(Count, features = guerrero) %>%
  pull(lambda_guerrero)

southcross %>%
  autoplot(box_cox(Count, lambda_sc)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "southern cross pedestrians: $\\lambda$ = ",
         round(lambda_sc,2))))

variance is stable, per lambda, but the frequency makes it almost impossible to read.

#3.7 Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) |> select(Gas) Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle? Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices. Do the results support the graphical interpretation from part a? Compute and plot the seasonally adjusted data. Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier? Does it make any difference if the outlier is near the end rather than in the middle of the time series?

#get only five years
gas <- tail(aus_production, 5*4) %>% select(Gas)
gas %>%
    autoplot(Gas)

We can see noticeable seasonal fluctations combined with gradual upward movement.

# let's next apply classical decomposition
gas %>% model(classical_decomposition(Gas, type = "mult")) %>%
  components() %>%
  autoplot() +
  labs(title = "decomposition Aus gas production")

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

These images support the previous observations of upward trending and seasonality.

#Compute and plot the seasonally adjusted data.

dec_gas <- gas %>%
  model(stl = STL(Gas))

components(dec_gas) %>%
  as_tsibble() %>%
  autoplot(Gas, color = "blue") +
  geom_line(aes(y=season_adjust), color = "red") +
  labs(y = "Gas production",
       title = "Gas Production: AUS")

Adjusting for seasonality gives a good snapshot of the data’s trend by mitigating the effect of the seasonal fluctuations.

#outlier / decomp
gas_out <- gas
gas_out$Gas[3] <- gas_out$Gas[3] + 300
gas_decomp <- gas_out %>%
  model(stl = STL(Gas))

#recompute and display
components(gas_decomp) %>%
  as_tsibble() %>%
  autoplot(Gas, color = "blue") +
  geom_line(aes(y=season_adjust), color = "red") +
  labs(y = "Gas production",
       title = "Gas Production: AUS")

The outlier has a disproportionate effect on the seasonally adjusted data, and essentially eliminates the trend.

Does it make any difference if the outlier is near the end rather than in the middle of the time series?

#change one observation to be an outlier
gas_out2 <- gas
gas_out2$Gas[19] <- gas_out2$Gas[19] + 300

#recompute the seasonally adjusted data
gas_decomp2 <- gas_out2 %>%
  model(stl = STL(Gas))
components(gas_decomp2) %>%
  as_tsibble() %>%
  autoplot(Gas, color = "blue") +
  geom_line(aes(y=season_adjust), color = "grey") +
  labs(y = "Gas production",
       title = "Gas production:AUS")

Observing the two, we see the outlier in the seasonally adjusted data in both.

#3.8

#Recall your retail time series data (from Exercise 8 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

set.seed(432)

series1 <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

x11 <- series1 %>%
  model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) %>%
  components()
autoplot(x11) +
  labs(title =
    "Decomposition:AUS Retail Turnover (X-11)")

There is a large irregular fluctuation between 1980 and 1985 but appears to be an outlier.

#3.9 Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation. Is the recession of 1991/1992 visible in the estimated components?

We observe a general upward trend with some variation that we really can’t derive much from, based on the plot. We see that the figures range from less than 7000 to less than 9000 from before January of 1980 through January of 1995. We also observe a clear upward movement in “trend.” In the season-year component, there are lots of ups and downs, but a noticeable pattern appears and repeats seasonally. The remainder component shows a noticeable drop in employment numbers, which would seem to indicate a recession, which answers the last question.