HW2 Data 624 Predictive Analytics 2024 Spring Term

Question 3.1
Question 3.2
Question 3.4
Question 3.5
Question 3.7
Question 3.8
Question 3.9

Do exercises 3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8 and 3.9 from the online Hyndman book. (https://oteinsom/fpp3/)

library(fpp3)
library(latex2exp)
library(seasonal)

Question 3.1

Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

Solve for GDP per capita

global_economy_add_GDP_per_capita <- global_economy %>%
  mutate(GDP_per_capita = GDP/Population)

Plot the GDP per capita for each country over time.

autoplot(global_economy_add_GDP_per_capita, GDP_per_capita) + 
  theme(legend.position = "none")

Country has the highest GDP per capita

head(global_economy_add_GDP_per_capita %>%
       select(Country, Year, GDP_per_capita)%>%
       arrange(desc(GDP_per_capita)),10)

## # A tsibble: 10 x 3 [1Y]
## # Key:       Country [2]
##    Country        Year GDP_per_capita
##    <fct>         <dbl>          <dbl>
##  1 Monaco         2014        185153.
##  2 Monaco         2008        180640.
##  3 Liechtenstein  2014        179308.
##  4 Liechtenstein  2013        173528.
##  5 Monaco         2013        172589.
##  6 Monaco         2016        168011.
##  7 Liechtenstein  2015        167591.
##  8 Monaco         2007        167125.
##  9 Liechtenstein  2016        164993.
## 10 Monaco         2015        163369.

In 2014, Monaco boasted the highest GDP per capita. However, over time, this figure declined, and by the following year, Liechtenstein surpassed Monaco to claim the highest GDP per capita.

Question 3.2

For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

a.United States GDP from global_economy.

global_economy %>%
  filter(Country == "United States") %>%
  autoplot(GDP) +
  labs(title= "GDP")

Data is influenced by changes in population, and therefore, it should be adjusted on a per capita basis.

global_economy %>%
  filter(Country == "United States") %>%
  autoplot(GDP/Population) +
  labs(title= "GDP per capita")

b.Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock.

aus_livestock %>%
  filter(State == "Victoria" &
           Animal == "Bulls, bullocks and steers") %>%
  autoplot(Count)

Did not detect any need for data transformation.

c.Victorian Electricity Demand from vic_elec.

autoplot(vic_elec, Demand)

Calendar adjustment is required for computing the average daily demand for total electricity in each month.

average_demand_per_month <- vic_elec %>%
  group_by(Date)%>%
  mutate(Daily_demand = sum(Demand))%>%
  distinct(Date, Daily_demand)%>%
  mutate(Month = yearmonth(Date))%>%
  group_by(Month)%>%
  mutate(Average_Monthly_Demand = mean(Daily_demand, na.rm = TRUE))%>%
  distinct(Month, Average_Monthly_Demand)%>% 
  as_tsibble(index = Month)

autoplot(average_demand_per_month, Average_Monthly_Demand)

d.Gas production from aus_production.

autoplot(aus_production, Gas)

There is too much seasonal variation. Will use Box-Cox transformation to standardize the seasonal patterns.

lambda <- aus_production |>
  features(Gas, features = guerrero) |>
  pull(lambda_guerrero)
aus_production |>
  autoplot(box_cox(Gas, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed gas production with $\\lambda$ = ",
         round(lambda,2))))

### Question 3.3 Why is a Box-Cox transformation unhelpful for the canadian_gas data?

autoplot(canadian_gas, Volume)

lambda <- canadian_gas |>
  features(Volume, features = guerrero) |>
  pull(lambda_guerrero)
canadian_gas |>
  autoplot(box_cox(Volume, lambda))

The Box-Cox transformation is not suitable for the Canadian_gas data because a good value of lambda is one that equalizes the size of the seasonal variation across the entire series. Neither this lambda nor any other lambda will achieve that.

Question 3.4

What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?

set.seed(100)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

autoplot(myseries,Turnover)

lambda <- myseries |>
  features(Turnover, features = guerrero) |>
  pull(lambda_guerrero)
myseries |>
  autoplot(box_cox(Turnover, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed gas production with $\\lambda$ = ",
         round(lambda,2))))

Question 3.5

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

1.Tobacco from aus_production

autoplot(aus_production, Tobacco)

lambda <- aus_production |>
  features(Tobacco, features = guerrero) |>
  pull(lambda_guerrero)
aus_production |>
  autoplot(box_cox(Tobacco, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed gas production with $\\lambda$ = ",
         round(lambda,2))))

2. Economy class passengers between Melbourne and Sydney from ansett

ansett %>%
  filter(Airports == "MEL-SYD",
         Class == "Economy") %>%
  autoplot(Passengers)

lambda <- ansett %>%
  filter(Airports == "MEL-SYD",
         Class == "Economy")%>%
  features(Passengers, features = guerrero) %>%
  pull(lambda_guerrero)

ansett %>%
  filter(Airports == "MEL-SYD",
         Class == "Economy")%>%
  autoplot(box_cox(Passengers, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed gas production with $\\lambda$ = ",
         round(lambda,2))))

3. Pedestrian counts at Southern Cross Station from pedestrian

pedestrian %>%
  filter(Sensor == "Southern Cross Station") %>%
  autoplot(Count)

lambda <- pedestrian %>%
  filter(Sensor == "Southern Cross Station")%>%
  features(Count, features = guerrero) %>%
  pull(lambda_guerrero)

pedestrian %>%
  filter(Sensor == "Southern Cross Station")%>%
  autoplot(box_cox(Count, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed gas production with $\\lambda$ = ",
         round(lambda,2))))

Question 3.7

Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) %>%
  select(Gas)

a.Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?

autoplot(gas, Gas)

The trend is increasing over the years, and the cycle occurs annually.

b.Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.

gas %>%
  model(
    classical_decomposition(Gas, type = "multiplicative")
  ) %>%
  components() %>%
  autoplot() +
  labs(title = "Classical multiplicative decomposition of Gas")

c.Do the results support the graphical interpretation from part a? The results do support the graphical interpretation. The trend increases over time, and there is an annual cycle occurrence.

d.Compute and plot the seasonally adjusted data.

dcmp <- gas %>%
  model(classical_decomposition(Gas, type = "multiplicative"))

components(dcmp) %>%
  as_tsibble() %>%
  autoplot(Gas, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Persons (thousands)",
       title = "Total employment in US retail")

e.Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

gas_random <- tail(aus_production, 5*4) %>%
  select(Gas)

random_row_index <- sample(nrow(gas_random), 1)  # Randomly select one row index
gas_random[random_row_index, "Gas"]<- 
  gas_random[random_row_index, "Gas"] + 300

set.seed(200)
dcmp <- gas_random %>%
  model(classical_decomposition(Gas, type = "multiplicative"))

components(dcmp) %>%
  as_tsibble() %>%
  autoplot(Gas, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Persons (thousands)",
       title = "Total employment in US retail")

The outlier creates a spike in the plot, making it difficult to determine seasonality and trend-cycles.

f.Does it make any difference if the outlier is near the end rather than in the middle of the time series?

gas_last <- tail(aus_production, 5*4) %>%
  select(Gas)

last_index <- nrow(gas_last )  # Get the index of the last row
gas_last [last_index, "Gas"] <- gas_last [last_index, "Gas"] + 300

dcmp <- gas_last  %>%
  model(classical_decomposition(Gas, type = "multiplicative"))

components(dcmp) %>%
  as_tsibble() %>%
  autoplot(Gas, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Persons (thousands)",
       title = "Total employment in US retail")

It does not make any difference whether the outlier in the data is in the middle or at the end. It still becomes hard to interpret.

Question 3.8

Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

set.seed(300)
x11_dcmp <- myseries |>
  model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) |>
  components()
autoplot(x11_dcmp) +
  labs(title =
    "Decomposition using X-11.")

There seem to be significant spikes in the irregular plot, indicating outliers in the data.

Question 3.9

Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

a.Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

The trend of the data, representing the civilian labor force in Australia each month from February 1978 to August 1995, shows a consistent trend increase over the years. Seasonally, the labor force experiences significant declines at the beginning of the year and fluctuates throughout the year. Additionally, there are noticeable spikes in the remainder plot, suggesting the presence of outliers in the data. These outliers may potentially impact the accuracy of any analysis.

b.Is the recession of 1991/1992 visible in the estimated components?

Yes, that is the noticeable downward spike in the remainder plot.