DATA 624 Predictive Analytics - SILMA KHAN

Homework #2 - Chapter 3 Time Series Decomposition

Do exercises 3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8 and 3.9 from the online Hyndman book

Question 3.1: Consider the GDP information information in “global_economy”. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

?global_economy
## No documentation for 'global_economy' in specified packages and libraries:
## you could try '??global_economy'
#install the libraries needed

library(fpp3)
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tibble      3.2.1     ✔ tsibble     1.1.6
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.1
## ✔ lubridate   1.9.4     ✔ fable       0.4.1
## ✔ ggplot2     3.5.1
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()
global_economy %>%
  autoplot(GDP / Population, show.legend = FALSE) +
  labs(title= "GDP per capita", y = "GDP per Capita")
## Warning: Removed 3242 rows containing missing values or values outside the scale range
## (`geom_line()`).

  • Since there are too many countries to have a specific legend to pinpoint the blue line that is clearly more significant, I need to filter out the countries to find that specific max GDP
global_economy %>%
  mutate(gdp_capita = GDP / Population) %>%
  filter(gdp_capita == max(gdp_capita, na.rm = TRUE)) %>%
  select(Country, gdp_capita)
## # A tsibble: 1 x 3 [1Y]
## # Key:       Country [1]
##   Country gdp_capita  Year
##   <fct>        <dbl> <dbl>
## 1 Monaco     185153.  2014
  • The country with the highest GDP per capita is Monaco. it seems to ave started around the 1970s and has gradually increases at a higher rate than most of the countries a shown in the visual

Question 3.2: For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

  • United States GDP from “global_economy”
global_economy %>% 
  filter(Country == "United States") %>% 
  autoplot(GDP) +
  labs(title = "US GDP")

- In this plot, the y axis values for the GDP values are in scientific notation since the values are so large, so looking at the amount of numbers in each number, I think transforming the GDP to trillions would help the graph be more understanding for the GDP values

global_economy %>%
  filter(Country == "United States") %>%
  autoplot(GDP / 10 ^ 12) +
  labs(title= "US GDP", y = "GDP (in trillions)") 

  • Slaughter of Victorian “Bulls, bullocks and steers” in “aus_livestock”
aus_livestock %>% 
  filter(Animal == "Bulls, bullocks and steers", State == "Victoria") %>% 
  autoplot(Count) +
  labs(title = "Slaughter Count of Victorian Bulls, Bullocks and Steers")

  • Victorian Electricity Demand from “vic_elec”
vic_elec %>% 
  autoplot(Demand) +
  labs(title = "Victorian Electricity Demand")

- Taking a look at this graph, it is harder to see the data as it is clusters half hourly, so there does need to be some transformation done to see a better and more clear trend. I decided to go with a monthly basis to see a clearer trend of electricity demand on a monthly basis

elec_demand <- vic_elec %>%
  group_by(Date) %>%
  mutate(Demand = sum(Demand)) %>%
  distinct(Date, Demand)

elec_demand %>%
  mutate(Date = yearmonth(Date)) %>%
  group_by(Date) %>%
  summarise(Demand = sum(Demand)) %>%
  as_tsibble(index = Date) %>%
  autoplot(Demand) +
  labs(title= "Monthly Victorian Electricity Demand")

- first I wanted to grpup the data by the date variable and then it sums up the demand values for each of those date grouped values and after calculating the sum for the demand it removes duplicated rows so each date onlt appears once with the total demand for each date - then to transform the data further, i transformed the date variable to a year month format since i want a monthly scale for the visual and then grouped the data by the monthly date. Then it sums the daily demand values for each of the months to get the monthly demand value and then we convert the data into a tsibble with the date as the time index - by doing this we can see that the summer months tend to have the highest demand for electricity

  • Gas production from “aus_production”
aus_production %>% 
  autoplot(Gas) +
  labs(title = "Gas Production")

Question 3.3: Why is a Box-Cox transformation unhelpful for the “canadian_gas” data?

?canadian_gas
## starting httpd help server ... done
canadian_gas
## # A tsibble: 542 x 2 [1M]
##       Month Volume
##       <mth>  <dbl>
##  1 1960 Jan  1.43 
##  2 1960 Feb  1.31 
##  3 1960 Mar  1.40 
##  4 1960 Apr  1.17 
##  5 1960 May  1.12 
##  6 1960 Jun  1.01 
##  7 1960 Jul  0.966
##  8 1960 Aug  0.977
##  9 1960 Sep  1.03 
## 10 1960 Oct  1.25 
## # ℹ 532 more rows
canadian_gas %>%
  autoplot(Volume) +
  labs(title = "Gas Production - Raw Data")

- to do the box cox transformation i decided to use the Guerreros method to calculate the most optimal value of lamda that minimizes the coefficient of variation of the new transformed data

lambda_val <- canadian_gas %>% 
  features(Volume, features = guerrero) %>% 
  pull(lambda_guerrero)

lambda_val
## [1] 0.5767648
  • the most optimal value for lambda is approx. 0.58 that means we would not need any transformation for this data, so meaning the box cox would be unhelpful

Question 3.4: What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?

lambda_retail <- aus_retail %>% 
  features(Turnover, features = guerrero) %>% 
  pull(lambda_guerrero)
lambda_retail
##   [1]  0.505437358  0.455334875 -0.001548504 -0.142211494  0.239724139
##   [6] -0.026472613  0.302765950 -0.347859628 -0.079800145  0.488891084
##  [11]  0.170177523  0.008546483  0.258373687 -0.278913616  0.085422753
##  [16]  0.216735747 -0.186680060  0.121172855  0.349057788  0.358373811
##  [21]  0.194644303  0.097852391  0.014794694 -0.029183970  0.219320743
##  [26]  0.015030705  0.087866199 -0.088668882  0.233295271  0.794736203
##  [31]  0.409200236 -0.034414504 -0.233638227 -0.402606263  0.048725122
##  [36]  0.170462269 -0.271677387  0.088105269  0.157152361  0.002144737
##  [41]  0.265927881  0.224713740  0.284977808  0.083036312  0.127694817
##  [46] -0.062802562 -0.162564723  0.346569384  0.434893851  0.338398250
##  [51]  0.260980929 -0.026468501  0.066365207 -0.071665181 -0.022160487
##  [56]  0.137501308  0.142266795 -0.009186879 -0.078532522  0.121434869
##  [61]  0.022846394  0.076369281 -0.216330466  0.221481127  0.480753465
##  [66]  0.231063552  0.132569588  0.043833072 -0.202826332  0.111909872
##  [71]  0.205315650  0.285948322  0.113040418  0.209030359  0.197041795
##  [76]  0.441087014  0.371315460 -0.102361315 -0.154344946  0.035435040
##  [81] -0.022985108  0.084733215 -0.187787216  0.142978026  0.466262913
##  [86]  0.166374070  0.016104851 -0.360022768 -0.639897411  0.083802504
##  [91]  0.097715830 -0.342658577  0.149896211  0.155555534  0.189590792
##  [96]  0.325699726  0.319622078  0.001893373 -0.093951032  0.155626888
## [101]  0.090231492 -0.376863610  0.159984903  0.444917891  0.287535970
## [106]  0.704385146  0.128944957 -0.532982237 -0.899926773 -0.037340959
## [111]  0.222986858  0.330815717  0.173409612  0.176110328  0.030210246
## [116]  0.004465470  0.172748557 -0.033916722  0.050680165 -0.078768077
## [121]  0.083552351  0.554245382  0.215164121 -0.014119612 -0.353784988
## [126] -0.265806646  0.034427160  0.116131283 -0.354627576  0.134290108
## [131]  0.153561415  0.088721960  0.312135155  0.271087477  0.078757767
## [136]  0.002190358  0.081555041 -0.033484940  0.117911536 -0.069445135
## [141]  0.031003706  0.464098968  0.203250692  0.037661525 -0.053178552
## [146] -0.210144461  0.143890246  0.223362268 -0.371484018  0.140920582
## [151]  0.205193769  0.129945930
  • Looking at most of these outputs you can see that most of these lambda values are closer to 0, but just to be sure:
summary(lambda_retail)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.89993 -0.02715  0.09778  0.08354  0.21738  0.79474
  • these show that the values are very close to zero which supports the idea that using lambda = 0 is appropriate for the retail data

Question 3.5: For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

?aus_production
  • Tobacco from aus_production
autoplot(aus_production, Tobacco)+
  labs(title = "Tobacco Production")
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

lambda <- aus_production %>%
  features(Tobacco, features = guerrero) %>%
  pull(lambda_guerrero)

lambda
## [1] 0.9264636
aus_production %>%
  autoplot(box_cox(Tobacco, lambda)) +
  labs(title = "Tobacco Production with lambda = 0.93",
         round(lambda,2))
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

  • Economy class passengers between Melbourne and Sydney from ansett
?ansett
ansett
## # A tsibble: 7,407 x 4 [1W]
## # Key:       Airports, Class [30]
##        Week Airports Class    Passengers
##      <week> <chr>    <chr>         <dbl>
##  1 1989 W28 ADL-PER  Business        193
##  2 1989 W29 ADL-PER  Business        254
##  3 1989 W30 ADL-PER  Business        185
##  4 1989 W31 ADL-PER  Business        254
##  5 1989 W32 ADL-PER  Business        191
##  6 1989 W33 ADL-PER  Business        136
##  7 1989 W34 ADL-PER  Business          0
##  8 1989 W35 ADL-PER  Business          0
##  9 1989 W36 ADL-PER  Business          0
## 10 1989 W37 ADL-PER  Business          0
## # ℹ 7,397 more rows
filtered_airports <- ansett %>%
  filter(Class == "Economy",
         Airports == "MEL-SYD")

filtered_airports
## # A tsibble: 282 x 4 [1W]
## # Key:       Airports, Class [1]
##        Week Airports Class   Passengers
##      <week> <chr>    <chr>        <dbl>
##  1 1987 W26 MEL-SYD  Economy      20167
##  2 1987 W27 MEL-SYD  Economy      20161
##  3 1987 W28 MEL-SYD  Economy      19993
##  4 1987 W29 MEL-SYD  Economy      20986
##  5 1987 W30 MEL-SYD  Economy      20497
##  6 1987 W31 MEL-SYD  Economy      20770
##  7 1987 W32 MEL-SYD  Economy      21111
##  8 1987 W33 MEL-SYD  Economy      20675
##  9 1987 W34 MEL-SYD  Economy      22092
## 10 1987 W35 MEL-SYD  Economy      20772
## # ℹ 272 more rows
autoplot(filtered_airports, Passengers)+
  labs(title = "Economy class Passengers Between Melbourne and Sydney")

lambda <- filtered_airports %>%
  features(Passengers, features = guerrero) %>%
  pull(lambda_guerrero)

lambda
## [1] 1.999927
filtered_airports %>%
  autoplot(box_cox(Passengers, lambda)) +
  labs(title = "Transformed # of Passengers Btwn Mel-Syd with Lambda = 2")

  • Pedestrian counts at Southern Cross Station from pedestrian
?pedestrian
pedestrian
## # A tsibble: 66,037 x 5 [1h] <Australia/Melbourne>
## # Key:       Sensor [4]
##    Sensor         Date_Time           Date        Time Count
##    <chr>          <dttm>              <date>     <int> <int>
##  1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01     0  1630
##  2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01     1   826
##  3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01     2   567
##  4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01     3   264
##  5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01     4   139
##  6 Birrarung Marr 2015-01-01 05:00:00 2015-01-01     5    77
##  7 Birrarung Marr 2015-01-01 06:00:00 2015-01-01     6    44
##  8 Birrarung Marr 2015-01-01 07:00:00 2015-01-01     7    56
##  9 Birrarung Marr 2015-01-01 08:00:00 2015-01-01     8   113
## 10 Birrarung Marr 2015-01-01 09:00:00 2015-01-01     9   166
## # ℹ 66,027 more rows
southern_cross_ped <- pedestrian %>%
  filter(Sensor == "Southern Cross Station") 

autoplot(southern_cross_ped, Count)+
  labs(title = "Pedestrian Counts at Southern Cross Station")

lambda <- southern_cross_ped %>%
  features(Count, features = guerrero) %>%
  pull(lambda_guerrero)

lambda
## [1] -0.2501616
southern_cross_ped %>%
  autoplot(box_cox(Count, lambda)) +
  labs(title = "Pedestrian Counts with Lambda = -0.23")

- the graphs here, since they are following a the date_time on an hour scale, looks too compact together to get any real feel of what is going on. I would transform this data to either a daily or weekly time

Question 3.7: Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) |> select(Gas)
gas
## # A tsibble: 20 x 2 [1Q]
##      Gas Quarter
##    <dbl>   <qtr>
##  1   221 2005 Q3
##  2   180 2005 Q4
##  3   171 2006 Q1
##  4   224 2006 Q2
##  5   233 2006 Q3
##  6   192 2006 Q4
##  7   187 2007 Q1
##  8   234 2007 Q2
##  9   245 2007 Q3
## 10   205 2007 Q4
## 11   194 2008 Q1
## 12   229 2008 Q2
## 13   249 2008 Q3
## 14   203 2008 Q4
## 15   196 2009 Q1
## 16   238 2009 Q2
## 17   252 2009 Q3
## 18   210 2009 Q4
## 19   205 2010 Q1
## 20   236 2010 Q2
  1. Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?
autoplot(gas, Gas)

- There seems to be an increasing trend over the course of one year. It happens to peak the highest for the quater during the midding of the year and then drops, and then peaks again

  1. Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.
gas_decomp <- gas %>%
  model(classical_decomposition(Gas, type = "multiplicative")) 
components(gas_decomp) %>%
  autoplot() +
  labs(title = "Classical Decomposition of Gas Production - Multiplicative")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

  1. Do the results support the graphical interpretation from part a?
  • The results do support the graphical interpretation from part a since the graphs show an increasing trend that increases every first half of the quarter and then decreases after the second half of the quarter
  1. Compute and plot the seasonally adjusted data.
components(gas_decomp) %>%
  as_tsibble() %>%
  autoplot(Gas, color = "blue") +
  geom_line(aes(y=season_adjust), color = "green") +
  labs(title = "Adjusted Gas Production - Seasonally")

  1. Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
gas %>%
  mutate(Gas = ifelse(Gas == 224, Gas + 300, Gas)) %>%
  model(classical_decomposition(Gas, type = "multiplicative")) %>%
  components() %>%
  as_tsibble() %>%
  autoplot(Gas, color = "blue") +
  geom_line(aes(y=season_adjust), color = "green") +
  labs(title = "Adjusted Gas Production with an Outlier")

- In 2006 Quarter 2 I added 300 to the original gas value which is 224, but now it has shot up to 524. This is a very strong increase where the actual raw data and the adjusted data increased.

  1. Does it make any difference if the outlier is near the end rather than in the middle of the time series?
gas %>%
  mutate(Gas = ifelse(Gas == 236, Gas + 300, Gas)) %>%
  model(classical_decomposition(Gas, type = "multiplicative")) %>%
  components() %>%
  as_tsibble() %>%
  autoplot(Gas, color = "blue") +
  geom_line(aes(y=season_adjust), color = "green") +
  labs(title = "Adjusted Gas Production with an Outlier Towards the End")

- By adding the outlier towards the end of the data rather than the middle, it doesnt seem to make any noticeable different opposed to when we had the outlier in the middle

Question 3.8: Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

library(seasonal)
## 
## Attaching package: 'seasonal'
## The following object is masked from 'package:tibble':
## 
##     view
set.seed(8675309)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

x11_dcmp <- myseries %>%
  model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) %>%
  components()
autoplot(x11_dcmp) +
  labs(title = "Decomposition of Retail Turnover - X-11.")

- Compared to the other models, the X-11 decomposition has fewer curved lines and becomes more cut. This method seems to be able to showcase more noise and outliers that we were not able to see originally

Question 3.9: Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

  1. Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.
  • The decomposition shows an increasing trend which signifies the true plot of the data. This also shows a sense of seasonality that has a smalle =r scale the rest of the data. This might show that the seaosnality might not be an important in the data. The graphs also show a decrease around the early 1990s as shown which can be the couse of some recession that was happening during such times.
  1. Is the recession of 1991/1992 visible in the estimated components?
  • The recession is visible in the estimated components