DATA 624 - Predictive Analystics - (Assignment 1)

INSTRUCTIONS

Please submit exercises 2.1, 2.2, 2.3, 2.4, 2.5 and 2.8 from the Hyndman online Forecasting book. Please submit both your Rpubs link as well as attach the .pdf file with your code.

Due on Sep 8, 2024 11:59 PM

LOAD PACKAGES

The following code below loops through the list of necessary packages and checks to determine if each is installed. If the package is not found it is installed and loaded.

pkges <- c("fpp3", "kableExtra", "readr", "USgas", "readxl")

# Loop through the packages
for (p in pkges) {
  # Check if package is installed
  if (!requireNamespace(p, quietly = TRUE)) {
    install.packages(p) #If the package is not installed, install the package
    
    library(p, character.only = TRUE) #Load the package
  } else {
    library(p, character.only = TRUE) #If the package is already installed, load the package
  }
}
## Warning: package 'fpp3' was built under R version 4.4.1
## Warning: package 'tibble' was built under R version 4.4.1
## Warning: package 'dplyr' was built under R version 4.4.1
## Warning: package 'tidyr' was built under R version 4.4.1
## Warning: package 'lubridate' was built under R version 4.4.1
## Warning: package 'ggplot2' was built under R version 4.4.1
## Warning: package 'tsibble' was built under R version 4.4.1
## Warning: package 'tsibbledata' was built under R version 4.4.1
## Warning: package 'feasts' was built under R version 4.4.1
## Warning: package 'fabletools' was built under R version 4.4.1
## Warning: package 'fable' was built under R version 4.4.1
## Warning: package 'kableExtra' was built under R version 4.4.1
## Warning: package 'readr' was built under R version 4.4.1
## Warning: package 'USgas' was built under R version 4.4.1
## Warning: package 'readxl' was built under R version 4.4.1

EXercise 2.10.1

Explore the following four time series: Bricks from aus_production, Lynx from pelt, Close from gafa_stock, Demand from vic_elec.

  • Use ? (or help()) to find out about the data in each series.
  • What is the time interval of each series?
  • Use autoplot() to produce a time plot of each series.
  • For the last plot, modify the axis labels and title.
data("aus_production")
data("pelt")
data("gafa_stock")
data("vic_elec")

Use ? (or help()) to find out about the data in each series.

help("aus_production")

help("pelt")

help("gafa_stock")

help("vic_elec")

What is the time interval of each series?

Time interval of each series: * aus_production:- This data has a quarterly time interval. * pelt:- This data is annual. * gafa_stock:- This data is daily. * vic_elec:- This data has half-hourly intervals.

# Bricks from aus_production
aus_production %>% filter(!is.na(Bricks)) -> bricks_data
head(aus_production, 10)%>% kable() %>% 
  kable_styling(bootstrap_options = "striped", font_size = 12,) %>% 
  scroll_box(height = "100%", width = "100%", fixed_thead = T)
Quarter Beer Tobacco Bricks Cement Electricity Gas
1956 Q1 284 5225 189 465 3923 5
1956 Q2 213 5178 204 532 4436 6
1956 Q3 227 5297 208 561 4806 7
1956 Q4 308 5681 197 570 4418 6
1957 Q1 262 5577 187 529 4339 5
1957 Q2 228 5651 214 604 4811 7
1957 Q3 236 5317 227 603 5259 7
1957 Q4 320 6152 222 582 4735 6
1958 Q1 272 5758 199 554 4608 5
1958 Q2 233 5641 229 620 5196 7
# Lynx from pelt
head(pelt, 10)%>% kable() %>% 
  kable_styling(bootstrap_options = "striped", font_size = 12,) %>% 
  scroll_box(height = "100%", width = "100%", fixed_thead = T)
Year Hare Lynx
1845 19580 30090
1846 19600 45150
1847 19610 49150
1848 11990 39520
1849 28040 21230
1850 58000 8420
1851 74600 5560
1852 75090 5080
1853 88480 10170
1854 61280 19600
# Close from gafa_stock
head(gafa_stock, 10)%>% kable() %>% 
  kable_styling(bootstrap_options = "striped", font_size = 12,) %>% 
  scroll_box(height = "100%", width = "100%", fixed_thead = T)
Symbol Date Open High Low Close Adj_Close Volume
AAPL 2014-01-02 79.38286 79.57571 78.86000 79.01857 66.96433 58671200
AAPL 2014-01-03 78.98000 79.10000 77.20428 77.28286 65.49342 98116900
AAPL 2014-01-06 76.77857 78.11429 76.22857 77.70428 65.85053 103152700
AAPL 2014-01-07 77.76000 77.99429 76.84571 77.14857 65.37959 79302300
AAPL 2014-01-08 76.97285 77.93714 76.95571 77.63715 65.79363 64632400
AAPL 2014-01-09 78.11429 78.12286 76.47857 76.64571 64.95345 69787200
AAPL 2014-01-10 77.11857 77.25714 75.87286 76.13429 64.52003 76244000
AAPL 2014-01-13 75.70143 77.50000 75.69714 76.53286 64.85782 94623200
AAPL 2014-01-14 76.88857 78.10429 76.80857 78.05572 66.14838 83140400
AAPL 2014-01-15 79.07429 80.02857 78.80857 79.62286 67.47642 97909700
# Demand from vic_elec
head(vic_elec, 10) %>% kable() %>% 
  kable_styling(bootstrap_options = "striped", font_size = 12,) %>% 
  scroll_box(height = "100%", width = "100%", fixed_thead = T)
Time Demand Temperature Date Holiday
2012-01-01 00:00:00 4382.825 21.40 2012-01-01 TRUE
2012-01-01 00:30:00 4263.366 21.05 2012-01-01 TRUE
2012-01-01 01:00:00 4048.966 20.70 2012-01-01 TRUE
2012-01-01 01:30:00 3877.563 20.55 2012-01-01 TRUE
2012-01-01 02:00:00 4036.230 20.40 2012-01-01 TRUE
2012-01-01 02:30:00 3865.597 20.25 2012-01-01 TRUE
2012-01-01 03:00:00 3694.098 20.10 2012-01-01 TRUE
2012-01-01 03:30:00 3561.624 19.60 2012-01-01 TRUE
2012-01-01 04:00:00 3433.035 19.10 2012-01-01 TRUE
2012-01-01 04:30:00 3359.468 18.95 2012-01-01 TRUE

Use autoplot() to produce a time plot of each series.

# Bricks data
autoplot(bricks_data, Bricks) +
  ggtitle("Bricks Production Over Time") +
  xlab("Year") +
  ylab("Bricks (Millions)")

# Lynx data
autoplot(pelt, Lynx) +
  ggtitle("Lynx Trapping Over Time") +
  xlab("Year") +
  ylab("Lynx pelts")

# Close data
autoplot(gafa_stock, Close) +
  ggtitle("GAFA Stock Close Price Over Time") +
  xlab("Date") +
  ylab("Close Price")

# Demand data
autoplot(vic_elec, Demand) +
  ggtitle("Electricity Demand") +
  xlab("Date") +
  ylab("Electricity Demand (MW)") +
  labs(title = "Electricity Demand Over Time",
       x = "Time",
       y = "Demand (Megawatts)")

For the last plot, modify the axis labels and title.

EXercise 2.10.2

Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.

# Inspect the gafa_stock dataset
head(gafa_stock, 10) %>% kable() %>% 
  kable_styling(bootstrap_options = "striped", font_size = 12,) %>% 
  scroll_box(height = "100%", width = "100%", fixed_thead = T)
Symbol Date Open High Low Close Adj_Close Volume
AAPL 2014-01-02 79.38286 79.57571 78.86000 79.01857 66.96433 58671200
AAPL 2014-01-03 78.98000 79.10000 77.20428 77.28286 65.49342 98116900
AAPL 2014-01-06 76.77857 78.11429 76.22857 77.70428 65.85053 103152700
AAPL 2014-01-07 77.76000 77.99429 76.84571 77.14857 65.37959 79302300
AAPL 2014-01-08 76.97285 77.93714 76.95571 77.63715 65.79363 64632400
AAPL 2014-01-09 78.11429 78.12286 76.47857 76.64571 64.95345 69787200
AAPL 2014-01-10 77.11857 77.25714 75.87286 76.13429 64.52003 76244000
AAPL 2014-01-13 75.70143 77.50000 75.69714 76.53286 64.85782 94623200
AAPL 2014-01-14 76.88857 78.10429 76.80857 78.05572 66.14838 83140400
AAPL 2014-01-15 79.07429 80.02857 78.80857 79.62286 67.47642 97909700
gafa_stock %>%
  group_by(Symbol) %>% 
  filter(Close == max(Close)) %>% # Use filter() to find the days that correspond to the peak Close price
  arrange(Symbol)
## # A tsibble: 4 x 8 [!]
## # Key:       Symbol [4]
## # Groups:    Symbol [4]
##   Symbol Date        Open  High   Low Close Adj_Close   Volume
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1 AAPL   2018-10-03  230.  233.  230.  232.      230. 28654800
## 2 AMZN   2018-09-04 2026. 2050. 2013  2040.     2040.  5721100
## 3 FB     2018-07-25  216.  219.  214.  218.      218. 58954200
## 4 GOOG   2018-07-26 1251  1270. 1249. 1268.     1268.  2405600

EXercise 2.10.3

Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

  1. You can read the data into R with the following script:
#Load the Data 
tute1 <- readr::read_csv("tute1.csv")
View(tute1)
  1. Convert the data to time series
mytimeseries <- tute1 |>
  mutate(Quarter = yearquarter(Quarter)) |>
  as_tsibble(index = Quarter)
  1. Construct time series plots of each of the three series
mytimeseries |>
  pivot_longer(-Quarter) |>
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y")

Check what happens when you don’t include facet_grid().

When Facet_grid() is removed, the three time series (Sales, AdBudget, and GDP) are plotted on the same graph. Depending on the scale of each series, this makes the plot crowded and harder to interpret. Facet_grid() helps to separate the plots, making it easier to analyze each series independently. In the previous code all three time series graphs are are plotted on individual graphs.

mytimeseries |>
  pivot_longer(-Quarter) |>
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() 

EXercise 2.10.4

The USgas package contains data on the demand for natural gas in the US.

Create a tsibble from us_total with year as the index and state as the key.

# Create a tsibble from us_total with year as the index and state as the key
usgas_tsibble <- us_total |>
  as_tsibble(index = year, key = state)

head(usgas_tsibble, 10) %>% kable() %>% 
  kable_styling(bootstrap_options = "striped", font_size = 12,) %>% 
  scroll_box(height = "100%", width = "100%", fixed_thead = T)
year state y
1997 Alabama 324158
1998 Alabama 329134
1999 Alabama 337270
2000 Alabama 353614
2001 Alabama 332693
2002 Alabama 379343
2003 Alabama 350345
2004 Alabama 382367
2005 Alabama 353156
2006 Alabama 391093

Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).

# Filter New England states
new_england_states <- c("Maine", "Vermont", "New Hampshire", "Massachusetts", "Connecticut", "Rhode Island")

new_england_data <- usgas_tsibble |>
  filter(state %in% new_england_states)

# Plot the annual natural gas consumption by state
new_england_data |>
  ggplot(aes(x = year, y = y, colour = state)) +
  geom_line() +
  ggtitle("Annual Natural Gas Consumption by State (New England)") +
  xlab("Year") +
  ylab("Natural Gas Consumption (Million Cubic Feet)") +
  theme_minimal()

EXercise 2.10.5

Download tourism.xlsx from the book website and read it into R using readxl::read_excel().

tourism_data <- readxl::read_excel("tourism.xlsx")

head(tourism, 10) %>% kable() %>% 
  kable_styling(bootstrap_options = "striped", font_size = 12,) %>% 
  scroll_box(height = "100%", width = "100%", fixed_thead = T)
Quarter Region State Purpose Trips
1998 Q1 Adelaide South Australia Business 135.0777
1998 Q2 Adelaide South Australia Business 109.9873
1998 Q3 Adelaide South Australia Business 166.0347
1998 Q4 Adelaide South Australia Business 127.1605
1999 Q1 Adelaide South Australia Business 137.4485
1999 Q2 Adelaide South Australia Business 199.9126
1999 Q3 Adelaide South Australia Business 169.3551
1999 Q4 Adelaide South Australia Business 134.3579
2000 Q1 Adelaide South Australia Business 154.0344
2000 Q2 Adelaide South Australia Business 168.7764

Create a tsibble which is identical to the tourism tsibble from the tsibble package.

tourism_tsibble <- tourism_data |>
  mutate(Quarter = yearquarter(Quarter)) |>
  as_tsibble(index = Quarter, key = c(Region, State, Purpose))

# View the tsibble
head(tourism_tsibble, 10) %>% kable() %>% 
  kable_styling(bootstrap_options = "striped", font_size = 12,) %>% 
  scroll_box(height = "100%", width = "100%", fixed_thead = T)
Quarter Region State Purpose Trips
1998 Q1 Adelaide South Australia Business 135.0777
1998 Q2 Adelaide South Australia Business 109.9873
1998 Q3 Adelaide South Australia Business 166.0347
1998 Q4 Adelaide South Australia Business 127.1605
1999 Q1 Adelaide South Australia Business 137.4485
1999 Q2 Adelaide South Australia Business 199.9126
1999 Q3 Adelaide South Australia Business 169.3551
1999 Q4 Adelaide South Australia Business 134.3579
2000 Q1 Adelaide South Australia Business 154.0344
2000 Q2 Adelaide South Australia Business 168.7764

Find what combination of Region and Purpose had the maximum number of overnight trips on average.

max_avg_trips <- tourism_tsibble |>
  group_by(Region, Purpose) |>
  summarize(avg_trips = mean(Trips, na.rm = TRUE)) |>
  ungroup() |>
  slice_max(avg_trips)

head(max_avg_trips, 10)%>% kable() %>% 
  kable_styling(bootstrap_options = "striped", font_size = 12,) %>% 
  scroll_box(height = "100%", width = "100%", fixed_thead = T)
Region Purpose Quarter avg_trips
Melbourne Visiting 2017 Q4 985.2784

Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.

state_total_trips <- tourism_tsibble |>
  index_by(Quarter) |>
  group_by(State) |>
  summarize(total_trips = sum(Trips, na.rm = TRUE)) |>
  ungroup() |>
  as_tsibble(index = Quarter, key = State)

# View the new tsibble
head(state_total_trips, 10)%>% kable() %>% 
  kable_styling(bootstrap_options = "striped", font_size = 12,) %>% 
  scroll_box(height = "100%", width = "100%", fixed_thead = T)
State Quarter total_trips
ACT 1998 Q1 551.0019
ACT 1998 Q2 416.0256
ACT 1998 Q3 436.0290
ACT 1998 Q4 449.7984
ACT 1999 Q1 378.5728
ACT 1999 Q2 558.1781
ACT 1999 Q3 448.9012
ACT 1999 Q4 594.8254
ACT 2000 Q1 599.6685
ACT 2000 Q2 557.1351

EXercise 2.10.8

Use the following graphics functions: autoplot(), gg_season(), gg_subseries(), gg_lag(), ACF() and explore features from the following time series: “Total Private” Employed from us_employment, Bricks from aus_production, Hare from pelt, “H02” Cost from PBS, and Barrels from us_gasoline.

us_employment |>
  filter(Title == 'Total Private') |>
  autoplot(Employed) +
  ggtitle("autoplot")

us_employment |>
  filter(Title == 'Total Private') |>
  gg_season(Employed) +
  ggtitle("gg_season Plot")

 us_employment |>
  filter(Title == 'Total Private') |>
  gg_subseries(Employed) +
  ggtitle("gg_subseries Plot") 

 us_employment |>
  filter(Title == 'Total Private') |>
  gg_lag(Employed) +
  ggtitle("gg_lag Plot") 

 us_employment |>
  filter(Title == 'Total Private') |>
  ACF(Employed) |>
  autoplot() +
  ggtitle("ACF Plot")

 # Bricks from aus_production
aus_production |>
  autoplot(Bricks) +
  ggtitle("Bricks Production in Australia")
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

aus_production |>
  gg_season(Bricks) +
  ggtitle("gg_season - Bricks Production Plot")
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

aus_production |>
  gg_subseries(Bricks) +
  ggtitle("gg_subseries - Bricks Production Plot")
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).

aus_production |>
  gg_lag(Bricks, geom='point') +
  ggtitle("gg_lag - Bricks Production Plot")
## Warning: Removed 20 rows containing missing values (gg_lag).

aus_production |>
  ACF(Bricks) |>
  autoplot() +
  ggtitle("ACF - Bricks Production Plot")

# Hare from pelt
autoplot(pelt, Hare) +
  ggtitle("Hare Pelts")

gg_subseries(pelt, Hare) +
  ggtitle("gg_subseries - Hare Pelts Plot")

gg_lag(pelt, Hare, geom='point') +
  ggtitle("gg_lag - Hare Pelts Plot")

pelt %>% ACF(Hare) 
## # A tsibble: 19 x 2 [1Y]
##         lag     acf
##    <cf_lag>   <dbl>
##  1       1Y  0.658 
##  2       2Y  0.214 
##  3       3Y -0.155 
##  4       4Y -0.401 
##  5       5Y -0.493 
##  6       6Y -0.401 
##  7       7Y -0.168 
##  8       8Y  0.113 
##  9       9Y  0.307 
## 10      10Y  0.340 
## 11      11Y  0.296 
## 12      12Y  0.206 
## 13      13Y  0.0372
## 14      14Y -0.153 
## 15      15Y -0.285 
## 16      16Y -0.295 
## 17      17Y -0.202 
## 18      18Y -0.0676
## 19      19Y  0.0956
PBS |>
  filter(ATC2 == 'H02') |>
  autoplot(Cost) +
  ggtitle("H02 Drug Cost from PBS")

PBS |>
  filter(ATC2 == 'H02') |>
  gg_season(Cost) +
  ggtitle("gg_season - H02 Drug Cost from PBS")

PBS |>
  filter(ATC2 == 'H02') |>
  gg_subseries(Cost) +
  ggtitle("gg_subseries - H02 Drug Cost from PBS")

PBS |>
  ACF(Cost) |>
  ggtitle("ACF - H02 Drug Cost from PBS")
## $title
## # A tsibble: 7,690 x 6 [1M]
## # Key:       Concession, Type, ATC1, ATC2 [336]
##    Concession   Type        ATC1  ATC2       lag     acf
##    <chr>        <chr>       <chr> <chr> <cf_lag>   <dbl>
##  1 Concessional Co-payments A     A01         1M  0.737 
##  2 Concessional Co-payments A     A01         2M  0.466 
##  3 Concessional Co-payments A     A01         3M  0.239 
##  4 Concessional Co-payments A     A01         4M  0.0575
##  5 Concessional Co-payments A     A01         5M  0.0120
##  6 Concessional Co-payments A     A01         6M -0.0325
##  7 Concessional Co-payments A     A01         7M -0.0219
##  8 Concessional Co-payments A     A01         8M  0.0389
##  9 Concessional Co-payments A     A01         9M  0.180 
## 10 Concessional Co-payments A     A01        10M  0.373 
## # ℹ 7,680 more rows
## 
## $subtitle
## [1] "ACF - H02 Drug Cost from PBS"
## 
## attr(,"class")
## [1] "labels"
us_gasoline |>
  autoplot(Barrels) +
  ggtitle("US Gasoline Barrels Production")

us_gasoline |>
  gg_season(Barrels) +
  ggtitle("gg_season - US Gasoline Barrels Production")

us_gasoline |>
  gg_subseries(Barrels) +
  ggtitle("gg_subseries - US Gasoline Barrels Production")

us_gasoline |>
  gg_lag(Barrels, geom='point') +
  ggtitle("gg_lag - US Gasoline Barrels Production")

us_gasoline |>
  ACF(Barrels) |>
  ggtitle("ACF - US Gasoline Barrels Production")
## $title
## # A tsibble: 31 x 2 [1W]
##         lag   acf
##    <cf_lag> <dbl>
##  1       1W 0.893
##  2       2W 0.882
##  3       3W 0.873
##  4       4W 0.866
##  5       5W 0.847
##  6       6W 0.844
##  7       7W 0.832
##  8       8W 0.831
##  9       9W 0.822
## 10      10W 0.808
## # ℹ 21 more rows
## 
## $subtitle
## [1] "ACF - US Gasoline Barrels Production"
## 
## attr(,"class")
## [1] "labels"
* Can you spot any seasonality, cyclicity and trend?

The Private Data Set shows increased employment in the summer months whist the pelt (hares) dataset shows cyclic patterns.

* What do you learn about the series?

The series exhibits long-term growth in employment in the summer months.

* What can you say about the seasonal patterns?

The time series (employment, bricks, drug costs, gasoline) exhibits clear seasonal patterns.

* Can you identify any unusual years?

There was decrease in employment in the years from 2008 to 2010. This was primarily because of the Global Financial Crisis (GFC), also known as the Great Recession, which was one of the most severe economic downturns since the Great Depression.