DATA 624 - Predictive Analystics - (Assignment 1)
INSTRUCTIONS
Please submit exercises 2.1, 2.2, 2.3, 2.4, 2.5 and 2.8 from the Hyndman online Forecasting book. Please submit both your Rpubs link as well as attach the .pdf file with your code.
Due on Sep 8, 2024 11:59 PM
LOAD PACKAGES
The following code below loops through the list of necessary packages and checks to determine if each is installed. If the package is not found it is installed and loaded.
pkges <- c("fpp3", "kableExtra", "readr", "USgas", "readxl")
# Loop through the packages
for (p in pkges) {
# Check if package is installed
if (!requireNamespace(p, quietly = TRUE)) {
install.packages(p) #If the package is not installed, install the package
library(p, character.only = TRUE) #Load the package
} else {
library(p, character.only = TRUE) #If the package is already installed, load the package
}
}## Warning: package 'fpp3' was built under R version 4.4.1
## Warning: package 'tibble' was built under R version 4.4.1
## Warning: package 'dplyr' was built under R version 4.4.1
## Warning: package 'tidyr' was built under R version 4.4.1
## Warning: package 'lubridate' was built under R version 4.4.1
## Warning: package 'ggplot2' was built under R version 4.4.1
## Warning: package 'tsibble' was built under R version 4.4.1
## Warning: package 'tsibbledata' was built under R version 4.4.1
## Warning: package 'feasts' was built under R version 4.4.1
## Warning: package 'fabletools' was built under R version 4.4.1
## Warning: package 'fable' was built under R version 4.4.1
## Warning: package 'kableExtra' was built under R version 4.4.1
## Warning: package 'readr' was built under R version 4.4.1
## Warning: package 'USgas' was built under R version 4.4.1
## Warning: package 'readxl' was built under R version 4.4.1
EXercise 2.10.1
Explore the following four time series: Bricks from aus_production, Lynx from pelt, Close from gafa_stock, Demand from vic_elec.
- Use ? (or help()) to find out about the data in each series.
- What is the time interval of each series?
- Use autoplot() to produce a time plot of each series.
- For the last plot, modify the axis labels and title.
Use ? (or help()) to find out about the data in each series.
What is the time interval of each series?
Time interval of each series: * aus_production:- This data has a quarterly time interval. * pelt:- This data is annual. * gafa_stock:- This data is daily. * vic_elec:- This data has half-hourly intervals.
# Bricks from aus_production
aus_production %>% filter(!is.na(Bricks)) -> bricks_data
head(aus_production, 10)%>% kable() %>%
kable_styling(bootstrap_options = "striped", font_size = 12,) %>%
scroll_box(height = "100%", width = "100%", fixed_thead = T)| Quarter | Beer | Tobacco | Bricks | Cement | Electricity | Gas |
|---|---|---|---|---|---|---|
| 1956 Q1 | 284 | 5225 | 189 | 465 | 3923 | 5 |
| 1956 Q2 | 213 | 5178 | 204 | 532 | 4436 | 6 |
| 1956 Q3 | 227 | 5297 | 208 | 561 | 4806 | 7 |
| 1956 Q4 | 308 | 5681 | 197 | 570 | 4418 | 6 |
| 1957 Q1 | 262 | 5577 | 187 | 529 | 4339 | 5 |
| 1957 Q2 | 228 | 5651 | 214 | 604 | 4811 | 7 |
| 1957 Q3 | 236 | 5317 | 227 | 603 | 5259 | 7 |
| 1957 Q4 | 320 | 6152 | 222 | 582 | 4735 | 6 |
| 1958 Q1 | 272 | 5758 | 199 | 554 | 4608 | 5 |
| 1958 Q2 | 233 | 5641 | 229 | 620 | 5196 | 7 |
# Lynx from pelt
head(pelt, 10)%>% kable() %>%
kable_styling(bootstrap_options = "striped", font_size = 12,) %>%
scroll_box(height = "100%", width = "100%", fixed_thead = T)| Year | Hare | Lynx |
|---|---|---|
| 1845 | 19580 | 30090 |
| 1846 | 19600 | 45150 |
| 1847 | 19610 | 49150 |
| 1848 | 11990 | 39520 |
| 1849 | 28040 | 21230 |
| 1850 | 58000 | 8420 |
| 1851 | 74600 | 5560 |
| 1852 | 75090 | 5080 |
| 1853 | 88480 | 10170 |
| 1854 | 61280 | 19600 |
# Close from gafa_stock
head(gafa_stock, 10)%>% kable() %>%
kable_styling(bootstrap_options = "striped", font_size = 12,) %>%
scroll_box(height = "100%", width = "100%", fixed_thead = T)| Symbol | Date | Open | High | Low | Close | Adj_Close | Volume |
|---|---|---|---|---|---|---|---|
| AAPL | 2014-01-02 | 79.38286 | 79.57571 | 78.86000 | 79.01857 | 66.96433 | 58671200 |
| AAPL | 2014-01-03 | 78.98000 | 79.10000 | 77.20428 | 77.28286 | 65.49342 | 98116900 |
| AAPL | 2014-01-06 | 76.77857 | 78.11429 | 76.22857 | 77.70428 | 65.85053 | 103152700 |
| AAPL | 2014-01-07 | 77.76000 | 77.99429 | 76.84571 | 77.14857 | 65.37959 | 79302300 |
| AAPL | 2014-01-08 | 76.97285 | 77.93714 | 76.95571 | 77.63715 | 65.79363 | 64632400 |
| AAPL | 2014-01-09 | 78.11429 | 78.12286 | 76.47857 | 76.64571 | 64.95345 | 69787200 |
| AAPL | 2014-01-10 | 77.11857 | 77.25714 | 75.87286 | 76.13429 | 64.52003 | 76244000 |
| AAPL | 2014-01-13 | 75.70143 | 77.50000 | 75.69714 | 76.53286 | 64.85782 | 94623200 |
| AAPL | 2014-01-14 | 76.88857 | 78.10429 | 76.80857 | 78.05572 | 66.14838 | 83140400 |
| AAPL | 2014-01-15 | 79.07429 | 80.02857 | 78.80857 | 79.62286 | 67.47642 | 97909700 |
# Demand from vic_elec
head(vic_elec, 10) %>% kable() %>%
kable_styling(bootstrap_options = "striped", font_size = 12,) %>%
scroll_box(height = "100%", width = "100%", fixed_thead = T)| Time | Demand | Temperature | Date | Holiday |
|---|---|---|---|---|
| 2012-01-01 00:00:00 | 4382.825 | 21.40 | 2012-01-01 | TRUE |
| 2012-01-01 00:30:00 | 4263.366 | 21.05 | 2012-01-01 | TRUE |
| 2012-01-01 01:00:00 | 4048.966 | 20.70 | 2012-01-01 | TRUE |
| 2012-01-01 01:30:00 | 3877.563 | 20.55 | 2012-01-01 | TRUE |
| 2012-01-01 02:00:00 | 4036.230 | 20.40 | 2012-01-01 | TRUE |
| 2012-01-01 02:30:00 | 3865.597 | 20.25 | 2012-01-01 | TRUE |
| 2012-01-01 03:00:00 | 3694.098 | 20.10 | 2012-01-01 | TRUE |
| 2012-01-01 03:30:00 | 3561.624 | 19.60 | 2012-01-01 | TRUE |
| 2012-01-01 04:00:00 | 3433.035 | 19.10 | 2012-01-01 | TRUE |
| 2012-01-01 04:30:00 | 3359.468 | 18.95 | 2012-01-01 | TRUE |
Use autoplot() to produce a time plot of each series.
# Bricks data
autoplot(bricks_data, Bricks) +
ggtitle("Bricks Production Over Time") +
xlab("Year") +
ylab("Bricks (Millions)")# Lynx data
autoplot(pelt, Lynx) +
ggtitle("Lynx Trapping Over Time") +
xlab("Year") +
ylab("Lynx pelts")# Close data
autoplot(gafa_stock, Close) +
ggtitle("GAFA Stock Close Price Over Time") +
xlab("Date") +
ylab("Close Price")# Demand data
autoplot(vic_elec, Demand) +
ggtitle("Electricity Demand") +
xlab("Date") +
ylab("Electricity Demand (MW)") +
labs(title = "Electricity Demand Over Time",
x = "Time",
y = "Demand (Megawatts)")For the last plot, modify the axis labels and title.
EXercise 2.10.2
Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.
# Inspect the gafa_stock dataset
head(gafa_stock, 10) %>% kable() %>%
kable_styling(bootstrap_options = "striped", font_size = 12,) %>%
scroll_box(height = "100%", width = "100%", fixed_thead = T)| Symbol | Date | Open | High | Low | Close | Adj_Close | Volume |
|---|---|---|---|---|---|---|---|
| AAPL | 2014-01-02 | 79.38286 | 79.57571 | 78.86000 | 79.01857 | 66.96433 | 58671200 |
| AAPL | 2014-01-03 | 78.98000 | 79.10000 | 77.20428 | 77.28286 | 65.49342 | 98116900 |
| AAPL | 2014-01-06 | 76.77857 | 78.11429 | 76.22857 | 77.70428 | 65.85053 | 103152700 |
| AAPL | 2014-01-07 | 77.76000 | 77.99429 | 76.84571 | 77.14857 | 65.37959 | 79302300 |
| AAPL | 2014-01-08 | 76.97285 | 77.93714 | 76.95571 | 77.63715 | 65.79363 | 64632400 |
| AAPL | 2014-01-09 | 78.11429 | 78.12286 | 76.47857 | 76.64571 | 64.95345 | 69787200 |
| AAPL | 2014-01-10 | 77.11857 | 77.25714 | 75.87286 | 76.13429 | 64.52003 | 76244000 |
| AAPL | 2014-01-13 | 75.70143 | 77.50000 | 75.69714 | 76.53286 | 64.85782 | 94623200 |
| AAPL | 2014-01-14 | 76.88857 | 78.10429 | 76.80857 | 78.05572 | 66.14838 | 83140400 |
| AAPL | 2014-01-15 | 79.07429 | 80.02857 | 78.80857 | 79.62286 | 67.47642 | 97909700 |
gafa_stock %>%
group_by(Symbol) %>%
filter(Close == max(Close)) %>% # Use filter() to find the days that correspond to the peak Close price
arrange(Symbol)## # A tsibble: 4 x 8 [!]
## # Key: Symbol [4]
## # Groups: Symbol [4]
## Symbol Date Open High Low Close Adj_Close Volume
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 2018-10-03 230. 233. 230. 232. 230. 28654800
## 2 AMZN 2018-09-04 2026. 2050. 2013 2040. 2040. 5721100
## 3 FB 2018-07-25 216. 219. 214. 218. 218. 58954200
## 4 GOOG 2018-07-26 1251 1270. 1249. 1268. 1268. 2405600
EXercise 2.10.3
Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.
- You can read the data into R with the following script:
- Convert the data to time series
- Construct time series plots of each of the three series
mytimeseries |>
pivot_longer(-Quarter) |>
ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line() +
facet_grid(name ~ ., scales = "free_y")
Check what happens when you don’t include facet_grid().
When Facet_grid() is removed, the three time series (Sales, AdBudget, and GDP) are plotted on the same graph. Depending on the scale of each series, this makes the plot crowded and harder to interpret. Facet_grid() helps to separate the plots, making it easier to analyze each series independently. In the previous code all three time series graphs are are plotted on individual graphs.
mytimeseries |>
pivot_longer(-Quarter) |>
ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line() EXercise 2.10.4
The USgas package contains data on the demand for natural gas in the US.
Create a tsibble from us_total with year as the index and state as the key.
# Create a tsibble from us_total with year as the index and state as the key
usgas_tsibble <- us_total |>
as_tsibble(index = year, key = state)
head(usgas_tsibble, 10) %>% kable() %>%
kable_styling(bootstrap_options = "striped", font_size = 12,) %>%
scroll_box(height = "100%", width = "100%", fixed_thead = T)| year | state | y |
|---|---|---|
| 1997 | Alabama | 324158 |
| 1998 | Alabama | 329134 |
| 1999 | Alabama | 337270 |
| 2000 | Alabama | 353614 |
| 2001 | Alabama | 332693 |
| 2002 | Alabama | 379343 |
| 2003 | Alabama | 350345 |
| 2004 | Alabama | 382367 |
| 2005 | Alabama | 353156 |
| 2006 | Alabama | 391093 |
Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).
# Filter New England states
new_england_states <- c("Maine", "Vermont", "New Hampshire", "Massachusetts", "Connecticut", "Rhode Island")
new_england_data <- usgas_tsibble |>
filter(state %in% new_england_states)
# Plot the annual natural gas consumption by state
new_england_data |>
ggplot(aes(x = year, y = y, colour = state)) +
geom_line() +
ggtitle("Annual Natural Gas Consumption by State (New England)") +
xlab("Year") +
ylab("Natural Gas Consumption (Million Cubic Feet)") +
theme_minimal()EXercise 2.10.5
Download tourism.xlsx from the book website and read it into R using readxl::read_excel().
tourism_data <- readxl::read_excel("tourism.xlsx")
head(tourism, 10) %>% kable() %>%
kable_styling(bootstrap_options = "striped", font_size = 12,) %>%
scroll_box(height = "100%", width = "100%", fixed_thead = T)| Quarter | Region | State | Purpose | Trips |
|---|---|---|---|---|
| 1998 Q1 | Adelaide | South Australia | Business | 135.0777 |
| 1998 Q2 | Adelaide | South Australia | Business | 109.9873 |
| 1998 Q3 | Adelaide | South Australia | Business | 166.0347 |
| 1998 Q4 | Adelaide | South Australia | Business | 127.1605 |
| 1999 Q1 | Adelaide | South Australia | Business | 137.4485 |
| 1999 Q2 | Adelaide | South Australia | Business | 199.9126 |
| 1999 Q3 | Adelaide | South Australia | Business | 169.3551 |
| 1999 Q4 | Adelaide | South Australia | Business | 134.3579 |
| 2000 Q1 | Adelaide | South Australia | Business | 154.0344 |
| 2000 Q2 | Adelaide | South Australia | Business | 168.7764 |
Create a tsibble which is identical to the tourism tsibble from the tsibble package.
tourism_tsibble <- tourism_data |>
mutate(Quarter = yearquarter(Quarter)) |>
as_tsibble(index = Quarter, key = c(Region, State, Purpose))
# View the tsibble
head(tourism_tsibble, 10) %>% kable() %>%
kable_styling(bootstrap_options = "striped", font_size = 12,) %>%
scroll_box(height = "100%", width = "100%", fixed_thead = T)| Quarter | Region | State | Purpose | Trips |
|---|---|---|---|---|
| 1998 Q1 | Adelaide | South Australia | Business | 135.0777 |
| 1998 Q2 | Adelaide | South Australia | Business | 109.9873 |
| 1998 Q3 | Adelaide | South Australia | Business | 166.0347 |
| 1998 Q4 | Adelaide | South Australia | Business | 127.1605 |
| 1999 Q1 | Adelaide | South Australia | Business | 137.4485 |
| 1999 Q2 | Adelaide | South Australia | Business | 199.9126 |
| 1999 Q3 | Adelaide | South Australia | Business | 169.3551 |
| 1999 Q4 | Adelaide | South Australia | Business | 134.3579 |
| 2000 Q1 | Adelaide | South Australia | Business | 154.0344 |
| 2000 Q2 | Adelaide | South Australia | Business | 168.7764 |
Find what combination of Region and Purpose had the maximum number of overnight trips on average.
max_avg_trips <- tourism_tsibble |>
group_by(Region, Purpose) |>
summarize(avg_trips = mean(Trips, na.rm = TRUE)) |>
ungroup() |>
slice_max(avg_trips)
head(max_avg_trips, 10)%>% kable() %>%
kable_styling(bootstrap_options = "striped", font_size = 12,) %>%
scroll_box(height = "100%", width = "100%", fixed_thead = T)| Region | Purpose | Quarter | avg_trips |
|---|---|---|---|
| Melbourne | Visiting | 2017 Q4 | 985.2784 |
Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.
state_total_trips <- tourism_tsibble |>
index_by(Quarter) |>
group_by(State) |>
summarize(total_trips = sum(Trips, na.rm = TRUE)) |>
ungroup() |>
as_tsibble(index = Quarter, key = State)
# View the new tsibble
head(state_total_trips, 10)%>% kable() %>%
kable_styling(bootstrap_options = "striped", font_size = 12,) %>%
scroll_box(height = "100%", width = "100%", fixed_thead = T)| State | Quarter | total_trips |
|---|---|---|
| ACT | 1998 Q1 | 551.0019 |
| ACT | 1998 Q2 | 416.0256 |
| ACT | 1998 Q3 | 436.0290 |
| ACT | 1998 Q4 | 449.7984 |
| ACT | 1999 Q1 | 378.5728 |
| ACT | 1999 Q2 | 558.1781 |
| ACT | 1999 Q3 | 448.9012 |
| ACT | 1999 Q4 | 594.8254 |
| ACT | 2000 Q1 | 599.6685 |
| ACT | 2000 Q2 | 557.1351 |
EXercise 2.10.8
Use the following graphics functions: autoplot(), gg_season(), gg_subseries(), gg_lag(), ACF() and explore features from the following time series: “Total Private” Employed from us_employment, Bricks from aus_production, Hare from pelt, “H02” Cost from PBS, and Barrels from us_gasoline.
us_employment |>
filter(Title == 'Total Private') |>
gg_season(Employed) +
ggtitle("gg_season Plot") us_employment |>
filter(Title == 'Total Private') |>
gg_subseries(Employed) +
ggtitle("gg_subseries Plot") us_employment |>
filter(Title == 'Total Private') |>
ACF(Employed) |>
autoplot() +
ggtitle("ACF Plot") # Bricks from aus_production
aus_production |>
autoplot(Bricks) +
ggtitle("Bricks Production in Australia")## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 20 rows containing missing values (gg_lag).
## # A tsibble: 19 x 2 [1Y]
## lag acf
## <cf_lag> <dbl>
## 1 1Y 0.658
## 2 2Y 0.214
## 3 3Y -0.155
## 4 4Y -0.401
## 5 5Y -0.493
## 6 6Y -0.401
## 7 7Y -0.168
## 8 8Y 0.113
## 9 9Y 0.307
## 10 10Y 0.340
## 11 11Y 0.296
## 12 12Y 0.206
## 13 13Y 0.0372
## 14 14Y -0.153
## 15 15Y -0.285
## 16 16Y -0.295
## 17 17Y -0.202
## 18 18Y -0.0676
## 19 19Y 0.0956
PBS |>
filter(ATC2 == 'H02') |>
gg_subseries(Cost) +
ggtitle("gg_subseries - H02 Drug Cost from PBS")## $title
## # A tsibble: 7,690 x 6 [1M]
## # Key: Concession, Type, ATC1, ATC2 [336]
## Concession Type ATC1 ATC2 lag acf
## <chr> <chr> <chr> <chr> <cf_lag> <dbl>
## 1 Concessional Co-payments A A01 1M 0.737
## 2 Concessional Co-payments A A01 2M 0.466
## 3 Concessional Co-payments A A01 3M 0.239
## 4 Concessional Co-payments A A01 4M 0.0575
## 5 Concessional Co-payments A A01 5M 0.0120
## 6 Concessional Co-payments A A01 6M -0.0325
## 7 Concessional Co-payments A A01 7M -0.0219
## 8 Concessional Co-payments A A01 8M 0.0389
## 9 Concessional Co-payments A A01 9M 0.180
## 10 Concessional Co-payments A A01 10M 0.373
## # ℹ 7,680 more rows
##
## $subtitle
## [1] "ACF - H02 Drug Cost from PBS"
##
## attr(,"class")
## [1] "labels"
## $title
## # A tsibble: 31 x 2 [1W]
## lag acf
## <cf_lag> <dbl>
## 1 1W 0.893
## 2 2W 0.882
## 3 3W 0.873
## 4 4W 0.866
## 5 5W 0.847
## 6 6W 0.844
## 7 7W 0.832
## 8 8W 0.831
## 9 9W 0.822
## 10 10W 0.808
## # ℹ 21 more rows
##
## $subtitle
## [1] "ACF - US Gasoline Barrels Production"
##
## attr(,"class")
## [1] "labels"
* Can you spot any seasonality, cyclicity and trend?
The Private Data Set shows increased employment in the summer months whist the pelt (hares) dataset shows cyclic patterns.
* What do you learn about the series?
The series exhibits long-term growth in employment in the summer months.
* What can you say about the seasonal patterns?
The time series (employment, bricks, drug costs, gasoline) exhibits clear seasonal patterns.
* Can you identify any unusual years?
There was decrease in employment in the years from 2008 to 2010. This was primarily because of the Global Financial Crisis (GFC), also known as the Great Recession, which was one of the most severe economic downturns since the Great Depression.