Homework 1

Data 624 Homework 1

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(tsibble)

## Warning: package 'tsibble' was built under R version 4.4.2

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## 
## Attaching package: 'tsibble'

## The following object is masked from 'package:lubridate':
## 
##     interval

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0     ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1     ✔ tibble  3.2.1
## ✔ purrr   1.0.2     ✔ tidyr   1.3.1
## ✔ readr   2.1.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.4.2

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tsibbledata 0.4.1     ✔ fable       0.4.1
## ✔ feasts      0.4.1

## Warning: package 'tsibbledata' was built under R version 4.4.2

## Warning: package 'feasts' was built under R version 4.4.2

## Warning: package 'fabletools' was built under R version 4.4.2

## Warning: package 'fable' was built under R version 4.4.2

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(forecast)

## Warning: package 'forecast' was built under R version 4.4.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Problem 2.1: Explore the following four time series: Bricks from aus_production, Lynx from pelt, Close from gafa_stock, Demand from vic_elec.

Use ? (or help()) to find out about the data in each series. What is the time interval of each series? Use autoplot() to produce a time plot of each series. For the last plot, modify the axis labels and title.

data("aus_production")
?aus_production

## starting httpd help server ... done

data("pelt")
?pelt

data("gafa_stock")
?gafa_stock

data("vic_elec")
?vic_elec

The time interval for Aus_production is Quarterly and it extends from 1956 to 2010. The time interval for pelt is Yearly and it extends from 1845 to 1935. The time interval for gafa_stock is every Business day when the Market is open and it extends from the start of 2014 to the end of 2018. The time interval for vic_elec is every 30 minutes and it extends from 2012 to 2014.

aus_production %>% 
  autoplot(Bricks)

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

pelt %>% 
  autoplot(Lynx)

gafa_stock %>% 
  autoplot(Close)

# Modifying Axis and Title
vic_elec %>% 
  autoplot(Demand) +
  labs(x = "Date", y = "Demand") +
  ggtitle("Electricity Demand Over Time")

Problem 2.2: Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.

gafa_stock %>% group_by(Symbol) %>%
  filter(Close==max(Close)) %>%
  select(Symbol,
         Date,
         Close)

## # A tsibble: 4 x 3 [!]
## # Key:       Symbol [4]
## # Groups:    Symbol [4]
##   Symbol Date       Close
##   <chr>  <date>     <dbl>
## 1 AAPL   2018-10-03  232.
## 2 AMZN   2018-09-04 2040.
## 3 FB     2018-07-25  218.
## 4 GOOG   2018-07-26 1268.

As we can see AAPL had a Max at 232.07 on 10/03/2018. AMZN had a max of 2039.51 on 9/4/2018. FB had a max of 217.50 on 7/25/2018. GOOG had a max of 1268.33 on 7/26/2018.

Problem 2.3 A:

tute1 <- read.csv('https://raw.githubusercontent.com/Jlok17/2022MSDS/main/Source/Data%20624/tute1.csv')
head(tute1)

##      Quarter  Sales AdBudget   GDP
## 1 1981-03-01 1020.2    659.2 251.8
## 2 1981-06-01  889.2    589.0 290.9
## 3 1981-09-01  795.0    512.5 290.8
## 4 1981-12-01 1003.9    614.1 292.4
## 5 1982-03-01 1057.7    647.2 279.1
## 6 1982-06-01  944.4    602.0 254.0

mytimeseries <- tute1 %>% 
  mutate(Quarter = yearquarter(Quarter)) %>%
  as_tsibble(index = Quarter)

mytimeseries %>%
  pivot_longer(-Quarter) %>%
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line()+ 
  facet_grid(name ~ ., scales = "free_y")

When Facet_grid is not included all the plots are together on one graph but when it is Added we see that 3 different graphs are formed for each Name: AdBudget, GDP, and Sales.

Problem 2.4: The USgas package contains data on the demand for natural gas in the US.

Install the USgas package.

Create a tsibble from us_total with year as the index and state as the key. Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island). A:

#install.packages('USgas')
library(USgas)

## Warning: package 'USgas' was built under R version 4.4.3

data("us_total")
str(us_total)

## 'data.frame':    1266 obs. of  3 variables:
##  $ year : int  1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 ...
##  $ state: chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ y    : int  324158 329134 337270 353614 332693 379343 350345 382367 353156 391093 ...

us_total <- us_total %>%
  rename(natural_gas_consumption_mcf = y)
us_total_tsibble <- us_total %>%
  filter(state %in% c("Maine", "Vermont", "New Hampshire", "Massachusetts", "Connecticut", "Rhode Island")) %>%
  as_tsibble(key = state, index = year)
us_total_tsibble

## # A tsibble: 138 x 3 [1Y]
## # Key:       state [6]
##     year state       natural_gas_consumption_mcf
##    <int> <chr>                             <int>
##  1  1997 Connecticut                      144708
##  2  1998 Connecticut                      131497
##  3  1999 Connecticut                      152237
##  4  2000 Connecticut                      159712
##  5  2001 Connecticut                      146278
##  6  2002 Connecticut                      177587
##  7  2003 Connecticut                      154075
##  8  2004 Connecticut                      162642
##  9  2005 Connecticut                      168067
## 10  2006 Connecticut                      172682
## # ℹ 128 more rows

# Plot the annual natural gas consumption
us_total_tsibble %>% autoplot(natural_gas_consumption_mcf)

Problem 2.5:

df5 <-  read.csv('https://raw.githubusercontent.com/Jlok17/2022MSDS/main/Source/Data%20624/tourism.xlsx%20-%20Sheet1.csv')
str(df5)

## 'data.frame':    24320 obs. of  5 variables:
##  $ Quarter: chr  "1998-01-01" "1998-04-01" "1998-07-01" "1998-10-01" ...
##  $ Region : chr  "Adelaide" "Adelaide" "Adelaide" "Adelaide" ...
##  $ State  : chr  "South Australia" "South Australia" "South Australia" "South Australia" ...
##  $ Purpose: chr  "Business" "Business" "Business" "Business" ...
##  $ Trips  : num  135 110 166 127 137 ...

# Converting data type from "Quarter" to Date and "Trips" to numeric
df5 <- df5 %>%
  mutate(Quarter = as.Date(Quarter),
         Trips = as.numeric(Trips))

# Creating a tsibble identical to the tourism one
tsib_df5 <- as_tsibble(df5, key = c(Region, State, Purpose), index = Quarter)


# Combination of Region and Purpose with the maximum number of overnight trips on average
max_avg_trips <- df5 %>%
  group_by(Region, Purpose) %>%
  summarise(avg_trips = mean(Trips)) %>%
  arrange(desc(avg_trips))

## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.

head(max_avg_trips)

## # A tibble: 6 × 3
## # Groups:   Region [4]
##   Region          Purpose  avg_trips
##   <chr>           <chr>        <dbl>
## 1 Sydney          Visiting      747.
## 2 Melbourne       Visiting      619.
## 3 Sydney          Business      602.
## 4 North Coast NSW Holiday       588.
## 5 Sydney          Holiday       550.
## 6 Gold Coast      Holiday       528.

# Tsibble for Total Trips by State
total_trips_by_state <- df5 %>%
  group_by(State) %>%
  summarise(total_trips = sum(Trips)) %>%
  arrange(desc(total_trips))

head(total_trips_by_state)

## # A tibble: 6 × 2
##   State             total_trips
##   <chr>                   <dbl>
## 1 New South Wales       557367.
## 2 Victoria              390463.
## 3 Queensland            386643.
## 4 Western Australia     147820.
## 5 South Australia       118151.
## 6 Tasmania               54137.

As we can see that New South Wales, Victoria and Queensland have a huge lead over the other places with being tripled and doubled respectively compared to Western Australia.

Problem 2.8

data("PBS")
data("us_employment")
data("us_gasoline")

Employed:

us_employment %>% 
  filter(Title == "Total Private") %>% 
  autoplot(Employed) + 
  ggtitle("Autoplot")

Bricks:

aus_production %>% 
  autoplot(Bricks) +
  ggtitle("Autoplot")

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

Pelt:

pelt %>% 
  autoplot(Hare) +
  ggtitle("Autoplot")

PBS:

PBS %>% 
  filter(ATC2 == "H02")  %>% 
  autoplot(Cost) + 
  ggtitle("Autoplot")

Gasoline:

us_gasoline %>% 
  autoplot() + 
  ggtitle("Autoplot")

## Plot variable not specified, automatically selected `.vars = Barrels`

us_gasoline %>% 
  gg_season() +
  ggtitle("Seasonal Decomposition")

## Plot variable not specified, automatically selected `y = Barrels`

us_gasoline %>% 
  gg_subseries()+
  ggtitle("Subseries Plot")

## Plot variable not specified, automatically selected `y = Barrels`

us_gasoline %>% 
  gg_lag() +
  ggtitle("Lag Plot")

## Plot variable not specified, automatically selected `y = Barrels`

us_gasoline %>% 
  ACF() %>% 
  autoplot() + 
  ggtitle("Autocorrelation Function")

## Response variable not specified, automatically selected `var = Barrels`

Gasoline Barrels series demonstrates a general positive trend over the time period with some general seasonality however it seems to have lots of noise but in can see some trends of peaks and declines at specific times of the month. The lag plot indicates positive correlation with some over plotting. In general I can’t really see any unusual years with in the data but seems to be reflected by the overplotting.

Homework 1

Ngawang Dakpa

2025-03-22