Data 624 Homework 1
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tsibble)
## Warning: package 'tsibble' was built under R version 4.4.2
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
## The following object is masked from 'package:lubridate':
##
## interval
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” forcats 1.0.0 âś” stringr 1.5.1
## âś” ggplot2 3.5.1 âś” tibble 3.2.1
## âś” purrr 1.0.2 âś” tidyr 1.3.1
## âś” readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– tsibble::interval() masks lubridate::interval()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(fpp3)
## Warning: package 'fpp3' was built under R version 4.4.2
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## âś” tsibbledata 0.4.1 âś” fable 0.4.1
## âś” feasts 0.4.1
## Warning: package 'tsibbledata' was built under R version 4.4.2
## Warning: package 'feasts' was built under R version 4.4.2
## Warning: package 'fabletools' was built under R version 4.4.2
## Warning: package 'fable' was built under R version 4.4.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## âś– lubridate::date() masks base::date()
## âś– dplyr::filter() masks stats::filter()
## âś– tsibble::intersect() masks base::intersect()
## âś– tsibble::interval() masks lubridate::interval()
## âś– dplyr::lag() masks stats::lag()
## âś– tsibble::setdiff() masks base::setdiff()
## âś– tsibble::union() masks base::union()
library(forecast)
## Warning: package 'forecast' was built under R version 4.4.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
Problem 2.1: Explore the following four time series: Bricks from aus_production, Lynx from pelt, Close from gafa_stock, Demand from vic_elec.
Use ? (or help()) to find out about the data in each series. What is the time interval of each series? Use autoplot() to produce a time plot of each series. For the last plot, modify the axis labels and title.
data("aus_production")
?aus_production
## starting httpd help server ... done
data("pelt")
?pelt
data("gafa_stock")
?gafa_stock
data("vic_elec")
?vic_elec
The time interval for Aus_production is Quarterly and it extends from 1956 to 2010. The time interval for pelt is Yearly and it extends from 1845 to 1935. The time interval for gafa_stock is every Business day when the Market is open and it extends from the start of 2014 to the end of 2018. The time interval for vic_elec is every 30 minutes and it extends from 2012 to 2014.
aus_production %>%
autoplot(Bricks)
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).
pelt %>%
autoplot(Lynx)
gafa_stock %>%
autoplot(Close)
# Modifying Axis and Title
vic_elec %>%
autoplot(Demand) +
labs(x = "Date", y = "Demand") +
ggtitle("Electricity Demand Over Time")
Problem 2.2: Use filter() to find what days corresponded to the peak
closing price for each of the four stocks in gafa_stock.
gafa_stock %>% group_by(Symbol) %>%
filter(Close==max(Close)) %>%
select(Symbol,
Date,
Close)
## # A tsibble: 4 x 3 [!]
## # Key: Symbol [4]
## # Groups: Symbol [4]
## Symbol Date Close
## <chr> <date> <dbl>
## 1 AAPL 2018-10-03 232.
## 2 AMZN 2018-09-04 2040.
## 3 FB 2018-07-25 218.
## 4 GOOG 2018-07-26 1268.
As we can see AAPL had a Max at 232.07 on 10/03/2018. AMZN had a max of 2039.51 on 9/4/2018. FB had a max of 217.50 on 7/25/2018. GOOG had a max of 1268.33 on 7/26/2018.
Problem 2.3 A:
tute1 <- read.csv('https://raw.githubusercontent.com/Jlok17/2022MSDS/main/Source/Data%20624/tute1.csv')
head(tute1)
## Quarter Sales AdBudget GDP
## 1 1981-03-01 1020.2 659.2 251.8
## 2 1981-06-01 889.2 589.0 290.9
## 3 1981-09-01 795.0 512.5 290.8
## 4 1981-12-01 1003.9 614.1 292.4
## 5 1982-03-01 1057.7 647.2 279.1
## 6 1982-06-01 944.4 602.0 254.0
B:
mytimeseries <- tute1 %>%
mutate(Quarter = yearquarter(Quarter)) %>%
as_tsibble(index = Quarter)
C:
mytimeseries %>%
pivot_longer(-Quarter) %>%
ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line()+
facet_grid(name ~ ., scales = "free_y")
When Facet_grid is not included all the plots are together on one graph
but when it is Added we see that 3 different graphs are formed for each
Name: AdBudget, GDP, and Sales.
Problem 2.4: The USgas package contains data on the demand for natural gas in the US.
Install the USgas package.
Create a tsibble from us_total with year as the index and state as the key. Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island). A:
#install.packages('USgas')
library(USgas)
## Warning: package 'USgas' was built under R version 4.4.3
data("us_total")
str(us_total)
## 'data.frame': 1266 obs. of 3 variables:
## $ year : int 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 ...
## $ state: chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ y : int 324158 329134 337270 353614 332693 379343 350345 382367 353156 391093 ...
B:
us_total <- us_total %>%
rename(natural_gas_consumption_mcf = y)
us_total_tsibble <- us_total %>%
filter(state %in% c("Maine", "Vermont", "New Hampshire", "Massachusetts", "Connecticut", "Rhode Island")) %>%
as_tsibble(key = state, index = year)
us_total_tsibble
## # A tsibble: 138 x 3 [1Y]
## # Key: state [6]
## year state natural_gas_consumption_mcf
## <int> <chr> <int>
## 1 1997 Connecticut 144708
## 2 1998 Connecticut 131497
## 3 1999 Connecticut 152237
## 4 2000 Connecticut 159712
## 5 2001 Connecticut 146278
## 6 2002 Connecticut 177587
## 7 2003 Connecticut 154075
## 8 2004 Connecticut 162642
## 9 2005 Connecticut 168067
## 10 2006 Connecticut 172682
## # ℹ 128 more rows
C:
# Plot the annual natural gas consumption
us_total_tsibble %>% autoplot(natural_gas_consumption_mcf)
Problem 2.5:
df5 <- read.csv('https://raw.githubusercontent.com/Jlok17/2022MSDS/main/Source/Data%20624/tourism.xlsx%20-%20Sheet1.csv')
str(df5)
## 'data.frame': 24320 obs. of 5 variables:
## $ Quarter: chr "1998-01-01" "1998-04-01" "1998-07-01" "1998-10-01" ...
## $ Region : chr "Adelaide" "Adelaide" "Adelaide" "Adelaide" ...
## $ State : chr "South Australia" "South Australia" "South Australia" "South Australia" ...
## $ Purpose: chr "Business" "Business" "Business" "Business" ...
## $ Trips : num 135 110 166 127 137 ...
# Converting data type from "Quarter" to Date and "Trips" to numeric
df5 <- df5 %>%
mutate(Quarter = as.Date(Quarter),
Trips = as.numeric(Trips))
# Creating a tsibble identical to the tourism one
tsib_df5 <- as_tsibble(df5, key = c(Region, State, Purpose), index = Quarter)
# Combination of Region and Purpose with the maximum number of overnight trips on average
max_avg_trips <- df5 %>%
group_by(Region, Purpose) %>%
summarise(avg_trips = mean(Trips)) %>%
arrange(desc(avg_trips))
## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.
head(max_avg_trips)
## # A tibble: 6 Ă— 3
## # Groups: Region [4]
## Region Purpose avg_trips
## <chr> <chr> <dbl>
## 1 Sydney Visiting 747.
## 2 Melbourne Visiting 619.
## 3 Sydney Business 602.
## 4 North Coast NSW Holiday 588.
## 5 Sydney Holiday 550.
## 6 Gold Coast Holiday 528.
# Tsibble for Total Trips by State
total_trips_by_state <- df5 %>%
group_by(State) %>%
summarise(total_trips = sum(Trips)) %>%
arrange(desc(total_trips))
head(total_trips_by_state)
## # A tibble: 6 Ă— 2
## State total_trips
## <chr> <dbl>
## 1 New South Wales 557367.
## 2 Victoria 390463.
## 3 Queensland 386643.
## 4 Western Australia 147820.
## 5 South Australia 118151.
## 6 Tasmania 54137.
As we can see that New South Wales, Victoria and Queensland have a huge lead over the other places with being tripled and doubled respectively compared to Western Australia.
Problem 2.8
data("PBS")
data("us_employment")
data("us_gasoline")
Employed:
us_employment %>%
filter(Title == "Total Private") %>%
autoplot(Employed) +
ggtitle("Autoplot")
Bricks:
aus_production %>%
autoplot(Bricks) +
ggtitle("Autoplot")
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).
Pelt:
pelt %>%
autoplot(Hare) +
ggtitle("Autoplot")
PBS:
PBS %>%
filter(ATC2 == "H02") %>%
autoplot(Cost) +
ggtitle("Autoplot")
Gasoline:
us_gasoline %>%
autoplot() +
ggtitle("Autoplot")
## Plot variable not specified, automatically selected `.vars = Barrels`
us_gasoline %>%
gg_season() +
ggtitle("Seasonal Decomposition")
## Plot variable not specified, automatically selected `y = Barrels`
us_gasoline %>%
gg_subseries()+
ggtitle("Subseries Plot")
## Plot variable not specified, automatically selected `y = Barrels`
us_gasoline %>%
gg_lag() +
ggtitle("Lag Plot")
## Plot variable not specified, automatically selected `y = Barrels`
us_gasoline %>%
ACF() %>%
autoplot() +
ggtitle("Autocorrelation Function")
## Response variable not specified, automatically selected `var = Barrels`
Gasoline Barrels series demonstrates a general positive trend over the
time period with some general seasonality however it seems to have lots
of noise but in can see some trends of peaks and declines at specific
times of the month. The lag plot indicates positive correlation with
some over plotting. In general I can’t really see any unusual years with
in the data but seems to be reflected by the overplotting.