This document contains the homework problems for the Data 624 course. Link: https://otexts.com/fpp3/graphics-exercises.html

Week 1 HW Problems

HA 2.1, 2.2, 2.3, 2.4, 2.5 and 2.8

2.1

Use the help function to explore what the series gafa_stock, PBS, vic_elec and pelt represent.

  • Use autoplot() to plot each of these in separate plots

  • What is the time interval of each series?

if (!require('fpp3')) (install.packages('fpp3'))
if (!require('magrittr')) (install.packages('magrittr'))
if (!require('dplyr')) (install.packages('dplyr'))
if (!require('readxl')) (install.packages('readxl'))
if (!require('RCurl')) (install.packages('RCurl'))
library(fpp3)
library(ggplot2)

#help("gafa_stock")
#help("PBS")
#help("vic_elec")
#help("pelt")

gafa_stock

The gafa_stock data represents the Historical stock prices from 2014-2018 for Google, Amazon, Facebook and Apple. All prices are in $USD.

summary(gafa_stock)
##     Symbol               Date                 Open              High        
##  Length:5032        Min.   :2014-01-02   Min.   :  54.02   Min.   :  54.94  
##  Class :character   1st Qu.:2015-04-02   1st Qu.: 118.33   1st Qu.: 119.25  
##  Mode  :character   Median :2016-06-30   Median : 257.59   Median : 261.94  
##                     Mean   :2016-07-01   Mean   : 465.75   Mean   : 469.95  
##                     3rd Qu.:2017-09-29   3rd Qu.: 746.53   3rd Qu.: 750.96  
##                     Max.   :2018-12-31   Max.   :2038.11   Max.   :2050.50  
##       Low              Close           Adj_Close           Volume         
##  Min.   :  51.85   Min.   :  53.53   Min.   :  53.53   Min.   :     7900  
##  1st Qu.: 117.35   1st Qu.: 118.54   1st Qu.: 115.48   1st Qu.:  2519975  
##  Median : 256.89   Median : 259.51   Median : 258.61   Median : 10804400  
##  Mean   : 460.92   Mean   : 465.56   Mean   : 464.24   Mean   : 19493800  
##  3rd Qu.: 738.01   3rd Qu.: 744.79   3rd Qu.: 744.79   3rd Qu.: 29399250  
##  Max.   :2013.00   Max.   :2039.51   Max.   :2039.51   Max.   :266380800
head(gafa_stock)
## # A tsibble: 6 x 8 [!]
## # Key:       Symbol [1]
##   Symbol Date        Open  High   Low Close Adj_Close    Volume
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>
## 1 AAPL   2014-01-02  79.4  79.6  78.9  79.0      67.0  58671200
## 2 AAPL   2014-01-03  79.0  79.1  77.2  77.3      65.5  98116900
## 3 AAPL   2014-01-06  76.8  78.1  76.2  77.7      65.9 103152700
## 4 AAPL   2014-01-07  77.8  78.0  76.8  77.1      65.4  79302300
## 5 AAPL   2014-01-08  77.0  77.9  77.0  77.6      65.8  64632400
## 6 AAPL   2014-01-09  78.1  78.1  76.5  76.6      65.0  69787200
autoplot(gafa_stock) +
  ggtitle("Historical stock prices from 2014-2018 (Google, Amazon, Facebook and Apple)") +
  xlab("Symbol") +
  ylab("Adj_Close")

What is the time interval of these stocks?

interval(gafa_stock)
## <interval[1]>
## [1] !

Interval cannot be calculated as it’s an irregular dataset. We can see that the date difference can range from 1 day to >1 day for each of the stock symbols

PBS

The PBS data represents the Monthly Medicare Australia prescription data.

head(PBS)
## # A tsibble: 6 x 9 [1M]
## # Key:       Concession, Type, ATC1, ATC2 [1]
##      Month Concession   Type        ATC1  ATC1_desc  ATC2  ATC2_…¹ Scripts  Cost
##      <mth> <chr>        <chr>       <chr> <chr>      <chr> <chr>     <dbl> <dbl>
## 1 1991 Jul Concessional Co-payments A     Alimentar… A01   STOMAT…   18228 67877
## 2 1991 Aug Concessional Co-payments A     Alimentar… A01   STOMAT…   15327 57011
## 3 1991 Sep Concessional Co-payments A     Alimentar… A01   STOMAT…   14775 55020
## 4 1991 Oct Concessional Co-payments A     Alimentar… A01   STOMAT…   15380 57222
## 5 1991 Nov Concessional Co-payments A     Alimentar… A01   STOMAT…   14371 52120
## 6 1991 Dec Concessional Co-payments A     Alimentar… A01   STOMAT…   15028 54299
## # … with abbreviated variable name ¹​ATC2_desc
PBS %>% filter(ATC2 == "A01")%>% autoplot(Cost) +
  ggtitle("Cost of the scripts in $AUD for  Anatomical Therapeutic Chemical index (level 2)") 

What is the time interval of these costs?

interval(PBS)
## <interval[1]>
## [1] 1M

The time interval of the dataset is at monthly level.

vic_elec

The vic_elec data represents the Half-hourly electricity demand for Victoria, Australia.This data is for operational demand, which is the demand met by local scheduled generating units, semi-scheduled generating units, and non-scheduled intermittent generating units of aggregate capacity larger than 30 MWh, and by generation imports to the region. The operational demand excludes the demand met by non-scheduled non-intermittent generating units, non-scheduled intermittent generating units of aggregate capacity smaller than 30 MWh, exempt generation (e.g. rooftop solar, gas tri-generation, very small wind farms, etc), and demand of local scheduled loads. It also excludes some very large industrial users (such as mines or smelters).

summary(vic_elec)
##       Time                         Demand      Temperature   
##  Min.   :2012-01-01 00:00:00   Min.   :2858   Min.   : 1.50  
##  1st Qu.:2012-09-30 22:52:30   1st Qu.:3969   1st Qu.:12.30  
##  Median :2013-07-01 22:45:00   Median :4635   Median :15.40  
##  Mean   :2013-07-01 22:45:00   Mean   :4665   Mean   :16.27  
##  3rd Qu.:2014-04-01 23:37:30   3rd Qu.:5244   3rd Qu.:19.40  
##  Max.   :2014-12-31 23:30:00   Max.   :9345   Max.   :43.20  
##       Date             Holiday       
##  Min.   :2012-01-01   Mode :logical  
##  1st Qu.:2012-09-30   FALSE:51120    
##  Median :2013-07-01   TRUE :1488     
##  Mean   :2013-07-01                  
##  3rd Qu.:2014-04-01                  
##  Max.   :2014-12-31
head(vic_elec)
## # A tsibble: 6 x 5 [30m] <Australia/Melbourne>
##   Time                Demand Temperature Date       Holiday
##   <dttm>               <dbl>       <dbl> <date>     <lgl>  
## 1 2012-01-01 00:00:00  4383.        21.4 2012-01-01 TRUE   
## 2 2012-01-01 00:30:00  4263.        21.0 2012-01-01 TRUE   
## 3 2012-01-01 01:00:00  4049.        20.7 2012-01-01 TRUE   
## 4 2012-01-01 01:30:00  3878.        20.6 2012-01-01 TRUE   
## 5 2012-01-01 02:00:00  4036.        20.4 2012-01-01 TRUE   
## 6 2012-01-01 02:30:00  3866.        20.2 2012-01-01 TRUE
autoplot(vic_elec) +
  ggtitle("Half-hourly electricity demand for Victoria, Australia") +
  xlab("Year") +
  ylab("Temperature")

What is the time interval of this demand?

interval(vic_elec)
## <interval[1]>
## [1] 30m

The time interval is 30 minutes.

pelt

The pelt data represents the Pelt trading records.Hudson Bay Company trading records for Snowshoe Hare and Canadian Lynx furs from 1845 to 1935. This data contains trade records for all areas of the company.

summary(pelt)
##       Year           Hare             Lynx      
##  Min.   :1845   Min.   :  1800   Min.   : 3190  
##  1st Qu.:1868   1st Qu.: 12730   1st Qu.:11840  
##  Median :1890   Median : 40970   Median :29590  
##  Mean   :1890   Mean   : 45406   Mean   :28337  
##  3rd Qu.:1912   3rd Qu.: 71405   3rd Qu.:42520  
##  Max.   :1935   Max.   :152650   Max.   :79350
head(pelt)
## # A tsibble: 6 x 3 [1Y]
##    Year  Hare  Lynx
##   <dbl> <dbl> <dbl>
## 1  1845 19580 30090
## 2  1846 19600 45150
## 3  1847 19610 49150
## 4  1848 11990 39520
## 5  1849 28040 21230
## 6  1850 58000  8420
autoplot(pelt) +
  ggtitle("The number of Snowshoe Hare pelts vs Canadian Lynx pelts traded)") +
  xlab("Year") +
  ylab("Trading records")

What is the time interval of this demand?

interval(pelt)
## <interval[1]>
## [1] 1Y

The time interval of the dataset is 1 year.

2.2

Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.

Filtering the dataset for the stocks available

filter_gafa <- gafa_stock %>% 
             group_by(Symbol) %>%
             filter(Close == max(Close)) %>%
             arrange(desc(Close))
filter_gafa
## # A tsibble: 4 x 8 [!]
## # Key:       Symbol [4]
## # Groups:    Symbol [4]
##   Symbol Date        Open  High   Low Close Adj_Close   Volume
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1 AMZN   2018-09-04 2026. 2050. 2013  2040.     2040.  5721100
## 2 GOOG   2018-07-26 1251  1270. 1249. 1268.     1268.  2405600
## 3 AAPL   2018-10-03  230.  233.  230.  232.      230. 28654800
## 4 FB     2018-07-25  216.  219.  214.  218.      218. 58954200

Peak closing price for AMZN is with price 2039.51 Peak closing price for AAPL is with price 232.07 Peak closing price for FB is with price 217.50 Peak closing price for GOOG is with price 1268.33

Interesting to see lot changed from then to now!

2.3

Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

    1. You can read the data into R with the following script:
tute1 <- read.csv("https://raw.githubusercontent.com/BharaniNittala/DATA624/main/HW1/tute1.csv") 
head(tute1)
##      Quarter  Sales AdBudget   GDP
## 1 1981-03-01 1020.2    659.2 251.8
## 2 1981-06-01  889.2    589.0 290.9
## 3 1981-09-01  795.0    512.5 290.8
## 4 1981-12-01 1003.9    614.1 292.4
## 5 1982-03-01 1057.7    647.2 279.1
## 6 1982-06-01  944.4    602.0 254.0
    1. Convert the data to time series
mytimeseries <- tute1 %>%
  mutate(Quarter = yearquarter(Quarter)) %>%
  as_tsibble(index = Quarter)

head(mytimeseries)
## # A tsibble: 6 x 4 [1Q]
##   Quarter Sales AdBudget   GDP
##     <qtr> <dbl>    <dbl> <dbl>
## 1 1981 Q1 1020.     659.  252.
## 2 1981 Q2  889.     589   291.
## 3 1981 Q3  795      512.  291.
## 4 1981 Q4 1004.     614.  292.
## 5 1982 Q1 1058.     647.  279.
## 6 1982 Q2  944.     602   254
    1. Construct time series plots of each of the three series Check what happens when you don’t include facet_grid().
mytimeseries %>%
  pivot_longer(-Quarter) %>%
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y")

mytimeseries %>%
  pivot_longer(-Quarter) %>%
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() 

Each of the time series have range of values. When facet_grid is not included, everything is plotted on one scale. While this makes comparison easy, we may miss out on specific time series learnings.

2.4

The USgas package contains data on the demand for natural gas in the US.

    1. Install the USgas package.
if (!require('USgas')) (install.packages('USgas'))
library(USgas)
    1. Create a tsibble from us_total with year as the index and state as the key.
us_total_new <- us_total
us_total_new <- us_total_new %>%
  as_tsibble(index = year, key = state)

head(us_total_new)
## # A tsibble: 6 x 3 [1Y]
## # Key:       state [1]
##    year state        y
##   <int> <chr>    <int>
## 1  1997 Alabama 324158
## 2  1998 Alabama 329134
## 3  1999 Alabama 337270
## 4  2000 Alabama 353614
## 5  2001 Alabama 332693
## 6  2002 Alabama 379343
    1. Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).
newengland_gas <- us_total_new %>%
  filter(state == 'Maine' |
           state == 'Vermont' |
           state == 'New Hampshire' |
           state == 'Massachusetts' |
           state == 'Connecticut' |
           state == 'Rhode Island') %>%
  mutate(y = y/1000)

head(newengland_gas)
## # A tsibble: 6 x 3 [1Y]
## # Key:       state [1]
##    year state           y
##   <int> <chr>       <dbl>
## 1  1997 Connecticut  145.
## 2  1998 Connecticut  131.
## 3  1999 Connecticut  152.
## 4  2000 Connecticut  160.
## 5  2001 Connecticut  146.
## 6  2002 Connecticut  178.
autoplot(newengland_gas, y) +
  labs(title = "The annual natural gas consumption by state",
       subtitle = "New England Zone",
       y = "Consumption in thousands")

2.5

    1. Download tourism.xlsx from the book website and read it into R using readxl::read_excel().
if (!require('readxl')) (install.packages('readxl'))
library("readxl") # library for read_excel() method
tourism_xlsx <- readxl::read_excel("C:/Users/nittalab/Documents/Personal/CUNY/DATA 624/HW1/tourism.xlsx")
head(tourism_xlsx)
## # A tibble: 6 × 5
##   Quarter    Region   State           Purpose  Trips
##   <chr>      <chr>    <chr>           <chr>    <dbl>
## 1 1998-01-01 Adelaide South Australia Business  135.
## 2 1998-04-01 Adelaide South Australia Business  110.
## 3 1998-07-01 Adelaide South Australia Business  166.
## 4 1998-10-01 Adelaide South Australia Business  127.
## 5 1999-01-01 Adelaide South Australia Business  137.
## 6 1999-04-01 Adelaide South Australia Business  200.
    1. Create a tsibble which is identical to the tourism tsibble from the tsibble package.
tourism_xlsx_tb <- tourism_xlsx %>% 
  mutate(Quarter = yearquarter(Quarter)) %>%
  as_tsibble(index = Quarter, key = c(Region, State, Purpose)) -> tourism_xlsx
head(tourism_xlsx_tb)
## # A tsibble: 6 x 5 [1Q]
## # Key:       Region, State, Purpose [1]
##   Quarter Region   State           Purpose  Trips
##     <qtr> <chr>    <chr>           <chr>    <dbl>
## 1 1998 Q1 Adelaide South Australia Business  135.
## 2 1998 Q2 Adelaide South Australia Business  110.
## 3 1998 Q3 Adelaide South Australia Business  166.
## 4 1998 Q4 Adelaide South Australia Business  127.
## 5 1999 Q1 Adelaide South Australia Business  137.
## 6 1999 Q2 Adelaide South Australia Business  200.
    1. Find what combination of Region and Purpose had the maximum number of overnight trips on average.
# summarize average trip length by region and purpose then finds combination with highest
tourism_xlsx%>%
  group_by(Region, Purpose)%>%
  summarize(avg_trip_length = mean(Trips), .groups = "keep")%>%
  ungroup()%>%
  filter(avg_trip_length == max(avg_trip_length))
## # A tsibble: 1 x 4 [1Q]
## # Key:       Region, Purpose [1]
##   Region    Purpose  Quarter avg_trip_length
##   <chr>     <chr>      <qtr>           <dbl>
## 1 Melbourne Visiting 2017 Q4            985.
    1. Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.
# summarize average trip length by region and purpose then finds combination with highest
t_by_state <- tourism_xlsx_tb %>%
  group_by(State) %>%
  summarise(Trips = sum(Trips)) %>%
  mutate(Quarter = yearquarter(Quarter)) %>%
  as_tsibble(index = Quarter, key = State)

head(t_by_state)
## # A tsibble: 6 x 3 [1Q]
## # Key:       State [1]
##   State Quarter Trips
##   <chr>   <qtr> <dbl>
## 1 ACT   1998 Q1  551.
## 2 ACT   1998 Q2  416.
## 3 ACT   1998 Q3  436.
## 4 ACT   1998 Q4  450.
## 5 ACT   1999 Q1  379.
## 6 ACT   1999 Q2  558.

2.8

Monthly Australian retail data is provided in aus_retail. Select one of the time series as follows (but choose your own seed value):

set.seed(12345678)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

Explore your chosen retail time series using the following functions:

autoplot(), ggseason(), ggsubseries(), gglag(), ACF() %>% autoplot()

autoplot(myseries) + 
  ggtitle("A3349767W")+
  xlab("Time") +
  ylab("Sales");

myseries%>%
  gg_season(Turnover)+labs(title = "Turnover for Clothing, footwear and personal accessory retailing",
       subtitle = "Series: A3349767W",
       y = "Turnover")

There appears to be mostly increasing trend with the exception of a slight dip after 2010. The auto plot shows evidence of seasonal changes in the data, evident by the constant fluctuations within each period. We can use the seasonal plot to drill down further.The seasonal plot actually shows a spike in consumer spending between from Nov and December. The slope of each spike increases every year. This could be representative of an increasing consumer culture mindset. December confirmed to be a major retail month from the charts.