624assignment1.knit

Daniel DeBonis - Assignment 1

Question 2.1

library(fpp3)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.6
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.1
## ✔ lubridate   1.9.4     ✔ fable       0.4.1
## ✔ ggplot2     3.5.1

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

options(repos = list(CRAN="http://cran.rstudio.com/"))

Paring the data sets down to the specified time series

bricks <- aus_production |>
     dplyr::select("Bricks")

This data set contains 218 observations of data that was taken at quarterly intervals.

autoplot(bricks)

## Plot variable not specified, automatically selected `.vars = Bricks`

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

We can then use autoplot, which recognizes that this is quarterly data and generates the appropriate line graph.

lynxdf <- pelt %>% select(-Hare)

This data set includes 91 observations of data that was taken at yearly intervals.

autoplot(lynxdf)

## Plot variable not specified, automatically selected `.vars = Lynx`

Once again, the plot generated is generated recognizing the yearly time interval.

closedf <- gafa_stock |>
    select("Date", "Close")
autoplot(closedf)

## Plot variable not specified, automatically selected `.vars = Close`

For the third data set, we have 5032 observations of data that was collected daily. This plot compares the closing prices of four stocks, observed on at a daily interval.

demanddf <- vic_elec |>
    select(Demand)
autoplot(demanddf)

## Plot variable not specified, automatically selected `.vars = Demand`

For the final data set, we have data collected every half hour and 52,608 observations. We can improve the title and axes of the generated graph

autoplot(demanddf) +
     labs(x = "Time", y = "Electricity Demand (kw)") +
    ggtitle("Electricity Demand per Half-Hour")

## Plot variable not specified, automatically selected `.vars = Demand`

Question 2.2

This question is asking us to find the peak closing price for each stock in GAFA, so we should filter the data by stock.

applehigh <- gafa_stock |>
          filter(Symbol == "AAPL") |>
           select("Date", "Close")
amznhigh <- gafa_stock |>
            filter(Symbol == "AMZN") |>
            select("Date", "Close")
fbhigh <- gafa_stock |>
         filter(Symbol == "FB") |>
         select("Date", "Close")
googhigh <- gafa_stock |>
        filter(Symbol == "GOOG") |>
        select("Date", "Close")

Now we can identify the peak closing price for each stock and when it occurred.

arrange(applehigh, desc(Close))

## Warning: Current temporal ordering may yield unexpected results.
## ℹ Suggest to sort by ``, `Date` first.

## # A tsibble: 1,258 x 2 [!]
##    Date       Close
##    <date>     <dbl>
##  1 2018-10-03  232.
##  2 2018-10-02  229.
##  3 2018-09-04  228.
##  4 2018-10-04  228.
##  5 2018-08-31  228.
##  6 2018-10-01  227.
##  7 2018-09-05  227.
##  8 2018-10-09  227.
##  9 2018-09-13  226.
## 10 2018-09-28  226.
## # ℹ 1,248 more rows

For Apple, the highest closing price was 232 AUD on 10/3/2018

arrange(amznhigh, desc(Close))

## Warning: Current temporal ordering may yield unexpected results.
## ℹ Suggest to sort by ``, `Date` first.

## # A tsibble: 1,258 x 2 [!]
##    Date       Close
##    <date>     <dbl>
##  1 2018-09-04 2040.
##  2 2018-09-27 2013.
##  3 2018-08-31 2013.
##  4 2018-10-01 2004.
##  5 2018-09-28 2003 
##  6 2018-08-30 2002.
##  7 2018-08-29 1998.
##  8 2018-09-05 1995.
##  9 2018-09-12 1990 
## 10 2018-09-13 1990.
## # ℹ 1,248 more rows

For Amazon, the highest closing price was 2040 AUD on 9/4/2018

arrange(fbhigh, desc(Close))

## Warning: Current temporal ordering may yield unexpected results.
## ℹ Suggest to sort by ``, `Date` first.

## # A tsibble: 1,258 x 2 [!]
##    Date       Close
##    <date>     <dbl>
##  1 2018-07-25  218.
##  2 2018-07-24  215.
##  3 2018-07-23  211.
##  4 2018-07-17  210.
##  5 2018-07-20  210.
##  6 2018-07-18  209.
##  7 2018-07-19  208.
##  8 2018-07-13  207.
##  9 2018-07-16  207.
## 10 2018-07-12  207.
## # ℹ 1,248 more rows

For Facebook, the highest closing price was 218 AUD on 7/25/2018

arrange(googhigh, desc(Close))

## Warning: Current temporal ordering may yield unexpected results.
## ℹ Suggest to sort by ``, `Date` first.

## # A tsibble: 1,258 x 2 [!]
##    Date       Close
##    <date>     <dbl>
##  1 2018-07-26 1268.
##  2 2018-07-25 1264.
##  3 2018-08-29 1249.
##  4 2018-08-09 1249.
##  5 2018-07-24 1248.
##  6 2018-08-08 1246.
##  7 2018-08-07 1242.
##  8 2018-08-14 1242.
##  9 2018-08-27 1242.
## 10 2018-08-30 1239.
## # ℹ 1,248 more rows

For Google, the highest closing price was 1268 AUD on 7/26/2018

Question 2.3

First we must import the csv file

tute1 <- read.csv('https://raw.githubusercontent.com/ddebonis47/classwork/refs/heads/main/tute1.csv')

Then we convert to a time series using the script provided in the textbook.

mytimeseries <- tute1 |>
  mutate(Quarter = yearquarter(Quarter)) |>
  as_tsibble(index = Quarter)

And we can compare the three variables with side by side line graphs using the rest of the code provided.

mytimeseries |>
  pivot_longer(-Quarter) |>
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y")

If we remove the facet_grid line from the code, it combines the three graphs into one, with the same y axis of “value” corresponding to all three groups as specified by the color code in the legend.

mytimeseries |>
         pivot_longer(-Quarter) |>
         ggplot(aes(x = Quarter, y = value, colour = name)) +
         geom_line()

Question 2.4

The first step is to import the data as a package.

install.packages("USgas")

## Installing package into 'C:/Users/ddebo/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'USgas' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\ddebo\AppData\Local\Temp\RtmpqQZe6C\downloaded_packages

library(USgas)

Since the data is not in the form of a tsibble, we need to transform it so that it can be processed as a time series.

ustsibble <-us_total |>
         mutate("year"=year) |>
        as_tsibble(index = year, key = state)

The question asks for a comparison in gas usage across New England, so we need to filter the data to only include the relevant states.

newengland <-ustsibble |>
       filter(state=="Maine"|state=="Vermont"|state=="New Hampshire"|state=="Connecticut"|state=="Massachusetts"|state=="Rhode Island")

Now let’s see what the data looks like with the autoplot function

autoplot(newengland)

## Plot variable not specified, automatically selected `.vars = y`

Question 2.5

The data for this question is stored in an Excel file, so we need to use the appropriate function to import the data to R.

library(readxl)
library(openxlsx)
tourismurl <-"https://www.github.com/ddebonis47/classwork/raw/refs/heads/main/tourism.xlsx"
df = read.xlsx(tourismurl)
head(df)

##      Quarter   Region           State  Purpose    Trips
## 1 1998-01-01 Adelaide South Australia Business 135.0777
## 2 1998-04-01 Adelaide South Australia Business 109.9873
## 3 1998-07-01 Adelaide South Australia Business 166.0347
## 4 1998-10-01 Adelaide South Australia Business 127.1605
## 5 1999-01-01 Adelaide South Australia Business 137.4485
## 6 1999-04-01 Adelaide South Australia Business 199.9126

Now we need to convert this from a data frame to a time series in order to make it match the tsibble from the tsibble collection

touritsble <- df |>
         mutate(Quarter = yearquarter(Quarter)) |>
         as_tsibble(index = Quarter, key = c(Region, State, Purpose))

To find the combination of Region and Purpose that were the highest, we need to group by those two variables. It would also help to get the output in descending order to easily see the maximum result.

topdf <- df |>
     group_by(Region, Purpose)
topdf2 <- topdf |>
     summarize(Total=mean(Trips))

## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.

topdf3 <- topdf2 |>
    arrange(desc(Total))
head(topdf3)

## # A tibble: 6 × 3
## # Groups:   Region [4]
##   Region          Purpose  Total
##   <chr>           <chr>    <dbl>
## 1 Sydney          Visiting  747.
## 2 Melbourne       Visiting  619.
## 3 Sydney          Business  602.
## 4 North Coast NSW Holiday   588.
## 5 Sydney          Holiday   550.
## 6 Gold Coast      Holiday   528.

Now we can clearly see that the highest average were trips visiting Sydney, followed by visiting Melbourne and business trips to Sydney.

Let us create a new tsibble that combines Region and Purpose, allowing us to simply see the data by state per quarter.

df4 <- touritsble |>
     group_by(State)
df5 <- df4 |>
    summarize(Trips=sum(Trips))
head(df5)

## # A tsibble: 6 x 3 [1Q]
## # Key:       State [1]
##   State Quarter Trips
##   <chr>   <qtr> <dbl>
## 1 ACT   1998 Q1  551.
## 2 ACT   1998 Q2  416.
## 3 ACT   1998 Q3  436.
## 4 ACT   1998 Q4  450.
## 5 ACT   1999 Q1  379.
## 6 ACT   1999 Q2  558.

Question 2.8

Let’s go through these time series one by one

private <- us_employment |>     
  filter(Title == "Total Private")
autoplot(private)

## Plot variable not specified, automatically selected `.vars = Employed`

gg_season(private)

## Plot variable not specified, automatically selected `y = Employed`

gg_subseries(private)

## Plot variable not specified, automatically selected `y = Employed`

gg_lag(private)

## Plot variable not specified, automatically selected `y = Employed`

private |>
     ACF(Employed) |>
     autoplot()

Overall, there has been an increase in the number of privately employed people over time. There seems to be a seasonal cycle that repeats yearly, with employment peaking in summer and falling by mid-winter. The seasonal plot shows striking consistency. One notable cycle is a downward shift every few years, often correlating to recessions or market crashes.2008 and 1984 are two years that exemplify this cycle. The cyclic downturns go against the overall trend of the employment numbers rising from year to year.

autoplot(bricks)

## Plot variable not specified, automatically selected `.vars = Bricks`

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

gg_season(bricks)

## Plot variable not specified, automatically selected `y = Bricks`

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

gg_subseries(bricks)

## Plot variable not specified, automatically selected `y = Bricks`

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).

gg_lag(bricks)

## Plot variable not specified, automatically selected `y = Bricks`

## Warning: Removed 20 rows containing missing values (gg_lag).

bricks |>
 ACF(Bricks) |>
 autoplot()

This data set does not have the clear positive trend we saw in the last one. There seems to have been a maximum in brick production around 1980 that lasted around 10 years, but has fallen since. Generally speaking, production was higher in Q2 and Q3, with a notable dip in Q1, so there is seasonality present. Several years saw precipitous drops in production, these being 1977, 1982, 1990, 1996, and 2001, suggesting a cyclical trend of decrease every several years followed by a return to relative normalcy in brick production. The ACF plot shows us the general falling trend over time. The cyclical effects are shown in the relative rise and fall of the lines, but the downward trend is clear.

haredf <- pelt %>% select (-Lynx)
autoplot(haredf)

## Plot variable not specified, automatically selected `.vars = Hare`

gg_lag(haredf)

## Plot variable not specified, automatically selected `y = Hare`

haredf |>
 ACF(haredf) |>
 autoplot()

Since the data is provided yearly, we do not have the detail to compare trends over the course of a year. What the data does show is is a general cyclical pattern of hare fur production rising and falling within approximately ten year increments. The early 1860s saw peaks of hare fur production, but it fell to prior levels within just a few years, only to start rising again soon after. The increments are not uniform, but the cycle is clear. Once again, the autocorrelation function illustrates this cycle well, making a sinusoidal shape.There is not much of an overall trend, just the general rise and fall of the cycle.

h02 <- PBS |>
          filter(ATC2 == "H02")
autoplot(h02, .vars=Cost)

gg_season(h02, y = Cost)

gg_subseries(h02, y = Cost)

h02 |>
 ACF(Cost) |>
 autoplot()

The PBS refers to the Pharmaceutical Benefits Scheme of Australia, which is the government subsidized healthcare available. The H02 category refers to steroids being prescribed for systemic use. The data follows clear seasonal patterns. This seasonality is shown most strongly in the cost of those classified as “Safety Net” with lows in spring that rise for the rest of the year, precipitously falling the next February. The subseries helps to identify the general trends, which are stronger for those in the concessional groups. These groups show a positive trend in cost over time, where the those in the general categories saw cost fluctuate without a strong trend. We also do not see years that stand out from this general trend unlike the other data sets.

autoplot(us_gasoline)

## Plot variable not specified, automatically selected `.vars = Barrels`

gg_season(us_gasoline)

## Plot variable not specified, automatically selected `y = Barrels`

gasyear <- us_gasoline |>
     group_by(year = lubridate::year(Week))
gys <- gasyear |>
     summarize(year_barrel=sum(Barrels))
gyscon <-   subset(gys, select = c("year", "year_barrel"))
gysyear <- gyscon |>
     group_by(year)
gyssum <- gysyear |>
     summarize(total = sum(year_barrel))
gysts <- gyssum |>
     mutate("year"=year) |>
     as_tsibble(index = year)
## unfortunately this didn't work as well since the final year does not include every week, hence the line plummeting on the end
gg_subseries(gysts)

## Plot variable not specified, automatically selected `y = total`

gg_lag(gysts)

## Plot variable not specified, automatically selected `y = total`

us_gasoline |>
 ACF(Barrels) |>
 autoplot()

This data is given weekly, but a week is not the most useful period for each type of graph here, so we need to transform the data to be analyzed on a yearly basis as well. The positive trend of this data is best seen in the autoplot. There is a change in the trend around 2007 until the mid 2010s where the production of oil went down slightly, but the positive trend seems to have picked up again by the end of our data set. There does seem to be seasonality in oil production looking at the graph, but this hypothesis does not hold up when looking at the seasonality graph. The lack of visible seasonality trend in the seasonality graph was surprising, but the more visible trend was the overall positive trend, since generally the older the data, the lower the number of barrels produced is.