Timeseries Data 624

Jack Wright

2.1

Use the help function to explore what the series gafa_stock, PBS, vic_elec and pelt represent. a.Use autoplot() to plot some of the series in these data sets. b.What is the time interval of each series?

Lets do some EDA on the datasets provided

gafa_stock

#help(gafa_stock)

gafa_stock are historical stock prices for Google, Amazon, Facebook and Apple from 2014 to 2018. Including Open, High, Low, Close, adjusted Close and Volume.

Note how there does not seem to be any discernable pattern in the data, this tracks because if stock data was predictable in this way, I wouldn’t be in class right now.

This is a faceted version of the stocks of interest. Note the different scales in the y axis allow an easier comparison of relative growth.

PBS

#help(PBS)

PBS is a monthly table with two values, total number of Scripts, and Cost of the scripts (the index is a year and month)

When using the autoplot function, the period is given on the x axis; in this case ‘month.’

autoplot(PBS%>%filter(ATC2=='A10'))

vic_elec

#help(vic_elec)

vic_elec describes three values, total electricity demand, temperature, and boolean for holiday in Melbourne Australia in 30 minute intervals

autoplot(vic_elec)
## Plot variable not specified, automatically selected `.vars = Demand`

As confirmed by help, the time interval is 30 minutes

Pelt

#help(pelt)

pelt are Hudson Bay Trading company trading records for Snowshoe Hare and Canadian Lynx furs from 1845 to 1935. The interval is yearly and the values are total Hare and total Lynx pelts sold respectively

autoplot(pelt)
## Plot variable not specified, automatically selected `.vars = Hare`

2.2

Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.

Symbol Date Close
AAPL 2018-10-03 232.07
AMZN 2018-09-04 2039.51
FB 2018-07-25 217.50
GOOG 2018-07-26 1268.33

2.3

Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

tute1<-read.csv('https://bit.ly/fpptute1')

Convert into time series

mytimeseries<-tute1%>%
  mutate(Quarter = yearmonth(Quarter))%>%
  as_tibble(index = Quarter)

Construct time series plots of each of the three slices

mytimeseries %>%
  pivot_longer(-Quarter) %>%
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y")

Check what happens when you don’t include facet_grid() When facet_grid() is not used, all of the variables are plotted on the same grid.

mytimeseries %>%
  pivot_longer(-Quarter) %>%
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() 

2.4

The USgas package contains data on the demand for natural gas in the US.

install USgas package

library(USgas)
## Warning: package 'USgas' was built under R version 4.1.2
  1. Create a tsibble from us_total with years the index and state as the key
dat<-tsibble(as_tibble(us_total),key = state,  index = year)
kbl(head(dat),booktabs=T)
year state y
1997 Alabama 324158
1998 Alabama 329134
1999 Alabama 337270
2000 Alabama 353614
2001 Alabama 332693
2002 Alabama 379343
  1. Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).

2.5

Download tourism.xlsx from the book website and read it into R using readxl::read_excel().

library(rio)
tourism_1<-rio::import('https://bit.ly/fpptourism')%>%mutate(Quarter = yearquarter(Quarter) )

The R function all.equal() compares if all the elements in two objects are equal to eachother

Create a tsibble which is identical to the tourism tsibble from the tsibble package.

tourism_1<-tourism_1%>%as_tsibble( key = c('Region', 'State','Purpose'),index = Quarter)

all.equal(tourism_1,tsibble::tourism)
## [1] TRUE

Find what combination of Region and Purpose had the maximum number of overnight trips on average.

Quarter Region State Purpose Trips
2017 Q4 Melbourne Victoria Visiting 985.2784

Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.

a<-tourism_1%>%unite(Purpose_Region, c(Purpose,Region))%>%group_by(State)%>%summarise(Trips = sum(Trips))
kbl(a)
State Trips
ACT 41006.59
New South Wales 557367.43
Northern Territory 28613.68
Queensland 386642.91
South Australia 118151.35
Tasmania 54137.09
Victoria 390462.91
Western Australia 147819.65

2.8

Monthly Australian retail data is provided in aus_retail. Select one of the time series as follows (but choose your own seed value):

set.seed(321)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

Explore your chosen retail time series using the following functions:

autoplot() from the fpp3 package plots this data as a time series from beginning to end. Intuition tells me there is some pattern in this data, but it isn’t clear at this scope.

autoplot

autoplot(myseries)

gg_season

gg_season coerces the data into year long increments and plots them over eachother. There does seem to be a seasonal trend, with increasing turnover at the end of the year.

gg_season(myseries, labels = 'both')

gg_subseries

gg_subseries facets the data into months and presents the mean value as a blue line. We are now given more concrete evidence that there is a yearly cycle in Turnover.

gg_subseries(myseries)

gg_lag()

Note the fanning nature of the lag plots. This is a common shape in plotting when both variables have an exponential shape. This could be due to the exponential growth year to year of Turnover.

gg_lag(myseries)

myseries%>%
  ACF(Turnover)%>%
  autoplot()+labs(title = 'Turnover correlogram')

Recall the ACF of a trended time series has positive values that slowly decrease as lag increases.

When data are seasonal, autocorrelations will be larger for seasonal lags, at multiples of the season. In this case, the ‘scallops’ are at year markers, confirming the yearly trend.

Further Exploration into aus_retail

Boxplots and ANOVA

My goal of creating this plot was to see if the difference between the months was statistically significant, maybe with an ANOVA test. The more I think about it, the variance per month is probably due mostly to the mean Turnover per year as opposed to variance around the mean in a given month. I thought It was worthy of including to illuminate my thought process.

myseries['month_index']=as.factor(month(myseries$Month))
ggplot(myseries,aes(x=month_index,y=Turnover, color = month_index))+geom_boxplot()+ggtitle('Boxplots by month for Turnover')

After viewing the boxplot, I wonder if this trend is statistically significant. I then look back at the season plot and see that every single year increases between the 11th and 12th month. I then want to answer the question “What is the probability that all 37 years increase if there is no trend in the data?”

Binomial Experiments with Trend

If there was no trend, each year would have a 50% probability of increasing (success). What is the probability of having 37 successes in 37 trials if the probability is 50%?

## probability of 25 successes in 25 trials: 0
Let’s expand this concept out to all of the possible month combinations.

months where there is more than 50% probability that there is a pattern. see appendix for code

month p
Jan-Feb 1.00
March-April 0.99
May-June 0.99
Aug-Sept 1.00
Dec-Jan 1.00

This does not tell us if there is an increase or decrease, but there is some signal we could further inspect.

Fourier transform on timeseries

When I look at a time series, it looks like a superposition of waves, kind of like an FM signal. Decomposing this wave into a fourier series might give us some more insight into what combination of trends we might be looking at.

To inspect what frequencies dominate this timeseries, I want to do a fourier analysis

looking at the monthly fourier transform

The fourier analysis implies that the signal Turnover can approximated into approximately six main wave functions.

meas<-myseries%>%mutate(date.month=month(Month))%>%select(date.month,Turnover)
mspect<-spectrum(meas$Turnover,log='no',spans=c(2,2),plot=FALSE)

specx<-mspect$freq
specy<-2*mspect$spec
plot(specx,specy, xlab='Period(years)',ylab = 'spectral density',type='l',main='Spectral density by Month of Turnover')

This is just my intuition about how one might find the trends in timeseries data. I don’t fully know how to convert frequency back into months, since the intervals in our data are discrete and the frequencies above clearly arent. Since frequency is cycles per second, and our scale is cycles per year. I assume you could think of .1 cycles per year as a period of 10 years? Which seems to match up pretty will with a macro cycle in the data.

I am also not totally sure if the unit in this spectral analysis is months or years. This was just my exploratory attempt. I will continue to research this, and hope to learn more about time-series decomposition next week.

Appendix:

code for generating binomial probabilities of trend:

build_compare_frame<-function(yearmonth){
  #builds the dataframe comparing one month to the next
  n <- myseries%>%filter(month(Month)==month(yearmonth))%>%select(Turnover)
  n1<-myseries%>%filter(month(Month)==month(yearmonth+1))%>%select(Turnover)
  #print (c(nrow(n),nrow(n1)))
  if (nrow(n)!=nrow(n1)){
    lowval = min(nrow(n),nrow(n1))
    return(data.frame(n[1:lowval,],n1[1:lowval,]))
  }
  return(data.frame(n,n1))
  
}

get_list_of_first_months<-function(series=myseries){
  #gets first occurence of every month
  unique_months<-sort(unique(month(series$Month)))
  output_list<-list()
  for(i in unique_months){
    first<-series%>%filter(month(Month)==i)%>%select(Month)%>%filter(Month==min(Month))
    
    output_list<-append(output_list,first)
  }
  return(output_list)
}
get_p<-function(a){
  #returns cumulative density function given number of successes vs probability of there being no trend. 
row<-a%>%
  mutate(
    increase = case_when(
      Turnover<Turnover.1 ~ TRUE,
      TRUE ~ FALSE
    )
  )%>%
  summarize(month = unique(month(Month)),
                           wins = sum(increase),
                           total = length(increase),
                           p = round(1-pbinom(wins,size=total, prob = .5),2))
  return(row)
}


build_probabilities<-function(myseries = myseries){
  #generate output table
  first_months<-get_list_of_first_months()
  cnames<-c('wins','total','p')
  output_df<-data.frame(matrix(ncol = 3, nrow = 0))
  colnames(output_df)<-cnames
  
  for (i in first_months){
    compare_frame<-build_compare_frame(i[1])
    output_df<-rbind(output_df,get_p(compare_frame))
  }
  return(output_df)
  
  
}
df<-build_probabilities()
df<-df%>%mutate(month=c("Jan-Feb","Feb-March", "March-April",'April-May','May-June','June-July','July-Aug','Aug-Sept','Sept-Oct','Oct-Nov','Nov-Dec','Dec-Jan'))