Use the help function to explore what the series gafa_stock, PBS, vic_elec and pelt represent. a.Use autoplot() to plot some of the series in these data sets. b.What is the time interval of each series?
Lets do some EDA on the datasets provided
#help(gafa_stock)
gafa_stock are historical stock prices for Google, Amazon, Facebook and Apple from 2014 to 2018. Including Open, High, Low, Close, adjusted Close and Volume.
Note how there does not seem to be any discernable pattern in the data, this tracks because if stock data was predictable in this way, I wouldn’t be in class right now.
This is a faceted version of the stocks of interest. Note the different scales in the y axis allow an easier comparison of relative growth.
#help(PBS)
PBS is a monthly table with two values, total number of Scripts, and Cost of the scripts (the index is a year and month)
When using the autoplot function, the period is given on the x axis; in this case ‘month.’
autoplot(PBS%>%filter(ATC2=='A10'))
#help(vic_elec)
vic_elec describes three values, total electricity demand, temperature, and boolean for holiday in Melbourne Australia in 30 minute intervals
autoplot(vic_elec)
## Plot variable not specified, automatically selected `.vars = Demand`
As confirmed by help, the time interval is 30 minutes
#help(pelt)
pelt are Hudson Bay Trading company trading records for Snowshoe Hare and Canadian Lynx furs from 1845 to 1935. The interval is yearly and the values are total Hare and total Lynx pelts sold respectively
autoplot(pelt)
## Plot variable not specified, automatically selected `.vars = Hare`
Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.
| Symbol | Date | Close |
|---|---|---|
| AAPL | 2018-10-03 | 232.07 |
| AMZN | 2018-09-04 | 2039.51 |
| FB | 2018-07-25 | 217.50 |
| GOOG | 2018-07-26 | 1268.33 |
Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.
tute1<-read.csv('https://bit.ly/fpptute1')
Convert into time series
mytimeseries<-tute1%>%
mutate(Quarter = yearmonth(Quarter))%>%
as_tibble(index = Quarter)
Construct time series plots of each of the three slices
mytimeseries %>%
pivot_longer(-Quarter) %>%
ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line() +
facet_grid(name ~ ., scales = "free_y")
Check what happens when you don’t include facet_grid() When facet_grid() is not used, all of the variables are plotted on the same grid.
mytimeseries %>%
pivot_longer(-Quarter) %>%
ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line()
The USgas package contains data on the demand for natural gas in the US.
install USgas package
library(USgas)
## Warning: package 'USgas' was built under R version 4.1.2
us_total with years the index and state as the keydat<-tsibble(as_tibble(us_total),key = state, index = year)
kbl(head(dat),booktabs=T)
| year | state | y |
|---|---|---|
| 1997 | Alabama | 324158 |
| 1998 | Alabama | 329134 |
| 1999 | Alabama | 337270 |
| 2000 | Alabama | 353614 |
| 2001 | Alabama | 332693 |
| 2002 | Alabama | 379343 |
Download tourism.xlsx from the book website and read it into R using readxl::read_excel().
library(rio)
tourism_1<-rio::import('https://bit.ly/fpptourism')%>%mutate(Quarter = yearquarter(Quarter) )
The R function all.equal() compares if all the elements in two objects are equal to eachother
Create a tsibble which is identical to the tourism tsibble from the tsibble package.
tourism_1<-tourism_1%>%as_tsibble( key = c('Region', 'State','Purpose'),index = Quarter)
all.equal(tourism_1,tsibble::tourism)
## [1] TRUE
Find what combination of Region and Purpose had the maximum number of overnight trips on average.
| Quarter | Region | State | Purpose | Trips |
|---|---|---|---|---|
| 2017 Q4 | Melbourne | Victoria | Visiting | 985.2784 |
Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.
a<-tourism_1%>%unite(Purpose_Region, c(Purpose,Region))%>%group_by(State)%>%summarise(Trips = sum(Trips))
kbl(a)
| State | Trips |
|---|---|
| ACT | 41006.59 |
| New South Wales | 557367.43 |
| Northern Territory | 28613.68 |
| Queensland | 386642.91 |
| South Australia | 118151.35 |
| Tasmania | 54137.09 |
| Victoria | 390462.91 |
| Western Australia | 147819.65 |
Monthly Australian retail data is provided in aus_retail. Select one of the time series as follows (but choose your own seed value):
set.seed(321)
myseries <- aus_retail %>%
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
Explore your chosen retail time series using the following functions:
autoplot() from the fpp3 package plots this data as a time series from beginning to end. Intuition tells me there is some pattern in this data, but it isn’t clear at this scope.
autoplot(myseries)
gg_season coerces the data into year long increments and plots them over eachother. There does seem to be a seasonal trend, with increasing turnover at the end of the year.
gg_season(myseries, labels = 'both')
gg_subseries facets the data into months and presents the mean value as a blue line. We are now given more concrete evidence that there is a yearly cycle in Turnover.
gg_subseries(myseries)
Note the fanning nature of the lag plots. This is a common shape in plotting when both variables have an exponential shape. This could be due to the exponential growth year to year of Turnover.
gg_lag(myseries)
myseries%>%
ACF(Turnover)%>%
autoplot()+labs(title = 'Turnover correlogram')
Recall the ACF of a trended time series has positive values that slowly decrease as lag increases.
When data are seasonal, autocorrelations will be larger for seasonal lags, at multiples of the season. In this case, the ‘scallops’ are at year markers, confirming the yearly trend.
aus_retailMy goal of creating this plot was to see if the difference between the months was statistically significant, maybe with an ANOVA test. The more I think about it, the variance per month is probably due mostly to the mean Turnover per year as opposed to variance around the mean in a given month. I thought It was worthy of including to illuminate my thought process.
myseries['month_index']=as.factor(month(myseries$Month))
ggplot(myseries,aes(x=month_index,y=Turnover, color = month_index))+geom_boxplot()+ggtitle('Boxplots by month for Turnover')
After viewing the boxplot, I wonder if this trend is statistically significant. I then look back at the season plot and see that every single year increases between the 11th and 12th month. I then want to answer the question “What is the probability that all 37 years increase if there is no trend in the data?”
If there was no trend, each year would have a 50% probability of increasing (success). What is the probability of having 37 successes in 37 trials if the probability is 50%?
## probability of 25 successes in 25 trials: 0
Let’s expand this concept out to all of the possible month combinations.
months where there is more than 50% probability that there is a pattern. see appendix for code
| month | p |
|---|---|
| Jan-Feb | 1.00 |
| March-April | 0.99 |
| May-June | 0.99 |
| Aug-Sept | 1.00 |
| Dec-Jan | 1.00 |
This does not tell us if there is an increase or decrease, but there is some signal we could further inspect.
When I look at a time series, it looks like a superposition of waves, kind of like an FM signal. Decomposing this wave into a fourier series might give us some more insight into what combination of trends we might be looking at.
To inspect what frequencies dominate this timeseries, I want to do a fourier analysis
looking at the monthly fourier transform
The fourier analysis implies that the signal Turnover can approximated into approximately six main wave functions.
meas<-myseries%>%mutate(date.month=month(Month))%>%select(date.month,Turnover)
mspect<-spectrum(meas$Turnover,log='no',spans=c(2,2),plot=FALSE)
specx<-mspect$freq
specy<-2*mspect$spec
plot(specx,specy, xlab='Period(years)',ylab = 'spectral density',type='l',main='Spectral density by Month of Turnover')
This is just my intuition about how one might find the trends in timeseries data. I don’t fully know how to convert frequency back into months, since the intervals in our data are discrete and the frequencies above clearly arent. Since frequency is cycles per second, and our scale is cycles per year. I assume you could think of .1 cycles per year as a period of 10 years? Which seems to match up pretty will with a macro cycle in the data.
I am also not totally sure if the unit in this spectral analysis is months or years. This was just my exploratory attempt. I will continue to research this, and hope to learn more about time-series decomposition next week.
code for generating binomial probabilities of trend:
build_compare_frame<-function(yearmonth){
#builds the dataframe comparing one month to the next
n <- myseries%>%filter(month(Month)==month(yearmonth))%>%select(Turnover)
n1<-myseries%>%filter(month(Month)==month(yearmonth+1))%>%select(Turnover)
#print (c(nrow(n),nrow(n1)))
if (nrow(n)!=nrow(n1)){
lowval = min(nrow(n),nrow(n1))
return(data.frame(n[1:lowval,],n1[1:lowval,]))
}
return(data.frame(n,n1))
}
get_list_of_first_months<-function(series=myseries){
#gets first occurence of every month
unique_months<-sort(unique(month(series$Month)))
output_list<-list()
for(i in unique_months){
first<-series%>%filter(month(Month)==i)%>%select(Month)%>%filter(Month==min(Month))
output_list<-append(output_list,first)
}
return(output_list)
}
get_p<-function(a){
#returns cumulative density function given number of successes vs probability of there being no trend.
row<-a%>%
mutate(
increase = case_when(
Turnover<Turnover.1 ~ TRUE,
TRUE ~ FALSE
)
)%>%
summarize(month = unique(month(Month)),
wins = sum(increase),
total = length(increase),
p = round(1-pbinom(wins,size=total, prob = .5),2))
return(row)
}
build_probabilities<-function(myseries = myseries){
#generate output table
first_months<-get_list_of_first_months()
cnames<-c('wins','total','p')
output_df<-data.frame(matrix(ncol = 3, nrow = 0))
colnames(output_df)<-cnames
for (i in first_months){
compare_frame<-build_compare_frame(i[1])
output_df<-rbind(output_df,get_p(compare_frame))
}
return(output_df)
}
df<-build_probabilities()
df<-df%>%mutate(month=c("Jan-Feb","Feb-March", "March-April",'April-May','May-June','June-July','July-Aug','Aug-Sept','Sept-Oct','Oct-Nov','Nov-Dec','Dec-Jan'))