Learning Objectives

In this lesson students will learn how the basics of time series…

  • What is a time series?
  • Why and when should we use time series?
  • How do you decompose a time series?

Motivation

Last class we saw how polynomial regression was used for model COVID-19 data in 2020; however, this model is inherently flawed because it assumes independence. When data are taken close

Example 1: COVID

covid<-read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA252/main/Data/complete.csv")

str(covid)
## 'data.frame':    4692 obs. of  10 variables:
##  $ Date                     : chr  "2020-01-30" "2020-01-31" "2020-02-01" "2020-02-02" ...
##  $ Name.of.State...UT       : chr  "Kerala" "Kerala" "Kerala" "Kerala" ...
##  $ Latitude                 : num  10.9 10.9 10.9 10.9 10.9 ...
##  $ Longitude                : num  76.3 76.3 76.3 76.3 76.3 ...
##  $ Total.Confirmed.cases    : num  1 1 2 3 3 3 3 3 3 3 ...
##  $ Death                    : chr  "0" "0" "0" "0" ...
##  $ Cured.Discharged.Migrated: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ New.cases                : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ New.deaths               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ New.recovered            : int  0 0 0 0 0 0 0 0 0 0 ...

WRANGLE!

library(tidyverse)
library(magrittr)

## NEEDS TO BE A DATE OBJECT 
covid$Date<-as.Date(covid$Date)

## UNIQUE STATES
covid<-covid%>%
  filter(Name.of.State...UT!="Telangana***")

states<-unique(covid$Name.of.State...UT)
#states

### CLEAN STATE DATA
### the first time the state is in the data 
### it doesn't say new cases
### it comes up at total cases

for(i in 1:length(states)){
  thisState<-covid%>%
    filter(Name.of.State...UT==states[i])%>%
    arrange(-desc(Date))
  
  thisState$New.cases[1]<-thisState$Total.Confirmed.cases[1]
  
  if(i==1){
    covidW<-thisState
  }
  if(i>1){
    covidW<-covidW%>%
      rbind(thisState)
  }
}

## CREATE A VECTOR OF UNIQUE DATES
dateRange<-unique(covid$Date)

## INITIALIZE EMPTY VECTOR FOR CASES
cases<-c()

for(i in 1:length(dateRange)){
  
  ## CALCULATE NUMBER OF NEW CASES ACROSS ALL STATES
  covidThisDate<-covid%>%
    filter(Date==dateRange[i])%$%
    sum(New.cases)
  
  cases<-c(cases, covidThisDate)
}

## Create a new data frame
newCaseDate<-data.frame(Date=dateRange, 
                        Daily_Cases=cases,
                        ## CUMULATIVE SUM
                        Total_Cases=cumsum(cases) 
)%>%
  mutate(Start=as.Date("2020-01-30"))%>%
  ## NUMBER OF DAYS SINCE START 1/30/2020
  mutate(Days=as.numeric(Date-Start))

Visualize

## Plot for New Cases in a day
ggplot(newCaseDate, aes(x=Days, y=Daily_Cases))+
  geom_line()+
  ggtitle("New Cases Daily")

## Plot for Total Cases
ggplot(newCaseDate, aes(x=Days, y=Total_Cases))+
  geom_line()+
  geom_point()+
  ggtitle("Total Case Over Time")

Time Series

TS Object

First we will create a time series object and plot it.

#Converting it to Time Series Object 
covidTS <- ts(newCaseDate$Total_Cases) 
ts.plot(covidTS)

TS from Scratch

Time series are often modelled using ARIMA, which stands for auto-regressive (AR) integrated moving average (MA). The ARIMA can be thought of as three parts:

  • AR (Autoregressive): A pattern of growth/decline in the data
  • Integrated: The rate of change of the growth/decline
  • MA (Moving Average): Noise between consecutive time points

An ARIMA model is typically expressed with three parameters \(ARIMA(p,d,q)\).

The value of \(Y\) at time \(t\) is considered to be a linear function of preceding time points (\(t-1\), \(t-2\)) and also the errors from previous time points.

The model takes the form of: \[Y_t=c+\phi_1 y_{d t-1}+\cdots+\phi_p y_{d t-p}+\theta_1 e_{t-1}+\cdots+\theta_q e_{t-q}+e_t\] where \(e\) is an error term and \(c\) is a constant.

The parameter of \(ARIMA(p,d,q)\) are described as:

  • \(p\): The number of preceding (“lagged”) \(Y\) values in the model
  • \(d\): The number of times the data is “differenced” to produce stationarity
  • \(q\): The number of preceding (“lagged”) error terms that are in our model

For more information please read:

https://ademos.people.uic.edu/Chapter23.html#2_our_example_data:_generating_sine_waves_to_play_with

Stationarity

A common assumption in time series is that the data are stationary. The stationary property implies that the mean, variance, and autocorrelation structure do not change over time.

In order to assess whether the process is stationary, we can perform the Augmented Dickey-Fuller Test.

#install.packages("tseries")
library(tseries)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
adf.test(covidTS)
## Warning in adf.test(covidTS): p-value greater than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  covidTS
## Dickey-Fuller = 4.1043, Lag order = 5, p-value = 0.99
## alternative hypothesis: stationary

In the output above, there is no evidence to support the claim that these data are stationary.

A common strategy to make data stationary is to performing differencing. Differencing can stabilize a time series by removing trend.

Let’s try a second order difference:

## TAKE TWO DIFFERENCES
covidTS2 <- diff(covidTS , differences = 2) 

adf.test(covidTS2)
## Warning in adf.test(covidTS2): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  covidTS2
## Dickey-Fuller = -9.8155, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
plot(covidTS2)

Now, looking at the p-value, there is evidence to suggest that these data are stationary.

ACF and PACF

The following graphics are used to diagnose the proper parameter for the \(AR(p)\) and \(MA(q)\) portions.

  • Auto-Correlation Function (ACF) gives us the auto correlation of any series with its lagged values.

  • Partial Auto-Correlation Function (PACF) gives us the amount of correlation between two variables which is not explained by their mutual correlations, but with the residuals.

For more information please read:

https://towardsdatascience.com/interpreting-acf-and-pacf-plots-for-time-series-forecasting-af0d6db4061c

#Plotting the ACF plot and PACF 
acf(covidTS2)

pacf(covidTS2)

Notice that the ACF plot shows significant spikes to lag 2, but the PACF is showing decay. It is suggestive of a \(MA(2)\).

Auto ARIMA

If can be very difficult to tell what \(p\) and \(q\) to use by looking at the ACF and PACF. Thankfully, there is a function called auto.arima() that finds the best fitting time series for you.

Total Cases
#install.packages("forecast")
library(forecast)

#Fitting auto.arima ()
fit1 <- auto.arima(newCaseDate$Total_Cases, seasonal = FALSE) 
fit1
## Series: newCaseDate$Total_Cases 
## ARIMA(1,2,2) 
## 
## Coefficients:
##          ar1      ma1     ma2
##       0.5747  -1.5243  0.7087
## s.e.  0.1138   0.0904  0.0764
## 
## sigma^2 = 30239245:  log likelihood = -1844.91
## AIC=3697.81   AICc=3698.03   BIC=3710.67
#Forecasting
forecast1 <- forecast(fit1, h = 10) 
forecast1 
##     Point Forecast   Lo 80   Hi 80   Lo 95   Hi 95
## 187        2017932 2010884 2024979 2007154 2028710
## 188        2071571 2061351 2081791 2055940 2087202
## 189        2125199 2111644 2138755 2104468 2145931
## 190        2178821 2161321 2196321 2152057 2205585
## 191        2232439 2210270 2254609 2198534 2266345
## 192        2286055 2258497 2313614 2243908 2328202
## 193        2339670 2306051 2373289 2288255 2391086
## 194        2393285 2352990 2433579 2331660 2454909
## 195        2446898 2399367 2494430 2374206 2519591
## 196        2500512 2445228 2555796 2415962 2585062
plot(forecast1)

Daily Cases
#Fitting auto.arima ()
fit2 <- auto.arima(newCaseDate$Daily_Cases, seasonal = FALSE) 
fit2
## Series: newCaseDate$Daily_Cases 
## ARIMA(1,1,2) with drift 
## 
## Coefficients:
##          ar1      ma1     ma2     drift
##       0.5161  -1.4843  0.6673  289.7799
## s.e.  0.1281   0.1080  0.0959  149.7659
## 
## sigma^2 = 29728245:  log likelihood = -1852.83
## AIC=3715.66   AICc=3715.99   BIC=3731.76
#Forecasting
forecast2 <- forecast(fit2, h = 10) 
forecast2 
##     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 187       54494.96 47507.47 61482.44 43808.52 65181.39
## 188       54532.95 47541.94 61523.96 43841.12 65224.78
## 189       54692.78 47564.35 61821.21 43790.79 65594.77
## 190       54915.50 47512.58 62318.42 43593.71 66237.29
## 191       55170.67 47416.00 62925.34 43310.92 67030.41
## 192       55442.58 47304.05 63581.12 42995.77 67889.40
## 193       55723.14 47193.56 64252.73 42678.26 68768.02
## 194       56008.16 47092.30 64924.03 42372.52 69643.81
## 195       56295.49 47003.19 65587.79 42084.14 70506.84
## 196       56584.00 46926.83 66241.17 41814.63 71353.37
plot(forecast2)

Question:

  • What do you observe happens to the errors as the forecasted time point is further from the data?

Example 2: Seasonal Flu

The first example, was pretty simple… let’s try something harder!

These data come from the CDC Weekly U.S. Influenza Surveillance Report (FluView)

Source:

fluAll<-read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA252/main/Data/flu.csv")

str(fluAll)
## 'data.frame':    9959 obs. of  8 variables:
##  $ YEAR           : chr  "2009-10" "2009-10" "2009-10" "2009-10" ...
##  $ MMWR.YEAR      : int  2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
##  $ MMWR.WEEK      : int  35 35 35 35 35 35 35 35 36 36 ...
##  $ AGE.CATEGORY   : chr  "Overall" "Overall" "Overall" "Overall" ...
##  $ SEX.CATEGORY   : chr  "Overall" "Overall" "Overall" "Overall" ...
##  $ RACE.CATEGORY  : chr  "White" "Black" "Hispanic/Latino" "Asian/Pacific Islander" ...
##  $ CUMULATIVE.RATE: chr  "0.1" "0.9" "0.4" "0.3" ...
##  $ WEEKLY.RATE    : chr  "0.1" "0.9" "0.4" "0.3" ...

Wrangle

fluAll$WEEKLY.RATE<-as.numeric(fluAll$WEEKLY.RATE)
## Warning: NAs introduced by coercion
flu<-fluAll%>%
  filter(AGE.CATEGORY=="Overall", 
         SEX.CATEGORY=="Overall", 
         RACE.CATEGORY=="Overall")

#install.packages("lubridate")
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
## Make a date column 
fluDat<-flu%>%
  mutate(date=lubridate::ymd(paste(MMWR.YEAR,"-01-01", sep="" )) + lubridate::weeks( MMWR.WEEK - 1 ))

Visualize

Make a time series plot.

### MAKE A TIME SERIES PLOT
ggplot(data=fluDat, aes(x=date, y=WEEKLY.RATE))+
  geom_line()
## Warning: Removed 12 rows containing missing values (`geom_line()`).

Decompose

We can decompose a time series into a seasonal component, trend, and noise.

### OH NO MISSING DATA DURING COVID
### and the data after looks ... suspect
useDat<-fluDat%>%
  filter(date<"2020-01-01")

### Decompose timeseries
flu_ts <- ts(useDat$WEEKLY.RATE, freq = 365.25/7)

plot(stl(flu_ts, 365.25/7))

Seasonality

How could we model seasonal patterns with a method we already learned?

### Seasonal flu
ggplot(flu, aes(x=MMWR.WEEK, y=WEEKLY.RATE, color=as.factor(MMWR.YEAR)))+
  geom_line()
## Warning: Removed 43 rows containing missing values (`geom_line()`).

### Let's add seasons
### week of year
lubridate::week(as.Date("2020-03-20")) # 12 (SPRING)
## [1] 12
lubridate::week(as.Date("2020-06-21")) #25 (SUMMER)
## [1] 25
lubridate::week(as.Date("2020-09-23")) #39 (FALL)
## [1] 39
lubridate::week(as.Date("2020-12-21")) #51 (WINTER)
## [1] 51
###
ggplot(flu, aes(x=MMWR.WEEK, y=WEEKLY.RATE, color=as.factor(MMWR.YEAR)))+
  geom_line()+
  geom_vline(xintercept = 12, lty=2)+
  geom_vline(xintercept = 25, lty=2)+
  geom_vline(xintercept = 39, lty=2)+
  geom_vline(xintercept = 51, lty=2)
## Warning: Removed 43 rows containing missing values (`geom_line()`).

Model seasonality with LOESS!

## LOESS
ggplot(flu, aes(x=MMWR.WEEK, y=WEEKLY.RATE))+
  geom_line(aes(group=MMWR.YEAR))+
  geom_smooth(method="loess" , se=FALSE, span=.2)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 43 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 43 rows containing missing values (`geom_line()`).

### DECOMPOSE WEEK
fluLoess<-loess(WEEKLY.RATE~MMWR.WEEK, span=0.2,
                data=useDat)

### PREDICT SEASONALITY EFFECT (by week)
weekPred<-data.frame(MMWR.WEEK=1:52)%>%
  mutate(weekSnl=predict(fluLoess, MMWR.WEEK))

We can do a pretty good job just continuting this seasonality.

newdata = data.frame(MMWR.YEAR=2020)

### new data frame
predDf<-data.frame(MMWR.YEAR=rep(2020, 52), 
                   MMWR.WEEK=1:52, 
                   YT=2.753444)%>%
  left_join(weekPred)%>%
  mutate(date=lubridate::ymd(paste(MMWR.YEAR,"-01-01", sep="" )) + lubridate::weeks( MMWR.WEEK - 1 ))
## Joining, by = "MMWR.WEEK"
#### Plot together
ggplot(useDat, aes(x=date, y=WEEKLY.RATE))+
  geom_line()+
  geom_line(data=predDf, aes(x=date, y=weekSnl), color="red")

SARIMA

This is for a seasonal ARIMA:

## CHANGE TO MONTHLY
fluMo<-useDat%>%
  mutate(date=lubridate::ymd(paste(MMWR.YEAR,"-01-01", sep="" )) + lubridate::weeks( MMWR.WEEK - 1 ))%>%
  mutate(MONTH=month(date))%>%
  group_by(MMWR.YEAR, MONTH)%>%
  summarise(MO.RATE=sum(WEEKLY.RATE))%>%
  mutate(date=lubridate::ymd(paste(MMWR.YEAR,"-",MONTH,"-01", sep="" )))
## `summarise()` has grouped output by 'MMWR.YEAR'. You can override using the
## `.groups` argument.
#View(fluMo) 

## SARIMA
fluSarima<-auto.arima(ts(fluMo$MO.RATE, freq =  12))
fluSarima
## Series: ts(fluMo$MO.RATE, freq = 12) 
## ARIMA(0,1,0)(0,0,1)[12] 
## 
## Coefficients:
##          sma1
##       -0.2616
## s.e.   0.1155
## 
## sigma^2 = 56.42:  log likelihood = -261.01
## AIC=526.02   AICc=526.18   BIC=530.68