In this lesson students will learn how the basics of time series…
Last class we saw how polynomial regression was used for model COVID-19 data in 2020; however, this model is inherently flawed because it assumes independence. When data are taken close
covid<-read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA252/main/Data/complete.csv")
str(covid)
## 'data.frame': 4692 obs. of 10 variables:
## $ Date : chr "2020-01-30" "2020-01-31" "2020-02-01" "2020-02-02" ...
## $ Name.of.State...UT : chr "Kerala" "Kerala" "Kerala" "Kerala" ...
## $ Latitude : num 10.9 10.9 10.9 10.9 10.9 ...
## $ Longitude : num 76.3 76.3 76.3 76.3 76.3 ...
## $ Total.Confirmed.cases : num 1 1 2 3 3 3 3 3 3 3 ...
## $ Death : chr "0" "0" "0" "0" ...
## $ Cured.Discharged.Migrated: num 0 0 0 0 0 0 0 0 0 0 ...
## $ New.cases : int 0 0 1 1 0 0 0 0 0 0 ...
## $ New.deaths : int 0 0 0 0 0 0 0 0 0 0 ...
## $ New.recovered : int 0 0 0 0 0 0 0 0 0 0 ...
library(tidyverse)
library(magrittr)
## NEEDS TO BE A DATE OBJECT
covid$Date<-as.Date(covid$Date)
## UNIQUE STATES
covid<-covid%>%
filter(Name.of.State...UT!="Telangana***")
states<-unique(covid$Name.of.State...UT)
#states
### CLEAN STATE DATA
### the first time the state is in the data
### it doesn't say new cases
### it comes up at total cases
for(i in 1:length(states)){
thisState<-covid%>%
filter(Name.of.State...UT==states[i])%>%
arrange(-desc(Date))
thisState$New.cases[1]<-thisState$Total.Confirmed.cases[1]
if(i==1){
covidW<-thisState
}
if(i>1){
covidW<-covidW%>%
rbind(thisState)
}
}
## CREATE A VECTOR OF UNIQUE DATES
dateRange<-unique(covid$Date)
## INITIALIZE EMPTY VECTOR FOR CASES
cases<-c()
for(i in 1:length(dateRange)){
## CALCULATE NUMBER OF NEW CASES ACROSS ALL STATES
covidThisDate<-covid%>%
filter(Date==dateRange[i])%$%
sum(New.cases)
cases<-c(cases, covidThisDate)
}
## Create a new data frame
newCaseDate<-data.frame(Date=dateRange,
Daily_Cases=cases,
## CUMULATIVE SUM
Total_Cases=cumsum(cases)
)%>%
mutate(Start=as.Date("2020-01-30"))%>%
## NUMBER OF DAYS SINCE START 1/30/2020
mutate(Days=as.numeric(Date-Start))
## Plot for New Cases in a day
ggplot(newCaseDate, aes(x=Days, y=Daily_Cases))+
geom_line()+
ggtitle("New Cases Daily")
## Plot for Total Cases
ggplot(newCaseDate, aes(x=Days, y=Total_Cases))+
geom_line()+
geom_point()+
ggtitle("Total Case Over Time")
First we will create a time series object and plot it.
#Converting it to Time Series Object
covidTS <- ts(newCaseDate$Total_Cases)
ts.plot(covidTS)
Time series are often modelled using ARIMA, which stands for auto-regressive (AR) integrated moving average (MA). The ARIMA can be thought of as three parts:
An ARIMA model is typically expressed with three parameters \(ARIMA(p,d,q)\).
The value of \(Y\) at time \(t\) is considered to be a linear function of preceding time points (\(t-1\), \(t-2\)) and also the errors from previous time points.
The model takes the form of: \[Y_t=c+\phi_1 y_{d t-1}+\cdots+\phi_p y_{d t-p}+\theta_1 e_{t-1}+\cdots+\theta_q e_{t-q}+e_t\] where \(e\) is an error term and \(c\) is a constant.
The parameter of \(ARIMA(p,d,q)\) are described as:
For more information please read:
https://ademos.people.uic.edu/Chapter23.html#2_our_example_data:_generating_sine_waves_to_play_with
A common assumption in time series is that the data are stationary. The stationary property implies that the mean, variance, and autocorrelation structure do not change over time.
In order to assess whether the process is stationary, we can perform the Augmented Dickey-Fuller Test.
#install.packages("tseries")
library(tseries)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
adf.test(covidTS)
## Warning in adf.test(covidTS): p-value greater than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: covidTS
## Dickey-Fuller = 4.1043, Lag order = 5, p-value = 0.99
## alternative hypothesis: stationary
In the output above, there is no evidence to support the claim that these data are stationary.
A common strategy to make data stationary is to performing differencing. Differencing can stabilize a time series by removing trend.
Let’s try a second order difference:
## TAKE TWO DIFFERENCES
covidTS2 <- diff(covidTS , differences = 2)
adf.test(covidTS2)
## Warning in adf.test(covidTS2): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: covidTS2
## Dickey-Fuller = -9.8155, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
plot(covidTS2)
Now, looking at the p-value, there is evidence to suggest that these data are stationary.
The following graphics are used to diagnose the proper parameter for the \(AR(p)\) and \(MA(q)\) portions.
Auto-Correlation Function (ACF) gives us the auto correlation of any series with its lagged values.
Partial Auto-Correlation Function (PACF) gives us the amount of correlation between two variables which is not explained by their mutual correlations, but with the residuals.
For more information please read:
#Plotting the ACF plot and PACF
acf(covidTS2)
pacf(covidTS2)
Notice that the ACF plot shows significant spikes to lag 2, but the PACF
is showing decay. It is suggestive of a \(MA(2)\).
If can be very difficult to tell what \(p\) and \(q\) to use by looking at the ACF and PACF.
Thankfully, there is a function called auto.arima()
that
finds the best fitting time series for you.
#install.packages("forecast")
library(forecast)
#Fitting auto.arima ()
fit1 <- auto.arima(newCaseDate$Total_Cases, seasonal = FALSE)
fit1
## Series: newCaseDate$Total_Cases
## ARIMA(1,2,2)
##
## Coefficients:
## ar1 ma1 ma2
## 0.5747 -1.5243 0.7087
## s.e. 0.1138 0.0904 0.0764
##
## sigma^2 = 30239245: log likelihood = -1844.91
## AIC=3697.81 AICc=3698.03 BIC=3710.67
#Forecasting
forecast1 <- forecast(fit1, h = 10)
forecast1
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 187 2017932 2010884 2024979 2007154 2028710
## 188 2071571 2061351 2081791 2055940 2087202
## 189 2125199 2111644 2138755 2104468 2145931
## 190 2178821 2161321 2196321 2152057 2205585
## 191 2232439 2210270 2254609 2198534 2266345
## 192 2286055 2258497 2313614 2243908 2328202
## 193 2339670 2306051 2373289 2288255 2391086
## 194 2393285 2352990 2433579 2331660 2454909
## 195 2446898 2399367 2494430 2374206 2519591
## 196 2500512 2445228 2555796 2415962 2585062
plot(forecast1)
#Fitting auto.arima ()
fit2 <- auto.arima(newCaseDate$Daily_Cases, seasonal = FALSE)
fit2
## Series: newCaseDate$Daily_Cases
## ARIMA(1,1,2) with drift
##
## Coefficients:
## ar1 ma1 ma2 drift
## 0.5161 -1.4843 0.6673 289.7799
## s.e. 0.1281 0.1080 0.0959 149.7659
##
## sigma^2 = 29728245: log likelihood = -1852.83
## AIC=3715.66 AICc=3715.99 BIC=3731.76
#Forecasting
forecast2 <- forecast(fit2, h = 10)
forecast2
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 187 54494.96 47507.47 61482.44 43808.52 65181.39
## 188 54532.95 47541.94 61523.96 43841.12 65224.78
## 189 54692.78 47564.35 61821.21 43790.79 65594.77
## 190 54915.50 47512.58 62318.42 43593.71 66237.29
## 191 55170.67 47416.00 62925.34 43310.92 67030.41
## 192 55442.58 47304.05 63581.12 42995.77 67889.40
## 193 55723.14 47193.56 64252.73 42678.26 68768.02
## 194 56008.16 47092.30 64924.03 42372.52 69643.81
## 195 56295.49 47003.19 65587.79 42084.14 70506.84
## 196 56584.00 46926.83 66241.17 41814.63 71353.37
plot(forecast2)
Question:
The first example, was pretty simple… let’s try something harder!
These data come from the CDC Weekly U.S. Influenza Surveillance Report (FluView)
Source:
fluAll<-read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA252/main/Data/flu.csv")
str(fluAll)
## 'data.frame': 9959 obs. of 8 variables:
## $ YEAR : chr "2009-10" "2009-10" "2009-10" "2009-10" ...
## $ MMWR.YEAR : int 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
## $ MMWR.WEEK : int 35 35 35 35 35 35 35 35 36 36 ...
## $ AGE.CATEGORY : chr "Overall" "Overall" "Overall" "Overall" ...
## $ SEX.CATEGORY : chr "Overall" "Overall" "Overall" "Overall" ...
## $ RACE.CATEGORY : chr "White" "Black" "Hispanic/Latino" "Asian/Pacific Islander" ...
## $ CUMULATIVE.RATE: chr "0.1" "0.9" "0.4" "0.3" ...
## $ WEEKLY.RATE : chr "0.1" "0.9" "0.4" "0.3" ...
fluAll$WEEKLY.RATE<-as.numeric(fluAll$WEEKLY.RATE)
## Warning: NAs introduced by coercion
flu<-fluAll%>%
filter(AGE.CATEGORY=="Overall",
SEX.CATEGORY=="Overall",
RACE.CATEGORY=="Overall")
#install.packages("lubridate")
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## Make a date column
fluDat<-flu%>%
mutate(date=lubridate::ymd(paste(MMWR.YEAR,"-01-01", sep="" )) + lubridate::weeks( MMWR.WEEK - 1 ))
Make a time series plot.
### MAKE A TIME SERIES PLOT
ggplot(data=fluDat, aes(x=date, y=WEEKLY.RATE))+
geom_line()
## Warning: Removed 12 rows containing missing values (`geom_line()`).
We can decompose a time series into a seasonal component, trend, and noise.
### OH NO MISSING DATA DURING COVID
### and the data after looks ... suspect
useDat<-fluDat%>%
filter(date<"2020-01-01")
### Decompose timeseries
flu_ts <- ts(useDat$WEEKLY.RATE, freq = 365.25/7)
plot(stl(flu_ts, 365.25/7))
How could we model seasonal patterns with a method we already learned?
### Seasonal flu
ggplot(flu, aes(x=MMWR.WEEK, y=WEEKLY.RATE, color=as.factor(MMWR.YEAR)))+
geom_line()
## Warning: Removed 43 rows containing missing values (`geom_line()`).
### Let's add seasons
### week of year
lubridate::week(as.Date("2020-03-20")) # 12 (SPRING)
## [1] 12
lubridate::week(as.Date("2020-06-21")) #25 (SUMMER)
## [1] 25
lubridate::week(as.Date("2020-09-23")) #39 (FALL)
## [1] 39
lubridate::week(as.Date("2020-12-21")) #51 (WINTER)
## [1] 51
###
ggplot(flu, aes(x=MMWR.WEEK, y=WEEKLY.RATE, color=as.factor(MMWR.YEAR)))+
geom_line()+
geom_vline(xintercept = 12, lty=2)+
geom_vline(xintercept = 25, lty=2)+
geom_vline(xintercept = 39, lty=2)+
geom_vline(xintercept = 51, lty=2)
## Warning: Removed 43 rows containing missing values (`geom_line()`).
Model seasonality with LOESS!
## LOESS
ggplot(flu, aes(x=MMWR.WEEK, y=WEEKLY.RATE))+
geom_line(aes(group=MMWR.YEAR))+
geom_smooth(method="loess" , se=FALSE, span=.2)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 43 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 43 rows containing missing values (`geom_line()`).
### DECOMPOSE WEEK
fluLoess<-loess(WEEKLY.RATE~MMWR.WEEK, span=0.2,
data=useDat)
### PREDICT SEASONALITY EFFECT (by week)
weekPred<-data.frame(MMWR.WEEK=1:52)%>%
mutate(weekSnl=predict(fluLoess, MMWR.WEEK))
We can do a pretty good job just continuting this seasonality.
newdata = data.frame(MMWR.YEAR=2020)
### new data frame
predDf<-data.frame(MMWR.YEAR=rep(2020, 52),
MMWR.WEEK=1:52,
YT=2.753444)%>%
left_join(weekPred)%>%
mutate(date=lubridate::ymd(paste(MMWR.YEAR,"-01-01", sep="" )) + lubridate::weeks( MMWR.WEEK - 1 ))
## Joining, by = "MMWR.WEEK"
#### Plot together
ggplot(useDat, aes(x=date, y=WEEKLY.RATE))+
geom_line()+
geom_line(data=predDf, aes(x=date, y=weekSnl), color="red")
This is for a seasonal ARIMA:
## CHANGE TO MONTHLY
fluMo<-useDat%>%
mutate(date=lubridate::ymd(paste(MMWR.YEAR,"-01-01", sep="" )) + lubridate::weeks( MMWR.WEEK - 1 ))%>%
mutate(MONTH=month(date))%>%
group_by(MMWR.YEAR, MONTH)%>%
summarise(MO.RATE=sum(WEEKLY.RATE))%>%
mutate(date=lubridate::ymd(paste(MMWR.YEAR,"-",MONTH,"-01", sep="" )))
## `summarise()` has grouped output by 'MMWR.YEAR'. You can override using the
## `.groups` argument.
#View(fluMo)
## SARIMA
fluSarima<-auto.arima(ts(fluMo$MO.RATE, freq = 12))
fluSarima
## Series: ts(fluMo$MO.RATE, freq = 12)
## ARIMA(0,1,0)(0,0,1)[12]
##
## Coefficients:
## sma1
## -0.2616
## s.e. 0.1155
##
## sigma^2 = 56.42: log likelihood = -261.01
## AIC=526.02 AICc=526.18 BIC=530.68