When to use a time series vs. a linear regression?
If you have continuous target variable, then it is a regression problem. For instance, in flight trials we have the flight distance to predict, which is continuous. Hence this becomes a regression problem. About time series, when the datapoints are time dependent, then it becomes a time series problem. Each data point has an order and is, typically, related to the data points before and after by some underlying process.
In turn, a times series regression is a statistical method for predicting a future response based on the response history (known as autoregressive dynamics) and the transfer of dynamics from relevant predictors. Time series regression is commonly used for modeling and forecasting of economic, financial, and biological systems.
There are three concepts then to keep in mind:
Stationary: if any trends or patterns in the time series are independent with time. (A linear regression equivalent to a time series at stationarity). There is strong and weak stationary.
Non-Stationary: trends dependent in time in the mean or statistical deviation of the data points.
Autocorrelation: “memory”; the degree to which time series values in period(t) are related to time series values in periods (t+1, t+2, t+3…). For this, we can test to see how long certain values last in a system.
There are also three words and their definitions to remember:
Questions about time series often asked include,
So how can you answer these questions through a time series regression and code? And where do you begin to test stationarity or visualize autocorrelation? And finally, how do you handle non-stationarity?
source_path = "~/Desktop/git_repositories/SBB-dispersal/avbernat_working_on/Rsrc/"
script_names = c("compare_models.R",
"regression_output.R",
"clean_morph_data.R", # two functions: read_morph_data and remove_torn_wings
"AICprobabilities.R")
for (script in script_names) {
path = paste0(source_path, script)
source(path)
}
source("~/Desktop/git_repositories/SBB-dispersal/avbernat_working_on/RTsrc/ts_auxiliary_funs.R")
Extract date information in order to prep it for converting it into a datetime object read by the xts() function.
Merge close dates together in order to have as equally spaced datapoints as possible. The acf() assumes that the data are regular spaced. The acf computes estimates of the autocovariance or autocorrelation function. We need this because it will tell us, if there are patterns, what the mathematical function of those patterns are.
data_list <- read_morph_data("data/allmorphology9.21.20-clean.csv")
## number of missing dates: 0
raw_data = data_list[[1]]
all_bugs = nrow(raw_data)
data_long = data_list[[2]]
# Remove individuals with torn wings first
raw_data = remove_torn_wings(raw_data) # **don't need to remove torn wings for wing morph analysis
data_long = remove_torn_wings(data_long) # **don't need to remove torn wings for wing morph analysis
clean_bugs = nrow(raw_data)
cat("number of bugs with torn wings:", all_bugs - clean_bugs, "\n\n")
## number of bugs with torn wings: 141
Looks like we have a measurement at least once per year, which is good. However, we only have 10 time points.
na.omit(unique(raw_data$datetime))
## [1] Apr 2013 Apr 2014 Apr 2015 Dec 2013 Aug 2017 Dec 2016 Sep 2018 May 2019
## [9] Oct 2019 Feb 2020
## 10 Levels: Apr 2013 Dec 2013 Apr 2014 Apr 2015 Dec 2016 Aug 2017 ... Feb 2020
# equation: +-2/sqrt(T) where T is the number of obs
n=length(na.omit(unique(raw_data$datetime)))
2/sqrt(n)
## [1] 0.6324555
This would make our time series a borderline-short time series rather than a long time series. It may make more sense to do a GLMM that accounts for temporal dependencies rather than try to have an autoregressive model conform to this data. However, doing a GLMM assumes that temporal dependencies are the same between any two time observations, whereas an autoregressive model will try to account for and define any decay.
I continued applying autoregressive models on the data to see how it would span out. Long story short, we need more time points.
Remove missing wing and datetime values.
Compute the wing length averages and generate datetime objects using as.Date(). In order to be processed by the as.Date() function, all time objects need a date, month, and year.
Use the xts & zoo R Libraries to read cleanly index the data by a formal time object (collection_time).
Optional: include major events that occurred in the 2010’s. This could be events that you find biologically significant to your questions. E.g. how did major hurricanes in Florida impact average soapberry bug wing length, wing2body ratios, and/or wing-morph frequencies across Florida?
Hurricane Source: https://en.wikipedia.org/wiki/List_of_Florida_hurricanes#2000–present
# remove NA dates
d = raw_data %>%
filter(!is.na(wing), !is.na(datetime))
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing", cat_var="datetime", func="mean")
wing_avg = ts_cols[[1]]
date = ts_cols[[2]]
# events
FL_major_hurr = c("Sep 2017 10", "Oct 2018 10") # Irma (s), Michael (n)
hurr_dates = as.Date(FL_major_hurr, "%b %Y %d")
events <- xts(c("Irma (S. FL)", "Michael (N. Fl)"),
hurr_dates)
plot_consecutive_yrs(wing_avg, "wing length (mm)") # not strong; the points are very dispersed
## t t1
## t 1.0000000 -0.4201376
## t1 -0.4201376 1.0000000
Create xts-zoo object and plot the time series data. Add any events to the plot with addEventLines().
Run the Augmented Dickey-Fuller Test to test for stationarity where the null hypothesis is non-stationarity and the alternative hypothesis is stationarity.
Finally, use an ACF (autocorrelation function) and PCF (partial-autocorrelation function) plot to identify temporal dependence in the data. Autocorrelation measures the linear relationship between lagged values of a time series.
\(R_s = Corr(x_t, x_{t+s})\) for lag \(s\).
There are two horizontal, blue, dashed lines as well. Those represent the significance threshold, where only the spikes that exceed this dashed line are considered significant. For a white noise series, we expect 95% of the spikes in the ACF to lie within +- 2/sqrt(T) where T is the length of the time series. In most acf plots below the thresholds will be equal to +- 2/sqrt(10) = -+ 0.63. If one or more large spikes are outside these bounds, or if substantially more than 5% of spikes are outside these bounds, then the series is probably not white noise.
In other words, these plots describe how well the present value of the series is related with its past values. Since, none do significantly, then there is no AR (a present value of the time series cannot be obtained using previous values of the same time series).
##
## Augmented Dickey-Fuller Test
##
## data: dx[, 1]
## Dickey-Fuller = -1.8019, Lag order = 2, p-value = 0.6479
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 7.8666
## s.e. 0.1281
##
## sigma^2 estimated as 0.1823: log likelihood=-5.15
## AIC=14.31 AICc=16.02 BIC=14.91
Interpretation:
Stationarity can be defined in precise mathematical terms, but for our purpose we mean a flat looking series, without trend, constant variance over time, a constant autocorrelation structure over time and no periodic fluctuations (seasonality). So the high p value means we do have non-stationarity. Similarly, the augmented Dickey–Fuller (ADF) statistic, used in the test, is a negative number. The more negative it is, the stronger the rejection of the null hypothesis. This is not negative enough.
Finally, the lag length (lag order) is how many terms back down the AR process you want to test for serial correlation. This is a non-stationary with AR(2).
Differencing is when a new series is constructed where the value at the current time step is calculated as the difference between the original observation and the observation at the previous time step.
detrend(wing_mm) # this is stationary
## Warning in adf.test(dx$diff): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$diff
## Dickey-Fuller = -13.094, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 7.8662
## s.e. 0.1423
##
## sigma^2 estimated as 0.2051: log likelihood=-5.11
## AIC=14.22 AICc=16.22 BIC=14.62
addEventLines(events, pos=2, srt=90, col="red")
Interpretation: This time series is stationary! And as a reminder, a data that is just noise will be stationary.
dedrift(wing_mm) # this is not stationary
##
## Augmented Dickey-Fuller Test
##
## data: dx$logv
## Dickey-Fuller = -2.0922, Lag order = 2, p-value = 0.5373
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 7.8662
## s.e. 0.1423
##
## sigma^2 estimated as 0.2051: log likelihood=-5.11
## AIC=14.22 AICc=16.22 BIC=14.62
addEventLines(events, pos=2, srt=90, col="red")
dedriftrend(wing_mm) # this is stationary
## Warning in adf.test(dx$logdiff): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$logdiff
## Dickey-Fuller = -17.391, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 7.8662
## s.e. 0.1423
##
## sigma^2 estimated as 0.2051: log likelihood=-5.11
## AIC=14.22 AICc=16.22 BIC=14.62
addEventLines(events, pos=2, srt=90, col="red")
# remove NA dates
d = data_long %>%
filter(!is.na(wing2body), !is.na(datetime))
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing2body", cat_var="datetime", func="mean")
ratio_avg = ts_cols[[1]]
ratio_avg = ratio_avg[!is.na(ratio_avg)]
date = ts_cols[[2]]
plot_consecutive_yrs(ratio_avg, "wing2body") # weak
## t t1
## t 1.0000000 -0.2066168
## t1 -0.2066168 1.0000000
ratio = xts(ratio_avg, date)
colnames(ratio) <- "wing2body"
check_stationarity(ratio) # this is not stationary
## Warning in adf.test(dx[, 1]): p-value greater than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx[, 1]
## Dickey-Fuller = 1.3655, Lag order = 2, p-value = 0.99
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7300
## s.e. 0.0013
##
## sigma^2 estimated as 1.621e-05: log likelihood=37.39
## AIC=-70.79 AICc=-68.79 BIC=-70.39
addEventLines(events, pos=2, srt=90, col="red")
#dedrift(ratio) # this is not stationary
detrend(ratio) # this is stationary
## Warning in adf.test(dx$diff): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$diff
## Dickey-Fuller = -14.892, Lag order = 1, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7299
## s.e. 0.0015
##
## sigma^2 estimated as 1.85e-05: log likelihood=32.77
## AIC=-61.55 AICc=-59.15 BIC=-61.39
addEventLines(events, pos=2, srt=90, col="red")
dedriftrend(ratio) # this is stationary
## Warning in adf.test(dx$logdiff): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$logdiff
## Dickey-Fuller = -14.859, Lag order = 1, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7299
## s.e. 0.0015
##
## sigma^2 estimated as 1.85e-05: log likelihood=32.77
## AIC=-61.55 AICc=-59.15 BIC=-61.39
addEventLines(events, pos=2, srt=90, col="red")
# remove NA dates and wing morph (S=0, L=1)
d = raw_data %>%
filter(!is.na(wing_morph_binom), !is.na(datetime))
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing_morph_binom", cat_var="datetime", func="mean")
freq_avg = ts_cols[[1]]
date = ts_cols[[2]]
plot_consecutive_yrs(freq_avg, "morph freq") # weak
## t t1
## t 1.0000000 -0.2018924
## t1 -0.2018924 1.0000000
morph = xts(freq_avg, date)
colnames(morph) <- "wing_morph_freq"
check_stationarity(morph) # this is non-stationary
##
## Augmented Dickey-Fuller Test
##
## data: dx[, 1]
## Dickey-Fuller = -1.9253, Lag order = 2, p-value = 0.6008
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7715
## s.e. 0.0280
##
## sigma^2 estimated as 0.008711: log likelihood=10.05
## AIC=-16.11 AICc=-14.39 BIC=-15.5
addEventLines(events, pos=2, srt=90, col="red")
#detrend(morph) # this is not stationary
#dedrift(morph) # this is not stationary
dedriftrend(morph) # this is stationary
##
## Augmented Dickey-Fuller Test
##
## data: dx$logdiff
## Dickey-Fuller = -3.615, Lag order = 2, p-value = 0.04893
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7676
## s.e. 0.0308
##
## sigma^2 estimated as 0.009626: log likelihood=8.65
## AIC=-13.31 AICc=-11.31 BIC=-12.91
addEventLines(events, pos=2, srt=90, col="red")
females = raw_data[raw_data$sex=="F",]
males = raw_data[raw_data$sex=="M",]
# remove NA dates
d = females %>%
filter(!is.na(wing), !is.na(datetime))
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing", cat_var="datetime", func="mean")
wing_avg = ts_cols[[1]]
date = ts_cols[[2]]
##
## Augmented Dickey-Fuller Test
##
## data: dx[, 1]
## Dickey-Fuller = -3.5315, Lag order = 2, p-value = 0.05952
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(1,0,0) with non-zero mean
##
## Coefficients:
## ar1 mean
## -0.6305 8.3045
## s.e. 0.2161 0.0731
##
## sigma^2 estimated as 0.1637: log likelihood=-4.28
## AIC=14.56 AICc=18.56 BIC=15.47
detrend(wing_mmf) # this is stationary
## Warning in adf.test(dx$diff): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$diff
## Dickey-Fuller = -5.1676, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(1,0,0) with non-zero mean
##
## Coefficients:
## ar1 mean
## -0.6707 8.3233
## s.e. 0.2241 0.0777
##
## sigma^2 estimated as 0.1759: log likelihood=-4.12
## AIC=14.24 AICc=19.04 BIC=14.83
addEventLines(events, pos=2, srt=90, col="red")
dedrift(wing_mmf) # this is stationary
## Warning in adf.test(dx$logv): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$logv
## Dickey-Fuller = -7.9803, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(1,0,0) with non-zero mean
##
## Coefficients:
## ar1 mean
## -0.6707 8.3233
## s.e. 0.2241 0.0777
##
## sigma^2 estimated as 0.1759: log likelihood=-4.12
## AIC=14.24 AICc=19.04 BIC=14.83
addEventLines(events, pos=2, srt=90, col="red")
# remove NA dates
d = data_long %>%
filter(!is.na(wing2body), !is.na(datetime))
d = d[d$sex=="F",]
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing2body", cat_var="datetime", func="mean")
ratio_avg = ts_cols[[1]]
ratio_avg = ratio_avg[!is.na(ratio_avg)]
date = ts_cols[[2]]
ratiof = xts(ratio_avg, date)
colnames(ratiof) <- "wing2body"
check_stationarity(ratiof) # this is not stationary
## Warning in adf.test(dx[, 1]): p-value greater than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx[, 1]
## Dickey-Fuller = 0.97018, Lag order = 2, p-value = 0.99
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7280
## s.e. 0.0017
##
## sigma^2 estimated as 2.726e-05: log likelihood=35.05
## AIC=-66.11 AICc=-64.11 BIC=-65.72
addEventLines(events, pos=2, srt=90, col="red")
#dedrift(ratiof) # not stationary
detrend(ratiof) # **barely** not stationary
##
## Augmented Dickey-Fuller Test
##
## data: dx$diff
## Dickey-Fuller = -3.5933, Lag order = 1, p-value = 0.05094
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7275
## s.e. 0.0018
##
## sigma^2 estimated as 2.844e-05: log likelihood=31.05
## AIC=-58.11 AICc=-55.71 BIC=-57.95
addEventLines(events, pos=2, srt=90, col="red")
dedriftrend(ratiof) # **barely** not not stationary
##
## Augmented Dickey-Fuller Test
##
## data: dx$logdiff
## Dickey-Fuller = -3.5825, Lag order = 1, p-value = 0.05243
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7275
## s.e. 0.0018
##
## sigma^2 estimated as 2.844e-05: log likelihood=31.05
## AIC=-58.11 AICc=-55.71 BIC=-57.95
addEventLines(events, pos=2, srt=90, col="red")
Will need to use another function - many like a polynomial to detrend would be better.
# remove NA dates and wing morph (S=0, L=1)
d = raw_data %>%
filter(!is.na(wing_morph_binom), !is.na(datetime))
d = d[d$sex=="F",]
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing_morph_binom", cat_var="datetime", func="mean")
freq_avg = ts_cols[[1]]
date = ts_cols[[2]]
morphf = xts(freq_avg, date)
colnames(morphf) <- "wing_morph_freq"
check_stationarity(morphf) # not stationary
##
## Augmented Dickey-Fuller Test
##
## data: dx[, 1]
## Dickey-Fuller = -1.7014, Lag order = 2, p-value = 0.6861
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7522
## s.e. 0.0284
##
## sigma^2 estimated as 0.008985: log likelihood=9.9
## AIC=-15.8 AICc=-14.08 BIC=-15.19
addEventLines(events, pos=2, srt=90, col="red")
#detrend(morphf) # not stationary
dedriftrend(morphf) # still not stationary
##
## Augmented Dickey-Fuller Test
##
## data: dx$logdiff
## Dickey-Fuller = -1.5666, Lag order = 2, p-value = 0.7375
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7500
## s.e. 0.0315
##
## sigma^2 estimated as 0.01005: log likelihood=8.46
## AIC=-12.92 AICc=-10.92 BIC=-12.52
dedrift(morphf) # stationary
## Warning in adf.test(dx$logv): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$logv
## Dickey-Fuller = -26.872, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7500
## s.e. 0.0315
##
## sigma^2 estimated as 0.01005: log likelihood=8.46
## AIC=-12.92 AICc=-10.92 BIC=-12.52
addEventLines(events, pos=2, srt=90, col="red")
Will need to use another function here too.
# remove NA dates
d = males %>%
filter(!is.na(wing), !is.na(datetime))
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing", cat_var="datetime", func="mean")
wing_avg = ts_cols[[1]]
date = ts_cols[[2]]
##
## Augmented Dickey-Fuller Test
##
## data: dx[, 1]
## Dickey-Fuller = -0.4763, Lag order = 2, p-value = 0.976
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 7.4762
## s.e. 0.1261
##
## sigma^2 estimated as 0.1766: log likelihood=-4.99
## AIC=13.99 AICc=15.7 BIC=14.59
#dedrift(wing_mmm) # not stationary
detrend(wing_mmm) # stationary
## Warning in adf.test(dx$diff): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$diff
## Dickey-Fuller = -7.1199, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 7.4705
## s.e. 0.1400
##
## sigma^2 estimated as 0.1983: log likelihood=-4.96
## AIC=13.92 AICc=15.92 BIC=14.31
addEventLines(events, pos=2, srt=90, col="red")
dedriftrend(wing_mm) # stationary
## Warning in adf.test(dx$logdiff): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$logdiff
## Dickey-Fuller = -17.391, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 7.8662
## s.e. 0.1423
##
## sigma^2 estimated as 0.2051: log likelihood=-5.11
## AIC=14.22 AICc=16.22 BIC=14.62
addEventLines(events, pos=2, srt=90, col="red")
# remove NA dates
d = data_long %>%
filter(!is.na(wing2body), !is.na(datetime))
d = d[d$sex=="M",]
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing2body", cat_var="datetime", func="mean")
ratio_avg = ts_cols[[1]]
ratio_avg = ratio_avg[!is.na(ratio_avg)]
date = ts_cols[[2]]
ratiom = xts(ratio_avg, date)
colnames(ratiom) <- "wing2body"
check_stationarity(ratiom) # not stationary
##
## Augmented Dickey-Fuller Test
##
## data: dx[, 1]
## Dickey-Fuller = -1.0597, Lag order = 2, p-value = 0.9118
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7319
## s.e. 0.0015
##
## sigma^2 estimated as 2.168e-05: log likelihood=36.09
## AIC=-68.17 AICc=-66.17 BIC=-67.78
addEventLines(events, pos=2, srt=90, col="red")
#dedrift(ratiom) #not stationary
#dedriftrend(ratiom) # this is stationary
#addEventLines(events, pos=2, srt=90, col="red") # the same as detrend
detrend(ratiom) # this is stationary
## Warning in adf.test(dx$diff): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$diff
## Dickey-Fuller = -8.8981, Lag order = 1, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7321
## s.e. 0.0017
##
## sigma^2 estimated as 2.406e-05: log likelihood=31.72
## AIC=-59.45 AICc=-57.05 BIC=-59.29
addEventLines(events, pos=2, srt=90, col="red")
# remove NA dates and wing morph (S=0, L=1)
d = raw_data %>%
filter(!is.na(wing_morph_binom), !is.na(datetime))
d = d[d$sex=="M",]
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing_morph_binom", cat_var="datetime", func="mean")
freq_avg = ts_cols[[1]]
date = ts_cols[[2]]
morphm = xts(freq_avg, date)
colnames(morphm) <- "wing_morph_freq"
check_stationarity(morphm) # not stationary
##
## Augmented Dickey-Fuller Test
##
## data: dx[, 1]
## Dickey-Fuller = -1.1172, Lag order = 2, p-value = 0.9034
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.794
## s.e. 0.032
##
## sigma^2 estimated as 0.01136: log likelihood=8.72
## AIC=-13.45 AICc=-11.73 BIC=-12.84
addEventLines(events, pos=2, srt=90, col="red")
#dedrift(morphm) # not stationary
detrend(morphm) # stationary
##
## Augmented Dickey-Fuller Test
##
## data: dx$diff
## Dickey-Fuller = -3.8803, Lag order = 2, p-value = 0.02998
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7897
## s.e. 0.0352
##
## sigma^2 estimated as 0.01258: log likelihood=7.45
## AIC=-10.9 AICc=-8.9 BIC=-10.51
addEventLines(events, pos=2, srt=90, col="red")
dedriftrend(morphm) # stationary
## Warning in adf.test(dx$logdiff): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$logdiff
## Dickey-Fuller = -7.1766, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7897
## s.e. 0.0352
##
## sigma^2 estimated as 0.01258: log likelihood=7.45
## AIC=-10.9 AICc=-8.9 BIC=-10.51
addEventLines(events, pos=2, srt=90, col="red")
BV = raw_data[raw_data$pophost == "C.corindum",]
GRT = raw_data[raw_data$pophost == "K.elegans",]
# remove NA dates
d = BV %>%
filter(!is.na(wing), !is.na(datetime))
ts_cols = clean_for_ts(d, contin_var="wing", cat_var="datetime", func="mean")
wing_avg = ts_cols[[1]]
date = ts_cols[[2]]
##
## Augmented Dickey-Fuller Test
##
## data: dx[, 1]
## Dickey-Fuller = -1.3557, Lag order = 2, p-value = 0.8178
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 7.1985
## s.e. 0.1876
##
## sigma^2 estimated as 0.3912: log likelihood=-8.97
## AIC=21.94 AICc=23.65 BIC=22.55
#detrend(wing_BV) # not stationary
#dedriftrend(wing_BV) # not stationary
dedrift(wing_BV) # stationary
## Warning in adf.test(dx$logv): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$logv
## Dickey-Fuller = -14.635, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 7.1557
## s.e. 0.2036
##
## sigma^2 estimated as 0.4195: log likelihood=-8.33
## AIC=20.66 AICc=22.66 BIC=21.06
addEventLines(events, pos=2, srt=90, col="red")
# remove NA dates
d = data_long %>%
filter(!is.na(wing2body), !is.na(datetime))
d = d[d$pophost=="C.corindum",]
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing2body", cat_var="datetime", func="mean")
ratio_avg = ts_cols[[1]]
ratio_avg = ratio_avg[!is.na(ratio_avg)]
date = ts_cols[[2]]
ratioBV = xts(ratio_avg, date)
colnames(ratioBV) <- "wing2body"
check_stationarity(ratioBV) # not stationary
## Warning in adf.test(dx[, 1]): p-value greater than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx[, 1]
## Dickey-Fuller = 2.0769, Lag order = 2, p-value = 0.99
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7252
## s.e. 0.0018
##
## sigma^2 estimated as 3.153e-05: log likelihood=34.4
## AIC=-64.8 AICc=-62.8 BIC=-64.41
addEventLines(events, pos=2, srt=90, col="red")
#dedrift(ratioBV) # not stationary
detrend(ratioBV) # stationary
## Warning in adf.test(dx$diff): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$diff
## Dickey-Fuller = -16.057, Lag order = 1, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7246
## s.e. 0.0019
##
## sigma^2 estimated as 3.264e-05: log likelihood=30.5
## AIC=-57.01 AICc=-54.61 BIC=-56.85
addEventLines(events, pos=2, srt=90, col="red")
#dedriftrend(ratioBV) # stationary
#addEventLines(events, pos=2, srt=90, col="red") # same as detrend
Over time, the wing2body ratio of the balloon vine bugs is decreasing. However, ACF and PACF not significant.
# remove NA dates and wing morph (S=0, L=1)
d = raw_data %>%
filter(!is.na(wing_morph_binom), !is.na(datetime))
d = d[d$pophost=="C.corindum",]
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing_morph_binom", cat_var="datetime", func="mean")
freq_avg = ts_cols[[1]]
date = ts_cols[[2]]
morphBV = xts(freq_avg, date)
colnames(morphBV) <- "wing_morph_freq"
check_stationarity(morphBV) # not stationary
##
## Augmented Dickey-Fuller Test
##
## data: dx[, 1]
## Dickey-Fuller = -1.3756, Lag order = 2, p-value = 0.8102
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.6050
## s.e. 0.0463
##
## sigma^2 estimated as 0.02384: log likelihood=5.02
## AIC=-6.04 AICc=-4.32 BIC=-5.43
addEventLines(events, pos=2, srt=90, col="red")
#detrend(morphBV) # not stationary
dedrift(morphBV) # stationary
## Warning in adf.test(dx$logv): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$logv
## Dickey-Fuller = -7.5547, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.5957
## s.e. 0.0505
##
## sigma^2 estimated as 0.02584: log likelihood=4.21
## AIC=-4.42 AICc=-2.42 BIC=-4.03
addEventLines(events, pos=2, srt=90, col="red")
#dedriftrend(morphBV) # not stationary
Will need to apply another function.
# remove NA dates
d = GRT %>%
filter(!is.na(wing), !is.na(datetime))
ts_cols = clean_for_ts(d, contin_var="wing", cat_var="datetime", func="mean")
wing_avg = ts_cols[[1]]
date = ts_cols[[2]]
##
## Augmented Dickey-Fuller Test
##
## data: dx[, 1]
## Dickey-Fuller = -1.5958, Lag order = 2, p-value = 0.7263
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 8.4840
## s.e. 0.0889
##
## sigma^2 estimated as 0.08778: log likelihood=-1.5
## AIC=7 AICc=8.71 BIC=7.6
detrend(wing_GRT) # stationary
##
## Augmented Dickey-Fuller Test
##
## data: dx$diff
## Dickey-Fuller = -3.9613, Lag order = 2, p-value = 0.02461
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 8.5289
## s.e. 0.0867
##
## sigma^2 estimated as 0.07606: log likelihood=-0.65
## AIC=5.3 AICc=7.3 BIC=5.69
addEventLines(events, pos=2, srt=90, col="red")
dedrift(wing_GRT) # stationary
## Warning in adf.test(dx$logv): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$logv
## Dickey-Fuller = -9.5131, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 8.5289
## s.e. 0.0867
##
## sigma^2 estimated as 0.07606: log likelihood=-0.65
## AIC=5.3 AICc=7.3 BIC=5.69
addEventLines(events, pos=2, srt=90, col="red")
# remove NA dates
d = data_long %>%
filter(!is.na(wing2body), !is.na(datetime))
d = d[d$pophost=="K.elegans",]
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing2body", cat_var="datetime", func="mean")
ratio_avg = ts_cols[[1]]
ratio_avg = ratio_avg[!is.na(ratio_avg)]
date = ts_cols[[2]]
ratioGRT = xts(ratio_avg, date)
colnames(ratioGRT) <- "wing2body"
check_stationarity(ratioGRT) # not stationary
## Warning in adf.test(dx[, 1]): p-value greater than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx[, 1]
## Dickey-Fuller = 0.49004, Lag order = 2, p-value = 0.99
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7326
## s.e. 0.0014
##
## sigma^2 estimated as 1.819e-05: log likelihood=36.88
## AIC=-69.75 AICc=-67.75 BIC=-69.36
addEventLines(events, pos=2, srt=90, col="red")
detrend(ratioGRT) # stationary
## Warning in adf.test(dx$diff): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$diff
## Dickey-Fuller = -55.197, Lag order = 1, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7329
## s.e. 0.0015
##
## sigma^2 estimated as 2.016e-05: log likelihood=32.43
## AIC=-60.86 AICc=-58.46 BIC=-60.7
addEventLines(events, pos=2, srt=90, col="red")
dedrift(ratioGRT) # stationary
## Warning in adf.test(dx$logv): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$logv
## Dickey-Fuller = -4.491, Lag order = 1, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.7329
## s.e. 0.0015
##
## sigma^2 estimated as 2.016e-05: log likelihood=32.43
## AIC=-60.86 AICc=-58.46 BIC=-60.7
addEventLines(events, pos=2, srt=90, col="red")
# remove NA dates and wing morph (S=0, L=1)
d = raw_data %>%
filter(!is.na(wing_morph_binom), !is.na(datetime))
d = d[d$pophost=="K.elegans",]
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing_morph_binom", cat_var="datetime", func="mean")
freq_avg = ts_cols[[1]]
date = ts_cols[[2]]
morphGRT = xts(freq_avg, date)
colnames(morphGRT) <- "wing_morph_freq"
check_stationarity(morphGRT) # not stationary
##
## Augmented Dickey-Fuller Test
##
## data: dx[, 1]
## Dickey-Fuller = -1.994, Lag order = 2, p-value = 0.5747
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.9214
## s.e. 0.0151
##
## sigma^2 estimated as 0.002534: log likelihood=16.23
## AIC=-28.45 AICc=-26.74 BIC=-27.85
addEventLines(events, pos=2, srt=90, col="red")
detrend(morphGRT) # stationary
##
## Augmented Dickey-Fuller Test
##
## data: dx$diff
## Dickey-Fuller = -3.6078, Lag order = 2, p-value = 0.04945
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.9246
## s.e. 0.0164
##
## sigma^2 estimated as 0.002735: log likelihood=14.32
## AIC=-24.63 AICc=-22.63 BIC=-24.24
addEventLines(events, pos=2, srt=90, col="red")
dedrift(morphGRT) # stationary
## Warning in adf.test(dx$logv): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dx$logv
## Dickey-Fuller = -32.943, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.9246
## s.e. 0.0164
##
## sigma^2 estimated as 0.002735: log likelihood=14.32
## AIC=-24.63 AICc=-22.63 BIC=-24.24
addEventLines(events, pos=2, srt=90, col="red")
Rob J Hyndman and George Athanasopoulos. Forecasting: Principles and Practice. https://otexts.com/fpp3/.
https://rstudio-pubs-static.s3.amazonaws.com/363796_2c2581467b02403bbdcc77c6b5660ec7.html
We have very few points (10). Ideal would be 15-25 observations. Because of the few points, none of the spikes in the ACF or PACF are significant even though the Augmented Dickey-Fuller Test keeps testing for non-stationarity. In turn, this is a short time series that can be best evaluated for temporal dependencies in another modeling framework (e.g. GLMM, GAMM, or LOESS).
Otherwise, in AR, these time series seem to fall under white noise with drift/intercept. In probability theory, stochastic drift is the change of the average value of a stochastic (random) process. A related concept is the drift rate, which is the rate at which the average changes. Here is a simulation to visualize this. Often times, white noise with drift always tests for nonstationarity but then may or may not have significant acf spikes
a = 2 # drift / intercept - shifted
m = 100 # length of time series
x.wn.drift = a + rnorm(m, 0, 1)
plot(x.wn.drift, type='l', col='steelblue2', lwd=2)
abline(h=a, lty=2)
adf.test(x.wn.drift)
##
## Augmented Dickey-Fuller Test
##
## data: x.wn.drift
## Dickey-Fuller = -3.4136, Lag order = 4, p-value = 0.05623
## alternative hypothesis: stationary
Acf(x.wn.drift, main='')
auto.arima(x.wn.drift)
## Series: x.wn.drift
## ARIMA(3,0,2) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 mean
## -0.1361 -0.6793 -0.2682 0.1267 0.6144 1.9418
## s.e. 0.2325 0.1469 0.1129 0.2370 0.1484 0.0750
##
## sigma^2 estimated as 0.8511: log likelihood=-131.01
## AIC=276.02 AICc=277.24 BIC=294.26
In our limited datasets, there isn’t any strong evidence of non-randomness because our autocorrelations were NOT statistically significant. However, there was still nonstationarity - most likely due to drift (moving/changing average). There were instances where autocorrelations were close to being statistically significant, but it would be ideal to have at least 15 observations in order to test that. Why? Because at 15 observations the significance threshold would lower from 0.63 to 0.52, which could lead to more significant spikes.
In turn, our datasets have white noise and a weak drift or even trend at times. However, white noise are variations in your data that cannot be explained by any regression model.
https://towardsdatascience.com/the-white-noise-model-1388dbd0a7d
So how many data points would we need?
# remove NA dates
d = raw_data %>%
filter(!is.na(wing), !is.na(datetime))
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing", cat_var="datetime", func="mean")
wing_avg = ts_cols[[1]]
date = ts_cols[[2]]
# determine your dates
newdates = c("Dec 2020 01", "April 2021 01", "June 2021 01", "Oct 2021 01", "Jan 2022 01")
dates = as.Date(newdates, "%b %Y %d")
sim_dates = c(date, dates)
# simulate randomly your new winglengths
set.seed(100) # e.g. set.seed = 100,
wl_sim = runif(length(newdates), min(wing_avg), max(wing_avg))
wings = c(wing_avg, wl_sim)
# create xts time series object
wing_sim = xts(wings, sim_dates)
colnames(wing_sim) <- "wing"
wing_sim
## wing
## 2013-04-01 7.870708
## 2013-12-01 7.687829
## 2014-04-01 8.481303
## 2015-04-01 7.584976
## 2016-12-01 8.014340
## 2017-08-01 8.203489
## 2018-09-01 7.887514
## 2019-05-01 7.017570
## 2019-10-01 8.332079
## 2020-02-01 7.586518
## 2020-12-01 7.468058
## 2021-04-01 7.394734
## 2021-06-01 7.826023
## 2021-10-01 7.100100
## 2022-01-01 7.703401
check_stationarity(wing_sim)
##
## Augmented Dickey-Fuller Test
##
## data: dx[, 1]
## Dickey-Fuller = -4.2158, Lag order = 2, p-value = 0.01573
## alternative hypothesis: stationary
##
## Series: dx[, 1]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 7.7439
## s.e. 0.1035
##
## sigma^2 estimated as 0.1723: log likelihood=-7.58
## AIC=19.16 AICc=20.16 BIC=20.57
addEventLines(events, pos=2, srt=90, col="red")
Hard to tell what works on a small scale like this. Would be interesting to make +1000 simulations testing out time series scenarios. Clearly, it’s not just a number of observations issue. How soapberry bug morphology changes at particular time points will matter a lot in defining stationarity first and then its AR model. Also, this was a poor simulation and just a quick test for myself. There are much more intricate ones that would help test different scenarios.