Loading libraries:
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(magrittr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(TTR)
library(forecast)
## Loading required package: timeDate
## This is forecast 7.3
NARMS_retailmeat_public <- read.csv("narms_public.csv")
d <- NARMS_retailmeat_public
Subsetting the data to contain only the genuses Salmonella and E coli, and renaming psec.
psec <- subset(d,(d$GENUS=="S"|d$GENUS=="EC"))
Creating a year/month variable in sec:
psec$ym <- as.yearmon(paste(psec$Year, psec$Month, sep = "-"))
Generating a table of proportion of isolates that were Susceptible, Resistant, or Intermediate to ceftriaxone (a cephalosporin drug, abbreviated AXO) by month. Creates the dataset sec.AXO (shorthand for preliminary salmonella e coli ceftriaxone) where ym is a yearmon variable and AXO.SIR is a categorical variable containing I, S and R with proportions indicating how many isolates that month were intermediate, susceptible, or resistant to ceftriaxone.
psec.AXO <- psec %>% group_by(ym, AXO.SIR) %>% summarise (n = n()) %>% mutate(freq = n / sum(n))
head(psec.AXO)
## Source: local data frame [6 x 4]
## Groups: ym [3]
##
## ym AXO.SIR n freq
## <S3: yearmon> <fctr> <int> <dbl>
## 1 Jan 2002 R 1 0.01063830
## 2 Jan 2002 S 93 0.98936170
## 3 Feb 2002 R 6 0.06122449
## 4 Feb 2002 S 92 0.93877551
## 5 Mar 2002 R 1 0.01123596
## 6 Mar 2002 S 88 0.98876404
Subsetting psec.AXO to contain only proportions of isolates that were resistant to ceftriaxone each month (which groups Intermediate and Susceptible together by default):
sec.AXO <- subset(psec.AXO, AXO.SIR=="R")
head(sec.AXO)
## Source: local data frame [6 x 4]
## Groups: ym [6]
##
## ym AXO.SIR n freq
## <S3: yearmon> <fctr> <int> <dbl>
## 1 Jan 2002 R 1 0.01063830
## 2 Feb 2002 R 6 0.06122449
## 3 Mar 2002 R 1 0.01123596
## 4 Apr 2002 R 1 0.01086957
## 5 May 2002 R 9 0.08571429
## 6 Jun 2002 R 2 0.02061856
Plotting the prevalence of resistance over time:
sec.axo.rprev <- plot(sec.AXO$ym, sec.AXO$freq, main = "Prevalence of ceftriaxone resistance in isolates of Salmonella and E. coli from NARMS retail meat", xlab = "Time (months)", ylab = "Proportion of isolates resistant to ceftriaxone")
Note: if the prevalence of resistance over time is made into a time series object and smoothed, it doesn’t look like the change is at 2012 - it seems to plateau a few years earlier. Not sure what to do with this information.
#creating time series object:
sec.AXO.ts <- ts(sec.AXO$freq, frequency = 12, start = c(2002, 1))
plot(sec.AXO.ts)
#smoothing, I know 12 is a lot but I was thinking it creates yearly moving averages:
sec.AXO.ts.SMA12 <- SMA(sec.AXO.ts,n=12)
plot.ts(sec.AXO.ts.SMA12, main = "Prevalence of AXO resistance in retail meat - smoothed", ylab = "Proportion of isolates resistant")
Most recent attempt to create a generalized linear model, just evaluating the effect of the intervention with no covariates assuming that the trendlines before and after the intervention will have to intersect at the breakpoint:
Creating a date field, where each timepoint is treated as the 15th of the month (because isolates were received throughout) and a dummy variable for before/after the law restricting cephalosporins went into effect (assuming that it was in effect starting May 1, 2012), and creating a dateLaw object that is just the date of the law:
sec.AXO$by_date <- strptime(paste("15", sec.AXO$ym), "%d %b %Y")
sec.AXO$date <- as.Date(sec.AXO$by_date)
sec.AXO$dummy <- ifelse(sec.AXO$by_date<"2012-05-01",0,1)
dateLaw <- as.Date("2012-01-01")
Creating a GLM object that (…I think?) should predict the change in slope for a linear model after the date the law went into effect, with no covariates:
glm.sec.AXO <- glm(sec.AXO$freq ~ sec.AXO$date + pmax(sec.AXO$date - dateLaw, 0))
summary(glm.sec.AXO)
##
## Call:
## glm(formula = sec.AXO$freq ~ sec.AXO$date + pmax(sec.AXO$date -
## dateLaw, 0))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.061953 -0.022232 -0.007517 0.016976 0.117512
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.178e-01 3.300e-02 -6.601 5.95e-10 ***
## sec.AXO$date 2.130e-05 2.408e-06 8.846 1.78e-15 ***
## pmax(sec.AXO$date - dateLaw, 0) -7.108e-05 1.067e-05 -6.661 4.33e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.0009542413)
##
## Null deviance: 0.22500 on 159 degrees of freedom
## Residual deviance: 0.14982 on 157 degrees of freedom
## AIC: -653.7
##
## Number of Fisher Scoring iterations: 2
This is where I don’t think I’m doing it right, because I have trouble interpreting the output of that summary without fully understanding the pmax() function. I was previously trying to make the GLM trendlines using the dummy variable, but when I used it to predict the frequency using the original data (below), I would almost always generate trendlines with identical slopes that didn’t intersect.
Displaying the predictions of the GLM model against the actual prevalence data:
newdata <- data.frame(sec.AXO$date)
newdata$freq <- predict(glm.sec.AXO, newdata = newdata)
plot.glm.sec.AXO <- plot(sec.AXO$freq ~ sec.AXO$date, type = "l", main = "Frequency of Resistance to Cephalosporin Resistance in E. coli and Salmonella in Retail Meat Before and After 2012 Restriction", xlab = "Time", ylab = "Proportion of Isolates Resistant") + lines(newdata$freq ~ newdata$sec.AXO.date)
Another method, based on https://rpubs.com/kaz_yos/its1 , but which allows for a change in intercept when the law went into effect, which is not reasonable:
#centering the data on the change in law, using a numeric version of the date:
sec.AXO$dateNum <- as.numeric(sec.AXO$date)
sec.AXO$dateLawctr <- sec.AXO$dateNum - as.numeric(dateLaw)
#determining which points are after the law, which seems redundant:
sec.AXO$postLaw <- (sec.AXO$date >= dateLaw)
glm.sec.AXO2 <- glm.axo.freq2 <- glm(freq ~ dateLawctr + postLaw + dateLawctr:postLaw, data = sec.AXO)
summary(glm.sec.AXO2)
##
## Call:
## glm(formula = freq ~ dateLawctr + postLaw + dateLawctr:postLaw,
## data = sec.AXO)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.065367 -0.020747 -0.007161 0.015465 0.115090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.132e-01 5.623e-03 20.126 < 2e-16 ***
## dateLawctr 2.302e-05 2.665e-06 8.638 6.35e-15 ***
## postLawTRUE -1.655e-02 1.120e-02 -1.478 0.141
## dateLawctr:postLawTRUE -5.759e-05 1.401e-05 -4.110 6.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.0009470929)
##
## Null deviance: 0.22500 on 159 degrees of freedom
## Residual deviance: 0.14775 on 156 degrees of freedom
## AIC: -653.93
##
## Number of Fisher Scoring iterations: 2
Generating a version of predictions to overlay over the original data:
#attempt to create the new predicted dataset using the methods from that walkthrough:
newdata2 <- data.frame(dateLawctr = seq(min(sec.AXO$dateLawctr), max(sec.AXO$dateLawctr), length.out = 160))
newdata2$postLaw <- (newdata2$dateLawctr >= 0)
newdata2$freq <- predict(glm.sec.AXO2, newdata2 = newdata2)
#splitting the data into "pre" and "post" sections:
newdata2pre <- subset(newdata2, dateLawctr < 0)
newdata2post <- subset(newdata2, dateLawctr >= 0)
Plotting the predictions of the second GLM model over the original data:
plot.glm.sec.AXO2 <- plot(sec.AXO$freq ~ sec.AXO$dateLawctr, type = "l", main = "Frequency of Resistance to Cephalosporin Resistance in E. coli and Salmonella in Retail Meat Before and After 2012 Restriction", xlab = "Time", ylab = "Proportion of Isolates Resistant") + lines(newdata2post$freq ~ newdata2post$dateLawctr) + lines(newdata2pre$freq ~ newdata2pre$dateLawctr)
plot.glm.sec.AXO2
## numeric(0)
Forecasting using Holt’s exponential smoothing:
sec.AXO.forecasts.HW <- HoltWinters(sec.AXO.ts)
sec.AXO.forecasts.HW
## Holt-Winters exponential smoothing with trend and additive seasonal component.
##
## Call:
## HoltWinters(x = sec.AXO.ts)
##
## Smoothing parameters:
## alpha: 0.1187211
## beta : 0
## gamma: 0.2439181
##
## Coefficients:
## [,1]
## a 0.0701384657
## b 0.0002725909
## s1 -0.0022264722
## s2 0.0007660117
## s3 -0.0082485110
## s4 -0.0086688479
## s5 -0.0030162577
## s6 0.0018989357
## s7 -0.0077848377
## s8 -0.0133825180
## s9 -0.0017713765
## s10 0.0165654732
## s11 0.0163671653
## s12 -0.0114692998
plot(sec.AXO.forecasts.HW)
sec.AXO.forecasts.HW.2 <- forecast.HoltWinters(sec.AXO.forecasts.HW, h=12)
plot.forecast(sec.AXO.forecasts.HW.2)
Why are there no residuals calculated for forecasts for the first year (months 1-12)? Is it that there wasn’t enough previous data to make predictions up to that point? I think this is preventing me from plotting the distribution of errors.
plot.ts(sec.AXO.forecasts.HW.2$residuals)
#from http://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html
plotForecastErrors <- function(forecasterrors)
{
# make a histogram of the forecast errors:
mybinsize <- IQR(forecasterrors)/4
mysd <- sd(forecasterrors)
mymin <- min(forecasterrors) - mysd*5
mymax <- max(forecasterrors) + mysd*3
# generate normally distributed data with mean 0 and standard deviation mysd
mynorm <- rnorm(10000, mean=0, sd=mysd)
mymin2 <- min(mynorm)
mymax2 <- max(mynorm)
if (mymin2 < mymin) { mymin <- mymin2 }
if (mymax2 > mymax) { mymax <- mymax2 }
# make a red histogram of the forecast errors, with the normally distributed data overlaid:
mybins <- seq(mymin, mymax, mybinsize)
hist(forecasterrors, col="red", freq=FALSE, breaks=mybins)
# freq=FALSE ensures the area under the histogram = 1
# generate normally distributed data with mean 0 and standard deviation mysd
myhist <- hist(mynorm, plot=FALSE, breaks=mybins)
# plot the normal curve as a blue line on top of the histogram of forecast errors:
points(myhist$mids, myhist$density, type="l", col="blue", lwd=2)
}
#this won't work: plotForecastErrors(sec.AXO.forecasts.HW.2$residuals)
is.na(sec.AXO.forecasts.HW.2$residuals)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [155] FALSE FALSE FALSE FALSE FALSE FALSE
I can’t figure out how to get the function to plot forecast errors to accept na.rm = TRUE.
trying out ARIMA model. I don’t know if this is helpful for evaluating whether there was a change after the 2012 intervention but I am trying to use this to learn more about building an appropriate model for the data.
plot.ts(sec.AXO.ts)
sec.AXO.ts.diff.1 <- diff(sec.AXO.ts, differences=1)
plot.ts(sec.AXO.ts.diff.1)
acf(sec.AXO.ts.diff.1, lag.max=20)
I don’t know what’s going on with the x-axis (shouldn’t that be 1-20?) but if I am interpreting this right it looks like there is a significant correlation between residuals at lag 1, which I would think would make sense.
pacf(sec.AXO.ts.diff.1, lag.max=20)
auto.arima(sec.AXO.ts.diff.1)
## Series: sec.AXO.ts.diff.1
## ARIMA(0,0,1)(1,0,0)[12] with zero mean
##
## Coefficients:
## ma1 sar1
## -0.8352 -0.1341
## s.e. 0.0382 0.0832
##
## sigma^2 estimated as 0.000969: log likelihood=326.34
## AIC=-646.69 AICc=-646.53 BIC=-637.48
It looks like the partial autocorrelation (which I do not fully understand how it is determined / how it is distinct from autocorrelation) is significant until lag 4, although it gets lower and lower from lag 1. Does that mean I can use an ARMA(0,1)? The auto.arima function says that the best fit is ARIMA(0,0,1)
sec.AXO.arima.diff.1 <- arima(sec.AXO.ts.diff.1, order=c(0,0,1))
sec.AXO.arima.forecasts.diff.1 <- forecast.Arima(sec.AXO.arima.diff.1, order=c(0,0,1))
## Warning in forecast.Arima(sec.AXO.arima.diff.1, order = c(0, 0, 1)): The
## non-existent order arguments will be ignored.
acf(sec.AXO.arima.forecasts.diff.1$residuals, lag.max=20)
plotForecastErrors(sec.AXO.arima.forecasts.diff.1$residuals) #strange this one works even with NA values, although it pulls up error messages.
sec.AXO.arima <- arima(sec.AXO.ts, order=c(0,0,1))
sec.AXO.arima.forecasts <- forecast.Arima(sec.AXO.arima, order=c(0,0,1))
## Warning in forecast.Arima(sec.AXO.arima, order = c(0, 0, 1)): The non-
## existent order arguments will be ignored.
acf(sec.AXO.arima.forecasts$residuals, lag.max=20)
I dont’ understand how to introduce the 2012 law into a model that uses a time series object. I’d like to somehow reproduce this example from MacLeod and Yu’s paper:
“A brief example of a medical intervention analysis carried out using arima() will now be discussed. In a medical time series of monthly average creatinine clearances, a step intervention analysis model with a multiplicative seasonal ARIMA(0, 1, 1) (1, 0, 0)12 error term was fit. The intervention effect was found to be significant at 1%. To illustrate this finding, Figure 8 compares the forecasts before and after the intervention date. The forecasts are from a model fit to the pre-intervention series. The plot visually confirms the decrease in creatinine clearances after the intervention.” at http://www.stats.uwo.ca/faculty/aim/tsar/tsar.pdf
Medical intervention