Attempts at GLM

Loading libraries:

library(zoo)

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(magrittr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(TTR)
library(forecast)

## Loading required package: timeDate

## This is forecast 7.3

NARMS_retailmeat_public <- read.csv("narms_public.csv")
d <- NARMS_retailmeat_public

Subsetting the data to contain only the genuses Salmonella and E coli, and renaming psec.

psec <- subset(d,(d$GENUS=="S"|d$GENUS=="EC"))

Creating a year/month variable in sec:

psec$ym <- as.yearmon(paste(psec$Year, psec$Month, sep = "-"))

Generating a table of proportion of isolates that were Susceptible, Resistant, or Intermediate to ceftriaxone (a cephalosporin drug, abbreviated AXO) by month. Creates the dataset sec.AXO (shorthand for preliminary salmonella e coli ceftriaxone) where ym is a yearmon variable and AXO.SIR is a categorical variable containing I, S and R with proportions indicating how many isolates that month were intermediate, susceptible, or resistant to ceftriaxone.

psec.AXO <- psec %>% group_by(ym, AXO.SIR) %>% summarise (n = n()) %>% mutate(freq = n / sum(n))
head(psec.AXO)

## Source: local data frame [6 x 4]
## Groups: ym [3]
## 
##              ym AXO.SIR     n       freq
##   <S3: yearmon>  <fctr> <int>      <dbl>
## 1      Jan 2002       R     1 0.01063830
## 2      Jan 2002       S    93 0.98936170
## 3      Feb 2002       R     6 0.06122449
## 4      Feb 2002       S    92 0.93877551
## 5      Mar 2002       R     1 0.01123596
## 6      Mar 2002       S    88 0.98876404

Subsetting psec.AXO to contain only proportions of isolates that were resistant to ceftriaxone each month (which groups Intermediate and Susceptible together by default):

sec.AXO <- subset(psec.AXO, AXO.SIR=="R")
head(sec.AXO)

## Source: local data frame [6 x 4]
## Groups: ym [6]
## 
##              ym AXO.SIR     n       freq
##   <S3: yearmon>  <fctr> <int>      <dbl>
## 1      Jan 2002       R     1 0.01063830
## 2      Feb 2002       R     6 0.06122449
## 3      Mar 2002       R     1 0.01123596
## 4      Apr 2002       R     1 0.01086957
## 5      May 2002       R     9 0.08571429
## 6      Jun 2002       R     2 0.02061856

Plotting the prevalence of resistance over time:

sec.axo.rprev <- plot(sec.AXO$ym, sec.AXO$freq, main = "Prevalence of ceftriaxone resistance in isolates of Salmonella and E. coli from NARMS retail meat", xlab = "Time (months)", ylab = "Proportion of isolates resistant to ceftriaxone")

Note: if the prevalence of resistance over time is made into a time series object and smoothed, it doesn’t look like the change is at 2012 - it seems to plateau a few years earlier. Not sure what to do with this information.

#creating time series object:
sec.AXO.ts <- ts(sec.AXO$freq, frequency = 12, start = c(2002, 1))
plot(sec.AXO.ts)

#smoothing, I know 12 is a lot but I was thinking it creates yearly moving averages:
sec.AXO.ts.SMA12 <- SMA(sec.AXO.ts,n=12)
plot.ts(sec.AXO.ts.SMA12, main = "Prevalence of AXO resistance in retail meat - smoothed", ylab = "Proportion of isolates resistant")

Most recent attempt to create a generalized linear model, just evaluating the effect of the intervention with no covariates assuming that the trendlines before and after the intervention will have to intersect at the breakpoint:

Creating a date field, where each timepoint is treated as the 15th of the month (because isolates were received throughout) and a dummy variable for before/after the law restricting cephalosporins went into effect (assuming that it was in effect starting May 1, 2012), and creating a dateLaw object that is just the date of the law:

sec.AXO$by_date <- strptime(paste("15", sec.AXO$ym), "%d %b %Y")
sec.AXO$date <- as.Date(sec.AXO$by_date)
sec.AXO$dummy <- ifelse(sec.AXO$by_date<"2012-05-01",0,1)
dateLaw <- as.Date("2012-01-01")

Creating a GLM object that (…I think?) should predict the change in slope for a linear model after the date the law went into effect, with no covariates:

glm.sec.AXO <- glm(sec.AXO$freq ~ sec.AXO$date + pmax(sec.AXO$date - dateLaw, 0))
summary(glm.sec.AXO)

## 
## Call:
## glm(formula = sec.AXO$freq ~ sec.AXO$date + pmax(sec.AXO$date - 
##     dateLaw, 0))
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -0.061953  -0.022232  -0.007517   0.016976   0.117512  
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -2.178e-01  3.300e-02  -6.601 5.95e-10 ***
## sec.AXO$date                     2.130e-05  2.408e-06   8.846 1.78e-15 ***
## pmax(sec.AXO$date - dateLaw, 0) -7.108e-05  1.067e-05  -6.661 4.33e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.0009542413)
## 
##     Null deviance: 0.22500  on 159  degrees of freedom
## Residual deviance: 0.14982  on 157  degrees of freedom
## AIC: -653.7
## 
## Number of Fisher Scoring iterations: 2

This is where I don’t think I’m doing it right, because I have trouble interpreting the output of that summary without fully understanding the pmax() function. I was previously trying to make the GLM trendlines using the dummy variable, but when I used it to predict the frequency using the original data (below), I would almost always generate trendlines with identical slopes that didn’t intersect.

Displaying the predictions of the GLM model against the actual prevalence data:

newdata <- data.frame(sec.AXO$date)
newdata$freq <- predict(glm.sec.AXO, newdata = newdata)
plot.glm.sec.AXO <- plot(sec.AXO$freq ~ sec.AXO$date, type = "l", main = "Frequency of Resistance to Cephalosporin Resistance in E. coli and Salmonella in Retail Meat Before and After 2012 Restriction", xlab = "Time", ylab = "Proportion of Isolates Resistant") + lines(newdata$freq ~ newdata$sec.AXO.date)

Another method, based on https://rpubs.com/kaz_yos/its1 , but which allows for a change in intercept when the law went into effect, which is not reasonable:

#centering the data on the change in law, using a numeric version of the date:
sec.AXO$dateNum <- as.numeric(sec.AXO$date)
sec.AXO$dateLawctr <- sec.AXO$dateNum - as.numeric(dateLaw)
#determining which points are after the law, which seems redundant:
sec.AXO$postLaw <- (sec.AXO$date >= dateLaw)
glm.sec.AXO2 <- glm.axo.freq2 <- glm(freq ~ dateLawctr + postLaw + dateLawctr:postLaw, data = sec.AXO)
summary(glm.sec.AXO2)

## 
## Call:
## glm(formula = freq ~ dateLawctr + postLaw + dateLawctr:postLaw, 
##     data = sec.AXO)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -0.065367  -0.020747  -0.007161   0.015465   0.115090  
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.132e-01  5.623e-03  20.126  < 2e-16 ***
## dateLawctr              2.302e-05  2.665e-06   8.638 6.35e-15 ***
## postLawTRUE            -1.655e-02  1.120e-02  -1.478    0.141    
## dateLawctr:postLawTRUE -5.759e-05  1.401e-05  -4.110 6.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.0009470929)
## 
##     Null deviance: 0.22500  on 159  degrees of freedom
## Residual deviance: 0.14775  on 156  degrees of freedom
## AIC: -653.93
## 
## Number of Fisher Scoring iterations: 2

Generating a version of predictions to overlay over the original data:

#attempt to create the new predicted dataset using the methods from that walkthrough:
newdata2 <- data.frame(dateLawctr = seq(min(sec.AXO$dateLawctr), max(sec.AXO$dateLawctr), length.out = 160))
newdata2$postLaw <- (newdata2$dateLawctr >= 0)
newdata2$freq <- predict(glm.sec.AXO2, newdata2 = newdata2)
#splitting the data into "pre" and "post" sections:
newdata2pre <- subset(newdata2, dateLawctr < 0)
newdata2post <- subset(newdata2, dateLawctr >= 0)

Plotting the predictions of the second GLM model over the original data:

plot.glm.sec.AXO2 <- plot(sec.AXO$freq ~ sec.AXO$dateLawctr, type = "l", main = "Frequency of Resistance to Cephalosporin Resistance in E. coli and Salmonella in Retail Meat Before and After 2012 Restriction", xlab = "Time", ylab = "Proportion of Isolates Resistant") + lines(newdata2post$freq ~ newdata2post$dateLawctr) + lines(newdata2pre$freq ~ newdata2pre$dateLawctr)

plot.glm.sec.AXO2

## numeric(0)

Forecasting using Holt’s exponential smoothing:

sec.AXO.forecasts.HW <- HoltWinters(sec.AXO.ts)
sec.AXO.forecasts.HW

## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = sec.AXO.ts)
## 
## Smoothing parameters:
##  alpha: 0.1187211
##  beta : 0
##  gamma: 0.2439181
## 
## Coefficients:
##              [,1]
## a    0.0701384657
## b    0.0002725909
## s1  -0.0022264722
## s2   0.0007660117
## s3  -0.0082485110
## s4  -0.0086688479
## s5  -0.0030162577
## s6   0.0018989357
## s7  -0.0077848377
## s8  -0.0133825180
## s9  -0.0017713765
## s10  0.0165654732
## s11  0.0163671653
## s12 -0.0114692998

plot(sec.AXO.forecasts.HW)

sec.AXO.forecasts.HW.2 <- forecast.HoltWinters(sec.AXO.forecasts.HW, h=12)
plot.forecast(sec.AXO.forecasts.HW.2)

Why are there no residuals calculated for forecasts for the first year (months 1-12)? Is it that there wasn’t enough previous data to make predictions up to that point? I think this is preventing me from plotting the distribution of errors.

plot.ts(sec.AXO.forecasts.HW.2$residuals)

#from http://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html
plotForecastErrors <- function(forecasterrors)
  {
     # make a histogram of the forecast errors:
     mybinsize <- IQR(forecasterrors)/4
     mysd   <- sd(forecasterrors)
     mymin  <- min(forecasterrors) - mysd*5
     mymax  <- max(forecasterrors) + mysd*3
     # generate normally distributed data with mean 0 and standard deviation mysd
     mynorm <- rnorm(10000, mean=0, sd=mysd)
     mymin2 <- min(mynorm)
     mymax2 <- max(mynorm)
     if (mymin2 < mymin) { mymin <- mymin2 }
     if (mymax2 > mymax) { mymax <- mymax2 }
     # make a red histogram of the forecast errors, with the normally distributed data overlaid:
     mybins <- seq(mymin, mymax, mybinsize)
     hist(forecasterrors, col="red", freq=FALSE, breaks=mybins)
     # freq=FALSE ensures the area under the histogram = 1
     # generate normally distributed data with mean 0 and standard deviation mysd
     myhist <- hist(mynorm, plot=FALSE, breaks=mybins)
     # plot the normal curve as a blue line on top of the histogram of forecast errors:
     points(myhist$mids, myhist$density, type="l", col="blue", lwd=2)
}
#this won't work: plotForecastErrors(sec.AXO.forecasts.HW.2$residuals)
is.na(sec.AXO.forecasts.HW.2$residuals)

##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [12]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [155] FALSE FALSE FALSE FALSE FALSE FALSE

I can’t figure out how to get the function to plot forecast errors to accept na.rm = TRUE.

trying out ARIMA model. I don’t know if this is helpful for evaluating whether there was a change after the 2012 intervention but I am trying to use this to learn more about building an appropriate model for the data.

plot.ts(sec.AXO.ts)

sec.AXO.ts.diff.1 <- diff(sec.AXO.ts, differences=1)
plot.ts(sec.AXO.ts.diff.1)

acf(sec.AXO.ts.diff.1, lag.max=20)

I don’t know what’s going on with the x-axis (shouldn’t that be 1-20?) but if I am interpreting this right it looks like there is a significant correlation between residuals at lag 1, which I would think would make sense.

pacf(sec.AXO.ts.diff.1, lag.max=20)

auto.arima(sec.AXO.ts.diff.1)

## Series: sec.AXO.ts.diff.1 
## ARIMA(0,0,1)(1,0,0)[12] with zero mean     
## 
## Coefficients:
##           ma1     sar1
##       -0.8352  -0.1341
## s.e.   0.0382   0.0832
## 
## sigma^2 estimated as 0.000969:  log likelihood=326.34
## AIC=-646.69   AICc=-646.53   BIC=-637.48

It looks like the partial autocorrelation (which I do not fully understand how it is determined / how it is distinct from autocorrelation) is significant until lag 4, although it gets lower and lower from lag 1. Does that mean I can use an ARMA(0,1)? The auto.arima function says that the best fit is ARIMA(0,0,1)

sec.AXO.arima.diff.1 <- arima(sec.AXO.ts.diff.1, order=c(0,0,1))
sec.AXO.arima.forecasts.diff.1 <- forecast.Arima(sec.AXO.arima.diff.1, order=c(0,0,1))

## Warning in forecast.Arima(sec.AXO.arima.diff.1, order = c(0, 0, 1)): The
## non-existent order arguments will be ignored.

acf(sec.AXO.arima.forecasts.diff.1$residuals, lag.max=20)

plotForecastErrors(sec.AXO.arima.forecasts.diff.1$residuals) #strange this one works even with NA values, although it pulls up error messages.

sec.AXO.arima <- arima(sec.AXO.ts, order=c(0,0,1))
sec.AXO.arima.forecasts <- forecast.Arima(sec.AXO.arima, order=c(0,0,1))

## Warning in forecast.Arima(sec.AXO.arima, order = c(0, 0, 1)): The non-
## existent order arguments will be ignored.

acf(sec.AXO.arima.forecasts$residuals, lag.max=20)

I dont’ understand how to introduce the 2012 law into a model that uses a time series object. I’d like to somehow reproduce this example from MacLeod and Yu’s paper:

“A brief example of a medical intervention analysis carried out using arima() will now be discussed. In a medical time series of monthly average creatinine clearances, a step intervention analysis model with a multiplicative seasonal ARIMA(0, 1, 1) (1, 0, 0)12 error term was fit. The intervention effect was found to be significant at 1%. To illustrate this finding, Figure 8 compares the forecasts before and after the intervention date. The forecasts are from a model fit to the pre-intervention series. The plot visually confirms the decrease in creatinine clearances after the intervention.” at http://www.stats.uwo.ca/faculty/aim/tsar/tsar.pdf

Medical intervention

Attempts at GLM

Anika Larson

12/2/2016