Report is on using Spatio-Temporal Condtionally Autoregressive (CAR) models for Areal Data. I look at two data sets that were provided in R packages that implement the CAR techniques. I then attempted to use them on the Goggle Trend flu data, but had little success with the flu data.

What’s in this report?

  1. Introduction
  2. Spatial CAR Models for Areal Data
  3. Spatial-Temporal CAR Models for Areal Data
  4. Google Flu
  5. Conclusions

1. Introduction

Researchers often encounter data that exists in non-overlapping spatial areal units. Areal units are finite in number, usually exist on a type of grid and are typically sums or averages, e.g. census tracts, disease prevalence in counties, etc. Areal units typically display spatial autocorrelation, that is near things are more similar than distant things. To be a little more rigorous, spatial autocorrelation measures how much close objects are in comparison with other close objects. As statisticians, we are interested in modeling and understanding this spatial autocorrelation.

The go-to tool in a statisticians tool box is the regression model. One could turn to this modeling techniques to account for some of the spatial autocorrelation in your data, but it is not a panacea. Even after accounting for the spatial autocorrelation in your data the model can still display spatial autocorrelation in the residuals, which would violate a key regression modeling assumption. Other issues that are difficult to account for in the typical regression model are neighbor effects and grouping effects. A neighbor effect is just the influence a “neighbor” might have on their “neighbor” and a grouping effect is the tendency of similar peoples to want to live with people similar to them.

Enter the Conditionally Autoregressive (CAR) model. The CAR model seeks to supplement the linear predictor with a set of spatially autocorrelated random effects, as part of a Bayesian hierarchical model. These random effects are typically represented with a conditional autoregressive prior, which induces spatial autocorrelation through the adjacency structure of the areal unit.(CAR, Besag et al. (1991)). However, even with the spatial effects handled in the CAR model, most data today also exists in the domain of time. So now our data of K areal units exists in T consecutive time periods, producing a rectangular array of \(K \times T\) of spatio-temporal observations. It would be ideal to not only be able to model the spatial effects of the data but also the temporal effects that exists in our areal units.

2. CAR Models for Areal Data

Before we dive into space and time, I will first explore the CARBayes package and its framework, which is a package created by Duncan Lee of the University of Glasgow. The CARBayes package implements a class of univariate and multivariate spatial generalized linear mixed models for areal unit data, with inference in a Bayesian setting using Markov chain Monte Carlo (MCMC) simulation (Lee, 1). Dr. Lee provides several worked examples using this package on data involving Scottish lip cancer and property prices in Greater Glasgow. The Scottish lip cancer is only used to showcase how to combine a data frame and shapefile together and does not have any of the CARBayes functions implemented on it. I will perform a similar analysis as done in the vignette on the Scottish lip cancer.

2.1 Framework for CAR Model

  • Set of \(k = 1, . . . , K\) non-overlapping areal units \(S = {S_1, . . . , S_K}\)
  • Linked to a corresponding set of responses \(Y = (Y_1, ... ,Y_K)\), and a vector of known offsets \(O = (O_1, . . . , O_K)\).
  • The spatial variation in the response is modeled by a matrix of covariates \(X = (x_1, . . . ,x_K)\) and a spatial structure component \(\psi = (\psi_1, . . . , \psi_K),\) the latter of which is included to model any spatial autocorrelation that remains in the data after the covariate effects have been accounted for
  • The vector of covariates for areal unit \(S_k\) are denoted by \(x_k = (1, x_{k1}, . . . , x_{kp})\)

  • Use the Hierarchical Bayesian Setting: \[Y_{k}|\mu_{k} \sim f(y_{k}|\mu_{k}, \nu^2), k =1,...K\] \[g(\mu_{k}) = x^T_{k} \beta + O_{k} + \psi_{k}\] \[\beta \sim N(\mu_\beta, \Sigma_\beta)\]
  • Within \(\psi\) is a Weight or Adjacency Matrix that controls spatial autocorrelation

2.2 Scottish Lip Cancer Data Summary Statistics and Plots

The Scottish Lip cancer data is purely spatial data that summarizes overall lip cancer cases and possible covariate risk factors for the period 1975 to 1986, for 56 districts in Scotland. Observed is the response variable that will be modeled, which is the total number of people who have/had lip cancer in the areal unit. Expected, pcaff, latitude and longitude are the covariates. Pcaff is the percentage of the workforce in each district employed in agriculture, fishing and forestry and the others are self-explanatory. Here I will attempt to identify if the covariates in the data set affect lip cancer and quantify their effects.

Summary Statistics
observed expected pcaff latitude longitude
Min. : 0.000 Min. : 1.100 Min. : 0.000 Min. :54.94 Min. :1.430
1st Qu.: 4.750 1st Qu.: 4.050 1st Qu.: 1.000 1st Qu.:55.78 1st Qu.:3.288
Median : 8.000 Median : 6.300 Median : 7.000 Median :56.04 Median :4.090
Mean : 9.571 Mean : 9.575 Mean : 8.661 Mean :56.40 Mean :4.012
3rd Qu.:11.000 3rd Qu.:10.125 3rd Qu.:11.500 3rd Qu.:57.02 3rd Qu.:4.730
Max. :39.000 Max. :88.700 Max. :24.000 Max. :60.24 Max. :6.800

Below is a summary and plot of constructing the W matrix, which is the backbone of the CAR model. The matrix controls the spatial autocorrelation structure of the random effects. I have reported the summary statistics for the areal data of Scottland as well as provided a map of Scottland detailing the graph structure of the W matrix. Also, reported below are the counts for the covariates provided in the data. I notice that the observed counts in the northeast portion of Scotland significantly differ from the expected counts. Perhaps this is due to the pcaff variables, that is there are more people in the fishing industry on this part of the coast.

As part of my exploratory data analysis I computed the Moran’ I, which is a way to establish that there is some spatial dependence in the data before proceeding. Moran’s I is a test statistic that allows one to investigate the presence of spatial correlation in your data. Moran’s I assumes a Null Hypothesis that there is no spatial correlation. I calculated the Moran’s I statistic available from the spdep package and found it equals 0.19331 with a corresponding p-value of 0.00699, which suggests that the residuals contain substantial positive spatial autocorrelation. However, the Moran’s I does not give indication of where this spatial correlation is located or the magnitude of the effects on the response variable in the Scottish Lip Cancer data set. I turn to other techniques to find this correlation and the effects.

2.3 Spatial CAR model implementation

I will now use the CARBayes package for modelling areal unit data with conditional autoregressive priors (CAR). I chose three models to attempt to understand the covariate effects on the response. First, is the S.CARleroux model, which uses a CAR prior for modelling varying strengths of spatial autocorrelation, using a single set of random effects. Second, is the same method, but uses a poisson distribution rather than a gaussian. The third model is the S.CARlocalised, which allows for localized spatial autocorrelation by modelling the elements in W, which allows you to augment the set of spatially smooth random effects with a piecewise constant intercept or cluster models, thus allowing large jumps in the mean surface between adjacent areal units in different clusters. Each model use the same \(W\) matrix and inference is the same for each model, which is done on 10000 samples, obtained by after a burn-in period of 20000 samples and thinning by of 10 to reduce their autocorrelation.

I use the DIC as a measure of model-fitting, which is a hierarchical modeling generalization of the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. The third model has the lowest DIC, which will be the preferred model. The covariates expected, latitude and longitude do not contain the value of 0 and I believe this means that they have an impact on the observed amount of cancer. Pcaff, the percentage of folks employed in the agriculture, fishing and forestry is not significant as it contains 0 in the interval. It seems the model is saying that latitude and longitude have more of an impact on getting lip cancer, then your employment in the 3 industries.

3 Spatial-Temporal CAR Models

3.1 Spatial-Temporal CAR Framework

Moving forward, I will now introduce time into the CAR Model…well not me literally. Data will now consists of \(K\) areal units and \(N\) consecutive time periods and will have a \(K \times N\) spatio-temporal observations. The same hierarchical bayesian setting as used in the CAR Model section except now time will also be an index.

\[Y_{kt}|\mu_{kt} \sim f(y_{kt}|\mu_{kt}, \nu^2), k =1,...,K; t=1,...,N\] \[g(\mu_{kt}) = x^T_{kt} \beta + O_{kt} + \psi_{kt}\] \[\beta \sim N(\mu_\beta, \Sigma_\beta)\]

Again, the spatial autocorrelation is controlled within \(\psi\), which is a Weight or Adjacency Matrix.

3.2 Glasgow air pollution Data Summary Statistics and Plots

This data set contains spatio-temporal data on respiratory hospitalizations, air pollution concentrations and socio-economic deprivation covariates for the 271 Intermediate Geographies (IG) that make up the Greater Glasgow and Clyde health board in Scotland. The goal is understand the effects of air pollution concentrations and the risk of respiratory disease. It is a worked example contained in the CARBayesST package.

The covariates are IG, a zip code/census tract used by the health board, year (2007 to 2011), expected deaths, pm10 (Average particulate matter (less than 10 microns) concentrations in each IG and year, jsa (The percentage of working age people who are in receipt of Job Seekers Allowance, a benefit paid to unemployed people looking for work, in each IG and year), price (Average property price (in thousands) in each IG and year.). SMR is the standardized mortality rate, which is just the ratio of observed over the expected deaths from respiratory disease. The response variable is observed deaths. The author uses the pairs plot to look at the relationships between the covariates. He believes that an increasing level of poverty leads to an increased risk of respiratory hospitalization due to the positive and negative relationship between the SMR and jsa and price respectively found in the pairs plots(not reported).

Summary Statistics
year observed expected pm10 jsa price SMR
Min. :2007 Min. : 10.0 Min. : 44.47 Min. : 7.838 Min. : 0.300 Min. :0.228 Min. :0.2091
1st Qu.:2008 1st Qu.: 54.0 1st Qu.: 72.71 1st Qu.:10.953 1st Qu.: 2.050 1st Qu.:0.880 1st Qu.:0.6020
Median :2009 Median : 74.0 Median : 89.00 Median :12.117 Median : 3.700 Median :1.150 Median :0.8458
Mean :2009 Mean : 79.2 Mean : 92.35 Mean :12.326 Mean : 4.191 Mean :1.277 Mean :0.8650
3rd Qu.:2010 3rd Qu.: 99.0 3rd Qu.:109.64 3rd Qu.:13.466 3rd Qu.: 5.925 3rd Qu.:1.541 3rd Qu.:1.0820
Max. :2011 Max. :213.0 Max. :180.54 Max. :19.612 Max. :13.750 Max. :4.300 Max. :2.1871

Below are two plots for SMR and price for Glasgow. The first plot is a visualization of the SMR over all five years for Glasgow’s IGs. The green areas are where the SMR risk is low and the orange and silver areas indicate elevated risks. The second plot is a visualization of the average property price in each IG over the same time period. To my untrained eye, I believe that the bulk of the elevated SMR is located in the center city where housing values are low. The SMR risk that is low is ringed around the city where there are many more IG’s with high property values.

As part of my exploratory data analysis I computed the Moran’s I, which is a way to establish that there is some spatial dependence in the data before proceeding. I computed Moran’s I for 2007, 2009, 2010. For some reason I couldn’t get 2008 and 2011 to work. Moran’s I is a test statistic that allows one to investigate the presence of spatial correlation in your data. Moran’s I assumes a Null Hypothesis that there is no spatial correlation. I calculated the Moran’s I statistic for 2007,2009,2010 and found respectively it equals 0.10358, 0.013545, 0.10403 with corresponding p-values of 0.0034, 0.31227, 0.0028. This suggests that the residuals contain substantial positive spatial autocorrelation in 2007 and 2010, but interestingly not 2009. However, the Moran’s I does not give indication of where this spatial correlation is located in the dataset. I turn to other techniques to find this correlation.

3.3 Spatio-Temporal CAR model implementation

I look at two models, ST.Caranova and ST.Carar to try and understand the data set. Each model use the same \(W\) matrix and inference is the same for each model, which is done on 10000 samples, obtained by after a burn-in period of 20000 samples and thinning by of 10 to reduce their autocorrelation. Below are trace plots and posterior distributions for the covariates as well as 95% credible intervals for the covariates. Also, DIC is again used to establish which model is a better fit.

ST.Caranova

This model was discussed in Knorr-Held paper I attempted to present in class. It decomposes the spatio-temporal variation in the data into an overall spatial effect, an overall trend effect and a space-time interaction. Below are traceplots and posterior distributions for the model.

ST.Carar

This model was proposed by Rushworth and was used on this data set in the vignette. It represents the spatio-temporal structure with a multivariate first order autoregressive process with a spatially correlated precision matrix. Below are traceplots and posterior distributions for the model.

Interestingly, the DIC points to the Knorr-Held model as being a superior choice. The paper uses the ST.CARar model as a way to investigate this data set. However, I am unsure if a difference of 10 deviance is an indicator of a superior model. The traceplot of Model 2 for JSA looks like it did not mix that well, perhaps this influences the DIC. The 95% credible interval for the STcaranova model has for the relative risk of an increase in PM10 is 1.024 (1.007, 1.039), suggesting that such an increase corresponds to 2.4% additional hospital admissions. The corresponding relative risk for a one percent increase in JSA is 1.058 (1.045, 1.070), while for a one hundred thousand pound increase in property price (the units for the property price data were in hundreds of thousands) the risk is 0.828 (0.789, 0.868). I believe these credible intervals back up our intuition from the EDA that respiratory illness increases as property prices decrease and vice versa.

Conclusion

CAR models to explore areal data over time are a powerful tool to gain understanding of spatial dependence especially when coupled with MCMC techniques. I would of liked to have run these methods on the google data to gain more understanding of flu trends. However, only having a basic understanding of CAR models, I think the selection of the google data set was a poor choice. Looking back on it now, I should of just acquired the CDC flu data and run an analysis. Perhaps a bonus would of been to look at the Google data if time was permitting.

Appendix

Resources

  1. CARBayesST vignette
  2. CARBayes vignette
  3. 597 Lecture notes
  4. Review of spatio-temporal models for disease mapping
  5. Knorr Held Paper

Code

# loading pakcages
library("sp");library(plyr);library("dplyr");library("tidyr");library(reshape2);library(reshape2);library(surveillance);library(spdep);library(maps);library(maptools);library(usmap);library(CARBayes);library(CARBayesdata);library(CARBayesST);library(shapefiles);library(sp);library(knitr);library(RColorBrewer);library(coda);library(spam);library(cdcfluview)

# setting global chunk options
knitr::opts_chunk$set(echo = FALSE);knitr::opts_chunk$set(message = FALSE);knitr::opts_chunk$set(warning = FALSE);knitr::opts_chunk$set(comment = "")
knitr::opts_chunk$set(cache = TRUE)

#data(statepov)#data(salesdata)#View(salesdata)
data(lipdata);data(lipdbf);data(lipshp)
kable(summary(lipdata), caption="Summary Statistics")
lipdbf$dbf <- lipdbf$dbf[ ,c(2,1)]
data.combined <- combine.data.shapefile(data=lipdata, shp=lipshp, dbf=lipdbf)
W.nb <-poly2nb(data.combined, row.names = rownames(lipdata))
W.mat <- nb2mat(W.nb, style="B")
#str(data.combined)

nb.bound <- poly2nb(data.combined) # shared boundaries
#summary(nb.bound)
coords <- coordinates(data.combined)
#plot(data.combined, border = "gray", main="Scottland")
#plot(nb.bound, coords, pch = 19, cex = 0.6, add = TRUE)
breakpoints <- seq(min(lipdata$observed)-1, max(lipdata$observed)+1, length.out=8)
my.palette <- brewer.pal(n = 7, name = "OrRd")

spplot(data.combined, c("observed", "expected"), main="Scottish Lip Cancer",at=breakpoints,col.regions=my.palette, col="grey")

spplot(data.combined, c("observed", "pcaff"), main="Scottish Lip Cancer", at=breakpoints,col.regions=my.palette, col="black")

glmmodel <- glm(observed~., family="poisson", data=data.combined@data)
#summary(glmmodel)
resid.glmmodel <- residuals(glmmodel)

W.nb <- poly2nb(data.combined, row.names = rownames(data.combined@data))
W.list <- nb2listw(W.nb, style="B")
testglm <- moran.mc(x=resid.glmmodel, listw=W.list, nsim=1000)
W <- nb2mat(W.nb, style="B")
formula <- observed ~ expected+pcaff+latitude+longitude
model.spatial1 <- S.CARleroux(formula=formula, data=data.combined@data,family="gaussian", W=W, burnin=20000, n.sample=120000, thin=10, verbose=FALSE)
betas1 <- summarise.samples(model.spatial1$samples$beta, quantiles=c(0.5, 0.025, 0.975))
resultsMS1 <- betas1$quantiles
rownames(resultsMS1) <- c("Intercept", "Expected", "Pcaff", "Latitude", "Longitude")
#kable(resultsMS1, caption="95% Credible Intervals Model 1")
                            
model.spatial2 <- S.CARleroux(formula=formula, data=data.combined@data,family="poisson", W=W, burnin=20000, n.sample=120000, thin=10, verbose=FALSE)
betas2 <- summarise.samples(model.spatial2$samples$beta, quantiles=c(0.5, 0.025, 0.975))
resultsMS2 <- betas2$quantiles
rownames(resultsMS2) <- c("Intercept", "Expected", "Pcaff", "Latitude", "Longitude")
#kable(resultsMS2, caption="95% Credible Intervals Model 2")

model.spatial3 <- S.CARlocalised(formula=formula,G=5, data=data.combined@data,family="poisson", W=W, burnin=20000, n.sample=120000, thin=10, verbose=FALSE)
betas3 <- summarise.samples(model.spatial3$samples$beta, quantiles=c(0.5, 0.025, 0.975))
resultsMS3 <- betas3$quantiles
rownames(resultsMS3) <- c("Expected", "Pcaff", "Latitude", "Longitude")
#kable(resultsMS3, caption="95% Credible Model 3")

getfancy <- matrix(c(model.spatial1$modelfit[1],
model.spatial2$modelfit[1],
model.spatial3$modelfit[1]), nrow = 3, ncol = 1, byrow = TRUE,
               dimnames = list(c("Model 1", "Model 2", "Model 3"),
                               c("DIC")))
#kable(getfancy, caption="Summary of Deviance")
data("GGHB.IG")
data("pollutionhealthdata")
pollutionhealthdata$SMR <- pollutionhealthdata$observed/ pollutionhealthdata$expected
pollutionhealthdata$logSMR <- log(pollutionhealthdata$SMR)
#par(pty="s", cex.axis=1.5, cex.lab=1.5)
#pairs(pollutionhealthdata[ , c(9, 5:7)], pch = 19, cex = 0.5, lower.panel=NULL, panel=panel.smooth,labels = c("ln(SMR)", "PM10", "JSA", "Price (*100,000)"))
SMR.av <- summarise(group_by(pollutionhealthdata,IG), SMR.mean =mean(SMR))
GGHB.IG@data$SMR <- SMR.av$SMR.mean
price.av <- summarise(group_by(pollutionhealthdata,IG), price.mean =mean(price))
GGHB.IG@data$price <- price.av$price.mean
W.nb <- poly2nb(GGHB.IG, row.names = SMR.av$IG)
W.list <- nb2listw(W.nb, style = "B")
W <- nb2mat(W.nb, style = "B")
nb.bound <- poly2nb(GGHB.IG) # shared boundaries
#summary(nb.bound)
kable(summary(pollutionhealthdata[,c(2:8)]), caption="Summary Statistics")
coords <- coordinates(GGHB.IG)
#plot(GGHB.IG, border = "gray", main="Glasgow")
#plot(nb.bound, coords, pch = 19, cex = 0.6, add = TRUE)
fig.width=3.5}
par(mfrow=c(1,2)) 
breakpoints <- seq(min(SMR.av$SMR.mean)-0.1, max(SMR.av$SMR.mean)+0.1,length.out = 11)
spplot(GGHB.IG, "SMR", main="SMR", xlab = "", ylab = "", scales = list(draw = FALSE),  at = breakpoints, col.regions = terrain.colors(n = length(breakpoints)-1),par.settings=list(fontsize=list(text=20)))

breakpoints <- seq(min(price.av$price.mean)-0.1, max(price.av$price.mean)+0.1,length.out = 11)
spplot(GGHB.IG, "price", main="Price", xlab = "", ylab = "", scales = list(draw = FALSE),  at = breakpoints, col.regions = terrain.colors(n = length(breakpoints)-1),par.settings=list(fontsize=list(text=20)))

formula <- observed ~ offset(log(expected)) + jsa + price + pm10
model1 <- glm(formula = formula, family = "quasipoisson", data = pollutionhealthdata)
resid.glm <- residuals(model1)
#summary(model1)$coefficients
#summary(model1)$dispersion
year1 <- moran.mc(x = resid.glm[1:271], listw = W.list, nsim = 10000)
#year2 <- moran.mc(x = resid.glm[272:543], listw = W.list, nsim = 10000)
year3 <- moran.mc(x = resid.glm[544:814], listw = W.list, nsim = 10000)
year4 <- moran.mc(x = resid.glm[815:1085], listw = W.list, nsim = 10000)
#year5 <- moran.mc(x = resid.glm[1086:1355], listw = W.list, nsim = 10000)
model1 <- ST.CARanova(formula = formula, family = "poisson", data = pollutionhealthdata, W = W, burnin = 20000, n.sample = 220000, thin = 10, verbose = FALSE)
colnames(model1$samples$beta) <- c("Intercept", "JSA", "Price", "PM10")
plot(exp(model1$samples$beta[ , -1]))
model2 <- ST.CARar(formula = formula, family = "poisson", data = pollutionhealthdata, W = W, burnin = 20000, n.sample = 220000, thin = 10, verbose = FALSE)
colnames(model2$samples$beta) <- c("Intercept", "JSA", "Price", "PM10")
plot(exp(model2$samples$beta[ , -1]))
ummarise.samples(exp(model1$samples$beta[ , -1]), quantiles = c(0.5, 0.025, 0.975))
m1 <- round(parameter.summary1$quantiles, 3)
rownames(m1) <- c("JSA", "Price", "PM10")
#kable(m1, caption="95% CI for Model 1")

parameter.summary2 <- summarise.samples(exp(model2$samples$beta[ , -1]), quantiles = c(0.5, 0.025, 0.975))
m2 <- round(parameter.summary2$quantiles, 3)
rownames(m2) <- c("JSA", "Price", "PM10")
#kable(m2, caption="95% CI for Model 2")

getfancy <- rbind(model1$modelfit[1],model2$modelfit[1])
rownames(getfancy) <- c("Model 1", "Model 2")
#kable(getfancy, caption="Summary of Deviances")
setwd("/Users/benStraub/Desktop/597_spatial_statistics/Project")

flu <- read.csv(url("https://www.google.org/flutrends/about/data/flu/us/data.txt"), skip=11)

#flu$Date <- as.Date(flu$Date)
states <- c(names(flu)[1:53])
flu.states <- flu[states]
flu.states <- flu.states[,-2]
#View(flu.states)
#dim(flu.states)#620x52
flu.states$dayMeans<-rowMeans((flu.states[2:52]),na.rm = TRUE)

#  Convert to date if not already
flu.states$Date <- as.Date(flu.states$Date)

#  Get months
flu.states$Month <- months(flu.states$Date)

#  Get years
flu.states$Year <- format(flu.states$Date,format="%y")

boxplot(flu.states$dayMeans ~ flu.states$Month+flu.states$Year, range = 0, xlab = "Year", ylab = "Flu Sympton Searches",col = "darkseagreen", border = "navy", main="Flu searches over 10 years")

flu <- read.csv(url("https://www.google.org/flutrends/about/data/flu/us/data.txt"), skip=11)
#flu$Date <- as.Date(flu$Date)
states <- c(names(flu)[1:53])
flu.states <- flu[states]
flu.states <- flu.states[,-2]
#View(flu.states)
#dim(flu.states)#620x52
flu.states$dayMeans<-rowMeans((flu.states[2:52]),na.rm = TRUE)

#  Convert to date if not already
flu.states$Date <- as.Date(flu.states$Date)

#  Get months
flu.states$Month <- months(flu.states$Date)

#  Get years
flu.states$Year <- format(flu.states$Date,format="%y")
tflu.states <- t(flu.states)
tflu.states <- tflu.states[-1,]
mtflu.states <- melt(tflu.states)
colnames(mtflu.states)<-c("state", "order", "counts")
states <- states[-1]
states <- states[-1]
sumstates <- arrange(mtflu.states, state)

sumstates$counts <- as.numeric(sumstates$counts)
meanstates <- aggregate(. ~ state, sumstates[-2], mean)

state.map <- map("state", plot = FALSE, fill = TRUE)
IDs <- sapply(strsplit(state.map$names, ":"), function(x) x[1])
state.sp <- map2SpatialPolygons(state.map, IDs = IDs, proj4string = CRS("+proj=longlat +ellps=WGS84"))
sat <- meanstates
sat[[1]] <- tolower(sat[[1]])
sat <- sat[-c(52:54),]
sat <- sat[-12,]
sat <- sat[-2,]
#View(sat)
sat$state[sat$state == "district.of.columbia"] <- "district of columbia"
sat$state[sat$state == "new.hampshire"] <- "new hampshire"
sat$state[sat$state == "new.jersey"] <- "new jersey"
sat$state[sat$state == "new.mexico"] <- "new mexico"
sat$state[sat$state == "new.york"] <- "new york"
sat$state[sat$state == "north.carolina"] <- "north carolina"
sat$state[sat$state == "north.dakota"] <- "north dakota"
sat$state[sat$state == "rhode.island"] <- "rhode island"
sat$state[sat$state == "south.carolina"] <- "south carolina"
sat$state[sat$state == "south.dakota"] <- "south dakota"
sat$state[sat$state == "west.virginia"] <- "west virginia"

rownames(sat) <- sat[,1]
id <- match(row.names(sat), row.names(state.sp))
#row.names(sat)[is.na(id)]

sat1 <- sat[!is.na(id), ]
state.spdf <- SpatialPolygonsDataFrame(state.sp, sat1)
breakpoints <- seq(min(sat1$counts)-0.1, max(sat1$counts)+0.1,length.out = 8)
spplot(state.spdf[2], main="Flu Means for 10 years", at=breakpoints,col.regions=my.palette, col="black")
setwd("/Users/benStraub/Desktop/597_spatial_statistics/Project")

flu <- read.csv(url("https://www.google.org/flutrends/about/data/flu/us/data.txt"), skip=11)

#flu$Date <- as.Date(flu$Date)
states <- c(names(flu)[1:53])
flu.states <- flu[states]
flu.states <- flu.states[,-2]

#  Convert to date if not already
flu.states$Date <- as.Date(flu.states$Date)
#  Get months
flu.states$Month <- months(flu.states$Date)
#  Get years
flu.states$Year <- format(flu.states$Date,format="%y")
flu.states <- flu.states[,-1]
fsym <- melt(flu.states, id=c("Month", "Year"), na.rm=TRUE)
fsy <- acast(fsym, Year ~ variable, mean)
tfsy <- t(fsy)
states <- as.data.frame(row.names(tfsy))
tfsy <- cbind(states, tfsy)
colnames(tfsy) <- "state"
rownames(tfsy) <- 1:nrow(tfsy)
tfsy[[1]] <- tolower(tfsy[[1]])
colnames(tfsy) <- c("state", "03","04","05","06","07","08","09","10","11","12","13","14","15" )

tfsy$state[tfsy$state == "district.of.columbia"] <- "district of columbia"
tfsy$state[tfsy$state == "new.hampshire"] <- "new hampshire"
tfsy$state[tfsy$state == "new.jersey"] <- "new jersey"
tfsy$state[tfsy$state == "new.mexico"] <- "new mexico"
tfsy$state[tfsy$state == "new.york"] <- "new york"
tfsy$state[tfsy$state == "north.carolina"] <- "north carolina"
tfsy$state[tfsy$state == "north.dakota"] <- "north dakota"
tfsy$state[tfsy$state == "rhode.island"] <- "rhode island"
tfsy$state[tfsy$state == "south.carolina"] <- "south carolina"
tfsy$state[tfsy$state == "south.dakota"] <- "south dakota"
tfsy$state[tfsy$state == "west.virginia"] <- "west virginia"
tfsy <- tfsy[-12,]
tfsy <- tfsy[-2,]
rownames(tfsy) <- tfsy[,1]
id <- match(row.names(tfsy), row.names(state.sp))
#row.names(tfsy)[is.na(id)]
tfsy1 <- tfsy[!is.na(id), ]
state.spdfY <- SpatialPolygonsDataFrame(state.sp, tfsy)

tfsy03 <- tfsy1[,1:2]
names(tfsy03)[names(tfsy03)=="03"] <- "counts"
state.spdfY <- SpatialPolygonsDataFrame(state.sp, tfsy03)
spplot(state.spdfY[2], main="Flu Means for 03",at=breakpoints,col.regions=my.palette, col="black")

tfsy08 <- tfsy1[c(1,7)]
names(tfsy08)[names(tfsy08)=="08"] <- "counts"
state.spdfY <- SpatialPolygonsDataFrame(state.sp, tfsy08)
spplot(state.spdfY[2], main="Flu Means for 08",at=breakpoints,col.regions=my.palette, col="black")

tfsy15 <- tfsy1[c(1,14)]
names(tfsy15)[names(tfsy15)=="15"] <- "counts"
state.spdfY <- SpatialPolygonsDataFrame(state.sp, tfsy15)
str(slot(state.spdfY, "data"))
spplot(state.spdfY[2], main="Flu Means for 15")



setwd("/Users/benStraub/Desktop/597_spatial_statistics/Project")

flu <- read.csv(url("https://www.google.org/flutrends/about/data/flu/us/data.txt"), skip=11)

#flu$Date <- as.Date(flu$Date)
states <- c(names(flu)[1:53])
flu.states <- flu[states]
flu.states <- flu.states[,-2]

#  Convert to date if not already
flu.states$Date <- as.Date(flu.states$Date)
#  Get months
flu.states$Month <- months(flu.states$Date)
#  Get years
flu.states$Year <- format(flu.states$Date,format="%y")
flu.states <- flu.states[,-1]
fsym <- melt(flu.states, id=c("Month", "Year"), na.rm=TRUE)
names(fsym)[names(fsym)=="variable"] <- "state"
names(fsym)[names(fsym)=="value"] <- "counts"
fsym$Month<-as.integer(fsym$Month)
fsym$Year<-as.integer(fsym$Year)

par(mfrow=c(1,2))
#create binary and row normalized adjacency matrices
usa.nb <- poly2nb(state.spdf)
summary(usa.nb)
usa.adj.matB = nb2mat(usa.nb, style="B") #binary adjacency matrix
usa.adj.matW = nb2mat(usa.nb, style="W") #row normalized adjacency matrix
usa.listw.B<- nb2listw(usa.nb, style="B")
usa.listw.W<- nb2listw(usa.nb, style="W")