Report is on using Spatio-Temporal Condtionally Autoregressive (CAR) models for Areal Data. I look at two data sets that were provided in R packages that implement the CAR techniques. I then attempted to use them on the Goggle Trend flu data, but had little success with the flu data.

What’s in this report?

Introduction
Spatial CAR Models for Areal Data
Spatial-Temporal CAR Models for Areal Data
Google Flu
Conclusions

1. Introduction

Researchers often encounter data that exists in non-overlapping spatial areal units. Areal units are finite in number, usually exist on a type of grid and are typically sums or averages, e.g. census tracts, disease prevalence in counties, etc. Areal units typically display spatial autocorrelation, that is near things are more similar than distant things. To be a little more rigorous, spatial autocorrelation measures how much close objects are in comparison with other close objects. As statisticians, we are interested in modeling and understanding this spatial autocorrelation.

The go-to tool in a statisticians tool box is the regression model. One could turn to this modeling techniques to account for some of the spatial autocorrelation in your data, but it is not a panacea. Even after accounting for the spatial autocorrelation in your data the model can still display spatial autocorrelation in the residuals, which would violate a key regression modeling assumption. Other issues that are difficult to account for in the typical regression model are neighbor effects and grouping effects. A neighbor effect is just the influence a “neighbor” might have on their “neighbor” and a grouping effect is the tendency of similar peoples to want to live with people similar to them.

Enter the Conditionally Autoregressive (CAR) model. The CAR model seeks to supplement the linear predictor with a set of spatially autocorrelated random effects, as part of a Bayesian hierarchical model. These random effects are typically represented with a conditional autoregressive prior, which induces spatial autocorrelation through the adjacency structure of the areal unit.(CAR, Besag et al. (1991)). However, even with the spatial effects handled in the CAR model, most data today also exists in the domain of time. So now our data of K areal units exists in T consecutive time periods, producing a rectangular array of \(K \times T\) of spatio-temporal observations. It would be ideal to not only be able to model the spatial effects of the data but also the temporal effects that exists in our areal units.

2. CAR Models for Areal Data

Before we dive into space and time, I will first explore the CARBayes package and its framework, which is a package created by Duncan Lee of the University of Glasgow. The CARBayes package implements a class of univariate and multivariate spatial generalized linear mixed models for areal unit data, with inference in a Bayesian setting using Markov chain Monte Carlo (MCMC) simulation (Lee, 1). Dr. Lee provides several worked examples using this package on data involving Scottish lip cancer and property prices in Greater Glasgow. The Scottish lip cancer is only used to showcase how to combine a data frame and shapefile together and does not have any of the CARBayes functions implemented on it. I will perform a similar analysis as done in the vignette on the Scottish lip cancer.

2.1 Framework for CAR Model

Set of \(k = 1, . . . , K\) non-overlapping areal units \(S = {S_1, . . . , S_K}\)
Linked to a corresponding set of responses \(Y = (Y_1, ... ,Y_K)\), and a vector of known offsets \(O = (O_1, . . . , O_K)\).
The spatial variation in the response is modeled by a matrix of covariates \(X = (x_1, . . . ,x_K)\) and a spatial structure component \(\psi = (\psi_1, . . . , \psi_K),\) the latter of which is included to model any spatial autocorrelation that remains in the data after the covariate effects have been accounted for
The vector of covariates for areal unit \(S_k\) are denoted by \(x_k = (1, x_{k1}, . . . , x_{kp})\)
Use the Hierarchical Bayesian Setting: \[Y_{k}|\mu_{k} \sim f(y_{k}|\mu_{k}, \nu^2), k =1,...K\] \[g(\mu_{k}) = x^T_{k} \beta + O_{k} + \psi_{k}\] \[\beta \sim N(\mu_\beta, \Sigma_\beta)\]
Within \(\psi\) is a Weight or Adjacency Matrix that controls spatial autocorrelation

2.2 Scottish Lip Cancer Data Summary Statistics and Plots

The Scottish Lip cancer data is purely spatial data that summarizes overall lip cancer cases and possible covariate risk factors for the period 1975 to 1986, for 56 districts in Scotland. Observed is the response variable that will be modeled, which is the total number of people who have/had lip cancer in the areal unit. Expected, pcaff, latitude and longitude are the covariates. Pcaff is the percentage of the workforce in each district employed in agriculture, fishing and forestry and the others are self-explanatory. Here I will attempt to identify if the covariates in the data set affect lip cancer and quantify their effects.

Summary Statistics
observed	expected	pcaff	latitude	longitude
Min. : 0.000	Min. : 1.100	Min. : 0.000	Min. :54.94	Min. :1.430
1st Qu.: 4.750	1st Qu.: 4.050	1st Qu.: 1.000	1st Qu.:55.78	1st Qu.:3.288
Median : 8.000	Median : 6.300	Median : 7.000	Median :56.04	Median :4.090
Mean : 9.571	Mean : 9.575	Mean : 8.661	Mean :56.40	Mean :4.012
3rd Qu.:11.000	3rd Qu.:10.125	3rd Qu.:11.500	3rd Qu.:57.02	3rd Qu.:4.730
Max. :39.000	Max. :88.700	Max. :24.000	Max. :60.24	Max. :6.800

Below is a summary and plot of constructing the W matrix, which is the backbone of the CAR model. The matrix controls the spatial autocorrelation structure of the random effects. I have reported the summary statistics for the areal data of Scottland as well as provided a map of Scottland detailing the graph structure of the W matrix. Also, reported below are the counts for the covariates provided in the data. I notice that the observed counts in the northeast portion of Scotland significantly differ from the expected counts. Perhaps this is due to the pcaff variables, that is there are more people in the fishing industry on this part of the coast.

As part of my exploratory data analysis I computed the Moran’ I, which is a way to establish that there is some spatial dependence in the data before proceeding. Moran’s I is a test statistic that allows one to investigate the presence of spatial correlation in your data. Moran’s I assumes a Null Hypothesis that there is no spatial correlation. I calculated the Moran’s I statistic available from the spdep package and found it equals 0.19331 with a corresponding p-value of 0.00699, which suggests that the residuals contain substantial positive spatial autocorrelation. However, the Moran’s I does not give indication of where this spatial correlation is located or the magnitude of the effects on the response variable in the Scottish Lip Cancer data set. I turn to other techniques to find this correlation and the effects.

2.3 Spatial CAR model implementation

I will now use the CARBayes package for modelling areal unit data with conditional autoregressive priors (CAR). I chose three models to attempt to understand the covariate effects on the response. First, is the S.CARleroux model, which uses a CAR prior for modelling varying strengths of spatial autocorrelation, using a single set of random effects. Second, is the same method, but uses a poisson distribution rather than a gaussian. The third model is the S.CARlocalised, which allows for localized spatial autocorrelation by modelling the elements in W, which allows you to augment the set of spatially smooth random effects with a piecewise constant intercept or cluster models, thus allowing large jumps in the mean surface between adjacent areal units in different clusters. Each model use the same \(W\) matrix and inference is the same for each model, which is done on 10000 samples, obtained by after a burn-in period of 20000 samples and thinning by of 10 to reduce their autocorrelation.

I use the DIC as a measure of model-fitting, which is a hierarchical modeling generalization of the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. The third model has the lowest DIC, which will be the preferred model. The covariates expected, latitude and longitude do not contain the value of 0 and I believe this means that they have an impact on the observed amount of cancer. Pcaff, the percentage of folks employed in the agriculture, fishing and forestry is not significant as it contains 0 in the interval. It seems the model is saying that latitude and longitude have more of an impact on getting lip cancer, then your employment in the 3 industries.

3 Spatial-Temporal CAR Models

3.1 Spatial-Temporal CAR Framework

Moving forward, I will now introduce time into the CAR Model…well not me literally. Data will now consists of \(K\) areal units and \(N\) consecutive time periods and will have a \(K \times N\) spatio-temporal observations. The same hierarchical bayesian setting as used in the CAR Model section except now time will also be an index.

\[Y_{kt}|\mu_{kt} \sim f(y_{kt}|\mu_{kt}, \nu^2), k =1,...,K; t=1,...,N\] \[g(\mu_{kt}) = x^T_{kt} \beta + O_{kt} + \psi_{kt}\] \[\beta \sim N(\mu_\beta, \Sigma_\beta)\]

Again, the spatial autocorrelation is controlled within \(\psi\), which is a Weight or Adjacency Matrix.

3.2 Glasgow air pollution Data Summary Statistics and Plots

This data set contains spatio-temporal data on respiratory hospitalizations, air pollution concentrations and socio-economic deprivation covariates for the 271 Intermediate Geographies (IG) that make up the Greater Glasgow and Clyde health board in Scotland. The goal is understand the effects of air pollution concentrations and the risk of respiratory disease. It is a worked example contained in the CARBayesST package.

The covariates are IG, a zip code/census tract used by the health board, year (2007 to 2011), expected deaths, pm10 (Average particulate matter (less than 10 microns) concentrations in each IG and year, jsa (The percentage of working age people who are in receipt of Job Seekers Allowance, a benefit paid to unemployed people looking for work, in each IG and year), price (Average property price (in thousands) in each IG and year.). SMR is the standardized mortality rate, which is just the ratio of observed over the expected deaths from respiratory disease. The response variable is observed deaths. The author uses the pairs plot to look at the relationships between the covariates. He believes that an increasing level of poverty leads to an increased risk of respiratory hospitalization due to the positive and negative relationship between the SMR and jsa and price respectively found in the pairs plots(not reported).

Summary Statistics
year	observed	expected	pm10	jsa	price	SMR
Min. :2007	Min. : 10.0	Min. : 44.47	Min. : 7.838	Min. : 0.300	Min. :0.228	Min. :0.2091
1st Qu.:2008	1st Qu.: 54.0	1st Qu.: 72.71	1st Qu.:10.953	1st Qu.: 2.050	1st Qu.:0.880	1st Qu.:0.6020
Median :2009	Median : 74.0	Median : 89.00	Median :12.117	Median : 3.700	Median :1.150	Median :0.8458
Mean :2009	Mean : 79.2	Mean : 92.35	Mean :12.326	Mean : 4.191	Mean :1.277	Mean :0.8650
3rd Qu.:2010	3rd Qu.: 99.0	3rd Qu.:109.64	3rd Qu.:13.466	3rd Qu.: 5.925	3rd Qu.:1.541	3rd Qu.:1.0820
Max. :2011	Max. :213.0	Max. :180.54	Max. :19.612	Max. :13.750	Max. :4.300	Max. :2.1871

Below are two plots for SMR and price for Glasgow. The first plot is a visualization of the SMR over all five years for Glasgow’s IGs. The green areas are where the SMR risk is low and the orange and silver areas indicate elevated risks. The second plot is a visualization of the average property price in each IG over the same time period. To my untrained eye, I believe that the bulk of the elevated SMR is located in the center city where housing values are low. The SMR risk that is low is ringed around the city where there are many more IG’s with high property values.

As part of my exploratory data analysis I computed the Moran’s I, which is a way to establish that there is some spatial dependence in the data before proceeding. I computed Moran’s I for 2007, 2009, 2010. For some reason I couldn’t get 2008 and 2011 to work. Moran’s I is a test statistic that allows one to investigate the presence of spatial correlation in your data. Moran’s I assumes a Null Hypothesis that there is no spatial correlation. I calculated the Moran’s I statistic for 2007,2009,2010 and found respectively it equals 0.10358, 0.013545, 0.10403 with corresponding p-values of 0.0034, 0.31227, 0.0028. This suggests that the residuals contain substantial positive spatial autocorrelation in 2007 and 2010, but interestingly not 2009. However, the Moran’s I does not give indication of where this spatial correlation is located in the dataset. I turn to other techniques to find this correlation.

3.3 Spatio-Temporal CAR model implementation

I look at two models, ST.Caranova and ST.Carar to try and understand the data set. Each model use the same \(W\) matrix and inference is the same for each model, which is done on 10000 samples, obtained by after a burn-in period of 20000 samples and thinning by of 10 to reduce their autocorrelation. Below are trace plots and posterior distributions for the covariates as well as 95% credible intervals for the covariates. Also, DIC is again used to establish which model is a better fit.

ST.Caranova

This model was discussed in Knorr-Held paper I attempted to present in class. It decomposes the spatio-temporal variation in the data into an overall spatial effect, an overall trend effect and a space-time interaction. Below are traceplots and posterior distributions for the model.

ST.Carar

This model was proposed by Rushworth and was used on this data set in the vignette. It represents the spatio-temporal structure with a multivariate first order autoregressive process with a spatially correlated precision matrix. Below are traceplots and posterior distributions for the model.

Interestingly, the DIC points to the Knorr-Held model as being a superior choice. The paper uses the ST.CARar model as a way to investigate this data set. However, I am unsure if a difference of 10 deviance is an indicator of a superior model. The traceplot of Model 2 for JSA looks like it did not mix that well, perhaps this influences the DIC. The 95% credible interval for the STcaranova model has for the relative risk of an increase in PM10 is 1.024 (1.007, 1.039), suggesting that such an increase corresponds to 2.4% additional hospital admissions. The corresponding relative risk for a one percent increase in JSA is 1.058 (1.045, 1.070), while for a one hundred thousand pound increase in property price (the units for the property price data were in hundreds of thousands) the risk is 0.828 (0.789, 0.868). I believe these credible intervals back up our intuition from the EDA that respiratory illness increases as property prices decrease and vice versa.

4 Google Flu trends in the United States

Google Flu Trends was a web service operated by Google, which provided estimates of influenza activity for the United States and many other countries. By aggregating Google search queries, it attempted to make accurate predictions about flu activity. The data is accessible on google trends and I was able to scrape data from 2003 to 2015 for all 50 states, but I ended up only looking at the 48 continental states.

The undertaking ended up being a bit daunting as I only had a fleeting understanding of CAR models in space and time. I was hoping that I could just fit a basic CAR model to the data set and gain some understanding and then move on down the line. However, I spent way too much cleaning and trying to get the data organized to use all the spatial packages and in the end was not successful. I ended up just being able to show a boxplot of the total flu searches by month over the course of the time period. Definitely a seasonal effect is in play, which is to be expected. I also was able to produce plots of the annual flu searches for the 48 states. I only reported 03 and 09 as well as the total mean search for the ten years.

Conclusion

CAR models to explore areal data over time are a powerful tool to gain understanding of spatial dependence especially when coupled with MCMC techniques. I would of liked to have run these methods on the google data to gain more understanding of flu trends. However, only having a basic understanding of CAR models, I think the selection of the google data set was a poor choice. Looking back on it now, I should of just acquired the CDC flu data and run an analysis. Perhaps a bonus would of been to look at the Google data if time was permitting.

Appendix

Resources

CARBayesST vignette
CARBayes vignette
597 Lecture notes
Review of spatio-temporal models for disease mapping
Knorr Held Paper

Code

# loading pakcages
library("sp");library(plyr);library("dplyr");library("tidyr");library(reshape2);library(reshape2);library(surveillance);library(spdep);library(maps);library(maptools);library(usmap);library(CARBayes);library(CARBayesdata);library(CARBayesST);library(shapefiles);library(sp);library(knitr);library(RColorBrewer);library(coda);library(spam);library(cdcfluview)

# setting global chunk options
knitr::opts_chunk$set(echo = FALSE);knitr::opts_chunk$set(message = FALSE);knitr::opts_chunk$set(warning = FALSE);knitr::opts_chunk$set(comment = "")
knitr::opts_chunk$set(cache = TRUE)

#data(statepov)#data(salesdata)#View(salesdata)
data(lipdata);data(lipdbf);data(lipshp)
kable(summary(lipdata), caption="Summary Statistics")
lipdbf$dbf <- lipdbf$dbf[ ,c(2,1)]
data.combined <- combine.data.shapefile(data=lipdata, shp=lipshp, dbf=lipdbf)
W.nb <-poly2nb(data.combined, row.names = rownames(lipdata))
W.mat <- nb2mat(W.nb, style="B")
#str(data.combined)

nb.bound <- poly2nb(data.combined) # shared boundaries
#summary(nb.bound)
coords <- coordinates(data.combined)
#plot(data.combined, border = "gray", main="Scottland")
#plot(nb.bound, coords, pch = 19, cex = 0.6, add = TRUE)
breakpoints <- seq(min(lipdata$observed)-1, max(lipdata$observed)+1, length.out=8)
my.palette <- brewer.pal(n = 7, name = "OrRd")

spplot(data.combined, c("observed", "expected"), main="Scottish Lip Cancer",at=breakpoints,col.regions=my.palette, col="grey")

spplot(data.combined, c("observed", "pcaff"), main="Scottish Lip Cancer", at=breakpoints,col.regions=my.palette, col="black")

glmmodel <- glm(observed~., family="poisson", data=data.combined@data)
#summary(glmmodel)
resid.glmmodel <- residuals(glmmodel)

W.nb <- poly2nb(data.combined, row.names = rownames(data.combined@data))
W.list <- nb2listw(W.nb, style="B")
testglm <- moran.mc(x=resid.glmmodel, listw=W.list, nsim=1000)
W <- nb2mat(W.nb, style="B")
formula <- observed ~ expected+pcaff+latitude+longitude
model.spatial1 <- S.CARleroux(formula=formula, data=data.combined@data,family="gaussian", W=W, burnin=20000, n.sample=120000, thin=10, verbose=FALSE)
betas1 <- summarise.samples(model.spatial1$samples$beta, quantiles=c(0.5, 0.025, 0.975))
resultsMS1 <- betas1$quantiles
rownames(resultsMS1) <- c("Intercept", "Expected", "Pcaff", "Latitude", "Longitude")
#kable(resultsMS1, caption="95% Credible Intervals Model 1")
                            
model.spatial2 <- S.CARleroux(formula=formula, data=data.combined@data,family="poisson", W=W, burnin=20000, n.sample=120000, thin=10, verbose=FALSE)
betas2 <- summarise.samples(model.spatial2$samples$beta, quantiles=c(0.5, 0.025, 0.975))
resultsMS2 <- betas2$quantiles
rownames(resultsMS2) <- c("Intercept", "Expected", "Pcaff", "Latitude", "Longitude")
#kable(resultsMS2, caption="95% Credible Intervals Model 2")

model.spatial3 <- S.CARlocalised(formula=formula,G=5, data=data.combined@data,family="poisson", W=W, burnin=20000, n.sample=120000, thin=10, verbose=FALSE)
betas3 <- summarise.samples(model.spatial3$samples$beta, quantiles=c(0.5, 0.025, 0.975))
resultsMS3 <- betas3$quantiles
rownames(resultsMS3) <- c("Expected", "Pcaff", "Latitude", "Longitude")
#kable(resultsMS3, caption="95% Credible Model 3")

getfancy <- matrix(c(model.spatial1$modelfit[1],
model.spatial2$modelfit[1],
model.spatial3$modelfit[1]), nrow = 3, ncol = 1, byrow = TRUE,
               dimnames = list(c("Model 1", "Model 2", "Model 3"),
                               c("DIC")))
#kable(getfancy, caption="Summary of Deviance")
data("GGHB.IG")
data("pollutionhealthdata")
pollutionhealthdata$SMR <- pollutionhealthdata$observed/ pollutionhealthdata$expected
pollutionhealthdata$logSMR <- log(pollutionhealthdata$SMR)
#par(pty="s", cex.axis=1.5, cex.lab=1.5)
#pairs(pollutionhealthdata[ , c(9, 5:7)], pch = 19, cex = 0.5, lower.panel=NULL, panel=panel.smooth,labels = c("ln(SMR)", "PM10", "JSA", "Price (*100,000)"))
SMR.av <- summarise(group_by(pollutionhealthdata,IG), SMR.mean =mean(SMR))
GGHB.IG@data$SMR <- SMR.av$SMR.mean
price.av <- summarise(group_by(pollutionhealthdata,IG), price.mean =mean(price))
GGHB.IG@data$price <- price.av$price.mean
W.nb <- poly2nb(GGHB.IG, row.names = SMR.av$IG)
W.list <- nb2listw(W.nb, style = "B")
W <- nb2mat(W.nb, style = "B")
nb.bound <- poly2nb(GGHB.IG) # shared boundaries
#summary(nb.bound)
kable(summary(pollutionhealthdata[,c(2:8)]), caption="Summary Statistics")
coords <- coordinates(GGHB.IG)
#plot(GGHB.IG, border = "gray", main="Glasgow")
#plot(nb.bound, coords, pch = 19, cex = 0.6, add = TRUE)
fig.width=3.5}
par(mfrow=c(1,2)) 
breakpoints <- seq(min(SMR.av$SMR.mean)-0.1, max(SMR.av$SMR.mean)+0.1,length.out = 11)
spplot(GGHB.IG, "SMR", main="SMR", xlab = "", ylab = "", scales = list(draw = FALSE),  at = breakpoints, col.regions = terrain.colors(n = length(breakpoints)-1),par.settings=list(fontsize=list(text=20)))

breakpoints <- seq(min(price.av$price.mean)-0.1, max(price.av$price.mean)+0.1,length.out = 11)
spplot(GGHB.IG, "price", main="Price", xlab = "", ylab = "", scales = list(draw = FALSE),  at = breakpoints, col.regions = terrain.colors(n = length(breakpoints)-1),par.settings=list(fontsize=list(text=20)))

formula <- observed ~ offset(log(expected)) + jsa + price + pm10
model1 <- glm(formula = formula, family = "quasipoisson", data = pollutionhealthdata)
resid.glm <- residuals(model1)
#summary(model1)$coefficients
#summary(model1)$dispersion
year1 <- moran.mc(x = resid.glm[1:271], listw = W.list, nsim = 10000)
#year2 <- moran.mc(x = resid.glm[272:543], listw = W.list, nsim = 10000)
year3 <- moran.mc(x = resid.glm[544:814], listw = W.list, nsim = 10000)
year4 <- moran.mc(x = resid.glm[815:1085], listw = W.list, nsim = 10000)
#year5 <- moran.mc(x = resid.glm[1086:1355], listw = W.list, nsim = 10000)
model1 <- ST.CARanova(formula = formula, family = "poisson", data = pollutionhealthdata, W = W, burnin = 20000, n.sample = 220000, thin = 10, verbose = FALSE)
colnames(model1$samples$beta) <- c("Intercept", "JSA", "Price", "PM10")
plot(exp(model1$samples$beta[ , -1]))
model2 <- ST.CARar(formula = formula, family = "poisson", data = pollutionhealthdata, W = W, burnin = 20000, n.sample = 220000, thin = 10, verbose = FALSE)
colnames(model2$samples$beta) <- c("Intercept", "JSA", "Price", "PM10")
plot(exp(model2$samples$beta[ , -1]))
ummarise.samples(exp(model1$samples$beta[ , -1]), quantiles = c(0.5, 0.025, 0.975))
m1 <- round(parameter.summary1$quantiles, 3)
rownames(m1) <- c("JSA", "Price", "PM10")
#kable(m1, caption="95% CI for Model 1")

parameter.summary2 <- summarise.samples(exp(model2$samples$beta[ , -1]), quantiles = c(0.5, 0.025, 0.975))
m2 <- round(parameter.summary2$quantiles, 3)
rownames(m2) <- c("JSA", "Price", "PM10")
#kable(m2, caption="95% CI for Model 2")

getfancy <- rbind(model1$modelfit[1],model2$modelfit[1])
rownames(getfancy) <- c("Model 1", "Model 2")
#kable(getfancy, caption="Summary of Deviances")
setwd("/Users/benStraub/Desktop/597_spatial_statistics/Project")

flu <- read.csv(url("https://www.google.org/flutrends/about/data/flu/us/data.txt"), skip=11)

#flu$Date <- as.Date(flu$Date)
states <- c(names(flu)[1:53])
flu.states <- flu[states]
flu.states <- flu.states[,-2]
#View(flu.states)
#dim(flu.states)#620x52
flu.states$dayMeans<-rowMeans((flu.states[2:52]),na.rm = TRUE)

#  Convert to date if not already
flu.states$Date <- as.Date(flu.states$Date)

#  Get months
flu.states$Month <- months(flu.states$Date)

#  Get years
flu.states$Year <- format(flu.states$Date,format="%y")

boxplot(flu.states$dayMeans ~ flu.states$Month+flu.states$Year, range = 0, xlab = "Year", ylab = "Flu Sympton Searches",col = "darkseagreen", border = "navy", main="Flu searches over 10 years")

flu <- read.csv(url("https://www.google.org/flutrends/about/data/flu/us/data.txt"), skip=11)
#flu$Date <- as.Date(flu$Date)
states <- c(names(flu)[1:53])
flu.states <- flu[states]
flu.states <- flu.states[,-2]
#View(flu.states)
#dim(flu.states)#620x52
flu.states$dayMeans<-rowMeans((flu.states[2:52]),na.rm = TRUE)

#  Convert to date if not already
flu.states$Date <- as.Date(flu.states$Date)

#  Get months
flu.states$Month <- months(flu.states$Date)

#  Get years
flu.states$Year <- format(flu.states$Date,format="%y")
tflu.states <- t(flu.states)
tflu.states <- tflu.states[-1,]
mtflu.states <- melt(tflu.states)
colnames(mtflu.states)<-c("state", "order", "counts")
states <- states[-1]
states <- states[-1]
sumstates <- arrange(mtflu.states, state)

sumstates$counts <- as.numeric(sumstates$counts)
meanstates <- aggregate(. ~ state, sumstates[-2], mean)

state.map <- map("state", plot = FALSE, fill = TRUE)
IDs <- sapply(strsplit(state.map$names, ":"), function(x) x[1])
state.sp <- map2SpatialPolygons(state.map, IDs = IDs, proj4string = CRS("+proj=longlat +ellps=WGS84"))
sat <- meanstates
sat[[1]] <- tolower(sat[[1]])
sat <- sat[-c(52:54),]
sat <- sat[-12,]
sat <- sat[-2,]
#View(sat)
sat$state[sat$state == "district.of.columbia"] <- "district of columbia"
sat$state[sat$state == "new.hampshire"] <- "new hampshire"
sat$state[sat$state == "new.jersey"] <- "new jersey"
sat$state[sat$state == "new.mexico"] <- "new mexico"
sat$state[sat$state == "new.york"] <- "new york"
sat$state[sat$state == "north.carolina"] <- "north carolina"
sat$state[sat$state == "north.dakota"] <- "north dakota"
sat$state[sat$state == "rhode.island"] <- "rhode island"
sat$state[sat$state == "south.carolina"] <- "south carolina"
sat$state[sat$state == "south.dakota"] <- "south dakota"
sat$state[sat$state == "west.virginia"] <- "west virginia"

rownames(sat) <- sat[,1]
id <- match(row.names(sat), row.names(state.sp))
#row.names(sat)[is.na(id)]

sat1 <- sat[!is.na(id), ]
state.spdf <- SpatialPolygonsDataFrame(state.sp, sat1)
breakpoints <- seq(min(sat1$counts)-0.1, max(sat1$counts)+0.1,length.out = 8)
spplot(state.spdf[2], main="Flu Means for 10 years", at=breakpoints,col.regions=my.palette, col="black")
setwd("/Users/benStraub/Desktop/597_spatial_statistics/Project")

flu <- read.csv(url("https://www.google.org/flutrends/about/data/flu/us/data.txt"), skip=11)

#flu$Date <- as.Date(flu$Date)
states <- c(names(flu)[1:53])
flu.states <- flu[states]
flu.states <- flu.states[,-2]

#  Convert to date if not already
flu.states$Date <- as.Date(flu.states$Date)
#  Get months
flu.states$Month <- months(flu.states$Date)
#  Get years
flu.states$Year <- format(flu.states$Date,format="%y")
flu.states <- flu.states[,-1]
fsym <- melt(flu.states, id=c("Month", "Year"), na.rm=TRUE)
fsy <- acast(fsym, Year ~ variable, mean)
tfsy <- t(fsy)
states <- as.data.frame(row.names(tfsy))
tfsy <- cbind(states, tfsy)
colnames(tfsy) <- "state"
rownames(tfsy) <- 1:nrow(tfsy)
tfsy[[1]] <- tolower(tfsy[[1]])
colnames(tfsy) <- c("state", "03","04","05","06","07","08","09","10","11","12","13","14","15" )

tfsy$state[tfsy$state == "district.of.columbia"] <- "district of columbia"
tfsy$state[tfsy$state == "new.hampshire"] <- "new hampshire"
tfsy$state[tfsy$state == "new.jersey"] <- "new jersey"
tfsy$state[tfsy$state == "new.mexico"] <- "new mexico"
tfsy$state[tfsy$state == "new.york"] <- "new york"
tfsy$state[tfsy$state == "north.carolina"] <- "north carolina"
tfsy$state[tfsy$state == "north.dakota"] <- "north dakota"
tfsy$state[tfsy$state == "rhode.island"] <- "rhode island"
tfsy$state[tfsy$state == "south.carolina"] <- "south carolina"
tfsy$state[tfsy$state == "south.dakota"] <- "south dakota"
tfsy$state[tfsy$state == "west.virginia"] <- "west virginia"
tfsy <- tfsy[-12,]
tfsy <- tfsy[-2,]
rownames(tfsy) <- tfsy[,1]
id <- match(row.names(tfsy), row.names(state.sp))
#row.names(tfsy)[is.na(id)]
tfsy1 <- tfsy[!is.na(id), ]
state.spdfY <- SpatialPolygonsDataFrame(state.sp, tfsy)

tfsy03 <- tfsy1[,1:2]
names(tfsy03)[names(tfsy03)=="03"] <- "counts"
state.spdfY <- SpatialPolygonsDataFrame(state.sp, tfsy03)
spplot(state.spdfY[2], main="Flu Means for 03",at=breakpoints,col.regions=my.palette, col="black")

tfsy08 <- tfsy1[c(1,7)]
names(tfsy08)[names(tfsy08)=="08"] <- "counts"
state.spdfY <- SpatialPolygonsDataFrame(state.sp, tfsy08)
spplot(state.spdfY[2], main="Flu Means for 08",at=breakpoints,col.regions=my.palette, col="black")

tfsy15 <- tfsy1[c(1,14)]
names(tfsy15)[names(tfsy15)=="15"] <- "counts"
state.spdfY <- SpatialPolygonsDataFrame(state.sp, tfsy15)
str(slot(state.spdfY, "data"))
spplot(state.spdfY[2], main="Flu Means for 15")



setwd("/Users/benStraub/Desktop/597_spatial_statistics/Project")

flu <- read.csv(url("https://www.google.org/flutrends/about/data/flu/us/data.txt"), skip=11)

#flu$Date <- as.Date(flu$Date)
states <- c(names(flu)[1:53])
flu.states <- flu[states]
flu.states <- flu.states[,-2]

#  Convert to date if not already
flu.states$Date <- as.Date(flu.states$Date)
#  Get months
flu.states$Month <- months(flu.states$Date)
#  Get years
flu.states$Year <- format(flu.states$Date,format="%y")
flu.states <- flu.states[,-1]
fsym <- melt(flu.states, id=c("Month", "Year"), na.rm=TRUE)
names(fsym)[names(fsym)=="variable"] <- "state"
names(fsym)[names(fsym)=="value"] <- "counts"
fsym$Month<-as.integer(fsym$Month)
fsym$Year<-as.integer(fsym$Year)

par(mfrow=c(1,2))
#create binary and row normalized adjacency matrices
usa.nb <- poly2nb(state.spdf)
summary(usa.nb)
usa.adj.matB = nb2mat(usa.nb, style="B") #binary adjacency matrix
usa.adj.matW = nb2mat(usa.nb, style="W") #row normalized adjacency matrix
usa.listw.B<- nb2listw(usa.nb, style="B")
usa.listw.W<- nb2listw(usa.nb, style="W")

597 Spatial Project

Ben Straub