Homework3 - STAT 406

library(knitr)
library(rmdformats)
library(ggplot2)
library(dplyr)
library(spdep)
library(sp)
library(maptools)
library(spdep)
library(sae)

1) woodlice.csv contains a classic data set on the spatial distribution of one species of woodlice (crustaceans that live in damp litter and related to the US pill bug) in a woodland near Oxford, England.

a) Plot the data. Include the plot in your answer. What is your visual impression about spatial clustering in these data? Identify at least one that suggests there is positive spatial clustering and at least one that suggests there is negative spatial clustering.

wood<-read.csv("woodlice.csv")
wood%>%ggplot(aes(x=x,y=y))+
  geom_point(aes(size=count))+theme_bw()+
  xlim(0,6)+
  ylim(0,6)+
  geom_point(x=5.63,y=2.59,color="red",size=6)+
  geom_text(x=5.8,y=2.0,label="Positive",color="red")+
  geom_text(x=1.3,y=0.65,label="Negative",color="Blue")

Of course we can see some patchiness in the study area for example a patch of points with a small values and positive corrolation.

b) Define neighbors as two quadrats separated by less than 1 unit. Do all quadrat locations have the same number of neighbors? If not, what are the smallest and largest number of neighbors for any individual quadrat?

wood.sp<-wood
coordinates(wood.sp) <- c('x','y')
gridded(wood.sp) <- T
sd.nb1 <- dnearneigh(wood.sp, 0, 1)
table(sapply(sd.nb1,length))

## 
##  3  4  6 
##  6 12 19

Here, 6 locations have 3 neighbors, 12 have 4 and 19 have 6.

c) Define spatial weights as 1 if two locations are neighbors (defined as in 1b). Using this weight matrix, estimate Moran’s I and Geary’s c and test the null hypothesis of no spatial association between quadrats. Report your estimates and p-value(s).

sd.w1 <- nb2listw(sd.nb1, style='B')
mor<-moran.mc(wood.sp$count, sd.w1, 9999)
ger<-geary.mc(wood.sp$count, sd.w1, 9999)

	Statistic	P.value
Moran’s I:	0.1074857	0.0914
Geary’s c :	0.0747	0.8353286

No spatial pattern for alpha level of 0.05.

d) In part c, you could have constructed a test using Monte-Carlo simulation or using a normal approximation. Which did you use and why? Since the sample size of 37 is not that big I prefered to use the Monte-Carlo simulation.

e) Calculate row-standardized weights (style = ‘W’) using the neighbor definition in 1b. Using this weight matrix, estimate Moran’s I and Geary’s c and test the null hypothesis of no spatial association between quadrats. Report your estimates and p-value(s).

sd.w1 <- nb2listw(sd.nb1, style='W')
mor<-moran.mc(wood.sp$count, sd.w1, 9999)
ger<-geary.mc(wood.sp$count, sd.w1, 9999)

	Statistic	P.value
Moran’s I:	0.1325634	0.0702
Geary’s c :	0.05715	0.8314899

f) Although there is no explicit distributional assumption behind Moran’s I or Geary’s c, both implicitly assume that a pair of values of 0, 1 (for location and its neighbor) are just as different as the pair of values of 100, 101. This makes a lot of sense when all values have the same variance. It makes less sense when the variance is much larger for values ca 100 than it is for values ca 1. Please ask if that statement doesn’t make sense. One way to control unequal variances is to transform the response. For count data like the woodlice, the standard transformation is the square root. \ Calculate the square root of the count at each location, use 0/1 weights and the transformed counts, and estimate Moran’s I and Geary’s c and test the hypothesis of no spatial association. Report your estimates and p-value(s). Summarize in a sentence (or perhaps two) what these results tell you about spatial clustering of woodlice.\

## tranfprming the count data
wood.sp$src<-sqrt(wood.sp$count)
##moran test
mor<-moran.mc(wood.sp$src, sd.w1,9999)
ger<-geary.mc(wood.sp$src, sd.w1, 9999)

	Statistic	P.value
Moran’s I:	0.1592349	0.0437
Geary’s c :	0.8097279	0.0392

Taking square root transformation helped to control the unequal variances and it results the rejection of null hypothesis and claiming a positive spatial corrolation. g) The biologists studying woodlice decide they are really interested whether woodlice were found (count > 0) or not (count = 0) in each quadrat. Transform the counts to presence / absence data and use join count analysis to test the hypotheses of no spatial association. Summarize in a sentence or two what these results tell you about spatial clustering of woodlice.\

wood.sp$status <- ifelse(wood.sp$count>0,"Presence","Absence")
wood.sp$status <- factor(wood.sp$status)

## join count monte carlo
Jc<-joincount.mc(wood.sp$status, sd.w1, 9999)

	statistic	p.value
Absence:	14.2	0.02165
Presence:	31.2	0.1874

Simply I judge just based on p-value, which is not significant for presence and indicative of CSR while, for absence we can reject the null hypothesis and claim the existance of spatial clustering.\ h) Continuing with the presence/absence data, if you didn’t realize that the values were 0 or 1, you might use Moran’s I or Geary’s c to evaluate spatial clustering / correlation of the presence/absence data. Is there any relationship between Geary’s c for presence/absence data and the join count statistics? Briefly explain why or why not.

2. This problem explores spatial smoothing and demonstrates techniques for manipulating data from a GIS system. The files in London.zip contain data on the number of suicides in 32 London boroughs summed over four years and related information. a) Read the shapefile into a SpatialPolygons object. Plot the London boroughs. Combine the information about each borough with the polygons. Plot the logSMR in the boroughs. The two plots are your answer.

London.poly <- readShapePoly("london/london.shp")
London.data<-read.csv("london/london.csv")
plot(London.poly)

## combining the information and polygon
both <- SpatialPolygonsDataFrame(London.poly, data.frame(response=London.data$logSMR, row.names=row.names(London.poly)))
##plot both
spplot(both, zcol="response",col=NA)

b) Define the neighbors of a polygon as those polygons that share a boundary. What is the minimum number of neighbors? The maximum number of neighbors?

plot(London.poly)
nbs<-poly2nb(London.poly)
plot(nbs , coordinates(London.poly),add=TRUE)

num.nbs<-unlist(lapply(nbs,function(x){length(x)}))

Minimum is 3 and maximum number of neighbors is 7.

c) Estimate the spatially smoothed log SMR for each borough. Plot a map showing the smoothed values. Your answer is the plot.

SMRsmooth <- eblupFH(logSMR~1, varSMR, data=London.data)
both$eblup <- SMRsmooth$eblup
spplot(both, zcol="eblup",col=NA)

d) Plot the smoothed values for each borough against their original logSMR. Which boroughs have smoothed values that are larger than the original value? Which have smaller? Describing some characteristic of each group is sufficient. You don’t need to give me lists of observation numbers. Briefly describe why those that increase and those that decrease “makes sense”.
Note: It may help to draw the line of equality on the plot, i.e. by abline(0, 1), which adds a line with intercept 0 and slope 1 to a plot.

both$dif<- both$response-both$eblup
spplot(both, zcol="dif",col=NA)

both@data%>%ggplot()+
  geom_point(aes(x=eblup,y=response),size=3)+
  xlim(-0.5,0.7)+
  ylim(-0.5,0.7)+
  geom_abline(intercept = 0,slope = 1)

It basically, increases the larger values and decreases the smaller values