Homework4 - STAT 406

library(knitr)
library(rmdformats)
library(ggplot2)
library(dplyr)
library(sp)
library(maptools)
library(spatstat)

1) The locations in tornado.csv are the touchdown locations for all recorded strong tornadoes in Iowa between 1950 and 2014. There are 160 tornadoes classified F3, F4 or F5 (the three strongest levels in the Fujita scale). The locations are UTM zone 15 coordinates, expressed in kilometers. These data were extracted from TornadoHistoryProject.com. The points in ia.csv define the border of IA, again in UTM zone 15 expressed as kilometers. See R notes at the end for how to define the study window using an irregular polygone. Our interest is in the spatial pattern of these locations. Problem 2 will explore the intensity of touch-down locations.
a) plot the locations. Your plot is your answer.

torn<-read.csv("tornado.csv")
ia.border<-read.csv("ia.csv")
tor.ppp <- ppp(torn$x, torn$y, poly=list(x = ia.border$x, y=ia.border$y))
plot(tor.ppp,pch=19,main='')

b) You want to know whether touch-down locations are clustered, randomly distributed, or spatially segregated. Use K(x) or L(x) to understand the spatial pattern. Describe in a sentence your interpretation of the spatial pattern (no hypothesis test needed or expected). Include a plot supporting your interpretation.

Lenv <- Lest(tor.ppp)
plot(Lenv)

c) Use point-wise $\alpha=0.05$ tests to determine whether the pattern shows clustering or spatial segregation. What do you think? Support your conclusion with an appropriate plot or test statistic(s).

Lenv <- envelope(tor.ppp, Lest, nsim=51, nrank=25)

## Generating 51 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
## 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
## 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,
## 46, 47, 48, 49, 50,  51.
## 
## Done.

plot(Lenv)

d) Use a summary test over the interval from 1km to 50km to evaluate whether the data show some non-random spatial pattern. Report which test you performed, the p-value, and your interpretation of the results of this test.

 mt<-dclf.test(tor.ppp, fun=Lest, rinterval=c(1,50) )

## Generating 99 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
## 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
## 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,
## 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
## 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75,
## 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
## 91, 92, 93, 94, 95, 96, 97, 98,  99.
## 
## Done.

mt$p.value

## [1] 0.01

There is consistent and small e) You want to estimate the spatial scale of the pattern, i.e. if a tornado touches down at a location, how likely is that another one touches down 5km away, or 10km away, .? Which summary function, F(x), G(x), J(x), K(x), L(x), or g(x) is most appropriate to answer this? Briefly explain your choice.

f) Analyze the data using the method chosen in e). What do you conclude about the spatial scale of the pattern? Include an appropriate plot to support your conclusion.

genv <- envelope(tor.ppp, pcf, nsim=51, nrank=25)

## Generating 51 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
## 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
## 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,
## 46, 47, 48, 49, 50,  51.
## 
## Done.

plot(genv, log(cbind(obs, theo, lo, hi) ) ~ r,xlim=c(0, 80))

g) Fit a Matern cluster process to the tornado data. I recommend that you restrict the fit to distances between 5km and 50km. What are the estimated parameters, including µ?

model1 <- matclust.estK(tor.ppp, startpar = c(kappa = 1.5, R = 2),rmin = 5,rmax = 50)
model1

## Minimum contrast fit (object of class "minconfit")
## Model: Matern Cluster process
## Fitted by matching theoretical K function to Kest(tor.ppp)
## 
## Internal parameters fitted by minimum contrast ($par):
##        kappa            R 
##  0.002352311 18.029645413 
## 
## Fitted cluster parameters:
##        kappa        scale 
##  0.002352311 18.029645413 
## 
## Converged successfully after 161 function evaluations
## 
## Starting values of parameters:
## kappa     R 
##   1.5   2.0 
## Domain of integration: [ 5 , 50 ]
## Exponents: p= 2, q= 0.25

plot(model1, legendargs=list(cex=0.8))

h) Overlay the L2(x) function, i.e. sqrt((K(r)/pi)-r), estimated from the data and the L2(x) function for the fitted Matern cluster process. Visually, does the Matern cluster process appear to fit the data? Include an appropriate plot with your answer.

#plot(model1, sqrt((iso/pi)-r), legendargs=list(cex=1))
#plot(model1, sqrt((fit/pi)-r) ~ r, legendargs=list(cex=1),add=TRUE)

i) Would it be appropriate to test whether the Matern cluster process fits these data by comparing the L2(x) function from the data to the envelope simulated using the fitted Matern cluster process? Briefly explain why or why not.

2) We now turn to the intensity function (vis-à-vis point patterns, the average number of tornadoes in a small area, not tornado intensity).

a) What is the optimal bandwidth when using the Diggle-Berman MSE criterion? What about when using the point process likelihood criterion?

bw.diggle(tor.ppp)

##    sigma 
## 18.41262

bw.diggle(tor.ppp,hmax=50)

##    sigma 
## 22.70059

bw.ppl(tor.ppp)

##    sigma 
## 50.06409

b) Plot the MSE criterion and the point process criterion for a range of bandwidths (I suggest from 5km to 50km). What do these plots tell you about the preciseness of the optimal bandwidth? In other words, for either method is the a clearly defined best bandwidth (precise) or is there a range of reasonable bandwidths (not precise). Include your plots with your answer.

temp <- bw.diggle(tor.ppp, srange=c(5, 50))
plot(temp)

temp <- bw.ppl(tor.ppp, srange=c(5, 50))

plot(temp)

c) Pick what you believe is a reasonable bandwidth and estimate the intensity of tornadoes over the state of Iowa. Your answer is your choice of bandwidth, a brief explanation of why you chose it, and a map of predicted intensity.

plot(density(tor.ppp, 18))

d) A friend claims tornadoes are most frequent in southern Iowa and becoming increasingly less common as you move north. Evaluate this claim by fitting an inhomogeneous Poisson process model with a log intensity that is a linear function of northing (the y coordinate), that is fit the model $log \lambda(s)= \beta_{0}+ \beta_1$ y(s), where y(s) is the northing at location s. Report the estimated parameters and test whether the data support your friend’s claim.

tor.ipp <- ppm(tor.ppp, ~x+y, correction='iso')

summary(tor.ipp)

## Point process model
## Fitting method: maximum likelihood (Berman-Turner approximation)
## Model was fitted using glm()
## Algorithm converged
## Call:
## ppm.ppp(Q = tor.ppp, trend = ~x + y, correction = "iso")
## Edge correction: "isotropic"
## ---------------------------------------------------------------------------
## Quadrature scheme = data + dummy + weights
## 
## Data pattern:
## Planar point pattern:  160 points
## Average intensity 0.0011 points per square unit
## Window: polygonal boundary
## single connected closed polygon with 127 vertices
## enclosing rectangle: [204.6172, 737.0234] x [4470.733, 4822.466] units
## Window area = 145040 square units
## 
## Dummy quadrature points:
##      (32 x 32 grid, plus 4 corner points)
## Planar point pattern:  839 points
## Average intensity 0.00578 points per square unit
## Window: polygonal boundary
## single connected closed polygon with 127 vertices
## enclosing rectangle: [204.6172, 737.0234] x [4470.733, 4822.466] units
## Window area = 145040 square units
## Quadrature weights:
##      (counting weights based on 32 x 32 array of rectangular tiles)
## All weights:
##  range: [7.68, 183]  total: 145000
## Weights on data points:
##  range: [37.7, 91.4] total: 13000
## Weights on dummy points:
##  range: [7.68, 183]  total: 132000
## ---------------------------------------------------------------------------
## FITTED MODEL:
## 
## Nonstationary Poisson process
## 
## ---- Intensity: ----
## 
## Log intensity: ~x + y
## 
## Fitted trend coefficients:
##   (Intercept)             x             y 
## -2.8444809498 -0.0008182010 -0.0007718252 
## 
##                  Estimate         S.E.       CI95.lo      CI95.hi Ztest
## (Intercept) -2.8444809498 4.0992621482 -10.878887124 5.1899252239      
## x           -0.0008182010 0.0006056726  -0.002005297 0.0003688955      
## y           -0.0007718252 0.0008725915  -0.002482073 0.0009384226      
##                   Zval
## (Intercept) -0.6939007
## x           -1.3508965
## y           -0.8845207
## 
## ----------- gory details -----
## 
## Fitted regular parameters (theta):
##   (Intercept)             x             y 
## -2.8444809498 -0.0008182010 -0.0007718252 
## 
## Fitted exp(theta):
## (Intercept)           x           y 
##  0.05816445  0.99918213  0.99922847

####### test see if the mode with x+y is better than just mean
tor.hpp <- ppm(tor.ppp, ~1, correction='iso')

anova(tor.hpp, tor.ipp, test='Chi')

## Analysis of Deviance Table
## 
## Model 1: ~1   Poisson
## Model 2: ~x + y   Poisson
##   Npar Df Deviance Pr(>Chi)
## 1    1                     
## 2    3  2   2.3937   0.3021

_e) Tornadoes, especially severe ones, can be on the ground for kilometers. There are some “Iowa” tornadoes that starting in other states. The recorded location (in this data set) for tornadoes that started in another state is the location on/near the state border where the tornado crossed into Iowa. You would like to know whether recorded tornado locations are more frequent along the southern border of Iowa. One way to evaluate this is to define a variable, border(s) that is 1 close to the border and 0 elsewhere, then fit the model: $log \lambda(s)= \beta_0+ \beta_1$ border(s). You decide that the “border” is every location with a northing less than 4510 km. (See notes on next page for how to create this variable). Fit this model, report your estimates of $\beta_0$ and 1$ .

ia.border <- as.im(function(x,y) { (y < 4510)+0}, W=tor.ppp)
tor.bor.pp<-ppm(tor.ppp, ~ia.border, correction='iso')

f) Test the hypothesis that $\beta_1=0$. Report the test you used, the test statistic, p-value, and a short conclusion.

anova(tor.hpp, tor.bor.pp, test='Chi')

## Analysis of Deviance Table
## 
## Model 1: ~1   Poisson
## Model 2: ~ia.border   Poisson
##   Npar Df Deviance Pr(>Chi)  
## 1    1                       
## 2    2  1    5.627  0.01769 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

g) Describe (using some sort of estimate) the estimated increase in intensity of tornadoes in the border strip.

h) Which model (from part d, with y as the covariate, or part e, with border as the covariate) is better fits these data? Support your conclusion with appropriate numbers or plots.

AIC(tor.ipp)

## [1] 2502.587

AIC(tor.bor.pp)

## [1] 2497.354