Tornado Climatology

Thomas H. Jagger, Florida State University
July 1, 2014, UCLA

Space Time Model of Kansas County Tornadoes
Integrating R-INLA with \( \require{color} \color{green} \mathbf{R} \) Spatial Packages

with support from James B. Elsner

Generated using R-Studio on Mon Jan 12 22:17:33 2015

Why model county activity?

We have county wide data such as
- monthly precipitation
- population
- surface roughness,
- any other aggregated measurements.
Reduce the complex space time stochastic model to a simpler space time lattice model.
Quickly Test model covariate importance, using a statistically defensible mechanism.

How do we model the activity by county?

Overlay tornado tracks onto counties.
- count intersects by year for each county.
- This reduces complex model to space time lattice model.
- This allows model estimation to be done in minutes.
- Single storm may be counted in multiple counties.
Aggregate all data to the county level
Model with Bayesian framework
- Estimate parameter distributions using R-INLA

Warning: Aggregation of any kind may result in spurious relationships

Why use Bayesian Modelling?

Bayesian models

assume parameters are not fixed! but random with distributions.
- Posterior \( \propto \) prior \( \cdot \) likelihood.
works well with hierarchical models
- obs~\( f_1 \)(param[1]) param[i]~\( f_i \)(param[i+1])
- we need it for constructing space-time models.
converge, as posterior samples exist with proper initialization.
incorporate prior information e.g. from previous research.
Allow model comparison: P(M|D) \( \propto \) P(D|M)\( \cdot \) P(M)

Why R-INLA

R-INLA performs Integrated Nested Laplacian Analysis

Performs Hierarchical Bayesian Analysis
- Free choice of likelihood, priors for latent parameters, hyper-priors
- latent parameters follow a GMRF, Gaussian Markov Random Field
- GMRF has hyper-priors! Obs ~ f(\( \mu \)), \( \mu \)~GMRF(\( \tau \))
- returns predictive, posterior and hyper prior distributions
Results e.g. posteriors can be easily plotted
- Use \( \require{color} \color{green} \mathbf{R} \) Spatial and Graphing packages
- ggplot2, sp
Provides diagnostics e.g. DIC, PIT, PO
Provides posterior samplers

How can I do it?

Hurricane Count Example:

\[ \begin{eqnarray} Y_{i} &\sim& \mbox{Pois}(\lambda_{i}) \\ \log(\lambda_{i}) &\sim& N(\alpha+ \mbox{sst}_i \cdot \beta,\tau^{-1}) \\ \alpha, \beta &\sim& N(0,1000) \\ \tau &\sim& \Gamma(1,.00001) \\ \end{eqnarray} \]

require(INLA)
a=.5;b=.5;n=100;sst=rnorm(100);ID=1:100    
y=rpois(n,exp(a+b*sst+rnorm(n)/sqrt(10)))   
mydata=data.frame(y=y,ID=1:100,sst=sst)


form = y ~ sst + f(ID, model="iid")   

inla.out = inla(form, family="poisson",data=mydata,
  control.compute = list(config = TRUE,mlik=T,cpo=T,dic=T,po=T))

Hurricane Example Output


Call:
c("inla(formula = form, family = \"poisson\", data = mydata, control.compute = list(config = TRUE, ",  "    mlik = T, cpo = T, dic = T, po = T))")

Time used:
 Pre-processing    Running inla Post-processing           Total 
         0.3310          0.0745          0.0440          0.4495 

Fixed effects:
              mean     sd 0.025quant 0.5quant 0.975quant   mode kld
(Intercept) 0.5777 0.0776     0.4218   0.5790     0.7267 0.5815   0
sst         0.4380 0.0663     0.3071   0.4382     0.5676 0.4387   0

Random effects:
Name      Model
 ID   IID model 

Model hyperparameters:
                     mean       sd 0.025quant 0.5quant 0.975quant    mode
Precision for ID 18571.51 18339.99    1261.14 13169.95   66863.71 3422.02

Expected number of effective parameters(std dev): 2.019(0.0145)
Number of equivalent replicates : 49.54 

Deviance Information Criterion: 340.41
Effective number of parameters: 2.019

Marginal Likelihood:  -176.22 
CPO and PIT are computed

Posterior marginals for linear predictor and fitted values computed

Tornado and Covariate Data Sets

We used a modified tornado data set keeping tornado paths in Kansas from 1954 through 2013 of at least EF1 (F1) strength. The data sets were put into \( \require{color} \color{green} \mathbf{R} \) SpatialLinesDataFrame objects or arrays. We use the \( \require{color} \color{green} \mathbf{R} \) ggplot2 package for plotting data sets.

Tornado data set is a spatial line data set with attributes from the SPC
Elevation data from Viewfinder Panoramas
Population data from The American Fact Finder
Kansas county precipitation data
Spatial derivatives of mean county elevation generated using our modified Savitsky-Golay algorithm
Source code with links: CountyStatisticsMerge.Rmd

Kansas Tornado Counts by County 1954-2013

Kansas Elevation in Meters =

Derived County Elevation Data

plot of chunk derivedcounty

County Population Data by County (Thousands)

plot of chunk population

Mean 1981-2010 Rainfall Data by County (Inches)

plot of chunk rainfall

Bayesian Model Strategy

We compare two models
- Spatial model using total tornado counts
- Yearly model using yearly county tornado counts
We initially develop a spatial model using the following variables:
- nT: Total county tornado count from 1954-2013
- Pop/Area: Population density
- ElevS:Elevation standard deviation (surface roughness)
- Area: As an offset and covariate
  - fixes fitted values as counts/area
- Elev.20: \( \frac{\partial^2\rm{Elev}}{\partial x^2} \)
In the time space model we use all the above except Elev.20 with
- nT: Yearly county tornado count
- Mar+Apr-mam: March + April county rain anomaly

Model Likelihoods and Priors

Likelihood: \( \mbox{Pr}(X=k|r,P) = \frac{\Gamma(k+r)}{\Gamma(r) k!}P^r(1-P)^k \)

The mean is modeled using a linear combination of covariates using logarithmic link function
The latent effect is modeled using a nearest neighbor Besag Model
- Using queen contiguity, generated by spdep poly2nb()
- Besag Model: \( \mu_i \sim N \left(\sum_{j \sim i} \mu_j, \frac{1}{\tau N_i}\right) \)
The count parameter is modeled with a \( \Gamma(0,.00001) \) hyper-prior.

spatial.formula = nT ~ I(log(area)) + 
  I(log(pop/area)) + I(elevS) + I(elev.20*100) + 
  f(ID, model = "besag", graph = tornb.inla)
  c.r = list(return.marginals.random = TRUE)
  c.c = list(config = TRUE,mlik=T,cpo=T,dic=T,po=T)       
  c.p = list(compute=TRUE)
spatial.inla= inla(formula = spatial.formula, family = "nbinomial", E = area,
  data = spdf.1@data, control.predictor = c.p, 
  control.compute=c.c,  control.results=c.r)

Spatial Model Output

Fixed effects:
                   mean    sd 0.025quant 0.5quant 0.975quant   mode kld
(Intercept)      -2.984 1.449     -5.844   -2.982     -0.142 -2.976   0
I(log(area))     -0.186 0.195     -0.569   -0.186      0.199 -0.187   0
I(log(pop/area))  0.100 0.043      0.017    0.100      0.185  0.100   0
I(elevS)         -0.014 0.005     -0.024   -0.014     -0.004 -0.014   0
I(elev.20 * 100) -0.149 0.163     -0.471   -0.149      0.172 -0.149   0

Random effects:
Name      Model
 ID   Besags ICAR model 

Model hyperparameters:
                                                      mean    sd 0.025quant
size for the nbinomial observations (overdispersion) 10.32  2.35       6.37
Precision for ID                                     18.77 17.33       4.65
                                                     0.5quant 0.975quant mode
size for the nbinomial observations (overdispersion)    10.11      15.54 9.72
Precision for ID                                        13.53      64.06 8.34

Expected number of effective parameters(std dev): 20.20(7.03)
Number of equivalent replicates : 5.20 

Deviance Information Criterion: 700.05
Effective number of parameters: 21.97

Spatial Model Output Discussion

We can use the output to quickly ascertain

What coefficients are significant statistically?
What is the coefficient posterior mean, median and how can one interpret these?
Are the effect sizes meaningful and of the expected sign?
What are the fixed effects, random effects and hyper priors?
Given that the dispersion parameter is modeled by the parameter \( r \), how dispersed is the data?
- \( \phi=(1+\mu/r) \)
What is the DIC, how do we use it? Why is it meaningless here?
- What is the effective number of parameters, what does this mean?

Fixed and Random Effects Posterior Means

plot of chunk maprandom

Posterior Densities for Spatial Model

plot of chunk SpatialcoefficientDensities

Adding a Time Component

We created a new data set, reduced to the 14 Years (2000 – 2013) with precipitation data.
Negative binomial distributions are closed under independent sums provided \( P \) the probability of of success is fixed.
- \( \mu = r\frac{1-P}{P} = r\mbox{ or} \)
- \( r \approx 60 \), but, we have only 14 years of time series data for precipitation.
- Response: Nt has 1470 data points

spaceTime.formula = nT ~ I(log(area)) + 
    I(log(pop/area)) + I(elevS) +
    I(Apr + May - mam) + I(mam) + f(ID, model = "besag", graph = tornb.inla)  
spaceTime.inla= inla(formula = spaceTime.formula, family = "nbinomial", E = area,
  data = inladata, control.predictor = c.p, control.compute=c.c,  control.results=c.r)

Space Time INLA Summary

Fixed effects:
                     mean    sd 0.025quant 0.5quant 0.975quant   mode kld
(Intercept)        -5.202 2.804    -10.761   -5.188      0.269 -5.158   0
I(log(area))       -0.404 0.380     -1.147   -0.406      0.346 -0.409   0
I(log(pop/area))    0.015 0.084     -0.148    0.015      0.182  0.014   0
I(elevS)           -0.019 0.010     -0.038   -0.019      0.000 -0.019   0
I(Apr + May - mam)  0.107 0.026      0.057    0.107      0.158  0.107   0
I(mam)             -0.006 0.096     -0.191   -0.008      0.188 -0.011   0

Random effects:
Name      Model
 ID   Besags ICAR model 

Model hyperparameters:
                                                      mean    sd 0.025quant
size for the nbinomial observations (overdispersion) 0.355 0.047      0.272
Precision for ID                                     2.086 0.934      0.882
                                                     0.5quant 0.975quant  mode
size for the nbinomial observations (overdispersion)    0.352      0.457 0.345
Precision for ID                                        1.883      4.452 1.551

Expected number of effective parameters(std dev): 29.65(6.18)
Number of equivalent replicates : 49.58 

Deviance Information Criterion: 2079.14
Effective number of parameters: 30.94

DIC is 15.1213 smaller than model without spring precipitation anomaly.

Posterior Densities for Space Time

plot of chunk coefficientDensities

Summary

R-INLA provides a fast way to perform Bayesian analysis on climate data sets
- Spatial Model fits well, data contains no zeros.
Space Time models are an easy extension of spatial models.
- Useful for including time varying covariates.
- The space time model PIT and CPO checks may look poor for positive counts, due to the number of zero counts in the fit.
- Other likelihoods: zero inflated negative binomial
  - hyper parameters distribution does not have covariates.
- Temporal latent fields, e.g. Random Walk, AR1 models possible with R-INLA
R-INLA allows space time simulation, just include data with NA and known predictors.

The End!

Thank you for your time.

Analysis and Talk on http://rpubs.com/thjagger/

Thomas Jagger tjagger@fsu.edu