Synopsis

Based upon U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database (1950-2011), we conducted an exploration data analysis with respect to the incidence of individual severe weather events (SWE), their geographical distribution and impact in the terms of human injuries, fatalities, property and crop damage. Data processing and exploration were made using R programming language and documented in Markdown, report further transferred to PDF and RPub. Due to the quality of data, further statistical analysis in deep may be performed only given a qualified input to the historical evolution of the data monitoring and registering system, as well as to the possibility of aggregation over the scattered event types and individual observations. Hail was on the top of the SWE by incidence, followed by different kinds of thunderstorm, tornado and flood. In U.S. states comparison Texas had the at most SWE, followed by Kansas and Oklahoma. As the figures for the states are surface dependent, at county level twenty leading counties belong all to Arkansas (ranking #8 at state level). At the basis of latitude and longitude analysis (1°x1° squares) we identified as most threatened contiguous region with latitude 32°N to 40°N and longitude 90°W to 101°W. Dynamics of registered SWE over years is suspected to be strongly influenced by the evolution of monitoring and registration system. The main threats wit respect to personal and economical damage has beeen defined, as well as U.S. states were ranked respective human and economical impact of SWE.

Data Sourcing and Processing

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database as data coming in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.

There is also some documentation of the database available

Set-up and downloading the data

startdir <- "~/R"
setwd(startdir)
datadir <- "./data"
datafile = paste(datadir,"StormData.csv.bz2",sep = "/")
durl = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

if(!dir.exists("./data")) dir.create("./data")

if(!file.exists(datafile)) {
        download.file(durl, destfile = datafile, mode = "wb")
}
#For the sake of performance and efficiency
library(data.table) 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Reading in the data

Here is a short insight in the data file from console.

bzcat StormData.csv.bz2|head -5

"STATE__","BGN_DATE","BGN_TIME","TIME_ZONE","COUNTY","COUNTYNAME","STATE","EVTYPE","BGN_RANGE","BGN_AZI","BGN_LOCATI","END_DATE","END_TIME","COUNTY_END","COUNTYENDN","END_RANGE","END_AZI","END_LOCATI","LENGTH","WIDTH","F","MAG","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","WFO","STATEOFFIC","ZONENAMES","LATITUDE","LONGITUDE","LATITUDE_E","LONGITUDE_","REMARKS","REFNUM"
1.00,4/18/1950 0:00:00,"0130","CST",97.00,"MOBILE","AL","TORNADO",0.00,,,,,0.00,,0.00,,,14.00,100.00,"3",0.00,0.00,15.00,25.00,"K",0.00,,,,,3040.00,8812.00,3051.00,8806.00,,1.00
1.00,4/18/1950 0:00:00,"0145","CST",3.00,"BALDWIN","AL","TORNADO",0.00,,,,,0.00,,0.00,,,2.00,150.00,"2",0.00,0.00,0.00,2.50,"K",0.00,,,,,3042.00,8755.00,0.00,0.00,,2.00
1.00,2/20/1951 0:00:00,"1600","CST",57.00,"FAYETTE","AL","TORNADO",0.00,,,,,0.00,,0.00,,,0.10,123.00,"2",0.00,0.00,2.00,25.00,"K",0.00,,,,,3340.00,8742.00,0.00,0.00,,3.00
1.00,6/8/1951 0:00:00,"0900","CST",89.00,"MADISON","AL","TORNADO",0.00,,,,,0.00,,0.00,,,0.00,100.00,"2",0.00,0.00,2.00,2.50,"K",0.00,,,,,3458.00,8626.00,0.00,0.00,,4.00

So, data looks regular at the start. But it is known, that the file lines are CRLF delimited (DOS formatted) and there are longer text comments down, with line breaks in. We must control the integrity and uniformity of the data after reading them in.

stormdata <- data.table(read.csv(datafile, na.strings = c("NA","")))

First evaluation of data structure, completeness and integrity

How big is the resulting data table?

class(stormdata)
## [1] "data.table" "data.frame"
dim(stormdata)
## [1] 902297     37

What kinds of data are in?

colnames(stormdata)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Removing the trailing underscores from column names (make.names alone doesn’t care)

cn <- make.names(sub("_*$","", x = colnames(stormdata)))
cn[1] <- "STATE_NR"
colnames(stormdata) <- cn

Looking at the structure of the data.

str(stormdata)
## Classes 'data.table' and 'data.frame':   902297 obs. of  37 variables:
##  $ STATE_NR  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29600 levels "5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13512 1872 4597 10591 4371 10093 1972 23872 24417 4597 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 34 levels "  N"," NW","E",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ BGN_LOCATI: Factor w/ 54428 levels " Christiansburg",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_DATE  : Factor w/ 6662 levels "1/1/1993 0:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_TIME  : Factor w/ 3646 levels " 0900CST"," 200CST",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 23 levels "E","ENE","ESE",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_LOCATI: Factor w/ 34505 levels " CANTON"," TULIA",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ WFO       : Factor w/ 541 levels " CI","%SD","$AC",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ STATEOFFIC: Factor w/ 249 levels "ALABAMA, Central",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ ZONENAMES : Factor w/ 25111 levels "                                                                                                                               "| __truncated__,..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE : num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436780 levels "\t","\t\t","\t\t\t\t",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Five-point summary

summary(stormdata)
##     STATE_NR                 BGN_DATE             BGN_TIME     
##  Min.   : 1.0   5/25/2011 0:00:00:  1202   12:00:00 AM: 10163  
##  1st Qu.:19.0   4/27/2011 0:00:00:  1193   06:00:00 PM:  7350  
##  Median :30.0   6/9/2011 0:00:00 :  1030   04:00:00 PM:  7261  
##  Mean   :31.2   5/30/2004 0:00:00:  1016   05:00:00 PM:  6891  
##  3rd Qu.:45.0   4/4/2011 0:00:00 :  1009   12:00:00 PM:  6703  
##  Max.   :95.0   4/2/2006 0:00:00 :   981   03:00:00 PM:  6700  
##                 (Other)          :895866   (Other)    :857229  
##    TIME_ZONE          COUNTY           COUNTYNAME         STATE       
##  CST    :547493   Min.   :  0.0   JEFFERSON :  7840   TX     : 83728  
##  EST    :245558   1st Qu.: 31.0   WASHINGTON:  7603   KS     : 53440  
##  MST    : 68390   Median : 75.0   JACKSON   :  6660   OK     : 46802  
##  PST    : 28302   Mean   :100.6   FRANKLIN  :  6256   MO     : 35648  
##  AST    :  6360   3rd Qu.:131.0   LINCOLN   :  5937   IA     : 31069  
##  HST    :  2563   Max.   :873.0   (Other)   :866412   NE     : 30271  
##  (Other):  3631                   NA's      :  1589   (Other):621339  
##                EVTYPE         BGN_RANGE           BGN_AZI      
##  HAIL             :288661   Min.   :   0.000   N      : 86752  
##  TSTM WIND        :219940   1st Qu.:   0.000   W      : 38446  
##  THUNDERSTORM WIND: 82563   Median :   0.000   S      : 37558  
##  TORNADO          : 60652   Mean   :   1.484   E      : 33178  
##  FLASH FLOOD      : 54277   3rd Qu.:   1.000   NW     : 24041  
##  FLOOD            : 25326   Max.   :3749.000   (Other):134990  
##  (Other)          :170878                      NA's   :547332  
##          BGN_LOCATI                  END_DATE             END_TIME     
##  COUNTYWIDE   : 19680   4/27/2011 0:00:00:  1214   06:00:00 PM:  9802  
##  Countywide   :   993   5/25/2011 0:00:00:  1196   05:00:00 PM:  8314  
##  SPRINGFIELD  :   843   6/9/2011 0:00:00 :  1021   04:00:00 PM:  8104  
##  SOUTH PORTION:   810   4/4/2011 0:00:00 :  1007   12:00:00 PM:  7483  
##  NORTH PORTION:   784   5/30/2004 0:00:00:   998   11:59:00 PM:  7184  
##  (Other)      :591444   (Other)          :653450   (Other)    :622432  
##  NA's         :287743   NA's             :243411   NA's       :238978  
##    COUNTY_END COUNTYENDN       END_RANGE           END_AZI      
##  Min.   :0    Mode:logical   Min.   :  0.0000   N      : 28082  
##  1st Qu.:0    NA's:902297    1st Qu.:  0.0000   S      : 22510  
##  Median :0                   Median :  0.0000   W      : 20119  
##  Mean   :0                   Mean   :  0.9862   E      : 20047  
##  3rd Qu.:0                   3rd Qu.:  0.0000   NE     : 14606  
##  Max.   :0                   Max.   :925.0000   (Other): 72096  
##                                                 NA's   :724837  
##            END_LOCATI         LENGTH              WIDTH         
##  COUNTYWIDE     : 19731   Min.   :   0.0000   Min.   :   0.000  
##  SOUTH PORTION  :   833   1st Qu.:   0.0000   1st Qu.:   0.000  
##  NORTH PORTION  :   780   Median :   0.0000   Median :   0.000  
##  CENTRAL PORTION:   617   Mean   :   0.2301   Mean   :   7.503  
##  SPRINGFIELD    :   575   3rd Qu.:   0.0000   3rd Qu.:   0.000  
##  (Other)        :380536   Max.   :2315.0000   Max.   :4400.000  
##  NA's           :499225                                         
##        F               MAG            FATALITIES          INJURIES        
##  Min.   :0.0      Min.   :    0.0   Min.   :  0.0000   Min.   :   0.0000  
##  1st Qu.:0.0      1st Qu.:    0.0   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  Median :1.0      Median :   50.0   Median :  0.0000   Median :   0.0000  
##  Mean   :0.9      Mean   :   46.9   Mean   :  0.0168   Mean   :   0.1557  
##  3rd Qu.:1.0      3rd Qu.:   75.0   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  Max.   :5.0      Max.   :22000.0   Max.   :583.0000   Max.   :1700.0000  
##  NA's   :843563                                                           
##     PROPDMG          PROPDMGEXP        CROPDMG          CROPDMGEXP    
##  Min.   :   0.00   K      :424665   Min.   :  0.000   K      :281832  
##  1st Qu.:   0.00   M      : 11330   1st Qu.:  0.000   M      :  1994  
##  Median :   0.00   0      :   216   Median :  0.000   k      :    21  
##  Mean   :  12.06   B      :    40   Mean   :  1.527   0      :    19  
##  3rd Qu.:   0.50   5      :    28   3rd Qu.:  0.000   B      :     9  
##  Max.   :5000.00   (Other):    84   Max.   :990.000   (Other):     9  
##                    NA's   :465934                     NA's   :618413  
##       WFO                                       STATEOFFIC    
##  OUN    : 17393   TEXAS, North                       : 12193  
##  JAN    : 13889   ARKANSAS, Central and North Central: 11738  
##  LWX    : 13174   IOWA, Central                      : 11345  
##  PHI    : 12551   KANSAS, Southwest                  : 11212  
##  TSA    : 12483   GEORGIA, North and Central         : 11120  
##  (Other):690738   (Other)                            :595920  
##  NA's   :142069   NA's                               :248769  
##                                                                                                                                                                                                     ZONENAMES     
##                                                                                                                                                                                                          :205988  
##  GREATER RENO / CARSON CITY / M - GREATER RENO / CARSON CITY / M                                                                                                                                         :   639  
##  GREATER LAKE TAHOE AREA - GREATER LAKE TAHOE AREA                                                                                                                                                       :   592  
##  JEFFERSON - JEFFERSON                                                                                                                                                                                   :   303  
##  MADISON - MADISON                                                                                                                                                                                       :   302  
##  (Other)                                                                                                                                                                                                 :100444  
##  NA's                                                                                                                                                                                                    :594029  
##     LATITUDE      LONGITUDE        LATITUDE_E     LONGITUDE     
##  Min.   :   0   Min.   :-14451   Min.   :   0   Min.   :-14455  
##  1st Qu.:2802   1st Qu.:  7247   1st Qu.:   0   1st Qu.:     0  
##  Median :3540   Median :  8707   Median :   0   Median :     0  
##  Mean   :2875   Mean   :  6940   Mean   :1452   Mean   :  3509  
##  3rd Qu.:4019   3rd Qu.:  9605   3rd Qu.:3549   3rd Qu.:  8735  
##  Max.   :9706   Max.   : 17124   Max.   :9706   Max.   :106220  
##  NA's   :47                      NA's   :40                     
##                                            REMARKS           REFNUM      
##                                                : 24013   Min.   :     1  
##  Trees down.\n                                 :  1110   1st Qu.:225575  
##  Several trees were blown down.\n              :   568   Median :451149  
##  Trees were downed.\n                          :   446   Mean   :451149  
##  Large trees and power lines were blown down.\n:   432   3rd Qu.:676723  
##  (Other)                                       :588295   Max.   :902297  
##  NA's                                          :287433

Looking for missing values.

sum(is.na(stormdata))
## [1] 6645709

It looks like a big number of data are not available (in absolute figures). Nothing to wonder, taking in account the time span of observations (1950-2011). How much is it in relation to the whole bunch?

mean(is.na(stormdata))
## [1] 0.1990628

So, about 20% of data entries are missing.

What kinds of data are missing (in per cent of all observations)?

mv <- sort(subset(x <- sapply(stormdata, function(x) mean(is.na(x))), x > 0)*100,
        decreasing = TRUE)
for(i in seq(along = mv)) {print(mv[i])}
## COUNTYENDN 
##        100 
##        F 
## 93.49061 
##  END_AZI 
## 80.33242 
## CROPDMGEXP 
##   68.53763 
## ZONENAMES 
##   65.8352 
##  BGN_AZI 
## 60.65985 
## END_LOCATI 
##   55.32823 
## PROPDMGEXP 
##   51.63865 
## BGN_LOCATI 
##   31.89005 
## REMARKS 
## 31.8557 
## STATEOFFIC 
##   27.57063 
## END_DATE 
## 26.97682 
## END_TIME 
## 26.48551 
##      WFO 
## 15.74526 
## COUNTYNAME 
##  0.1761061 
##    LATITUDE 
## 0.005208928 
## LATITUDE_E 
## 0.00443313

The data labeled as “COUNTYENDN” are totally missing, those labeled with “F” are missing for about 94% of observations. Data labeled “END_AZI” are missing in 80% of observations and so on. Curiously, that the data for “END_DATE” and “END_TIME” are missing in about 26-27% as well.

It seems to exist no some rational common imputing strategy for those kinds of missing data.

Reasonable data table transformations

Looking at the showed data structure, some transformations of the are reasonable for the following generic data analysis:

Following the common logic and assuming the length restrictions for the column names in original document, the column #34 becomes name “LATITUDE_END” and #35 the name “LONGITUDE_END”.

Converting of BGN_TIME and END_TIME to POSIXct/lt format is operation of general usability, but really hard to implement (only a few of used TZ abbreviations are directly accepted by R, changing time zone belongings in the course of time). As we make no use of precise times in this analysis, the otherwise quite reasonable transformation will be omitted here.

stormdata <- transform(stormdata, STATE_NR = factor(STATE_NR),
        COUNTY = factor(COUNTY),
        BGN_DATE = as.Date(BGN_DATE, format = "%m/%d/%Y %H:%M:%S"),
        END_DATE = as.Date(END_DATE, format = "%m/%d/%Y %H:%M:%S"),
        REMARKS = as.character(REMARKS),
        REFNUM = as.character(REFNUM)
)       
colnames(stormdata)[34] <- "LATITUDE_END"
colnames(stormdata)[35] <- "LONGITUDE_END"

Classification of variables and definition of observation

The grouping into fixed and measured variables is some tricky in this case, as the data came not from clear experimental design, but the “set-up” was made by the nature. The next challenge is, that not all of the variable names are intuitive and we could find an authentic interpretation neither in the given sources, nor with a help of Google. So, the choice of the analyzed variables is to some extent arbitrary.

Variables have been grouped them as follows.

Fixed variables (what? where? when? has happened; type, spatial and time identifiers)

REFNUM
STATE
STATE_NR
COUNTY
COUNTYNAME
COUNTY_END
EVTYPE
TIME_ZONE
BGN_DATE
BGN_TIME
END_DATE
END_TIME
BGN_LOCATI
END_LOCATI
LATITUDE
LONGITUDE
LATITUDE_END
LONGITUDE_END

Measured variables (strength? extents? impact?)

LENGTH
WIDTH
BGN_RANGE
END_RANGE
BGN_AZI
END_AZI
FATALITIES
INJURIES
PROPDMG
CROPDMG

An observation may be defined as unique combination of location (state + county), date (e.g. BGN_DATE) and event type. Let us check the work definition.

library(data.table)
tmpt <- data.table(with(stormdata, paste(STATE, COUNTYNAME, BGN_DATE, BGN_TIME, EVTYPE)))
length(tmpt$V1)
## [1] 902297
length(unique(tmpt$V1))
## [1] 874197
#Are there really double or multiple identical entries?
count(tmpt, V1, sort = T)
## Source: local data table [874,197 x 2]
## 
##                                                       V1     n
##                                                    (chr) (int)
## 1  ME AROOSTOOK 2010-05-25 02:25:00 PM THUNDERSTORM WIND    15
## 2             VA SHENANDOAH 2010-03-28 08:00:00 PM FLOOD    14
## 3                  ME WALDO 2007-03-17 10:00:00 AM FLOOD    13
## 4    AL CULLMAN 2009-12-08 10:10:00 PM THUNDERSTORM WIND    13
## 5                  ME WALDO 2006-10-28 19:00:00 PM FLOOD    12
## 6                    KS STAFFORD 1955-06-04 1830 TORNADO    11
## 7             VA SHENANDOAH 2011-04-14 08:00:00 AM FLOOD    11
## 8                   ME KNOX 2007-03-17 09:00:00 AM FLOOD    10
## 9      FL DIXIE 2011-04-05 05:05:00 AM THUNDERSTORM WIND     9
## 10       MD BALTIMORE 2011-07-07 07:45:00 PM FLASH FLOOD     9
## ..                                                   ...   ...

Hm, the uniqueness of the proposed definition is not proven, in about 3% of all cases duplicate or multiple entries for location, beginning time and event type are there. Is location data at county level not precise enough? Let us append the beginning location.

tmpt <- data.table(with(stormdata, paste(STATE, COUNTYNAME, BGN_DATE, BGN_TIME,
                                         LONGITUDE, LATITUDE, EVTYPE)))
length(tmpt$V1)
## [1] 902297
length(unique(tmpt$V1))
## [1] 893700
#Are there still double or multiple identical entries?
count(tmpt, V1, sort = T)
## Source: local data table [893,700 x 2]
## 
##                                                              V1     n
##                                                           (chr) (int)
## 1                 KS STAFFORD 1955-06-04 1830 9836 3804 TORNADO    11
## 2             NV NVZ006 1996-11-16 04:30:00 AM 0 0 WINTER STORM     8
## 3             VT WINDHAM 1995-07-15 0725 0 0 THUNDERSTORM WINDS     7
## 4         ND GRAND FORKS 2000-06-12 10:30:00 PM 0 0 FLASH FLOOD     7
## 5               MT MTZ028 2002-03-08 09:00:00 AM 0 0 HEAVY SNOW     7
## 6                 KS RUSH 2006-08-18 07:00:00 AM 0 0 HEAVY RAIN     7
## 7  IN MORGAN 2008-01-29 19:05:00 PM 8625 3925 THUNDERSTORM WIND     7
## 8                    KS FORD 1955-06-04 1700 10002 3729 TORNADO     6
## 9                    SD BROWN 1980-05-29 1442 9806 4520 TORNADO     6
## 10                  TX FANNIN 1966-05-23 1600 9610 3336 TORNADO     6
## ..                                                          ...   ...

There are either inconsistencies or technological redundancy in the data (e.g. through multiple automatic monitoring stations in recent years). We would need some more professional counseling at this point. It is probably not right time and place for perfection, but we take notice of the fact.

Results

Which types of severe weather events were registered most often?

Top-20 of the SWE (over the whole observation period)

sorted <- sort(table(stormdata$EVTYPE), decreasing = TRUE, n = 20)[1:20]
print(as.data.frame(sorted))
##                          sorted
## HAIL                     288661
## TSTM WIND                219940
## THUNDERSTORM WIND         82563
## TORNADO                   60652
## FLASH FLOOD               54277
## FLOOD                     25326
## THUNDERSTORM WINDS        20843
## HIGH WIND                 20212
## LIGHTNING                 15754
## HEAVY SNOW                15708
## HEAVY RAIN                11723
## WINTER STORM              11433
## WINTER WEATHER             7026
## FUNNEL CLOUD               6839
## MARINE TSTM WIND           6175
## MARINE THUNDERSTORM WIND   5812
## WATERSPOUT                 3796
## STRONG WIND                3566
## URBAN/SML STREAM FLD       3392
## WILDFIRE                   2761

We can see, that hail is the leading kind of SWE considering the number of registered events. The thunderstorm events as a whole could compete with it. There are over 120 different event types wit a pattern “TSTM” or “THUND” in the table; spreading due to historical reasons (changing in classification) assumable.

What US states are most threatened by SWE?

Top-10 states by number of SWE (1950-2011)

hd <- stormdata%>%count(STATE)%>%arrange(desc(n))
hd[1:10]
## Source: local data table [10 x 2]
## 
##     STATE     n
##    (fctr) (int)
## 1      TX 83728
## 2      KS 53440
## 3      OK 46802
## 4      MO 35648
## 5      IA 31069
## 6      NE 30271
## 7      IL 28488
## 8      AR 27102
## 9      NC 25351
## 10     GA 25259

The clear “leader” is by far Texas, followed by Kansas, Oklahoma and Montana.

Comparison of the distribution of SWE by state: whole observation period vs. 2000-2011

hd2000 <- stormdata%>%filter(BGN_DATE >= as.Date("2000-01-01","%Y-%m-%d"))%>%
        count(STATE)
hd <- merge(hd, hd2000,by = "STATE")
hd <- hd%>%arrange(desc(n.x))
barplot(hd$n.x, names.arg = hd$STATE, col="green", ylim = c(0,90000), main = "
        Incidence of SWE by U.S. state\n1950-2011 vs. 2000-2011", xlab = "U.S. state",
        ylab = "Number of SWE by state")
barplot(hd$n.y, names.arg = hd$STATE, col="blue", ylim = c(0,90000), add=T)
legend("topright",legend =c("1950-2011","2000-2011"), col = c("green","blue"), pch=c(19,19))
box()

The leading states in the past ten years of monitoring are the same as for the whole reporting period.

It looks alarming, like about the half of all registered SWE happened within the last ten years of the whole reported period of sixty one years. It could indicate adverse climatic trends, but only after we have cleared possible impact of changing monitoring and reporting density and coverage.

Number of registered SWE as a timeline

Driven by the results of the analysis of structure of event types, we want to overview the timely dynamics of the total number of registered SWE.

ty <- table(format(stormdata$BGN_DATE,"%Y"))
plot(x=names(ty),y = as.integer(ty), pch = 20, xlab = "Year", ylab = "Number of severe events", ylim = c(0, max(ty)), main = "Number of registered SWE over years")
box()

At this moment we don’t need any kind of regression function to see the dramatic gain of the number of registered severe weather events in time. The rocket like trajectory is unrealistic to explain only due to the natural (r)evolution. One more time, advancements in coverage of monitoring supposed to make a huge contribution to the figures.

Spatial imaging the incidence of SWE by latitude and longitude

The next consideration with respect to the shown distribution of SWE is, that calculated incidence of the SWE by U.S. state depends amongst others upon it’s surface area respectively the number of counties in the state. As the second ranked U.S. state, Texas has some “natural handicap” in the competition.

This consideration is a fortiori weighty, as severe events are registered at in fact on the county basis, i.e. the territorially large scale events may be registered county-wise many times in the given table. This approach may be well grounded for administrative reasons, but it definitely distorts the human perception.

The trueness of this assumption may be illustrated with the following evaluation.

tstm <- subset(stormdata, EVTYPE %like% "THUNDERS|TSTM", select = c("STATE","BGN_DATE"))
tstm
##         STATE   BGN_DATE
##      1:    AL 1955-03-21
##      2:    AL 1955-04-06
##      3:    AL 1955-04-06
##      4:    AL 1955-04-06
##      5:    AL 1955-04-13
##     ---                 
## 336805:    AN 2011-11-16
## 336806:    AM 2011-11-16
## 336807:    AM 2011-11-16
## 336808:    LE 2011-11-14
## 336809:    LE 2011-11-14
unique(tstm)
##        STATE   BGN_DATE
##     1:    AL 1955-03-21
##     2:    AL 1955-04-06
##     3:    AL 1955-04-13
##     4:    AL 1955-04-21
##     5:    AL 1955-05-24
##    ---                 
## 66555:    AM 2011-11-29
## 66556:    GA 2011-11-22
## 66557:    TX 2011-11-21
## 66558:    AN 2011-11-16
## 66559:    LE 2011-11-14

The counts show, the total number of the registered thunderstorms (as individual observations, i.e. per county) is about five times higher, than the number calculated on the state and same-day basis.

That is the ground, why exploration data analysis with unified spatial units may be of additional interest, especially in respect to human perception. Based on the given data for such unit may be taken quadrate with the side of 1° in geographic coordinate system. It allows a simple allocation of the weather events over the available coordinates like LATITUDE and LONGITUDE.

Incidence of severe events by latitude and longitude (pseudo-map)

mtrx <- xtabs(formula = ~ as.integer(LATITUDE/100) + as.integer(LONGITUDE/100),
              data = stormdata, drop.unused.levels = TRUE)
#Neutralizing the fictitious maximum at longitude 0 and latitude 0 
#(incompleteness due to historical reasons?)
mtrx[which.max(mtrx)] <- 0
#Swapping the columns - reverting the longitude 
mtrx1 <- matrix(nrow = dim(mtrx)[1], ncol = dim(mtrx)[2])
for (i in 1:dim(mtrx)[2]) mtrx1[,dim(mtrx)[2] -i + 1] <- mtrx[,i]
#Cropping the matrix
mtrx2 <- mtrx1[10:40,40:99]
#And here is the pseudo-map 
image(x = 1:dim(mtrx2)[2],y = 1:dim(mtrx2)[1], z = t(as.matrix(mtrx2)), 
      col = topo.colors(50), main = "Can You recognize the land?", xaxt = "n", yaxt = "n",xlab = "", ylab = "")
box()

Here is an attempt to recycle the calculated spatial data for tabular estimation, validating the former imaging one more time.

tmp <- sort(mtrx, decreasing = TRUE)[1:20] # selecting the top-20 1°x1° squares
#Demo - matrix indices (but not geo. coordinates, preserved as row-/col-names!) 
#of the top-20 squares
sapply(tmp, function(x) which(mtrx==x, arr.ind=TRUE))
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]   22   24   26   19   23   25   21   21   20    24    28    22    24
## [2,]   38   38   35   38   36   31   27   33   25    41    34    42    34
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]    23    19    22    21    26    24    25
## [2,]    38    34    35    23    36    39    18
#Saving multiple matrix indices for  top-20
tmp2 <- sapply(tmp, function(x) which(mtrx==x, arr.ind=TRUE))
#Demo tmp2 - indices for latitude in first row, for longitude in second
#tmp2
#      [,1] [,2] [,3] [,4]  ...
#[1,]   22   24   26   ... 
#[2,]   38   38   35 
#
#Compare tmp2[1,1] to the indices of the top-1 in the mtrx
#which(mtrx==4347, arr.ind=TRUE)
#   as.integer(LATITUDE/100) as.integer(LONGITUDE/100)
#35                       22                        38
#
#Like tmp2 but with the geo coordinates
geo_top20 <- rbind(as.integer(rownames(mtrx)[tmp2[1,]]),as.integer(colnames(mtrx)[tmp2[2,]]))
#Latitudes of the top-20
sort(geo_top20[1,])
##  [1] 32 32 33 34 34 34 35 35 35 36 36 37 37 37 37 38 38 39 39 41
#Longitudes of the top-20
sort(geo_top20[2,])
##  [1]  77  82  84  86  90  92  93  93  93  94  94  95  95  97  97  97  97
## [18]  98 100 101

We can see, that the geo coordinates of the top-20 squares build the contiguous region with latitude 32°N to 40°N and longitude 90°W to 101°W. This region is the most threatened through severe weather events.

The Top-20 counties are found (no more surprisingly) all in Arkansas (33°00′ N to 36°30′ N, 89°39′ W to 94°37′ W)

(stormdata%>%count(STATE, COUNTYNAME)%>%arrange(desc(n)))[1:20]
## Source: local data table [20 x 3]
## Groups: 
## 
##     STATE COUNTYNAME     n
##    (fctr)     (fctr) (int)
## 1      AK     AKZ024   120
## 2      AK     AKZ213   110
## 3      AK     AKZ019   102
## 4      AK     AKZ001    96
## 5      AK     AKZ020    95
## 6      AK     AKZ017    94
## 7      AK     AKZ022    90
## 8      AK     AKZ125    88
## 9      AK     AKZ101    86
## 10     AK     AKZ131    86
## 11     AK     AKZ155    82
## 12     AK     AKZ007    79
## 13     AK     AKZ181    75
## 14     AK     AKZ195    71
## 15     AK     AKZ191    71
## 16     AK     AKZ025    66
## 17     AK     AKZ018    64
## 18     AK     AKZ201    59
## 19     AK     AKZ185    57
## 20     AK     AKZ214    51

Which types of events are most harmful to population health?

As indicators for impact of severe weather events on population health we can take such quantitive measures as FATALITIES and INJURIES. It is one more time the problem of scattered event types, but it deserves a try.

The Top-20 SWE by the total number of caused injuries (1950-2011)

injdata<- stormdata%>%group_by(EVTYPE)%>%summarize(Injured = sum(INJURIES))%>%arrange(desc(Injured)) 
#Total number of event types causing person injuries.
dim(injdata[injdata$Injured > 0])[1]
## [1] 158
print(as.data.frame(injdata[injdata$Injured > 0][1:20]))
##                EVTYPE Injured
## 1             TORNADO   91346
## 2           TSTM WIND    6957
## 3               FLOOD    6789
## 4      EXCESSIVE HEAT    6525
## 5           LIGHTNING    5230
## 6                HEAT    2100
## 7           ICE STORM    1975
## 8         FLASH FLOOD    1777
## 9   THUNDERSTORM WIND    1488
## 10               HAIL    1361
## 11       WINTER STORM    1321
## 12  HURRICANE/TYPHOON    1275
## 13          HIGH WIND    1137
## 14         HEAVY SNOW    1021
## 15           WILDFIRE     911
## 16 THUNDERSTORM WINDS     908
## 17           BLIZZARD     805
## 18                FOG     734
## 19   WILD/FOREST FIRE     545
## 20         DUST STORM     440

We see tornado at pole position as cause for person injuries. Taking into account the already addressed extreme scattering of the event types around thunderstorm, here is an attempt to accumulate the figures using the known pattern.

stormdata%>%filter(EVTYPE %like% "TSTM|THUND")%>%summarize(sum(INJURIES))
##    sum(INJURIES)
## 1:          9545

Even after cumulative evaluation thunderstorms of all kinds would stay at the second place, by far away from tornados.

The Top-20 SWE by the total number of fatalities

injdata<- stormdata%>%group_by(EVTYPE)%>%summarize(fatalities = sum(FATALITIES))%>%arrange(desc(fatalities)) 
#Total number of event types causing person injuries.
dim(injdata[injdata$fatalities > 0])[1]
## [1] 168
print(as.data.frame(injdata[injdata$fatalities > 0][1:20]))
##                     EVTYPE fatalities
## 1                  TORNADO       5633
## 2           EXCESSIVE HEAT       1903
## 3              FLASH FLOOD        978
## 4                     HEAT        937
## 5                LIGHTNING        816
## 6                TSTM WIND        504
## 7                    FLOOD        470
## 8              RIP CURRENT        368
## 9                HIGH WIND        248
## 10               AVALANCHE        224
## 11            WINTER STORM        206
## 12            RIP CURRENTS        204
## 13               HEAT WAVE        172
## 14            EXTREME COLD        160
## 15       THUNDERSTORM WIND        133
## 16              HEAVY SNOW        127
## 17 EXTREME COLD/WIND CHILL        125
## 18             STRONG WIND        103
## 19                BLIZZARD        101
## 20               HIGH SURF        101

Tornados stay at the top as a cause for fatalities just like it was for injuries, but excessive heat at the second place is some kind of surprise. It is even more surprising, as the relation of heat to death is mostly much less explicit than those of mechanical damages through tornado or drowning in a flood. And so, such “unspectacular” cause as heat might be still strong undervalued. as a fatal factor.

It is moreover the question, as heat is scattered into at least sixteen different event types.

unique(grep("heat",stormdata$EVTYPE, ignore.case = T, value =T))
##  [1] "HEAT"                   "EXTREME HEAT"          
##  [3] "EXCESSIVE HEAT"         "RECORD HEAT"           
##  [5] "HEAT WAVE"              "DROUGHT/EXCESSIVE HEAT"
##  [7] "RECORD HEAT WAVE"       "RECORD/EXCESSIVE HEAT" 
##  [9] "HEAT WAVES"             "HEAT WAVE DROUGHT"     
## [11] "HEAT/DROUGHT"           "HEAT DROUGHT"          
## [13] "Heatburst"              "Record Heat"           
## [15] "Heat Wave"              "EXCESSIVE HEAT/DROUGHT"

…and tornado comes in fifteen different avatars.

unique(grep("tornado",stormdata$EVTYPE, ignore.case = TRUE, value = T))
##  [1] "TORNADO"                    "TORNADO F0"                
##  [3] "TORNADOS"                   "WATERSPOUT/TORNADO"        
##  [5] "WATERSPOUT TORNADO"         "WATERSPOUT-TORNADO"        
##  [7] "TORNADOES, TSTM WIND, HAIL" "COLD AIR TORNADO"          
##  [9] "WATERSPOUT/ TORNADO"        "TORNADO F3"                
## [11] "TORNADO F1"                 "TORNADO/WATERSPOUT"        
## [13] "TORNADO F2"                 "TORNADOES"                 
## [15] "TORNADO DEBRIS"

Cumulative assessment tornado vs. heat as a cause of mortality

Plots, even the boxplots with log-functions, are not the best way to visually compare mortality through heat vs. tornado. It is due to very low mean/median values and extreme outliers in both cases.

The five-number summary fits better.

fltheat <- grep("heat",stormdata$EVTYPE, ignore.case = T)
m1 <- cbind("HEAT",stormdata$FATALITIES[fltheat])
fltorn <- grep("tornado",stormdata$EVTYPE, ignore.case = T)
m2 <- cbind("TORNADO",stormdata$FATALITIES[fltorn])
dtab <- as.data.table(rbind(m1,m2))
#Heat
summary(as.integer(m1[,2]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.185   1.000 583.000
#Tornado
summary(as.integer(m2[,2]))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   0.00000   0.00000   0.00000   0.09326   0.00000 158.00000

The ultimate evaluation of heat vs. tornado as cause of fatalities shows, that in both cases the mean number of fatalities is about zero with extreme outliers, corresponding to catastrophic events.

Here is the recalculated total number of fatalities summarized for main kinds of SWE.

dtab%>%group_by(V1)%>%summarize(FATALITIES = sum(as.numeric(V2)))%>%arrange()
## Source: local data table [0 x 0]

As we can see, tornado keeps the leadership, but summarizing the fatalities through heat shows essential gain of mortality through heat.

Which types of events have the greatest economic consequences?

The only interpretable variables to estimate the economic consequences of severe weather events in the available data are PROPDMG (for property damage) and CROPDMG (for crop damage).

We are using at this point the same techniques as for estimation of the population health impact.

Top-20 SWE by property damage

dmgdata<- stormdata%>%group_by(EVTYPE)%>%summarize(PropDamage = sum(PROPDMG))%>%arrange(desc(PropDamage)) 
#Total number of event types causing person injuries.
dim(dmgdata[dmgdata$PropDamage > 0])[1]
## [1] 406
print(as.data.frame(dmgdata[dmgdata$PropDamage > 0][1:20]))
##                  EVTYPE PropDamage
## 1               TORNADO 3212258.16
## 2           FLASH FLOOD 1420124.59
## 3             TSTM WIND 1335965.61
## 4                 FLOOD  899938.48
## 5     THUNDERSTORM WIND  876844.17
## 6                  HAIL  688693.38
## 7             LIGHTNING  603351.78
## 8    THUNDERSTORM WINDS  446293.18
## 9             HIGH WIND  324731.56
## 10         WINTER STORM  132720.59
## 11           HEAVY SNOW  122251.99
## 12             WILDFIRE   84459.34
## 13            ICE STORM   66000.67
## 14          STRONG WIND   62993.81
## 15           HIGH WINDS   55625.00
## 16           HEAVY RAIN   50842.14
## 17       TROPICAL STORM   48423.68
## 18     WILD/FOREST FIRE   39344.95
## 19       FLASH FLOODING   28497.15
## 20 URBAN/SML STREAM FLD   26051.94

With respect to property damage the usual suspects are here, like tornado and thunderstorm, kinds of windy weather, hail, lightning and flood. As the further grouping is difficult for known reasons, we’ll try to to apply this, probably not perfect classification to evaluate the cumulative structure of property damage through the top placed SWE.

fltornado <- grepl("tornado",stormdata$EVTYPE,  ignore.case = TRUE)
mtornado <- cbind("TORNADO",stormdata$PROPDMG[fltornado])

flthund <- grepl("tstm|thund",stormdata$EVTYPE,  ignore.case = TRUE)
mthund <- cbind("THUNDERSTORM",stormdata$PROPDMG[flthund])

flwind <- grepl("wind",stormdata$EVTYPE,  ignore.case = TRUE)
mwind <- cbind("WIND",stormdata$PROPDMG[flwind])

flhail <- grepl("hail",stormdata$EVTYPE,  ignore.case = T)
mhail <- cbind("HAIL",stormdata$PROPDMG[flhail])

fltlight <- grepl("lightning",stormdata$EVTYPE,  ignore.case = TRUE)
mlihtning <- cbind("LIGHTNING",stormdata$PROPDMG[fltlight])

fltflood <- grepl("flood",stormdata$EVTYPE,  ignore.case = TRUE)
mflood <- cbind("FLOOD",stormdata$PROPDMG[fltflood])

dtab <- as.data.table(rbind(mtornado, mthund, mwind, mhail, mlihtning, mflood))
dim(dtab)
## [1] 1151331       2
#Redundancy through partially overlapping event type names,%
dim(dtab)[1]/sum(grepl("tornado|tstm|thund|wind|hail|lightning|flood", 
                       stormdata$EVTYPE, fixed = FALSE, ignore.case = TRUE))
## [1] 1.415363

The result of selection is redundant through partially overlapping event type names like “TORNADOES, TSTM WIND, HAIL” or “TSTM WIND”. The data redundancy in the selection due to it is about 40%. The number of rows in the selection is 20% higher, then the number of original observations.

An attempt to reduce the impact of historical dimension (evolution of definitions) brings no breakthrough - the redundancy of selection for the period 2000-2011 stays over 39%.

stormdata2k <- stormdata%>%filter(BGN_DATE >= as.Date("2000-01-01","%Y-%m-%d"))

fltornado <- grepl("tornado",stormdata2k$EVTYPE,  ignore.case = TRUE)
mtornado <- cbind("TORNADO",stormdata2k$PROPDMG[fltornado])

flthund <- grepl("tstm|thund",stormdata2k$EVTYPE, fixed = FALSE, ignore.case = TRUE)
mthund <- cbind("THUNDERSTORM",stormdata2k$PROPDMG[flthund])

flwind <- grepl("wind",stormdata2k$EVTYPE,  ignore.case = TRUE)
mwind <- cbind("WIND",stormdata2k$PROPDMG[flwind])

flhail <- grepl("hail",stormdata2k$EVTYPE,  ignore.case = T)
mhail <- cbind("HAIL",stormdata2k$PROPDMG[flhail])

fltlight <- grepl("lightning",stormdata2k$EVTYPE,  ignore.case = TRUE)
mlihtning <- cbind("LIGHTNING",stormdata2k$PROPDMG[fltlight])

fltflood <- grepl("flood",stormdata2k$EVTYPE,  ignore.case = TRUE)
mflood <- cbind("FLOOD",stormdata2k$PROPDMG[fltflood])

dtab2k <- as.data.table(rbind(mtornado, mthund, mwind, mhail, mlihtning, mflood))
dim(dtab2k)
## [1] 634896      2
#Redundancy through partially overlapping event type names,% (2000-2011)
dim(dtab2k)[1]/sum(grepl("tornado|tstm|thund|wind|hail|lightning|flood", 
                       stormdata2k$EVTYPE, ignore.case = TRUE))
## [1] 1.393434

At this point we see no possibility for further data aggregation without the professional input concerning the way to do it.

Top-20 SWE with respect to the crop damage

dmgdata<- stormdata%>%group_by(EVTYPE)%>%summarize(CropDamage = sum(CROPDMG))%>%arrange(desc(CropDamage)) 
#Total number of event types causing person injuries.
dim(dmgdata[dmgdata$CropDamage > 0])[1]
## [1] 136
print(as.data.frame(dmgdata[dmgdata$CropDamage > 0][1:20]))
##                EVTYPE CropDamage
## 1                HAIL  579596.28
## 2         FLASH FLOOD  179200.46
## 3               FLOOD  168037.88
## 4           TSTM WIND  109202.60
## 5             TORNADO  100018.52
## 6   THUNDERSTORM WIND   66791.45
## 7             DROUGHT   33898.62
## 8  THUNDERSTORM WINDS   18684.93
## 9           HIGH WIND   17283.21
## 10         HEAVY RAIN   11122.80
## 11       FROST/FREEZE    7034.14
## 12       EXTREME COLD    6121.14
## 13     TROPICAL STORM    5899.12
## 14          HURRICANE    5339.31
## 15     FLASH FLOODING    5126.05
## 16  HURRICANE/TYPHOON    4798.48
## 17           WILDFIRE    4364.20
## 18     TSTM WIND/HAIL    4356.65
## 19   WILD/FOREST FIRE    4189.54
## 20          LIGHTNING    3580.61

For crop damage it is a hail leading by far, with flash flood and flood at following positions.

For the reasons addressed before we resign at this moment the further aggregation of data.

Raniking of sates with respect to personal and property damage (2000-2011)

Following states are especially threatened through the SWE with respect to the damage (data for the period 2000-2011).

dmgstat <- stormdata2k%>%group_by(STATE)%>%summarize(Injuries = sum(INJURIES),Fatalities = sum(FATALITIES),PropDamage = sum(PROPDMG),CropDamage = sum(CROPDMG))
injdmg <- arrange(dmgstat, desc(Injuries))
fataldmg <- arrange(dmgstat, desc(Fatalities))
propdmg <- arrange(dmgstat, desc(PropDamage))
cropdmg <- arrange(dmgstat, desc(CropDamage))

Top-10 respective injuries

States sorted descendent by the total number of injuries.

select(injdmg, STATE, Injuries)[1:10]
## Source: local data table [10 x 2]
## 
##     STATE Injuries
##    (fctr)    (dbl)
## 1      MO     5057
## 2      AL     3206
## 3      TX     2192
## 4      CA     2050
## 5      TN     1863
## 6      FL     1845
## 7      OK     1543
## 8      GA     1226
## 9      MS     1032
## 10     NC      887

Top-10 states respective fatalities

States sorted descendent by the total number of fatalities.

select(fataldmg, STATE, Fatalities)[1:10]
## Source: local data table [10 x 2]
## 
##     STATE Fatalities
##    (fctr)      (dbl)
## 1      TX        501
## 2      MO        411
## 3      FL        396
## 4      AL        369
## 5      CA        355
## 6      IL        353
## 7      PA        285
## 8      TN        282
## 9      NY        171
## 10     NC        164

Summarized ranking for injuries and fatalities

Here is the summarized ranking for personal damage by state, as a sum of both rankings. The placement on top means more risk of personal damage.

injstates <- as.vector(injdmg$STATE)
fatalstates <- as.vector(fataldmg$STATE)
rank <- sapply(injstates, function(x) which(injstates == x, arr.ind = TRUE) + which(fatalstates == x, arr.ind = TRUE))

rnames <- names(rank)
rank <- data.table(cbind(rnames, rank))
print(rank%>%arrange(as.numeric(rank)))
##     rnames rank
##  1:     MO    3
##  2:     TX    4
##  3:     AL    6
##  4:     CA    9
##  5:     FL    9
##  6:     TN   13
##  7:     IL   19
##  8:     NC   20
##  9:     MS   21
## 10:     OK   22
## 11:     GA   22
## 12:     AR   22
## 13:     PA   22
## 14:     IN   30
## 15:     NY   31
## 16:     MD   36
## 17:     KS   38
## 18:     MI   39
## 19:     VA   40
## 20:     KY   43
## 21:     AZ   43
## 22:     CO   46
## 23:     OH   46
## 24:     LA   46
## 25:     NJ   51
## 26:     UT   51
## 27:     IA   53
## 28:     SC   54
## 29:     GU   59
## 30:     WI   59
## 31:     MN   62
## 32:     MA   64
## 33:     WA   65
## 34:     NV   68
## 35:     WY   72
## 36:     AS   73
## 37:     ID   78
## 38:     NE   79
## 39:     DE   81
## 40:     MT   81
## 41:     PR   81
## 42:     CT   82
## 43:     AK   83
## 44:     NM   84
## 45:     WV   84
## 46:     OR   88
## 47:     ND   90
## 48:     ME   93
## 49:     SD   95
## 50:     HI   95
## 51:     NH   97
## 52:     DC  107
## 53:     AM  107
## 54:     VT  107
## 55:     AN  108
## 56:     RI  112
## 57:     VI  114
## 58:     PZ  115
## 59:     LM  118
## 60:     PH  121
## 61:     LC  123
## 62:     GM  123
## 63:     LH  127
## 64:     LS  127
## 65:     LE  129
## 66:     SL  132
## 67:     LO  134
## 68:     PM  136
## 69:     PK  138
## 70:     XX  140
##     rnames rank

This ranking should sensitize the local authorities preparing the emergency medical services for natural disasters.

Here is the analogue estimation for property and crop damages.

The Top-10 states respective property damage (2000-2011)

select(propdmg, STATE, PropDamage)[1:10]
## Source: local data table [10 x 2]
## 
##     STATE PropDamage
##    (fctr)      (dbl)
## 1      TX   490910.6
## 2      IA   369875.3
## 3      MS   318315.5
## 4      OH   313869.5
## 5      GA   312846.9
## 6      AR   218050.4
## 7      IL   217247.2
## 8      AL   201949.7
## 9      PA   189069.5
## 10     NY   187893.7

The Top-10 states respective crop damage (2000-2011)

select(cropdmg, STATE, CropDamage)[1:10]
## Source: local data table [10 x 2]
## 
##     STATE CropDamage
##    (fctr)      (dbl)
## 1      NE  196201.10
## 2      TX  104436.51
## 3      IA   95231.05
## 4      KS   83595.02
## 5      WI   55866.14
## 6      MS   55622.79
## 7      ND   37129.00
## 8      MN   35195.20
## 9      AR   25592.01
## 10     LA   22632.95

Summarized ranking by economical damage by state

The places in property and crop damage rankings are added, the place on top means more economical damage.

pdmgstates <- as.vector(propdmg$STATE)
cdmgstates <- as.vector(cropdmg$STATE)
rank <- sapply(injstates, function(x) which(cdmgstates == x, arr.ind = TRUE) + which(pdmgstates == x, arr.ind = TRUE))

rnames <- names(rank)
rank <- data.table(cbind(rnames, rank))
print(rank%>%arrange(as.numeric(rank)))
##     rnames rank
##  1:     TX    3
##  2:     IA    5
##  3:     MS    9
##  4:     AR   15
##  5:     KS   15
##  6:     NE   15
##  7:     OH   16
##  8:     WI   21
##  9:     IL   24
## 10:     NY   24
## 11:     LA   25
## 12:     GA   28
## 13:     MO   29
## 14:     TN   30
## 15:     CA   32
## 16:     MN   32
## 17:     ND   32
## 18:     NC   33
## 19:     MI   36
## 20:     AL   38
## 21:     OK   40
## 22:     IN   40
## 23:     PA   41
## 24:     FL   44
## 25:     KY   44
## 26:     SD   49
## 27:     VA   50
## 28:     WA   57
## 29:     MT   65
## 30:     VT   66
## 31:     CO   68
## 32:     NM   68
## 33:     SC   70
## 34:     MD   71
## 35:     MA   73
## 36:     NJ   73
## 37:     WV   74
## 38:     AZ   75
## 39:     WY   76
## 40:     PR   77
## 41:     GU   78
## 42:     UT   79
## 43:     ID   84
## 44:     NH   87
## 45:     ME   90
## 46:     AK   90
## 47:     OR   91
## 48:     DE   95
## 49:     HI   95
## 50:     CT   98
## 51:     NV   98
## 52:     AS  100
## 53:     AM  104
## 54:     VI  105
## 55:     RI  107
## 56:     DC  109
## 57:     GM  117
## 58:     LM  119
## 59:     AN  121
## 60:     PZ  121
## 61:     LC  123
## 62:     LS  124
## 63:     PH  125
## 64:     LE  128
## 65:     LO  129
## 66:     LH  130
## 67:     SL  131
## 68:     PK  132
## 69:     PM  137
## 70:     XX  140
##     rnames rank