Based upon U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database (1950-2011), we conducted an exploration data analysis with respect to the incidence of individual severe weather events (SWE), their geographical distribution and impact in the terms of human injuries, fatalities, property and crop damage. Data processing and exploration were made using R programming language and documented in Markdown, report further transferred to PDF and RPub. Due to the quality of data, further statistical analysis in deep may be performed only given a qualified input to the historical evolution of the data monitoring and registering system, as well as to the possibility of aggregation over the scattered event types and individual observations. Hail was on the top of the SWE by incidence, followed by different kinds of thunderstorm, tornado and flood. In U.S. states comparison Texas had the at most SWE, followed by Kansas and Oklahoma. As the figures for the states are surface dependent, at county level twenty leading counties belong all to Arkansas (ranking #8 at state level). At the basis of latitude and longitude analysis (1°x1° squares) we identified as most threatened contiguous region with latitude 32°N to 40°N and longitude 90°W to 101°W. Dynamics of registered SWE over years is suspected to be strongly influenced by the evolution of monitoring and registration system. The main threats wit respect to personal and economical damage has beeen defined, as well as U.S. states were ranked respective human and economical impact of SWE.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database as data coming in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.
There is also some documentation of the database available
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
Set-up and downloading the data
startdir <- "~/R"
setwd(startdir)
datadir <- "./data"
datafile = paste(datadir,"StormData.csv.bz2",sep = "/")
durl = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!dir.exists("./data")) dir.create("./data")
if(!file.exists(datafile)) {
download.file(durl, destfile = datafile, mode = "wb")
}
#For the sake of performance and efficiency
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Reading in the data
Here is a short insight in the data file from console.
bzcat StormData.csv.bz2|head -5
"STATE__","BGN_DATE","BGN_TIME","TIME_ZONE","COUNTY","COUNTYNAME","STATE","EVTYPE","BGN_RANGE","BGN_AZI","BGN_LOCATI","END_DATE","END_TIME","COUNTY_END","COUNTYENDN","END_RANGE","END_AZI","END_LOCATI","LENGTH","WIDTH","F","MAG","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","WFO","STATEOFFIC","ZONENAMES","LATITUDE","LONGITUDE","LATITUDE_E","LONGITUDE_","REMARKS","REFNUM"
1.00,4/18/1950 0:00:00,"0130","CST",97.00,"MOBILE","AL","TORNADO",0.00,,,,,0.00,,0.00,,,14.00,100.00,"3",0.00,0.00,15.00,25.00,"K",0.00,,,,,3040.00,8812.00,3051.00,8806.00,,1.00
1.00,4/18/1950 0:00:00,"0145","CST",3.00,"BALDWIN","AL","TORNADO",0.00,,,,,0.00,,0.00,,,2.00,150.00,"2",0.00,0.00,0.00,2.50,"K",0.00,,,,,3042.00,8755.00,0.00,0.00,,2.00
1.00,2/20/1951 0:00:00,"1600","CST",57.00,"FAYETTE","AL","TORNADO",0.00,,,,,0.00,,0.00,,,0.10,123.00,"2",0.00,0.00,2.00,25.00,"K",0.00,,,,,3340.00,8742.00,0.00,0.00,,3.00
1.00,6/8/1951 0:00:00,"0900","CST",89.00,"MADISON","AL","TORNADO",0.00,,,,,0.00,,0.00,,,0.00,100.00,"2",0.00,0.00,2.00,2.50,"K",0.00,,,,,3458.00,8626.00,0.00,0.00,,4.00
So, data looks regular at the start. But it is known, that the file lines are CRLF delimited (DOS formatted) and there are longer text comments down, with line breaks in. We must control the integrity and uniformity of the data after reading them in.
stormdata <- data.table(read.csv(datafile, na.strings = c("NA","")))
First evaluation of data structure, completeness and integrity
How big is the resulting data table?
class(stormdata)
## [1] "data.table" "data.frame"
dim(stormdata)
## [1] 902297 37
What kinds of data are in?
colnames(stormdata)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Removing the trailing underscores from column names (make.names alone doesn’t care)
cn <- make.names(sub("_*$","", x = colnames(stormdata)))
cn[1] <- "STATE_NR"
colnames(stormdata) <- cn
Looking at the structure of the data.
str(stormdata)
## Classes 'data.table' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE_NR : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29600 levels "5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13512 1872 4597 10591 4371 10093 1972 23872 24417 4597 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 34 levels " N"," NW","E",..: NA NA NA NA NA NA NA NA NA NA ...
## $ BGN_LOCATI: Factor w/ 54428 levels " Christiansburg",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_DATE : Factor w/ 6662 levels "1/1/1993 0:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_TIME : Factor w/ 3646 levels " 0900CST"," 200CST",..: NA NA NA NA NA NA NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 23 levels "E","ENE","ESE",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_LOCATI: Factor w/ 34505 levels " CANTON"," TULIA",..: NA NA NA NA NA NA NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
## $ WFO : Factor w/ 541 levels " CI","%SD","$AC",..: NA NA NA NA NA NA NA NA NA NA ...
## $ STATEOFFIC: Factor w/ 249 levels "ALABAMA, Central",..: NA NA NA NA NA NA NA NA NA NA ...
## $ ZONENAMES : Factor w/ 25111 levels " "| __truncated__,..: NA NA NA NA NA NA NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE : num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436780 levels "\t","\t\t","\t\t\t\t",..: NA NA NA NA NA NA NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, ".internal.selfref")=<externalptr>
Five-point summary
summary(stormdata)
## STATE_NR BGN_DATE BGN_TIME
## Min. : 1.0 5/25/2011 0:00:00: 1202 12:00:00 AM: 10163
## 1st Qu.:19.0 4/27/2011 0:00:00: 1193 06:00:00 PM: 7350
## Median :30.0 6/9/2011 0:00:00 : 1030 04:00:00 PM: 7261
## Mean :31.2 5/30/2004 0:00:00: 1016 05:00:00 PM: 6891
## 3rd Qu.:45.0 4/4/2011 0:00:00 : 1009 12:00:00 PM: 6703
## Max. :95.0 4/2/2006 0:00:00 : 981 03:00:00 PM: 6700
## (Other) :895866 (Other) :857229
## TIME_ZONE COUNTY COUNTYNAME STATE
## CST :547493 Min. : 0.0 JEFFERSON : 7840 TX : 83728
## EST :245558 1st Qu.: 31.0 WASHINGTON: 7603 KS : 53440
## MST : 68390 Median : 75.0 JACKSON : 6660 OK : 46802
## PST : 28302 Mean :100.6 FRANKLIN : 6256 MO : 35648
## AST : 6360 3rd Qu.:131.0 LINCOLN : 5937 IA : 31069
## HST : 2563 Max. :873.0 (Other) :866412 NE : 30271
## (Other): 3631 NA's : 1589 (Other):621339
## EVTYPE BGN_RANGE BGN_AZI
## HAIL :288661 Min. : 0.000 N : 86752
## TSTM WIND :219940 1st Qu.: 0.000 W : 38446
## THUNDERSTORM WIND: 82563 Median : 0.000 S : 37558
## TORNADO : 60652 Mean : 1.484 E : 33178
## FLASH FLOOD : 54277 3rd Qu.: 1.000 NW : 24041
## FLOOD : 25326 Max. :3749.000 (Other):134990
## (Other) :170878 NA's :547332
## BGN_LOCATI END_DATE END_TIME
## COUNTYWIDE : 19680 4/27/2011 0:00:00: 1214 06:00:00 PM: 9802
## Countywide : 993 5/25/2011 0:00:00: 1196 05:00:00 PM: 8314
## SPRINGFIELD : 843 6/9/2011 0:00:00 : 1021 04:00:00 PM: 8104
## SOUTH PORTION: 810 4/4/2011 0:00:00 : 1007 12:00:00 PM: 7483
## NORTH PORTION: 784 5/30/2004 0:00:00: 998 11:59:00 PM: 7184
## (Other) :591444 (Other) :653450 (Other) :622432
## NA's :287743 NA's :243411 NA's :238978
## COUNTY_END COUNTYENDN END_RANGE END_AZI
## Min. :0 Mode:logical Min. : 0.0000 N : 28082
## 1st Qu.:0 NA's:902297 1st Qu.: 0.0000 S : 22510
## Median :0 Median : 0.0000 W : 20119
## Mean :0 Mean : 0.9862 E : 20047
## 3rd Qu.:0 3rd Qu.: 0.0000 NE : 14606
## Max. :0 Max. :925.0000 (Other): 72096
## NA's :724837
## END_LOCATI LENGTH WIDTH
## COUNTYWIDE : 19731 Min. : 0.0000 Min. : 0.000
## SOUTH PORTION : 833 1st Qu.: 0.0000 1st Qu.: 0.000
## NORTH PORTION : 780 Median : 0.0000 Median : 0.000
## CENTRAL PORTION: 617 Mean : 0.2301 Mean : 7.503
## SPRINGFIELD : 575 3rd Qu.: 0.0000 3rd Qu.: 0.000
## (Other) :380536 Max. :2315.0000 Max. :4400.000
## NA's :499225
## F MAG FATALITIES INJURIES
## Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 K :424665 Min. : 0.000 K :281832
## 1st Qu.: 0.00 M : 11330 1st Qu.: 0.000 M : 1994
## Median : 0.00 0 : 216 Median : 0.000 k : 21
## Mean : 12.06 B : 40 Mean : 1.527 0 : 19
## 3rd Qu.: 0.50 5 : 28 3rd Qu.: 0.000 B : 9
## Max. :5000.00 (Other): 84 Max. :990.000 (Other): 9
## NA's :465934 NA's :618413
## WFO STATEOFFIC
## OUN : 17393 TEXAS, North : 12193
## JAN : 13889 ARKANSAS, Central and North Central: 11738
## LWX : 13174 IOWA, Central : 11345
## PHI : 12551 KANSAS, Southwest : 11212
## TSA : 12483 GEORGIA, North and Central : 11120
## (Other):690738 (Other) :595920
## NA's :142069 NA's :248769
## ZONENAMES
## :205988
## GREATER RENO / CARSON CITY / M - GREATER RENO / CARSON CITY / M : 639
## GREATER LAKE TAHOE AREA - GREATER LAKE TAHOE AREA : 592
## JEFFERSON - JEFFERSON : 303
## MADISON - MADISON : 302
## (Other) :100444
## NA's :594029
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE
## Min. : 0 Min. :-14451 Min. : 0 Min. :-14455
## 1st Qu.:2802 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0
## Median :3540 Median : 8707 Median : 0 Median : 0
## Mean :2875 Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.:4019 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. :9706 Max. : 17124 Max. :9706 Max. :106220
## NA's :47 NA's :40
## REMARKS REFNUM
## : 24013 Min. : 1
## Trees down.\n : 1110 1st Qu.:225575
## Several trees were blown down.\n : 568 Median :451149
## Trees were downed.\n : 446 Mean :451149
## Large trees and power lines were blown down.\n: 432 3rd Qu.:676723
## (Other) :588295 Max. :902297
## NA's :287433
Looking for missing values.
sum(is.na(stormdata))
## [1] 6645709
It looks like a big number of data are not available (in absolute figures). Nothing to wonder, taking in account the time span of observations (1950-2011). How much is it in relation to the whole bunch?
mean(is.na(stormdata))
## [1] 0.1990628
So, about 20% of data entries are missing.
What kinds of data are missing (in per cent of all observations)?
mv <- sort(subset(x <- sapply(stormdata, function(x) mean(is.na(x))), x > 0)*100,
decreasing = TRUE)
for(i in seq(along = mv)) {print(mv[i])}
## COUNTYENDN
## 100
## F
## 93.49061
## END_AZI
## 80.33242
## CROPDMGEXP
## 68.53763
## ZONENAMES
## 65.8352
## BGN_AZI
## 60.65985
## END_LOCATI
## 55.32823
## PROPDMGEXP
## 51.63865
## BGN_LOCATI
## 31.89005
## REMARKS
## 31.8557
## STATEOFFIC
## 27.57063
## END_DATE
## 26.97682
## END_TIME
## 26.48551
## WFO
## 15.74526
## COUNTYNAME
## 0.1761061
## LATITUDE
## 0.005208928
## LATITUDE_E
## 0.00443313
The data labeled as “COUNTYENDN” are totally missing, those labeled with “F” are missing for about 94% of observations. Data labeled “END_AZI” are missing in 80% of observations and so on. Curiously, that the data for “END_DATE” and “END_TIME” are missing in about 26-27% as well.
It seems to exist no some rational common imputing strategy for those kinds of missing data.
Reasonable data table transformations
Looking at the showed data structure, some transformations of the are reasonable for the following generic data analysis:
Following the common logic and assuming the length restrictions for the column names in original document, the column #34 becomes name “LATITUDE_END” and #35 the name “LONGITUDE_END”.
Converting of BGN_TIME and END_TIME to POSIXct/lt format is operation of general usability, but really hard to implement (only a few of used TZ abbreviations are directly accepted by R, changing time zone belongings in the course of time). As we make no use of precise times in this analysis, the otherwise quite reasonable transformation will be omitted here.
stormdata <- transform(stormdata, STATE_NR = factor(STATE_NR),
COUNTY = factor(COUNTY),
BGN_DATE = as.Date(BGN_DATE, format = "%m/%d/%Y %H:%M:%S"),
END_DATE = as.Date(END_DATE, format = "%m/%d/%Y %H:%M:%S"),
REMARKS = as.character(REMARKS),
REFNUM = as.character(REFNUM)
)
colnames(stormdata)[34] <- "LATITUDE_END"
colnames(stormdata)[35] <- "LONGITUDE_END"
The grouping into fixed and measured variables is some tricky in this case, as the data came not from clear experimental design, but the “set-up” was made by the nature. The next challenge is, that not all of the variable names are intuitive and we could find an authentic interpretation neither in the given sources, nor with a help of Google. So, the choice of the analyzed variables is to some extent arbitrary.
Variables have been grouped them as follows.
Fixed variables (what? where? when? has happened; type, spatial and time identifiers)
REFNUM
STATE
STATE_NR
COUNTY
COUNTYNAME
COUNTY_END
EVTYPE
TIME_ZONE
BGN_DATE
BGN_TIME
END_DATE
END_TIME
BGN_LOCATI
END_LOCATI
LATITUDE
LONGITUDE
LATITUDE_END
LONGITUDE_END
Measured variables (strength? extents? impact?)
LENGTH
WIDTH
BGN_RANGE
END_RANGE
BGN_AZI
END_AZI
FATALITIES
INJURIES
PROPDMG
CROPDMG
An observation may be defined as unique combination of location (state + county), date (e.g. BGN_DATE) and event type. Let us check the work definition.
library(data.table)
tmpt <- data.table(with(stormdata, paste(STATE, COUNTYNAME, BGN_DATE, BGN_TIME, EVTYPE)))
length(tmpt$V1)
## [1] 902297
length(unique(tmpt$V1))
## [1] 874197
#Are there really double or multiple identical entries?
count(tmpt, V1, sort = T)
## Source: local data table [874,197 x 2]
##
## V1 n
## (chr) (int)
## 1 ME AROOSTOOK 2010-05-25 02:25:00 PM THUNDERSTORM WIND 15
## 2 VA SHENANDOAH 2010-03-28 08:00:00 PM FLOOD 14
## 3 ME WALDO 2007-03-17 10:00:00 AM FLOOD 13
## 4 AL CULLMAN 2009-12-08 10:10:00 PM THUNDERSTORM WIND 13
## 5 ME WALDO 2006-10-28 19:00:00 PM FLOOD 12
## 6 KS STAFFORD 1955-06-04 1830 TORNADO 11
## 7 VA SHENANDOAH 2011-04-14 08:00:00 AM FLOOD 11
## 8 ME KNOX 2007-03-17 09:00:00 AM FLOOD 10
## 9 FL DIXIE 2011-04-05 05:05:00 AM THUNDERSTORM WIND 9
## 10 MD BALTIMORE 2011-07-07 07:45:00 PM FLASH FLOOD 9
## .. ... ...
Hm, the uniqueness of the proposed definition is not proven, in about 3% of all cases duplicate or multiple entries for location, beginning time and event type are there. Is location data at county level not precise enough? Let us append the beginning location.
tmpt <- data.table(with(stormdata, paste(STATE, COUNTYNAME, BGN_DATE, BGN_TIME,
LONGITUDE, LATITUDE, EVTYPE)))
length(tmpt$V1)
## [1] 902297
length(unique(tmpt$V1))
## [1] 893700
#Are there still double or multiple identical entries?
count(tmpt, V1, sort = T)
## Source: local data table [893,700 x 2]
##
## V1 n
## (chr) (int)
## 1 KS STAFFORD 1955-06-04 1830 9836 3804 TORNADO 11
## 2 NV NVZ006 1996-11-16 04:30:00 AM 0 0 WINTER STORM 8
## 3 VT WINDHAM 1995-07-15 0725 0 0 THUNDERSTORM WINDS 7
## 4 ND GRAND FORKS 2000-06-12 10:30:00 PM 0 0 FLASH FLOOD 7
## 5 MT MTZ028 2002-03-08 09:00:00 AM 0 0 HEAVY SNOW 7
## 6 KS RUSH 2006-08-18 07:00:00 AM 0 0 HEAVY RAIN 7
## 7 IN MORGAN 2008-01-29 19:05:00 PM 8625 3925 THUNDERSTORM WIND 7
## 8 KS FORD 1955-06-04 1700 10002 3729 TORNADO 6
## 9 SD BROWN 1980-05-29 1442 9806 4520 TORNADO 6
## 10 TX FANNIN 1966-05-23 1600 9610 3336 TORNADO 6
## .. ... ...
There are either inconsistencies or technological redundancy in the data (e.g. through multiple automatic monitoring stations in recent years). We would need some more professional counseling at this point. It is probably not right time and place for perfection, but we take notice of the fact.
Top-20 of the SWE (over the whole observation period)
sorted <- sort(table(stormdata$EVTYPE), decreasing = TRUE, n = 20)[1:20]
print(as.data.frame(sorted))
## sorted
## HAIL 288661
## TSTM WIND 219940
## THUNDERSTORM WIND 82563
## TORNADO 60652
## FLASH FLOOD 54277
## FLOOD 25326
## THUNDERSTORM WINDS 20843
## HIGH WIND 20212
## LIGHTNING 15754
## HEAVY SNOW 15708
## HEAVY RAIN 11723
## WINTER STORM 11433
## WINTER WEATHER 7026
## FUNNEL CLOUD 6839
## MARINE TSTM WIND 6175
## MARINE THUNDERSTORM WIND 5812
## WATERSPOUT 3796
## STRONG WIND 3566
## URBAN/SML STREAM FLD 3392
## WILDFIRE 2761
We can see, that hail is the leading kind of SWE considering the number of registered events. The thunderstorm events as a whole could compete with it. There are over 120 different event types wit a pattern “TSTM” or “THUND” in the table; spreading due to historical reasons (changing in classification) assumable.
Top-10 states by number of SWE (1950-2011)
hd <- stormdata%>%count(STATE)%>%arrange(desc(n))
hd[1:10]
## Source: local data table [10 x 2]
##
## STATE n
## (fctr) (int)
## 1 TX 83728
## 2 KS 53440
## 3 OK 46802
## 4 MO 35648
## 5 IA 31069
## 6 NE 30271
## 7 IL 28488
## 8 AR 27102
## 9 NC 25351
## 10 GA 25259
The clear “leader” is by far Texas, followed by Kansas, Oklahoma and Montana.
Comparison of the distribution of SWE by state: whole observation period vs. 2000-2011
hd2000 <- stormdata%>%filter(BGN_DATE >= as.Date("2000-01-01","%Y-%m-%d"))%>%
count(STATE)
hd <- merge(hd, hd2000,by = "STATE")
hd <- hd%>%arrange(desc(n.x))
barplot(hd$n.x, names.arg = hd$STATE, col="green", ylim = c(0,90000), main = "
Incidence of SWE by U.S. state\n1950-2011 vs. 2000-2011", xlab = "U.S. state",
ylab = "Number of SWE by state")
barplot(hd$n.y, names.arg = hd$STATE, col="blue", ylim = c(0,90000), add=T)
legend("topright",legend =c("1950-2011","2000-2011"), col = c("green","blue"), pch=c(19,19))
box()
The leading states in the past ten years of monitoring are the same as for the whole reporting period.
It looks alarming, like about the half of all registered SWE happened within the last ten years of the whole reported period of sixty one years. It could indicate adverse climatic trends, but only after we have cleared possible impact of changing monitoring and reporting density and coverage.
Number of registered SWE as a timeline
Driven by the results of the analysis of structure of event types, we want to overview the timely dynamics of the total number of registered SWE.
ty <- table(format(stormdata$BGN_DATE,"%Y"))
plot(x=names(ty),y = as.integer(ty), pch = 20, xlab = "Year", ylab = "Number of severe events", ylim = c(0, max(ty)), main = "Number of registered SWE over years")
box()
At this moment we don’t need any kind of regression function to see the dramatic gain of the number of registered severe weather events in time. The rocket like trajectory is unrealistic to explain only due to the natural (r)evolution. One more time, advancements in coverage of monitoring supposed to make a huge contribution to the figures.
Spatial imaging the incidence of SWE by latitude and longitude
The next consideration with respect to the shown distribution of SWE is, that calculated incidence of the SWE by U.S. state depends amongst others upon it’s surface area respectively the number of counties in the state. As the second ranked U.S. state, Texas has some “natural handicap” in the competition.
This consideration is a fortiori weighty, as severe events are registered at in fact on the county basis, i.e. the territorially large scale events may be registered county-wise many times in the given table. This approach may be well grounded for administrative reasons, but it definitely distorts the human perception.
The trueness of this assumption may be illustrated with the following evaluation.
tstm <- subset(stormdata, EVTYPE %like% "THUNDERS|TSTM", select = c("STATE","BGN_DATE"))
tstm
## STATE BGN_DATE
## 1: AL 1955-03-21
## 2: AL 1955-04-06
## 3: AL 1955-04-06
## 4: AL 1955-04-06
## 5: AL 1955-04-13
## ---
## 336805: AN 2011-11-16
## 336806: AM 2011-11-16
## 336807: AM 2011-11-16
## 336808: LE 2011-11-14
## 336809: LE 2011-11-14
unique(tstm)
## STATE BGN_DATE
## 1: AL 1955-03-21
## 2: AL 1955-04-06
## 3: AL 1955-04-13
## 4: AL 1955-04-21
## 5: AL 1955-05-24
## ---
## 66555: AM 2011-11-29
## 66556: GA 2011-11-22
## 66557: TX 2011-11-21
## 66558: AN 2011-11-16
## 66559: LE 2011-11-14
The counts show, the total number of the registered thunderstorms (as individual observations, i.e. per county) is about five times higher, than the number calculated on the state and same-day basis.
That is the ground, why exploration data analysis with unified spatial units may be of additional interest, especially in respect to human perception. Based on the given data for such unit may be taken quadrate with the side of 1° in geographic coordinate system. It allows a simple allocation of the weather events over the available coordinates like LATITUDE and LONGITUDE.
Incidence of severe events by latitude and longitude (pseudo-map)
mtrx <- xtabs(formula = ~ as.integer(LATITUDE/100) + as.integer(LONGITUDE/100),
data = stormdata, drop.unused.levels = TRUE)
#Neutralizing the fictitious maximum at longitude 0 and latitude 0
#(incompleteness due to historical reasons?)
mtrx[which.max(mtrx)] <- 0
#Swapping the columns - reverting the longitude
mtrx1 <- matrix(nrow = dim(mtrx)[1], ncol = dim(mtrx)[2])
for (i in 1:dim(mtrx)[2]) mtrx1[,dim(mtrx)[2] -i + 1] <- mtrx[,i]
#Cropping the matrix
mtrx2 <- mtrx1[10:40,40:99]
#And here is the pseudo-map
image(x = 1:dim(mtrx2)[2],y = 1:dim(mtrx2)[1], z = t(as.matrix(mtrx2)),
col = topo.colors(50), main = "Can You recognize the land?", xaxt = "n", yaxt = "n",xlab = "", ylab = "")
box()
Here is an attempt to recycle the calculated spatial data for tabular estimation, validating the former imaging one more time.
tmp <- sort(mtrx, decreasing = TRUE)[1:20] # selecting the top-20 1°x1° squares
#Demo - matrix indices (but not geo. coordinates, preserved as row-/col-names!)
#of the top-20 squares
sapply(tmp, function(x) which(mtrx==x, arr.ind=TRUE))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 22 24 26 19 23 25 21 21 20 24 28 22 24
## [2,] 38 38 35 38 36 31 27 33 25 41 34 42 34
## [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 23 19 22 21 26 24 25
## [2,] 38 34 35 23 36 39 18
#Saving multiple matrix indices for top-20
tmp2 <- sapply(tmp, function(x) which(mtrx==x, arr.ind=TRUE))
#Demo tmp2 - indices for latitude in first row, for longitude in second
#tmp2
# [,1] [,2] [,3] [,4] ...
#[1,] 22 24 26 ...
#[2,] 38 38 35
#
#Compare tmp2[1,1] to the indices of the top-1 in the mtrx
#which(mtrx==4347, arr.ind=TRUE)
# as.integer(LATITUDE/100) as.integer(LONGITUDE/100)
#35 22 38
#
#Like tmp2 but with the geo coordinates
geo_top20 <- rbind(as.integer(rownames(mtrx)[tmp2[1,]]),as.integer(colnames(mtrx)[tmp2[2,]]))
#Latitudes of the top-20
sort(geo_top20[1,])
## [1] 32 32 33 34 34 34 35 35 35 36 36 37 37 37 37 38 38 39 39 41
#Longitudes of the top-20
sort(geo_top20[2,])
## [1] 77 82 84 86 90 92 93 93 93 94 94 95 95 97 97 97 97
## [18] 98 100 101
We can see, that the geo coordinates of the top-20 squares build the contiguous region with latitude 32°N to 40°N and longitude 90°W to 101°W. This region is the most threatened through severe weather events.
The Top-20 counties are found (no more surprisingly) all in Arkansas (33°00′ N to 36°30′ N, 89°39′ W to 94°37′ W)
(stormdata%>%count(STATE, COUNTYNAME)%>%arrange(desc(n)))[1:20]
## Source: local data table [20 x 3]
## Groups:
##
## STATE COUNTYNAME n
## (fctr) (fctr) (int)
## 1 AK AKZ024 120
## 2 AK AKZ213 110
## 3 AK AKZ019 102
## 4 AK AKZ001 96
## 5 AK AKZ020 95
## 6 AK AKZ017 94
## 7 AK AKZ022 90
## 8 AK AKZ125 88
## 9 AK AKZ101 86
## 10 AK AKZ131 86
## 11 AK AKZ155 82
## 12 AK AKZ007 79
## 13 AK AKZ181 75
## 14 AK AKZ195 71
## 15 AK AKZ191 71
## 16 AK AKZ025 66
## 17 AK AKZ018 64
## 18 AK AKZ201 59
## 19 AK AKZ185 57
## 20 AK AKZ214 51
As indicators for impact of severe weather events on population health we can take such quantitive measures as FATALITIES and INJURIES. It is one more time the problem of scattered event types, but it deserves a try.
The Top-20 SWE by the total number of caused injuries (1950-2011)
injdata<- stormdata%>%group_by(EVTYPE)%>%summarize(Injured = sum(INJURIES))%>%arrange(desc(Injured))
#Total number of event types causing person injuries.
dim(injdata[injdata$Injured > 0])[1]
## [1] 158
print(as.data.frame(injdata[injdata$Injured > 0][1:20]))
## EVTYPE Injured
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
## 11 WINTER STORM 1321
## 12 HURRICANE/TYPHOON 1275
## 13 HIGH WIND 1137
## 14 HEAVY SNOW 1021
## 15 WILDFIRE 911
## 16 THUNDERSTORM WINDS 908
## 17 BLIZZARD 805
## 18 FOG 734
## 19 WILD/FOREST FIRE 545
## 20 DUST STORM 440
We see tornado at pole position as cause for person injuries. Taking into account the already addressed extreme scattering of the event types around thunderstorm, here is an attempt to accumulate the figures using the known pattern.
stormdata%>%filter(EVTYPE %like% "TSTM|THUND")%>%summarize(sum(INJURIES))
## sum(INJURIES)
## 1: 9545
Even after cumulative evaluation thunderstorms of all kinds would stay at the second place, by far away from tornados.
The Top-20 SWE by the total number of fatalities
injdata<- stormdata%>%group_by(EVTYPE)%>%summarize(fatalities = sum(FATALITIES))%>%arrange(desc(fatalities))
#Total number of event types causing person injuries.
dim(injdata[injdata$fatalities > 0])[1]
## [1] 168
print(as.data.frame(injdata[injdata$fatalities > 0][1:20]))
## EVTYPE fatalities
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
## 11 WINTER STORM 206
## 12 RIP CURRENTS 204
## 13 HEAT WAVE 172
## 14 EXTREME COLD 160
## 15 THUNDERSTORM WIND 133
## 16 HEAVY SNOW 127
## 17 EXTREME COLD/WIND CHILL 125
## 18 STRONG WIND 103
## 19 BLIZZARD 101
## 20 HIGH SURF 101
Tornados stay at the top as a cause for fatalities just like it was for injuries, but excessive heat at the second place is some kind of surprise. It is even more surprising, as the relation of heat to death is mostly much less explicit than those of mechanical damages through tornado or drowning in a flood. And so, such “unspectacular” cause as heat might be still strong undervalued. as a fatal factor.
It is moreover the question, as heat is scattered into at least sixteen different event types.
unique(grep("heat",stormdata$EVTYPE, ignore.case = T, value =T))
## [1] "HEAT" "EXTREME HEAT"
## [3] "EXCESSIVE HEAT" "RECORD HEAT"
## [5] "HEAT WAVE" "DROUGHT/EXCESSIVE HEAT"
## [7] "RECORD HEAT WAVE" "RECORD/EXCESSIVE HEAT"
## [9] "HEAT WAVES" "HEAT WAVE DROUGHT"
## [11] "HEAT/DROUGHT" "HEAT DROUGHT"
## [13] "Heatburst" "Record Heat"
## [15] "Heat Wave" "EXCESSIVE HEAT/DROUGHT"
…and tornado comes in fifteen different avatars.
unique(grep("tornado",stormdata$EVTYPE, ignore.case = TRUE, value = T))
## [1] "TORNADO" "TORNADO F0"
## [3] "TORNADOS" "WATERSPOUT/TORNADO"
## [5] "WATERSPOUT TORNADO" "WATERSPOUT-TORNADO"
## [7] "TORNADOES, TSTM WIND, HAIL" "COLD AIR TORNADO"
## [9] "WATERSPOUT/ TORNADO" "TORNADO F3"
## [11] "TORNADO F1" "TORNADO/WATERSPOUT"
## [13] "TORNADO F2" "TORNADOES"
## [15] "TORNADO DEBRIS"
Cumulative assessment tornado vs. heat as a cause of mortality
Plots, even the boxplots with log-functions, are not the best way to visually compare mortality through heat vs. tornado. It is due to very low mean/median values and extreme outliers in both cases.
The five-number summary fits better.
fltheat <- grep("heat",stormdata$EVTYPE, ignore.case = T)
m1 <- cbind("HEAT",stormdata$FATALITIES[fltheat])
fltorn <- grep("tornado",stormdata$EVTYPE, ignore.case = T)
m2 <- cbind("TORNADO",stormdata$FATALITIES[fltorn])
dtab <- as.data.table(rbind(m1,m2))
#Heat
summary(as.integer(m1[,2]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.185 1.000 583.000
#Tornado
summary(as.integer(m2[,2]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09326 0.00000 158.00000
The ultimate evaluation of heat vs. tornado as cause of fatalities shows, that in both cases the mean number of fatalities is about zero with extreme outliers, corresponding to catastrophic events.
Here is the recalculated total number of fatalities summarized for main kinds of SWE.
dtab%>%group_by(V1)%>%summarize(FATALITIES = sum(as.numeric(V2)))%>%arrange()
## Source: local data table [0 x 0]
As we can see, tornado keeps the leadership, but summarizing the fatalities through heat shows essential gain of mortality through heat.
The only interpretable variables to estimate the economic consequences of severe weather events in the available data are PROPDMG (for property damage) and CROPDMG (for crop damage).
We are using at this point the same techniques as for estimation of the population health impact.
Top-20 SWE by property damage
dmgdata<- stormdata%>%group_by(EVTYPE)%>%summarize(PropDamage = sum(PROPDMG))%>%arrange(desc(PropDamage))
#Total number of event types causing person injuries.
dim(dmgdata[dmgdata$PropDamage > 0])[1]
## [1] 406
print(as.data.frame(dmgdata[dmgdata$PropDamage > 0][1:20]))
## EVTYPE PropDamage
## 1 TORNADO 3212258.16
## 2 FLASH FLOOD 1420124.59
## 3 TSTM WIND 1335965.61
## 4 FLOOD 899938.48
## 5 THUNDERSTORM WIND 876844.17
## 6 HAIL 688693.38
## 7 LIGHTNING 603351.78
## 8 THUNDERSTORM WINDS 446293.18
## 9 HIGH WIND 324731.56
## 10 WINTER STORM 132720.59
## 11 HEAVY SNOW 122251.99
## 12 WILDFIRE 84459.34
## 13 ICE STORM 66000.67
## 14 STRONG WIND 62993.81
## 15 HIGH WINDS 55625.00
## 16 HEAVY RAIN 50842.14
## 17 TROPICAL STORM 48423.68
## 18 WILD/FOREST FIRE 39344.95
## 19 FLASH FLOODING 28497.15
## 20 URBAN/SML STREAM FLD 26051.94
With respect to property damage the usual suspects are here, like tornado and thunderstorm, kinds of windy weather, hail, lightning and flood. As the further grouping is difficult for known reasons, we’ll try to to apply this, probably not perfect classification to evaluate the cumulative structure of property damage through the top placed SWE.
fltornado <- grepl("tornado",stormdata$EVTYPE, ignore.case = TRUE)
mtornado <- cbind("TORNADO",stormdata$PROPDMG[fltornado])
flthund <- grepl("tstm|thund",stormdata$EVTYPE, ignore.case = TRUE)
mthund <- cbind("THUNDERSTORM",stormdata$PROPDMG[flthund])
flwind <- grepl("wind",stormdata$EVTYPE, ignore.case = TRUE)
mwind <- cbind("WIND",stormdata$PROPDMG[flwind])
flhail <- grepl("hail",stormdata$EVTYPE, ignore.case = T)
mhail <- cbind("HAIL",stormdata$PROPDMG[flhail])
fltlight <- grepl("lightning",stormdata$EVTYPE, ignore.case = TRUE)
mlihtning <- cbind("LIGHTNING",stormdata$PROPDMG[fltlight])
fltflood <- grepl("flood",stormdata$EVTYPE, ignore.case = TRUE)
mflood <- cbind("FLOOD",stormdata$PROPDMG[fltflood])
dtab <- as.data.table(rbind(mtornado, mthund, mwind, mhail, mlihtning, mflood))
dim(dtab)
## [1] 1151331 2
#Redundancy through partially overlapping event type names,%
dim(dtab)[1]/sum(grepl("tornado|tstm|thund|wind|hail|lightning|flood",
stormdata$EVTYPE, fixed = FALSE, ignore.case = TRUE))
## [1] 1.415363
The result of selection is redundant through partially overlapping event type names like “TORNADOES, TSTM WIND, HAIL” or “TSTM WIND”. The data redundancy in the selection due to it is about 40%. The number of rows in the selection is 20% higher, then the number of original observations.
An attempt to reduce the impact of historical dimension (evolution of definitions) brings no breakthrough - the redundancy of selection for the period 2000-2011 stays over 39%.
stormdata2k <- stormdata%>%filter(BGN_DATE >= as.Date("2000-01-01","%Y-%m-%d"))
fltornado <- grepl("tornado",stormdata2k$EVTYPE, ignore.case = TRUE)
mtornado <- cbind("TORNADO",stormdata2k$PROPDMG[fltornado])
flthund <- grepl("tstm|thund",stormdata2k$EVTYPE, fixed = FALSE, ignore.case = TRUE)
mthund <- cbind("THUNDERSTORM",stormdata2k$PROPDMG[flthund])
flwind <- grepl("wind",stormdata2k$EVTYPE, ignore.case = TRUE)
mwind <- cbind("WIND",stormdata2k$PROPDMG[flwind])
flhail <- grepl("hail",stormdata2k$EVTYPE, ignore.case = T)
mhail <- cbind("HAIL",stormdata2k$PROPDMG[flhail])
fltlight <- grepl("lightning",stormdata2k$EVTYPE, ignore.case = TRUE)
mlihtning <- cbind("LIGHTNING",stormdata2k$PROPDMG[fltlight])
fltflood <- grepl("flood",stormdata2k$EVTYPE, ignore.case = TRUE)
mflood <- cbind("FLOOD",stormdata2k$PROPDMG[fltflood])
dtab2k <- as.data.table(rbind(mtornado, mthund, mwind, mhail, mlihtning, mflood))
dim(dtab2k)
## [1] 634896 2
#Redundancy through partially overlapping event type names,% (2000-2011)
dim(dtab2k)[1]/sum(grepl("tornado|tstm|thund|wind|hail|lightning|flood",
stormdata2k$EVTYPE, ignore.case = TRUE))
## [1] 1.393434
At this point we see no possibility for further data aggregation without the professional input concerning the way to do it.
Top-20 SWE with respect to the crop damage
dmgdata<- stormdata%>%group_by(EVTYPE)%>%summarize(CropDamage = sum(CROPDMG))%>%arrange(desc(CropDamage))
#Total number of event types causing person injuries.
dim(dmgdata[dmgdata$CropDamage > 0])[1]
## [1] 136
print(as.data.frame(dmgdata[dmgdata$CropDamage > 0][1:20]))
## EVTYPE CropDamage
## 1 HAIL 579596.28
## 2 FLASH FLOOD 179200.46
## 3 FLOOD 168037.88
## 4 TSTM WIND 109202.60
## 5 TORNADO 100018.52
## 6 THUNDERSTORM WIND 66791.45
## 7 DROUGHT 33898.62
## 8 THUNDERSTORM WINDS 18684.93
## 9 HIGH WIND 17283.21
## 10 HEAVY RAIN 11122.80
## 11 FROST/FREEZE 7034.14
## 12 EXTREME COLD 6121.14
## 13 TROPICAL STORM 5899.12
## 14 HURRICANE 5339.31
## 15 FLASH FLOODING 5126.05
## 16 HURRICANE/TYPHOON 4798.48
## 17 WILDFIRE 4364.20
## 18 TSTM WIND/HAIL 4356.65
## 19 WILD/FOREST FIRE 4189.54
## 20 LIGHTNING 3580.61
For crop damage it is a hail leading by far, with flash flood and flood at following positions.
For the reasons addressed before we resign at this moment the further aggregation of data.
Following states are especially threatened through the SWE with respect to the damage (data for the period 2000-2011).
dmgstat <- stormdata2k%>%group_by(STATE)%>%summarize(Injuries = sum(INJURIES),Fatalities = sum(FATALITIES),PropDamage = sum(PROPDMG),CropDamage = sum(CROPDMG))
injdmg <- arrange(dmgstat, desc(Injuries))
fataldmg <- arrange(dmgstat, desc(Fatalities))
propdmg <- arrange(dmgstat, desc(PropDamage))
cropdmg <- arrange(dmgstat, desc(CropDamage))
Top-10 respective injuries
States sorted descendent by the total number of injuries.
select(injdmg, STATE, Injuries)[1:10]
## Source: local data table [10 x 2]
##
## STATE Injuries
## (fctr) (dbl)
## 1 MO 5057
## 2 AL 3206
## 3 TX 2192
## 4 CA 2050
## 5 TN 1863
## 6 FL 1845
## 7 OK 1543
## 8 GA 1226
## 9 MS 1032
## 10 NC 887
Top-10 states respective fatalities
States sorted descendent by the total number of fatalities.
select(fataldmg, STATE, Fatalities)[1:10]
## Source: local data table [10 x 2]
##
## STATE Fatalities
## (fctr) (dbl)
## 1 TX 501
## 2 MO 411
## 3 FL 396
## 4 AL 369
## 5 CA 355
## 6 IL 353
## 7 PA 285
## 8 TN 282
## 9 NY 171
## 10 NC 164
Summarized ranking for injuries and fatalities
Here is the summarized ranking for personal damage by state, as a sum of both rankings. The placement on top means more risk of personal damage.
injstates <- as.vector(injdmg$STATE)
fatalstates <- as.vector(fataldmg$STATE)
rank <- sapply(injstates, function(x) which(injstates == x, arr.ind = TRUE) + which(fatalstates == x, arr.ind = TRUE))
rnames <- names(rank)
rank <- data.table(cbind(rnames, rank))
print(rank%>%arrange(as.numeric(rank)))
## rnames rank
## 1: MO 3
## 2: TX 4
## 3: AL 6
## 4: CA 9
## 5: FL 9
## 6: TN 13
## 7: IL 19
## 8: NC 20
## 9: MS 21
## 10: OK 22
## 11: GA 22
## 12: AR 22
## 13: PA 22
## 14: IN 30
## 15: NY 31
## 16: MD 36
## 17: KS 38
## 18: MI 39
## 19: VA 40
## 20: KY 43
## 21: AZ 43
## 22: CO 46
## 23: OH 46
## 24: LA 46
## 25: NJ 51
## 26: UT 51
## 27: IA 53
## 28: SC 54
## 29: GU 59
## 30: WI 59
## 31: MN 62
## 32: MA 64
## 33: WA 65
## 34: NV 68
## 35: WY 72
## 36: AS 73
## 37: ID 78
## 38: NE 79
## 39: DE 81
## 40: MT 81
## 41: PR 81
## 42: CT 82
## 43: AK 83
## 44: NM 84
## 45: WV 84
## 46: OR 88
## 47: ND 90
## 48: ME 93
## 49: SD 95
## 50: HI 95
## 51: NH 97
## 52: DC 107
## 53: AM 107
## 54: VT 107
## 55: AN 108
## 56: RI 112
## 57: VI 114
## 58: PZ 115
## 59: LM 118
## 60: PH 121
## 61: LC 123
## 62: GM 123
## 63: LH 127
## 64: LS 127
## 65: LE 129
## 66: SL 132
## 67: LO 134
## 68: PM 136
## 69: PK 138
## 70: XX 140
## rnames rank
This ranking should sensitize the local authorities preparing the emergency medical services for natural disasters.
Here is the analogue estimation for property and crop damages.
The Top-10 states respective property damage (2000-2011)
select(propdmg, STATE, PropDamage)[1:10]
## Source: local data table [10 x 2]
##
## STATE PropDamage
## (fctr) (dbl)
## 1 TX 490910.6
## 2 IA 369875.3
## 3 MS 318315.5
## 4 OH 313869.5
## 5 GA 312846.9
## 6 AR 218050.4
## 7 IL 217247.2
## 8 AL 201949.7
## 9 PA 189069.5
## 10 NY 187893.7
The Top-10 states respective crop damage (2000-2011)
select(cropdmg, STATE, CropDamage)[1:10]
## Source: local data table [10 x 2]
##
## STATE CropDamage
## (fctr) (dbl)
## 1 NE 196201.10
## 2 TX 104436.51
## 3 IA 95231.05
## 4 KS 83595.02
## 5 WI 55866.14
## 6 MS 55622.79
## 7 ND 37129.00
## 8 MN 35195.20
## 9 AR 25592.01
## 10 LA 22632.95
Summarized ranking by economical damage by state
The places in property and crop damage rankings are added, the place on top means more economical damage.
pdmgstates <- as.vector(propdmg$STATE)
cdmgstates <- as.vector(cropdmg$STATE)
rank <- sapply(injstates, function(x) which(cdmgstates == x, arr.ind = TRUE) + which(pdmgstates == x, arr.ind = TRUE))
rnames <- names(rank)
rank <- data.table(cbind(rnames, rank))
print(rank%>%arrange(as.numeric(rank)))
## rnames rank
## 1: TX 3
## 2: IA 5
## 3: MS 9
## 4: AR 15
## 5: KS 15
## 6: NE 15
## 7: OH 16
## 8: WI 21
## 9: IL 24
## 10: NY 24
## 11: LA 25
## 12: GA 28
## 13: MO 29
## 14: TN 30
## 15: CA 32
## 16: MN 32
## 17: ND 32
## 18: NC 33
## 19: MI 36
## 20: AL 38
## 21: OK 40
## 22: IN 40
## 23: PA 41
## 24: FL 44
## 25: KY 44
## 26: SD 49
## 27: VA 50
## 28: WA 57
## 29: MT 65
## 30: VT 66
## 31: CO 68
## 32: NM 68
## 33: SC 70
## 34: MD 71
## 35: MA 73
## 36: NJ 73
## 37: WV 74
## 38: AZ 75
## 39: WY 76
## 40: PR 77
## 41: GU 78
## 42: UT 79
## 43: ID 84
## 44: NH 87
## 45: ME 90
## 46: AK 90
## 47: OR 91
## 48: DE 95
## 49: HI 95
## 50: CT 98
## 51: NV 98
## 52: AS 100
## 53: AM 104
## 54: VI 105
## 55: RI 107
## 56: DC 109
## 57: GM 117
## 58: LM 119
## 59: AN 121
## 60: PZ 121
## 61: LC 123
## 62: LS 124
## 63: PH 125
## 64: LE 128
## 65: LO 129
## 66: LH 130
## 67: SL 131
## 68: PK 132
## 69: PM 137
## 70: XX 140
## rnames rank