Most dangerouse catastrophe events: severity assessment

Synopsis

In this study, author uses the exploratory analysis of the data on Natural disasters to assess the most dangerous event types.

The definition of “most dangerous”, naturally, could not be strict and allows for multiple interpretations, depending on the measure suitable for particular application. For instance, mosquitos are the cause of extreme number of deaths, but one particular insect can be harmless, while King Cobra is comparably rare killer, despite having 100% letal bite. Therefore, we study 4 measures: mortality/injury numbers caused and mortality/injury rate, comparing to the number of events registered.

Analysis in the first chapter itself is really simple: we aggregate the data by event type (with some linguistic transformations), calculate total numbers & injuries/fatalities rates and avarage loss per event for property damage level assessments.

In the second chapter, we derive the most risky events for each of the states, given their historical data.

To count for correlation with ther parameters, we apply classification based on “Rpart” R package.

Resulting conclusion is that in any reasonable sence, TORNADO is extremely dangerous event.

To give some stability, trends are studies as well.

Data Processing

Firstly, we get the data on Natural disasters and load it to the environment:

require(knitr)

## Loading required package: knitr

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","_temp.csv.bz2")
dtstorm_raw<-read.csv("_temp.csv.bz2")
#saveRDS(dtstorm_raw,file = "stats.rds")
#dtstorm_raw<-readRDS("stats.rds")

Firstly, to use properly the date attributes of the records, we need to convert the relevant variables to standard POSIX time format.

dtstorm<-dtstorm_raw
dtstorm$BGN_DATE<-as.POSIXct(strptime(dtstorm$BGN_DATE,
                                      format="%m/%d/%Y %H:%M:%S",tz = "GMT"))
dtstorm$END_DATE<-as.POSIXct(strptime(dtstorm$END_DATE,
                                      format="%m/%d/%Y %H:%M:%S",tz = "GMT"))
dtstorm$PROPDMG_mult<-sapply(as.character(dtstorm$PROPDMGEXP),function(i){
  ifelse(i=='K',1000,ifelse(i=='m',1000000,ifelse(i=='M',1000000,ifelse(i=='B',1000000000,1))))
})
dtstorm$PROPDMG<-dtstorm$PROPDMG_mult*dtstorm$PROPDMG/1000000

As one note, the EVTYPE is clearly unstructured data variable: too many levels for the factors and too few data records for the very vast majority of it. In order to proceed with statistical inference, we firstly need to adjust the data by “reasonable linguistic processing”, for 10 top event terms mantioned in [EVTYPE], we search for the closest ones (in terms of certain distance measure between string varibles, based on adist() standard function) and replace them with the single term, e.g. “THUNDERSTORM WIND” and “THUNDERSTORM WINDS”, that will somewhat automate the grouping. Besides, there are two levels, which apparently have the same meaning, but are distant in terms of strings similarity: “TSTM WIND” and “THUNDERSTORM WIND”, you may refer to almost any acronyms dictionary, and it’s clearly not “too small to measure”, when occured along with an assessment of fatalities numbers.

Considering the above, we construct another variable “event” which will allow us to treat “differences in spelling” and “acronyms” properly.

freq_type<-aggregate(INJURIES~EVTYPE,data=dtstorm,FUN=sum)
freq_type<-freq_type[order(-freq_type$INJURIES),]#most common terms

dtstorm$event<-as.character(dtstorm$EVTYPE)
dtstorm$event[which(dtstorm$event=="TSTM WIND")]<-"THUNDERSTORM WIND"
for(i in 1:10){
  term2use<-as.character(freq_type[i,1])
  #pick certain frequent term, like "FLOOD"
  if(term2use%in%levels(as.factor(dtstorm$event))){
    #check whether this term wasn't replaced in from the data previously
    dists<<-sapply(as.character(freq_type$EVTYPE),function(term)
      adist(term2use,term,partial = T)
      #calculate the distances in terms of string similarity
      /min(nchar(term2use),nchar(as.character(term)))
    )
    terms2replace=as.character(freq_type$EVTYPE)[which(dists<0.1)]
    #check the replace candidates
    print(paste(term2use,":",paste(terms2replace[1:3],collapse="; "),"..."),sep="")
    ids<-which(as.character(dtstorm$EVTYPE)%in%terms2replace)
    dtstorm$event[ids]<-term2use
    #replace the occurances of event term variations with a single one
  }
}

## [1] "TORNADO : TORNADO; WATERSPOUT/TORNADO; TORNADO F2 ..."
## [1] "FLOOD : FLOOD; FLASH FLOOD; FLOOD/FLASH FLOOD ..."
## [1] "EXCESSIVE HEAT : EXCESSIVE HEAT; DROUGHT/EXCESSIVE HEAT; EXCESSIVE HEAT/DROUGHT ..."
## [1] "LIGHTNING : LIGHTNING; LIGHTNING AND THUNDERSTORM WIN; LIGHTNING INJURY ..."
## [1] "HEAT : EXCESSIVE HEAT; HEAT; HEAT WAVE ..."
## [1] "ICE STORM : ICE STORM; GLAZE/ICE STORM; ICE STORM/FLASH FLOOD ..."
## [1] "THUNDERSTORM WIND : THUNDERSTORM WIND; THUNDERSTORM WINDS; MARINE THUNDERSTORM WIND ..."
## [1] "HAIL : HAIL; TSTM WIND/HAIL; SMALL HAIL ..."

Chapter 1: Selecting the measures: frequencies vs totals, US combined data.

Let’s start from direct, “plain” assessments:

data_analysis_all<-dtstorm
freq_inj<-aggregate(INJURIES~event,data=data_analysis_all,FUN=sum)
freq_fat<-aggregate(FATALITIES~event,data=data_analysis_all,FUN=sum)
freq_prop<-aggregate(PROPDMG~event,data=data_analysis_all,FUN=sum)
freq_cnt<-aggregate(INJURIES~event,data=data_analysis_all,FUN=length)
freq_all<-merge(merge(merge(freq_inj,
                            freq_fat,by='event'),
                            freq_prop,by='event'),freq_cnt,by='event')
colnames(freq_all)<-c("event","INJURIES","FATALITIES","PROP","COUNT")

Eventually, we can describe the severity of different event types in terms of “TOPxx”’s. As we will see, the result significantly depends upon selected “measure”. Data for 1950-2011yy.

top5_inj_tot<-freq_all[order(-freq_all$INJURIES)[1:5],c('event','INJURIES')]
freq_all$inj_freq<-round(freq_all$INJURIES/freq_all$COUNT)
top5_inj_freq<-freq_all[order(-freq_all$inj_freq)[1:5],c('event','inj_freq')]

top5_fat_tot<-freq_all[order(-freq_all$FATALITIES)[1:5],c('event','FATALITIES')]
freq_all$fat_freq<-round(freq_all$FATALITIES/freq_all$COUNT)
top5_fat_freq<-freq_all[order(-freq_all$fat_freq)[1:5],c('event','fat_freq')]

top5_prop_tot<-freq_all[order(-freq_all$PROP)[1:5],c('event','PROP')]
freq_all$prop_norm<-round(freq_all$PROP/freq_all$COUNT)
top5_prop_norm<-freq_all[order(-freq_all$prop_norm)[1:5],c('event','prop_norm')]

colnames(top5_inj_tot)<-c("event","injuries")
top5_inj_tot

##                 event injuries
## 613           TORNADO    91407
## 605 THUNDERSTORM WIND     9396
## 194              HEAT     9154
## 136             FLOOD     8601
## 357         LIGHTNING     5231

colnames(top5_inj_freq)<-c("event","injuries per 1 event")
top5_inj_freq

##                     event injuries per 1 event
## 195             Heat Wave                   70
## 621 TROPICAL STORM GORDON                   43
## 704            WILD FIRES                   38
## 607         THUNDERSTORMW                   27
## 269    HIGH WIND AND SEAS                   20

colnames(top5_fat_tot)<-c("event","fatalities")
top5_fat_tot

##                 event fatalities
## 613           TORNADO       5636
## 194              HEAT       3138
## 136             FLOOD       1523
## 357         LIGHTNING        817
## 605 THUNDERSTORM WIND        714

colnames(top5_fat_freq)<-c("event","fatalities per 1 event")
top5_fat_freq

##                     event fatalities per 1 event
## 59          COLD AND SNOW                     14
## 621 TROPICAL STORM GORDON                      8
## 276        HIGH WIND/SEAS                      4
## 365         MARINE MISHAP                      4
## 244   Heavy surf and wind                      3

colnames(top5_prop_tot)<-c("event","property damage, mln.USD")
top5_prop_tot

##                 event property damage, mln.USD
## 136             FLOOD                167379.12
## 311 HURRICANE/TYPHOON                 69305.84
## 613           TORNADO                 56993.10
## 524       STORM SURGE                 43323.54
## 190              HAIL                 17619.99

colnames(top5_prop_norm)<-c("event","property damage per 1 event, mln.USD")
top5_prop_norm

##                         event property damage per 1 event, mln.USD
## 208 HEAVY RAIN/SEVERE WEATHER                                 1250
## 311         HURRICANE/TYPHOON                                  788
## 308            HURRICANE OPAL                                  353
## 524               STORM SURGE                                  166
## 704                WILD FIRES                                  156

As one could note, “frequencies” appear to have poor performance in task of risk level assessment: very few event, bearing for “unique” event definition bring the highest risk-to-count ratio. Therefore, proper metric using “frequencies” approach could be developed only using Value-At-Risk stochastic approach, or proper pre-grouping of the events, which would require additional data (like expert grouping, average temperature/pressure/weather conditions etc.).

In any case, we will state that the main approach for determining the most dangerous events would be just the plain total numbers.

colnames(top5_inj_tot)<-c("event","estimate")
top5_inj_tot$label<-'MHE: total injuries'
colnames(top5_fat_tot)<-c("event","estimate")
top5_fat_tot$label<-'MHE: total fatalities'
colnames(top5_prop_tot)<-c("event","estimate")
top5_prop_tot$label<-'EC: property damage'
top5_events<-rbind(top5_inj_tot,top5_fat_tot,top5_prop_tot)
require(ggplot2)

## Loading required package: ggplot2

top5_events$ord<-order(top5_events$label,-top5_events$estimate)
top5_events$ord<-rep(1:15)
ggplot(top5_events,aes(x=ord, y=estimate))+
  geom_bar(stat="identity")+ 
  ggtitle("Most harmful events (MHE) and economic consequences (EC): \n1950+ data by event types")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+xlab("event type (EVTYPE grouped)")+
  scale_x_continuous(breaks=1:15,labels=c(top5_events$event))+
  facet_wrap(~label, scales="free",ncol = 3)

Eventually, this explains why we chose the very simple approach: we examine only aggregated sums with distribution respect to each of the state.

Chapter 2: Across-the-states analysis

As seen before, most simple measure like “total injuries / total property damage” to be used as a proxy for answers:

will help in answering two basic questions of this study:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

the following code provides for derivation for each state TOP5 event types with most injuries and most property damage caused.

inj_tot_states<-aggregate(INJURIES~event+STATE,data=data_analysis_all,FUN=sum,na.rm=T)
inj_tot_states<-subset(inj_tot_states,INJURIES>0)
inj_tot_states<-inj_tot_states[order(inj_tot_states$STATE,-inj_tot_states$INJURIES),]
inj_tot_states$ord<-1:nrow(inj_tot_states)
inj_tot_states$min_ord<-sapply(as.character(inj_tot_states$STATE),function(st){min(
  subset(inj_tot_states,STATE==st)$ord)})
inj_tot_states$rank<-inj_tot_states$ord-inj_tot_states$min_ord+1
top5_inj_states<-inj_tot_states[which(inj_tot_states$rank<=5),
                                c("STATE","event","rank","INJURIES")]

prop_tot_states<-aggregate(PROPDMG~event+STATE,data=data_analysis_all,FUN=sum,na.rm=T)
prop_tot_states<-subset(prop_tot_states,PROPDMG>0)
prop_tot_states<-prop_tot_states[order(prop_tot_states$STATE,-prop_tot_states$PROPDMG),]
prop_tot_states$ord<-1:nrow(prop_tot_states)
prop_tot_states$min_ord<-sapply(as.character(prop_tot_states$STATE),function(st){min(
  subset(prop_tot_states,STATE==st)$ord)})
prop_tot_states$rank<-prop_tot_states$ord-prop_tot_states$min_ord+1
top5_prop_states<-prop_tot_states[which(prop_tot_states$rank<=5),
                                  c("STATE","event","rank","PROPDMG")]

The resulting lists could be found in the Results section. Let’s plot the map using maps package.

top1_prop_states<-prop_tot_states[which(prop_tot_states$rank==1),
                                  c("STATE","event","rank","PROPDMG")]
top1_prop_states$event<-factor(top1_prop_states$event)
top1_prop_states$statelabel<-(sapply(top1_prop_states$STATE,function(st)
  {tolower(state.name[which(state.abb==st)])}))
top1_prop_states$len<-sapply(top1_prop_states$statelabel,length)
top1_prop_states<-subset(top1_prop_states,len>0)#do not plot regions which do not
colours<-c("springgreen4","darkslategray","skyblue","skyblue3","aliceblue",
           "midnightblue","deepskyblue4","mediumpurple4","red4")
top1_prop_states$event<-factor(top1_prop_states$event)
leg.txt<-levels(top1_prop_states$event)
#appear in the map
require(maps)
top1_prop_states$clr<-colours[as.numeric(top1_prop_states$event)]

namevec <- map(database = "state", col = top1_prop_states$clr,fill=T, namesonly=TRUE,plot=F)
namevec<-sapply(namevec,function(nm){
  if(grepl(pattern = ":",nm)>0){
    strsplit(nm,":")[[1]][1]
  }else{nm}
})

clr<-sapply(namevec,function(st){
  if(st %in% levels(factor(as.character(top1_prop_states$statelabel)))){
    top1_prop_states$clr[which(as.character(top1_prop_states$statelabel)==st)]}else{
    "white"
  }})
map(database = "state", col = clr,fill=T)
title("Events with greatest economic consequences, 1950+")
legend("bottomleft", leg.txt, horiz = F,fill=colours,cex = 0.7)

Chapter 3: Dynamics

Now, to assess the present level of risk, for instance, for injuries, let us study the dynamics of TOP 5 event types with most numbers of injured.

require(lubridate)
require(ggplot2)
dyn_data<-aggregate(INJURIES~year(BGN_DATE)+event,data=
                      subset(data_analysis_all,year(BGN_DATE)>1970 & 
                               event%in%top5_inj_tot$event),FUN=sum,na.rm=T)
colnames(dyn_data)<-c("year","event","injuries")
ggplot(dyn_data,aes(x=year,y=injuries,group=event,col=event))+
  geom_point()+geom_line()

Therefore, we can make the following conclusions:

in further studies timeframe should be more acurately defined
injuries from flood are really rare except for a single catastrophe, so it’s better removed from data used for forecasting/predictions as a potential outlier
general declining dynamics of injuries from TORNADO’s, but the latest disaster seems to cross these improvements
there are some patterns (e.g. cycles) which could be revealed and leading to more preceise detecation of risks for a particular time period and state

Results

As seen from the map above, most dangeroes events are connected with winds: in central US, most injuries and property damage is caused by TORNADOs, the variations like “Thunderstorm wind” or “HIGH WIND” are frequent to central-western or north-eastern states. FLOODs cause economic damages (but less risky for injuries/fatalities) mostly at north-eastern states. There is a single state (Delaware) where lightning is the most dangerous event type.

The Most dangerous for peoples’ health and property events for each state (HH=‘health harmful’,EC=‘economic consequences’):

require(knitr)
ds1<-subset(top5_inj_states,rank==1,select=c("STATE",'event','INJURIES'))
colnames(ds1)<-c("state","most HH","#injuries")
ds2<-subset(top5_prop_states,rank==1,select=c("STATE",'event','PROPDMG'))
colnames(ds2)<-c("state","greatest EC","property damage, '000$")
kable(merge(ds1,ds2,by='state'),
      caption="Most dangeroues events by state, 1950+ data",
      row.names = F,escape = "\\")

Most dangeroues events by state, 1950+ data
state	most HH	#injuries	greatest EC	property damage, ’000$
AK	ICE STORM	34	FLOOD	193.0250
AL	TORNADO	7929	TORNADO	6321.2976
AM	THUNDERSTORM WIND	22	WATERSPOUT	5.1020
AN	MARINE STRONG WIND	18	THUNDERSTORM WIND	0.1690
AR	TORNADO	5116	TORNADO	2590.0073
AS	TSUNAMI	129	TSUNAMI	81.0000
AZ	THUNDERSTORM WIND	186	HAIL	2829.2712
CA	WILDFIRE	623	FLOOD	117117.4110
CO	TORNADO	261	HAIL	1424.0448
CT	TORNADO	703	TORNADO	596.2366
DC	HEAT	316	TROPICAL STORM	127.6000
DE	TORNADO	73	FLOOD	71.4095
FL	TORNADO	3344	HURRICANE/TYPHOON	27596.8650
GA	TORNADO	3926	TORNADO	3261.0267
GU	HURRICANE/TYPHOON	333	TYPHOON	600.2300
HI	STRONG WIND	20	FLOOD	162.8431
IA	TORNADO	2208	TORNADO	2286.5763
ID	THUNDERSTORM WIND	152	FLOOD	129.9740
IL	TORNADO	4145	FLOOD	6047.1340
IN	TORNADO	4224	TORNADO	2594.7939
KS	TORNADO	2721	TORNADO	2669.8907
KY	TORNADO	2806	TORNADO	888.7687
LA	TORNADO	2676	STORM SURGE	31742.7350
LM	MARINE STRONG WIND	1	MARINE TSTM WIND	1.2050
MA	TORNADO	1758	TORNADO	756.0391
MD	HEAT	545	TROPICAL STORM	538.5050
ME	LIGHTNING	70	ICE STORM	318.2300
MH	HIGH SURF	1	HIGH SURF	5.0000
MI	TORNADO	3362	TORNADO	1071.7656
MN	TORNADO	1976	TORNADO	1903.7011
MO	TORNADO	4330	TORNADO	4800.7017
MS	TORNADO	6246	HURRICANE/TYPHOON	13492.7350
MT	WILD/FOREST FIRE	33	HAIL	94.9239
NC	TORNADO	2548	HURRICANE	4979.8610
ND	TORNADO	326	FLOOD	4071.4875
NE	TORNADO	1158	TORNADO	1718.1647
NH	LIGHTNING	85	FLOOD	105.9158
NJ	HEAT	304	FLOOD	2890.0900
NM	TORNADO	155	WILD/FOREST FIRE	1509.7000
NV	FLOOD	63	FLOOD	739.2661
NY	THUNDERSTORM WIND	344	FLOOD	3207.7735
OH	TORNADO	4442	TORNADO	2283.1578
OK	TORNADO	4829	TORNADO	3268.7082
OR	HIGH WIND	50	FLOOD	741.8625
PA	TORNADO	1241	FLOOD	2522.9480
PR	HEAVY RAIN	10	HURRICANE	1824.4310
PZ	MARINE STRONG WIND	3	MARINE STRONG WIND	0.0760
RI	TORNADO	23	FLOOD	93.5410
SC	TORNADO	1314	TORNADO	531.7452
SD	TORNADO	452	TORNADO	231.2138
TN	TORNADO	4748	FLOOD	4764.7492
TX	TORNADO	8207	TROPICAL STORM	5491.1930
UT	WINTER STORM	415	FLOOD	371.1125
VA	TORNADO	914	HURRICANE/TYPHOON	512.0000
VI	LIGHTNING	1	HURRICANE	28.2200
VT	THUNDERSTORM WIND	34	FLOOD	1428.5735
WA	TORNADO	303	FLOOD	234.0150
WI	TORNADO	1601	FLOOD	986.1774
WV	THUNDERSTORM WIND	151	FLOOD	769.8981
WY	WINTER STORM	119	HAIL	111.2217

Appendix 1: Some math+stats samples

Now, let us add a covariate (please note, that more comprehensive analysis would not fit the frame of current study, so this is just a pointer towards more detailed reasearch direction) which could interfere with the consequences of the catastrophe events under investigation: let’s construct “time of the year”, “time of the day” and add “latitude”/“longitude”" to track the souther/northern and eastern/western environment features.

To ease the computations, choose the subsample of 50k records.

set.seed(145)#in order to speed-up the calculations
data_analysis<-data_analysis_all[sample(1:nrow(data_analysis_all),50000),]
data_analysis$duration<-data_analysis$END_DATE-data_analysis$BGN_DATE

timeofyear<-function(dt){
  nmonth<-as.numeric(format(dt,"%m"))
  if(nmonth%in%c(12,1,2))return('win')
  if(nmonth%in%c(3,4,5))return('spr')
  if(nmonth%in%c(6,7,8))return('sum')
  if(nmonth%in%c(9,10,11))return('aut')
}

hh<-function(tm){
  hh1<-as.numeric(substr(as.character(tm),1,2))
  ifelse((grepl("AM",tm)>0) & (hh1==12), 0,
         #12:00 AM->return 0
         ifelse((grepl("PM",tm)>0) & (hh1==12), 12,
                #12:00 PM return 12
                (grepl("PM",tm)>0)*12+hh1))
  #there occur all the possible cases: 00:00 AM, 12:00 AM, 12:00 PM,...
}

data_analysis$timeofyear<-sapply(data_analysis$BGN_DATE,timeofyear)
data_analysis$year<-sapply(data_analysis$BGN_DATE,year)
data_analysis$hour<-sapply(data_analysis$BGN_DATE,hh)


data_analysis<-data_analysis[,c("event","EVTYPE",
                                #primary variables: new & old event type definition
                                 "year",
                                #time variable
                              "timeofyear","hour","duration",'MAG','F','LENGTH','WIDTH',
                              #covariates & attribs
                              'PROPDMG','INJURIES','FATALITIES'
                              #responce = event consequences
                              )]

As for inclusion of variables in the analysis, the motivation is as follows:

STATE/STATE__,COUNTY/COUNTYNAME/BGN_LOCATI/COUNTY_END,COUNTYENDN: could not be accurately studied at this stage, proper clustering analysis would be needed, too many levels/too few data in cells on average
BGN_DATE - included in the “time of the year” and year for dynamics studies
BGN_TIME: too hard to read (“01:30:13 PM” vs “0130”), only “hour” will be used
END_TIME: excluded at this stage
END_DATE: included at “duration” variable
BGN_RANGE/BGN_AZI/END_RANGE/END_AZI/END_LOCATI: too few data, omitted
LENGTH/WIDTH/MAG: included
INJURIES/FATALITIES/PROPDMG - included as a responce variables
CROPDMG: excluded at this stage,
PROPDMG/CROPDMG - not included since data needs significant upgrade: too many levels

Please note that we didn’t count for timezones in order not to complicate: time would be meaningful as recorded and date/time of the year/year shifts would be negligable.

The above mentioned notes are implemented as follows:

require(rpart)
fit<-rpart(data=data_analysis,INJURIES~event+duration+timeofyear+hour+duration+MAG+LENGTH+WIDTH)

Therefore, the tree structure of identification of the most risky events is as follows:

fit

## n= 50000 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##   1) root 50000 352576.600  0.12086000  
##     2) LENGTH< 9.35 49663 158440.900  0.07361617 *
##     3) LENGTH>=9.35 337 177689.700  7.08308600  
##       6) WIDTH< 366.5 242  30811.040  2.76859500 *
##       7) WIDTH>=366.5 95 130898.500 18.07368000  
##        14) timeofyear=aut,sum 30   3124.167  6.16666700 *
##        15) timeofyear=spr,win 65 121557.900 23.56923000  
##          30) hour>=19.5 26  17908.460 11.53846000 *
##          31) hour< 19.5 39  97377.440 31.58974000  
##            62) WIDTH< 470 19  26141.680 19.26316000 *
##            63) WIDTH>=470 20  65606.200 43.30000000  
##             126) WIDTH>=750 12  30596.920 23.58333000 *
##             127) WIDTH< 750 8  23346.880 72.87500000 *