Synopsis

In this study, author uses the exploratory analysis of the data on Natural disasters to assess the most dangerous event types.

The definition of “most dangerous”, naturally, could not be strict and allows for multiple interpretations, depending on the measure suitable for particular application. For instance, mosquitos are the cause of extreme number of deaths, but one particular insect can be harmless, while King Cobra is comparably rare killer, despite having 100% letal bite. Therefore, we study 4 measures: mortality/injury numbers caused and mortality/injury rate, comparing to the number of events registered.

Analysis in the first chapter itself is really simple: we aggregate the data by event type (with some linguistic transformations), calculate total numbers & injuries/fatalities rates and avarage loss per event for property damage level assessments.

In the second chapter, we derive the most risky events for each of the states, given their historical data.

To count for correlation with ther parameters, we apply classification based on “Rpart” R package.

Resulting conclusion is that in any reasonable sence, TORNADO is extremely dangerous event.

To give some stability, trends are studies as well.

Data Processing

Firstly, we get the data on Natural disasters and load it to the environment:

require(knitr)
## Loading required package: knitr
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","_temp.csv.bz2")
dtstorm_raw<-read.csv("_temp.csv.bz2")
#saveRDS(dtstorm_raw,file = "stats.rds")
#dtstorm_raw<-readRDS("stats.rds")

Firstly, to use properly the date attributes of the records, we need to convert the relevant variables to standard POSIX time format.

dtstorm<-dtstorm_raw
dtstorm$BGN_DATE<-as.POSIXct(strptime(dtstorm$BGN_DATE,
                                      format="%m/%d/%Y %H:%M:%S",tz = "GMT"))
dtstorm$END_DATE<-as.POSIXct(strptime(dtstorm$END_DATE,
                                      format="%m/%d/%Y %H:%M:%S",tz = "GMT"))
dtstorm$PROPDMG_mult<-sapply(as.character(dtstorm$PROPDMGEXP),function(i){
  ifelse(i=='K',1000,ifelse(i=='m',1000000,ifelse(i=='M',1000000,ifelse(i=='B',1000000000,1))))
})
dtstorm$PROPDMG<-dtstorm$PROPDMG_mult*dtstorm$PROPDMG/1000000

As one note, the EVTYPE is clearly unstructured data variable: too many levels for the factors and too few data records for the very vast majority of it. In order to proceed with statistical inference, we firstly need to adjust the data by “reasonable linguistic processing”, for 10 top event terms mantioned in [EVTYPE], we search for the closest ones (in terms of certain distance measure between string varibles, based on adist() standard function) and replace them with the single term, e.g. “THUNDERSTORM WIND” and “THUNDERSTORM WINDS”, that will somewhat automate the grouping. Besides, there are two levels, which apparently have the same meaning, but are distant in terms of strings similarity: “TSTM WIND” and “THUNDERSTORM WIND”, you may refer to almost any acronyms dictionary, and it’s clearly not “too small to measure”, when occured along with an assessment of fatalities numbers.

Considering the above, we construct another variable “event” which will allow us to treat “differences in spelling” and “acronyms” properly.

freq_type<-aggregate(INJURIES~EVTYPE,data=dtstorm,FUN=sum)
freq_type<-freq_type[order(-freq_type$INJURIES),]#most common terms

dtstorm$event<-as.character(dtstorm$EVTYPE)
dtstorm$event[which(dtstorm$event=="TSTM WIND")]<-"THUNDERSTORM WIND"
for(i in 1:10){
  term2use<-as.character(freq_type[i,1])
  #pick certain frequent term, like "FLOOD"
  if(term2use%in%levels(as.factor(dtstorm$event))){
    #check whether this term wasn't replaced in from the data previously
    dists<<-sapply(as.character(freq_type$EVTYPE),function(term)
      adist(term2use,term,partial = T)
      #calculate the distances in terms of string similarity
      /min(nchar(term2use),nchar(as.character(term)))
    )
    terms2replace=as.character(freq_type$EVTYPE)[which(dists<0.1)]
    #check the replace candidates
    print(paste(term2use,":",paste(terms2replace[1:3],collapse="; "),"..."),sep="")
    ids<-which(as.character(dtstorm$EVTYPE)%in%terms2replace)
    dtstorm$event[ids]<-term2use
    #replace the occurances of event term variations with a single one
  }
}
## [1] "TORNADO : TORNADO; WATERSPOUT/TORNADO; TORNADO F2 ..."
## [1] "FLOOD : FLOOD; FLASH FLOOD; FLOOD/FLASH FLOOD ..."
## [1] "EXCESSIVE HEAT : EXCESSIVE HEAT; DROUGHT/EXCESSIVE HEAT; EXCESSIVE HEAT/DROUGHT ..."
## [1] "LIGHTNING : LIGHTNING; LIGHTNING AND THUNDERSTORM WIN; LIGHTNING INJURY ..."
## [1] "HEAT : EXCESSIVE HEAT; HEAT; HEAT WAVE ..."
## [1] "ICE STORM : ICE STORM; GLAZE/ICE STORM; ICE STORM/FLASH FLOOD ..."
## [1] "THUNDERSTORM WIND : THUNDERSTORM WIND; THUNDERSTORM WINDS; MARINE THUNDERSTORM WIND ..."
## [1] "HAIL : HAIL; TSTM WIND/HAIL; SMALL HAIL ..."

Chapter 1: Selecting the measures: frequencies vs totals, US combined data.

Let’s start from direct, “plain” assessments:

data_analysis_all<-dtstorm
freq_inj<-aggregate(INJURIES~event,data=data_analysis_all,FUN=sum)
freq_fat<-aggregate(FATALITIES~event,data=data_analysis_all,FUN=sum)
freq_prop<-aggregate(PROPDMG~event,data=data_analysis_all,FUN=sum)
freq_cnt<-aggregate(INJURIES~event,data=data_analysis_all,FUN=length)
freq_all<-merge(merge(merge(freq_inj,
                            freq_fat,by='event'),
                            freq_prop,by='event'),freq_cnt,by='event')
colnames(freq_all)<-c("event","INJURIES","FATALITIES","PROP","COUNT")

Eventually, we can describe the severity of different event types in terms of “TOPxx”’s. As we will see, the result significantly depends upon selected “measure”. Data for 1950-2011yy.

top5_inj_tot<-freq_all[order(-freq_all$INJURIES)[1:5],c('event','INJURIES')]
freq_all$inj_freq<-round(freq_all$INJURIES/freq_all$COUNT)
top5_inj_freq<-freq_all[order(-freq_all$inj_freq)[1:5],c('event','inj_freq')]

top5_fat_tot<-freq_all[order(-freq_all$FATALITIES)[1:5],c('event','FATALITIES')]
freq_all$fat_freq<-round(freq_all$FATALITIES/freq_all$COUNT)
top5_fat_freq<-freq_all[order(-freq_all$fat_freq)[1:5],c('event','fat_freq')]

top5_prop_tot<-freq_all[order(-freq_all$PROP)[1:5],c('event','PROP')]
freq_all$prop_norm<-round(freq_all$PROP/freq_all$COUNT)
top5_prop_norm<-freq_all[order(-freq_all$prop_norm)[1:5],c('event','prop_norm')]

colnames(top5_inj_tot)<-c("event","injuries")
top5_inj_tot
##                 event injuries
## 613           TORNADO    91407
## 605 THUNDERSTORM WIND     9396
## 194              HEAT     9154
## 136             FLOOD     8601
## 357         LIGHTNING     5231
colnames(top5_inj_freq)<-c("event","injuries per 1 event")
top5_inj_freq
##                     event injuries per 1 event
## 195             Heat Wave                   70
## 621 TROPICAL STORM GORDON                   43
## 704            WILD FIRES                   38
## 607         THUNDERSTORMW                   27
## 269    HIGH WIND AND SEAS                   20
colnames(top5_fat_tot)<-c("event","fatalities")
top5_fat_tot
##                 event fatalities
## 613           TORNADO       5636
## 194              HEAT       3138
## 136             FLOOD       1523
## 357         LIGHTNING        817
## 605 THUNDERSTORM WIND        714
colnames(top5_fat_freq)<-c("event","fatalities per 1 event")
top5_fat_freq
##                     event fatalities per 1 event
## 59          COLD AND SNOW                     14
## 621 TROPICAL STORM GORDON                      8
## 276        HIGH WIND/SEAS                      4
## 365         MARINE MISHAP                      4
## 244   Heavy surf and wind                      3
colnames(top5_prop_tot)<-c("event","property damage, mln.USD")
top5_prop_tot
##                 event property damage, mln.USD
## 136             FLOOD                167379.12
## 311 HURRICANE/TYPHOON                 69305.84
## 613           TORNADO                 56993.10
## 524       STORM SURGE                 43323.54
## 190              HAIL                 17619.99
colnames(top5_prop_norm)<-c("event","property damage per 1 event, mln.USD")
top5_prop_norm
##                         event property damage per 1 event, mln.USD
## 208 HEAVY RAIN/SEVERE WEATHER                                 1250
## 311         HURRICANE/TYPHOON                                  788
## 308            HURRICANE OPAL                                  353
## 524               STORM SURGE                                  166
## 704                WILD FIRES                                  156

As one could note, “frequencies” appear to have poor performance in task of risk level assessment: very few event, bearing for “unique” event definition bring the highest risk-to-count ratio. Therefore, proper metric using “frequencies” approach could be developed only using Value-At-Risk stochastic approach, or proper pre-grouping of the events, which would require additional data (like expert grouping, average temperature/pressure/weather conditions etc.).

In any case, we will state that the main approach for determining the most dangerous events would be just the plain total numbers.

colnames(top5_inj_tot)<-c("event","estimate")
top5_inj_tot$label<-'MHE: total injuries'
colnames(top5_fat_tot)<-c("event","estimate")
top5_fat_tot$label<-'MHE: total fatalities'
colnames(top5_prop_tot)<-c("event","estimate")
top5_prop_tot$label<-'EC: property damage'
top5_events<-rbind(top5_inj_tot,top5_fat_tot,top5_prop_tot)
require(ggplot2)
## Loading required package: ggplot2
top5_events$ord<-order(top5_events$label,-top5_events$estimate)
top5_events$ord<-rep(1:15)
ggplot(top5_events,aes(x=ord, y=estimate))+
  geom_bar(stat="identity")+ 
  ggtitle("Most harmful events (MHE) and economic consequences (EC): \n1950+ data by event types")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+xlab("event type (EVTYPE grouped)")+
  scale_x_continuous(breaks=1:15,labels=c(top5_events$event))+
  facet_wrap(~label, scales="free",ncol = 3)

Eventually, this explains why we chose the very simple approach: we examine only aggregated sums with distribution respect to each of the state.

Chapter 2: Across-the-states analysis

As seen before, most simple measure like “total injuries / total property damage” to be used as a proxy for answers:

will help in answering two basic questions of this study:

the following code provides for derivation for each state TOP5 event types with most injuries and most property damage caused.

inj_tot_states<-aggregate(INJURIES~event+STATE,data=data_analysis_all,FUN=sum,na.rm=T)
inj_tot_states<-subset(inj_tot_states,INJURIES>0)
inj_tot_states<-inj_tot_states[order(inj_tot_states$STATE,-inj_tot_states$INJURIES),]
inj_tot_states$ord<-1:nrow(inj_tot_states)
inj_tot_states$min_ord<-sapply(as.character(inj_tot_states$STATE),function(st){min(
  subset(inj_tot_states,STATE==st)$ord)})
inj_tot_states$rank<-inj_tot_states$ord-inj_tot_states$min_ord+1
top5_inj_states<-inj_tot_states[which(inj_tot_states$rank<=5),
                                c("STATE","event","rank","INJURIES")]

prop_tot_states<-aggregate(PROPDMG~event+STATE,data=data_analysis_all,FUN=sum,na.rm=T)
prop_tot_states<-subset(prop_tot_states,PROPDMG>0)
prop_tot_states<-prop_tot_states[order(prop_tot_states$STATE,-prop_tot_states$PROPDMG),]
prop_tot_states$ord<-1:nrow(prop_tot_states)
prop_tot_states$min_ord<-sapply(as.character(prop_tot_states$STATE),function(st){min(
  subset(prop_tot_states,STATE==st)$ord)})
prop_tot_states$rank<-prop_tot_states$ord-prop_tot_states$min_ord+1
top5_prop_states<-prop_tot_states[which(prop_tot_states$rank<=5),
                                  c("STATE","event","rank","PROPDMG")]

The resulting lists could be found in the Results section. Let’s plot the map using maps package.

top1_prop_states<-prop_tot_states[which(prop_tot_states$rank==1),
                                  c("STATE","event","rank","PROPDMG")]
top1_prop_states$event<-factor(top1_prop_states$event)
top1_prop_states$statelabel<-(sapply(top1_prop_states$STATE,function(st)
  {tolower(state.name[which(state.abb==st)])}))
top1_prop_states$len<-sapply(top1_prop_states$statelabel,length)
top1_prop_states<-subset(top1_prop_states,len>0)#do not plot regions which do not
colours<-c("springgreen4","darkslategray","skyblue","skyblue3","aliceblue",
           "midnightblue","deepskyblue4","mediumpurple4","red4")
top1_prop_states$event<-factor(top1_prop_states$event)
leg.txt<-levels(top1_prop_states$event)
#appear in the map
require(maps)
top1_prop_states$clr<-colours[as.numeric(top1_prop_states$event)]

namevec <- map(database = "state", col = top1_prop_states$clr,fill=T, namesonly=TRUE,plot=F)
namevec<-sapply(namevec,function(nm){
  if(grepl(pattern = ":",nm)>0){
    strsplit(nm,":")[[1]][1]
  }else{nm}
})

clr<-sapply(namevec,function(st){
  if(st %in% levels(factor(as.character(top1_prop_states$statelabel)))){
    top1_prop_states$clr[which(as.character(top1_prop_states$statelabel)==st)]}else{
    "white"
  }})
map(database = "state", col = clr,fill=T)
title("Events with greatest economic consequences, 1950+")
legend("bottomleft", leg.txt, horiz = F,fill=colours,cex = 0.7)

Chapter 3: Dynamics

Now, to assess the present level of risk, for instance, for injuries, let us study the dynamics of TOP 5 event types with most numbers of injured.

require(lubridate)
require(ggplot2)
dyn_data<-aggregate(INJURIES~year(BGN_DATE)+event,data=
                      subset(data_analysis_all,year(BGN_DATE)>1970 & 
                               event%in%top5_inj_tot$event),FUN=sum,na.rm=T)
colnames(dyn_data)<-c("year","event","injuries")
ggplot(dyn_data,aes(x=year,y=injuries,group=event,col=event))+
  geom_point()+geom_line()

Therefore, we can make the following conclusions:

Results

As seen from the map above, most dangeroes events are connected with winds: in central US, most injuries and property damage is caused by TORNADOs, the variations like “Thunderstorm wind” or “HIGH WIND” are frequent to central-western or north-eastern states. FLOODs cause economic damages (but less risky for injuries/fatalities) mostly at north-eastern states. There is a single state (Delaware) where lightning is the most dangerous event type.

The Most dangerous for peoples’ health and property events for each state (HH=‘health harmful’,EC=‘economic consequences’):

require(knitr)
ds1<-subset(top5_inj_states,rank==1,select=c("STATE",'event','INJURIES'))
colnames(ds1)<-c("state","most HH","#injuries")
ds2<-subset(top5_prop_states,rank==1,select=c("STATE",'event','PROPDMG'))
colnames(ds2)<-c("state","greatest EC","property damage, '000$")
kable(merge(ds1,ds2,by='state'),
      caption="Most dangeroues events by state, 1950+ data",
      row.names = F,escape = "\\")
Most dangeroues events by state, 1950+ data
state most HH #injuries greatest EC property damage, ’000$
AK ICE STORM 34 FLOOD 193.0250
AL TORNADO 7929 TORNADO 6321.2976
AM THUNDERSTORM WIND 22 WATERSPOUT 5.1020
AN MARINE STRONG WIND 18 THUNDERSTORM WIND 0.1690
AR TORNADO 5116 TORNADO 2590.0073
AS TSUNAMI 129 TSUNAMI 81.0000
AZ THUNDERSTORM WIND 186 HAIL 2829.2712
CA WILDFIRE 623 FLOOD 117117.4110
CO TORNADO 261 HAIL 1424.0448
CT TORNADO 703 TORNADO 596.2366
DC HEAT 316 TROPICAL STORM 127.6000
DE TORNADO 73 FLOOD 71.4095
FL TORNADO 3344 HURRICANE/TYPHOON 27596.8650
GA TORNADO 3926 TORNADO 3261.0267
GU HURRICANE/TYPHOON 333 TYPHOON 600.2300
HI STRONG WIND 20 FLOOD 162.8431
IA TORNADO 2208 TORNADO 2286.5763
ID THUNDERSTORM WIND 152 FLOOD 129.9740
IL TORNADO 4145 FLOOD 6047.1340
IN TORNADO 4224 TORNADO 2594.7939
KS TORNADO 2721 TORNADO 2669.8907
KY TORNADO 2806 TORNADO 888.7687
LA TORNADO 2676 STORM SURGE 31742.7350
LM MARINE STRONG WIND 1 MARINE TSTM WIND 1.2050
MA TORNADO 1758 TORNADO 756.0391
MD HEAT 545 TROPICAL STORM 538.5050
ME LIGHTNING 70 ICE STORM 318.2300
MH HIGH SURF 1 HIGH SURF 5.0000
MI TORNADO 3362 TORNADO 1071.7656
MN TORNADO 1976 TORNADO 1903.7011
MO TORNADO 4330 TORNADO 4800.7017
MS TORNADO 6246 HURRICANE/TYPHOON 13492.7350
MT WILD/FOREST FIRE 33 HAIL 94.9239
NC TORNADO 2548 HURRICANE 4979.8610
ND TORNADO 326 FLOOD 4071.4875
NE TORNADO 1158 TORNADO 1718.1647
NH LIGHTNING 85 FLOOD 105.9158
NJ HEAT 304 FLOOD 2890.0900
NM TORNADO 155 WILD/FOREST FIRE 1509.7000
NV FLOOD 63 FLOOD 739.2661
NY THUNDERSTORM WIND 344 FLOOD 3207.7735
OH TORNADO 4442 TORNADO 2283.1578
OK TORNADO 4829 TORNADO 3268.7082
OR HIGH WIND 50 FLOOD 741.8625
PA TORNADO 1241 FLOOD 2522.9480
PR HEAVY RAIN 10 HURRICANE 1824.4310
PZ MARINE STRONG WIND 3 MARINE STRONG WIND 0.0760
RI TORNADO 23 FLOOD 93.5410
SC TORNADO 1314 TORNADO 531.7452
SD TORNADO 452 TORNADO 231.2138
TN TORNADO 4748 FLOOD 4764.7492
TX TORNADO 8207 TROPICAL STORM 5491.1930
UT WINTER STORM 415 FLOOD 371.1125
VA TORNADO 914 HURRICANE/TYPHOON 512.0000
VI LIGHTNING 1 HURRICANE 28.2200
VT THUNDERSTORM WIND 34 FLOOD 1428.5735
WA TORNADO 303 FLOOD 234.0150
WI TORNADO 1601 FLOOD 986.1774
WV THUNDERSTORM WIND 151 FLOOD 769.8981
WY WINTER STORM 119 HAIL 111.2217

Appendix 1: Some math+stats samples

Now, let us add a covariate (please note, that more comprehensive analysis would not fit the frame of current study, so this is just a pointer towards more detailed reasearch direction) which could interfere with the consequences of the catastrophe events under investigation: let’s construct “time of the year”, “time of the day” and add “latitude”/“longitude”" to track the souther/northern and eastern/western environment features.

To ease the computations, choose the subsample of 50k records.

set.seed(145)#in order to speed-up the calculations
data_analysis<-data_analysis_all[sample(1:nrow(data_analysis_all),50000),]
data_analysis$duration<-data_analysis$END_DATE-data_analysis$BGN_DATE

timeofyear<-function(dt){
  nmonth<-as.numeric(format(dt,"%m"))
  if(nmonth%in%c(12,1,2))return('win')
  if(nmonth%in%c(3,4,5))return('spr')
  if(nmonth%in%c(6,7,8))return('sum')
  if(nmonth%in%c(9,10,11))return('aut')
}

hh<-function(tm){
  hh1<-as.numeric(substr(as.character(tm),1,2))
  ifelse((grepl("AM",tm)>0) & (hh1==12), 0,
         #12:00 AM->return 0
         ifelse((grepl("PM",tm)>0) & (hh1==12), 12,
                #12:00 PM return 12
                (grepl("PM",tm)>0)*12+hh1))
  #there occur all the possible cases: 00:00 AM, 12:00 AM, 12:00 PM,...
}

data_analysis$timeofyear<-sapply(data_analysis$BGN_DATE,timeofyear)
data_analysis$year<-sapply(data_analysis$BGN_DATE,year)
data_analysis$hour<-sapply(data_analysis$BGN_DATE,hh)


data_analysis<-data_analysis[,c("event","EVTYPE",
                                #primary variables: new & old event type definition
                                 "year",
                                #time variable
                              "timeofyear","hour","duration",'MAG','F','LENGTH','WIDTH',
                              #covariates & attribs
                              'PROPDMG','INJURIES','FATALITIES'
                              #responce = event consequences
                              )]

As for inclusion of variables in the analysis, the motivation is as follows:

Please note that we didn’t count for timezones in order not to complicate: time would be meaningful as recorded and date/time of the year/year shifts would be negligable.

The above mentioned notes are implemented as follows:

require(rpart)
fit<-rpart(data=data_analysis,INJURIES~event+duration+timeofyear+hour+duration+MAG+LENGTH+WIDTH)

Therefore, the tree structure of identification of the most risky events is as follows:

fit
## n= 50000 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##   1) root 50000 352576.600  0.12086000  
##     2) LENGTH< 9.35 49663 158440.900  0.07361617 *
##     3) LENGTH>=9.35 337 177689.700  7.08308600  
##       6) WIDTH< 366.5 242  30811.040  2.76859500 *
##       7) WIDTH>=366.5 95 130898.500 18.07368000  
##        14) timeofyear=aut,sum 30   3124.167  6.16666700 *
##        15) timeofyear=spr,win 65 121557.900 23.56923000  
##          30) hour>=19.5 26  17908.460 11.53846000 *
##          31) hour< 19.5 39  97377.440 31.58974000  
##            62) WIDTH< 470 19  26141.680 19.26316000 *
##            63) WIDTH>=470 20  65606.200 43.30000000  
##             126) WIDTH>=750 12  30596.920 23.58333000 *
##             127) WIDTH< 750 8  23346.880 72.87500000 *