In this study, author uses the exploratory analysis of the data on Natural disasters to assess the most dangerous event types.
The definition of “most dangerous”, naturally, could not be strict and allows for multiple interpretations, depending on the measure suitable for particular application. For instance, mosquitos are the cause of extreme number of deaths, but one particular insect can be harmless, while King Cobra is comparably rare killer, despite having 100% letal bite. Therefore, we study 4 measures: mortality/injury numbers caused and mortality/injury rate, comparing to the number of events registered.
Analysis in the first chapter itself is really simple: we aggregate the data by event type (with some linguistic transformations), calculate total numbers & injuries/fatalities rates and avarage loss per event for property damage level assessments.
In the second chapter, we derive the most risky events for each of the states, given their historical data.
To count for correlation with ther parameters, we apply classification based on “Rpart” R package.
Resulting conclusion is that in any reasonable sence, TORNADO is extremely dangerous event.
To give some stability, trends are studies as well.
Firstly, we get the data on Natural disasters and load it to the environment:
require(knitr)
## Loading required package: knitr
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","_temp.csv.bz2")
dtstorm_raw<-read.csv("_temp.csv.bz2")
#saveRDS(dtstorm_raw,file = "stats.rds")
#dtstorm_raw<-readRDS("stats.rds")
Firstly, to use properly the date attributes of the records, we need to convert the relevant variables to standard POSIX time format.
dtstorm<-dtstorm_raw
dtstorm$BGN_DATE<-as.POSIXct(strptime(dtstorm$BGN_DATE,
format="%m/%d/%Y %H:%M:%S",tz = "GMT"))
dtstorm$END_DATE<-as.POSIXct(strptime(dtstorm$END_DATE,
format="%m/%d/%Y %H:%M:%S",tz = "GMT"))
dtstorm$PROPDMG_mult<-sapply(as.character(dtstorm$PROPDMGEXP),function(i){
ifelse(i=='K',1000,ifelse(i=='m',1000000,ifelse(i=='M',1000000,ifelse(i=='B',1000000000,1))))
})
dtstorm$PROPDMG<-dtstorm$PROPDMG_mult*dtstorm$PROPDMG/1000000
As one note, the EVTYPE is clearly unstructured data variable: too many levels for the factors and too few data records for the very vast majority of it. In order to proceed with statistical inference, we firstly need to adjust the data by “reasonable linguistic processing”, for 10 top event terms mantioned in [EVTYPE], we search for the closest ones (in terms of certain distance measure between string varibles, based on adist() standard function) and replace them with the single term, e.g. “THUNDERSTORM WIND” and “THUNDERSTORM WINDS”, that will somewhat automate the grouping. Besides, there are two levels, which apparently have the same meaning, but are distant in terms of strings similarity: “TSTM WIND” and “THUNDERSTORM WIND”, you may refer to almost any acronyms dictionary, and it’s clearly not “too small to measure”, when occured along with an assessment of fatalities numbers.
Considering the above, we construct another variable “event” which will allow us to treat “differences in spelling” and “acronyms” properly.
freq_type<-aggregate(INJURIES~EVTYPE,data=dtstorm,FUN=sum)
freq_type<-freq_type[order(-freq_type$INJURIES),]#most common terms
dtstorm$event<-as.character(dtstorm$EVTYPE)
dtstorm$event[which(dtstorm$event=="TSTM WIND")]<-"THUNDERSTORM WIND"
for(i in 1:10){
term2use<-as.character(freq_type[i,1])
#pick certain frequent term, like "FLOOD"
if(term2use%in%levels(as.factor(dtstorm$event))){
#check whether this term wasn't replaced in from the data previously
dists<<-sapply(as.character(freq_type$EVTYPE),function(term)
adist(term2use,term,partial = T)
#calculate the distances in terms of string similarity
/min(nchar(term2use),nchar(as.character(term)))
)
terms2replace=as.character(freq_type$EVTYPE)[which(dists<0.1)]
#check the replace candidates
print(paste(term2use,":",paste(terms2replace[1:3],collapse="; "),"..."),sep="")
ids<-which(as.character(dtstorm$EVTYPE)%in%terms2replace)
dtstorm$event[ids]<-term2use
#replace the occurances of event term variations with a single one
}
}
## [1] "TORNADO : TORNADO; WATERSPOUT/TORNADO; TORNADO F2 ..."
## [1] "FLOOD : FLOOD; FLASH FLOOD; FLOOD/FLASH FLOOD ..."
## [1] "EXCESSIVE HEAT : EXCESSIVE HEAT; DROUGHT/EXCESSIVE HEAT; EXCESSIVE HEAT/DROUGHT ..."
## [1] "LIGHTNING : LIGHTNING; LIGHTNING AND THUNDERSTORM WIN; LIGHTNING INJURY ..."
## [1] "HEAT : EXCESSIVE HEAT; HEAT; HEAT WAVE ..."
## [1] "ICE STORM : ICE STORM; GLAZE/ICE STORM; ICE STORM/FLASH FLOOD ..."
## [1] "THUNDERSTORM WIND : THUNDERSTORM WIND; THUNDERSTORM WINDS; MARINE THUNDERSTORM WIND ..."
## [1] "HAIL : HAIL; TSTM WIND/HAIL; SMALL HAIL ..."
Let’s start from direct, “plain” assessments:
data_analysis_all<-dtstorm
freq_inj<-aggregate(INJURIES~event,data=data_analysis_all,FUN=sum)
freq_fat<-aggregate(FATALITIES~event,data=data_analysis_all,FUN=sum)
freq_prop<-aggregate(PROPDMG~event,data=data_analysis_all,FUN=sum)
freq_cnt<-aggregate(INJURIES~event,data=data_analysis_all,FUN=length)
freq_all<-merge(merge(merge(freq_inj,
freq_fat,by='event'),
freq_prop,by='event'),freq_cnt,by='event')
colnames(freq_all)<-c("event","INJURIES","FATALITIES","PROP","COUNT")
Eventually, we can describe the severity of different event types in terms of “TOPxx”’s. As we will see, the result significantly depends upon selected “measure”. Data for 1950-2011yy.
top5_inj_tot<-freq_all[order(-freq_all$INJURIES)[1:5],c('event','INJURIES')]
freq_all$inj_freq<-round(freq_all$INJURIES/freq_all$COUNT)
top5_inj_freq<-freq_all[order(-freq_all$inj_freq)[1:5],c('event','inj_freq')]
top5_fat_tot<-freq_all[order(-freq_all$FATALITIES)[1:5],c('event','FATALITIES')]
freq_all$fat_freq<-round(freq_all$FATALITIES/freq_all$COUNT)
top5_fat_freq<-freq_all[order(-freq_all$fat_freq)[1:5],c('event','fat_freq')]
top5_prop_tot<-freq_all[order(-freq_all$PROP)[1:5],c('event','PROP')]
freq_all$prop_norm<-round(freq_all$PROP/freq_all$COUNT)
top5_prop_norm<-freq_all[order(-freq_all$prop_norm)[1:5],c('event','prop_norm')]
colnames(top5_inj_tot)<-c("event","injuries")
top5_inj_tot
## event injuries
## 613 TORNADO 91407
## 605 THUNDERSTORM WIND 9396
## 194 HEAT 9154
## 136 FLOOD 8601
## 357 LIGHTNING 5231
colnames(top5_inj_freq)<-c("event","injuries per 1 event")
top5_inj_freq
## event injuries per 1 event
## 195 Heat Wave 70
## 621 TROPICAL STORM GORDON 43
## 704 WILD FIRES 38
## 607 THUNDERSTORMW 27
## 269 HIGH WIND AND SEAS 20
colnames(top5_fat_tot)<-c("event","fatalities")
top5_fat_tot
## event fatalities
## 613 TORNADO 5636
## 194 HEAT 3138
## 136 FLOOD 1523
## 357 LIGHTNING 817
## 605 THUNDERSTORM WIND 714
colnames(top5_fat_freq)<-c("event","fatalities per 1 event")
top5_fat_freq
## event fatalities per 1 event
## 59 COLD AND SNOW 14
## 621 TROPICAL STORM GORDON 8
## 276 HIGH WIND/SEAS 4
## 365 MARINE MISHAP 4
## 244 Heavy surf and wind 3
colnames(top5_prop_tot)<-c("event","property damage, mln.USD")
top5_prop_tot
## event property damage, mln.USD
## 136 FLOOD 167379.12
## 311 HURRICANE/TYPHOON 69305.84
## 613 TORNADO 56993.10
## 524 STORM SURGE 43323.54
## 190 HAIL 17619.99
colnames(top5_prop_norm)<-c("event","property damage per 1 event, mln.USD")
top5_prop_norm
## event property damage per 1 event, mln.USD
## 208 HEAVY RAIN/SEVERE WEATHER 1250
## 311 HURRICANE/TYPHOON 788
## 308 HURRICANE OPAL 353
## 524 STORM SURGE 166
## 704 WILD FIRES 156
As one could note, “frequencies” appear to have poor performance in task of risk level assessment: very few event, bearing for “unique” event definition bring the highest risk-to-count ratio. Therefore, proper metric using “frequencies” approach could be developed only using Value-At-Risk stochastic approach, or proper pre-grouping of the events, which would require additional data (like expert grouping, average temperature/pressure/weather conditions etc.).
In any case, we will state that the main approach for determining the most dangerous events would be just the plain total numbers.
colnames(top5_inj_tot)<-c("event","estimate")
top5_inj_tot$label<-'MHE: total injuries'
colnames(top5_fat_tot)<-c("event","estimate")
top5_fat_tot$label<-'MHE: total fatalities'
colnames(top5_prop_tot)<-c("event","estimate")
top5_prop_tot$label<-'EC: property damage'
top5_events<-rbind(top5_inj_tot,top5_fat_tot,top5_prop_tot)
require(ggplot2)
## Loading required package: ggplot2
top5_events$ord<-order(top5_events$label,-top5_events$estimate)
top5_events$ord<-rep(1:15)
ggplot(top5_events,aes(x=ord, y=estimate))+
geom_bar(stat="identity")+
ggtitle("Most harmful events (MHE) and economic consequences (EC): \n1950+ data by event types")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+xlab("event type (EVTYPE grouped)")+
scale_x_continuous(breaks=1:15,labels=c(top5_events$event))+
facet_wrap(~label, scales="free",ncol = 3)
Eventually, this explains why we chose the very simple approach: we examine only aggregated sums with distribution respect to each of the state.
As seen before, most simple measure like “total injuries / total property damage” to be used as a proxy for answers:
will help in answering two basic questions of this study:
the following code provides for derivation for each state TOP5 event types with most injuries and most property damage caused.
inj_tot_states<-aggregate(INJURIES~event+STATE,data=data_analysis_all,FUN=sum,na.rm=T)
inj_tot_states<-subset(inj_tot_states,INJURIES>0)
inj_tot_states<-inj_tot_states[order(inj_tot_states$STATE,-inj_tot_states$INJURIES),]
inj_tot_states$ord<-1:nrow(inj_tot_states)
inj_tot_states$min_ord<-sapply(as.character(inj_tot_states$STATE),function(st){min(
subset(inj_tot_states,STATE==st)$ord)})
inj_tot_states$rank<-inj_tot_states$ord-inj_tot_states$min_ord+1
top5_inj_states<-inj_tot_states[which(inj_tot_states$rank<=5),
c("STATE","event","rank","INJURIES")]
prop_tot_states<-aggregate(PROPDMG~event+STATE,data=data_analysis_all,FUN=sum,na.rm=T)
prop_tot_states<-subset(prop_tot_states,PROPDMG>0)
prop_tot_states<-prop_tot_states[order(prop_tot_states$STATE,-prop_tot_states$PROPDMG),]
prop_tot_states$ord<-1:nrow(prop_tot_states)
prop_tot_states$min_ord<-sapply(as.character(prop_tot_states$STATE),function(st){min(
subset(prop_tot_states,STATE==st)$ord)})
prop_tot_states$rank<-prop_tot_states$ord-prop_tot_states$min_ord+1
top5_prop_states<-prop_tot_states[which(prop_tot_states$rank<=5),
c("STATE","event","rank","PROPDMG")]
The resulting lists could be found in the Results section. Let’s plot the map using maps package.
top1_prop_states<-prop_tot_states[which(prop_tot_states$rank==1),
c("STATE","event","rank","PROPDMG")]
top1_prop_states$event<-factor(top1_prop_states$event)
top1_prop_states$statelabel<-(sapply(top1_prop_states$STATE,function(st)
{tolower(state.name[which(state.abb==st)])}))
top1_prop_states$len<-sapply(top1_prop_states$statelabel,length)
top1_prop_states<-subset(top1_prop_states,len>0)#do not plot regions which do not
colours<-c("springgreen4","darkslategray","skyblue","skyblue3","aliceblue",
"midnightblue","deepskyblue4","mediumpurple4","red4")
top1_prop_states$event<-factor(top1_prop_states$event)
leg.txt<-levels(top1_prop_states$event)
#appear in the map
require(maps)
top1_prop_states$clr<-colours[as.numeric(top1_prop_states$event)]
namevec <- map(database = "state", col = top1_prop_states$clr,fill=T, namesonly=TRUE,plot=F)
namevec<-sapply(namevec,function(nm){
if(grepl(pattern = ":",nm)>0){
strsplit(nm,":")[[1]][1]
}else{nm}
})
clr<-sapply(namevec,function(st){
if(st %in% levels(factor(as.character(top1_prop_states$statelabel)))){
top1_prop_states$clr[which(as.character(top1_prop_states$statelabel)==st)]}else{
"white"
}})
map(database = "state", col = clr,fill=T)
title("Events with greatest economic consequences, 1950+")
legend("bottomleft", leg.txt, horiz = F,fill=colours,cex = 0.7)
Now, to assess the present level of risk, for instance, for injuries, let us study the dynamics of TOP 5 event types with most numbers of injured.
require(lubridate)
require(ggplot2)
dyn_data<-aggregate(INJURIES~year(BGN_DATE)+event,data=
subset(data_analysis_all,year(BGN_DATE)>1970 &
event%in%top5_inj_tot$event),FUN=sum,na.rm=T)
colnames(dyn_data)<-c("year","event","injuries")
ggplot(dyn_data,aes(x=year,y=injuries,group=event,col=event))+
geom_point()+geom_line()
Therefore, we can make the following conclusions:
As seen from the map above, most dangeroes events are connected with winds: in central US, most injuries and property damage is caused by TORNADOs, the variations like “Thunderstorm wind” or “HIGH WIND” are frequent to central-western or north-eastern states. FLOODs cause economic damages (but less risky for injuries/fatalities) mostly at north-eastern states. There is a single state (Delaware) where lightning is the most dangerous event type.
The Most dangerous for peoples’ health and property events for each state (HH=‘health harmful’,EC=‘economic consequences’):
require(knitr)
ds1<-subset(top5_inj_states,rank==1,select=c("STATE",'event','INJURIES'))
colnames(ds1)<-c("state","most HH","#injuries")
ds2<-subset(top5_prop_states,rank==1,select=c("STATE",'event','PROPDMG'))
colnames(ds2)<-c("state","greatest EC","property damage, '000$")
kable(merge(ds1,ds2,by='state'),
caption="Most dangeroues events by state, 1950+ data",
row.names = F,escape = "\\")
| state | most HH | #injuries | greatest EC | property damage, ’000$ |
|---|---|---|---|---|
| AK | ICE STORM | 34 | FLOOD | 193.0250 |
| AL | TORNADO | 7929 | TORNADO | 6321.2976 |
| AM | THUNDERSTORM WIND | 22 | WATERSPOUT | 5.1020 |
| AN | MARINE STRONG WIND | 18 | THUNDERSTORM WIND | 0.1690 |
| AR | TORNADO | 5116 | TORNADO | 2590.0073 |
| AS | TSUNAMI | 129 | TSUNAMI | 81.0000 |
| AZ | THUNDERSTORM WIND | 186 | HAIL | 2829.2712 |
| CA | WILDFIRE | 623 | FLOOD | 117117.4110 |
| CO | TORNADO | 261 | HAIL | 1424.0448 |
| CT | TORNADO | 703 | TORNADO | 596.2366 |
| DC | HEAT | 316 | TROPICAL STORM | 127.6000 |
| DE | TORNADO | 73 | FLOOD | 71.4095 |
| FL | TORNADO | 3344 | HURRICANE/TYPHOON | 27596.8650 |
| GA | TORNADO | 3926 | TORNADO | 3261.0267 |
| GU | HURRICANE/TYPHOON | 333 | TYPHOON | 600.2300 |
| HI | STRONG WIND | 20 | FLOOD | 162.8431 |
| IA | TORNADO | 2208 | TORNADO | 2286.5763 |
| ID | THUNDERSTORM WIND | 152 | FLOOD | 129.9740 |
| IL | TORNADO | 4145 | FLOOD | 6047.1340 |
| IN | TORNADO | 4224 | TORNADO | 2594.7939 |
| KS | TORNADO | 2721 | TORNADO | 2669.8907 |
| KY | TORNADO | 2806 | TORNADO | 888.7687 |
| LA | TORNADO | 2676 | STORM SURGE | 31742.7350 |
| LM | MARINE STRONG WIND | 1 | MARINE TSTM WIND | 1.2050 |
| MA | TORNADO | 1758 | TORNADO | 756.0391 |
| MD | HEAT | 545 | TROPICAL STORM | 538.5050 |
| ME | LIGHTNING | 70 | ICE STORM | 318.2300 |
| MH | HIGH SURF | 1 | HIGH SURF | 5.0000 |
| MI | TORNADO | 3362 | TORNADO | 1071.7656 |
| MN | TORNADO | 1976 | TORNADO | 1903.7011 |
| MO | TORNADO | 4330 | TORNADO | 4800.7017 |
| MS | TORNADO | 6246 | HURRICANE/TYPHOON | 13492.7350 |
| MT | WILD/FOREST FIRE | 33 | HAIL | 94.9239 |
| NC | TORNADO | 2548 | HURRICANE | 4979.8610 |
| ND | TORNADO | 326 | FLOOD | 4071.4875 |
| NE | TORNADO | 1158 | TORNADO | 1718.1647 |
| NH | LIGHTNING | 85 | FLOOD | 105.9158 |
| NJ | HEAT | 304 | FLOOD | 2890.0900 |
| NM | TORNADO | 155 | WILD/FOREST FIRE | 1509.7000 |
| NV | FLOOD | 63 | FLOOD | 739.2661 |
| NY | THUNDERSTORM WIND | 344 | FLOOD | 3207.7735 |
| OH | TORNADO | 4442 | TORNADO | 2283.1578 |
| OK | TORNADO | 4829 | TORNADO | 3268.7082 |
| OR | HIGH WIND | 50 | FLOOD | 741.8625 |
| PA | TORNADO | 1241 | FLOOD | 2522.9480 |
| PR | HEAVY RAIN | 10 | HURRICANE | 1824.4310 |
| PZ | MARINE STRONG WIND | 3 | MARINE STRONG WIND | 0.0760 |
| RI | TORNADO | 23 | FLOOD | 93.5410 |
| SC | TORNADO | 1314 | TORNADO | 531.7452 |
| SD | TORNADO | 452 | TORNADO | 231.2138 |
| TN | TORNADO | 4748 | FLOOD | 4764.7492 |
| TX | TORNADO | 8207 | TROPICAL STORM | 5491.1930 |
| UT | WINTER STORM | 415 | FLOOD | 371.1125 |
| VA | TORNADO | 914 | HURRICANE/TYPHOON | 512.0000 |
| VI | LIGHTNING | 1 | HURRICANE | 28.2200 |
| VT | THUNDERSTORM WIND | 34 | FLOOD | 1428.5735 |
| WA | TORNADO | 303 | FLOOD | 234.0150 |
| WI | TORNADO | 1601 | FLOOD | 986.1774 |
| WV | THUNDERSTORM WIND | 151 | FLOOD | 769.8981 |
| WY | WINTER STORM | 119 | HAIL | 111.2217 |
Now, let us add a covariate (please note, that more comprehensive analysis would not fit the frame of current study, so this is just a pointer towards more detailed reasearch direction) which could interfere with the consequences of the catastrophe events under investigation: let’s construct “time of the year”, “time of the day” and add “latitude”/“longitude”" to track the souther/northern and eastern/western environment features.
To ease the computations, choose the subsample of 50k records.
set.seed(145)#in order to speed-up the calculations
data_analysis<-data_analysis_all[sample(1:nrow(data_analysis_all),50000),]
data_analysis$duration<-data_analysis$END_DATE-data_analysis$BGN_DATE
timeofyear<-function(dt){
nmonth<-as.numeric(format(dt,"%m"))
if(nmonth%in%c(12,1,2))return('win')
if(nmonth%in%c(3,4,5))return('spr')
if(nmonth%in%c(6,7,8))return('sum')
if(nmonth%in%c(9,10,11))return('aut')
}
hh<-function(tm){
hh1<-as.numeric(substr(as.character(tm),1,2))
ifelse((grepl("AM",tm)>0) & (hh1==12), 0,
#12:00 AM->return 0
ifelse((grepl("PM",tm)>0) & (hh1==12), 12,
#12:00 PM return 12
(grepl("PM",tm)>0)*12+hh1))
#there occur all the possible cases: 00:00 AM, 12:00 AM, 12:00 PM,...
}
data_analysis$timeofyear<-sapply(data_analysis$BGN_DATE,timeofyear)
data_analysis$year<-sapply(data_analysis$BGN_DATE,year)
data_analysis$hour<-sapply(data_analysis$BGN_DATE,hh)
data_analysis<-data_analysis[,c("event","EVTYPE",
#primary variables: new & old event type definition
"year",
#time variable
"timeofyear","hour","duration",'MAG','F','LENGTH','WIDTH',
#covariates & attribs
'PROPDMG','INJURIES','FATALITIES'
#responce = event consequences
)]
As for inclusion of variables in the analysis, the motivation is as follows:
Please note that we didn’t count for timezones in order not to complicate: time would be meaningful as recorded and date/time of the year/year shifts would be negligable.
The above mentioned notes are implemented as follows:
require(rpart)
fit<-rpart(data=data_analysis,INJURIES~event+duration+timeofyear+hour+duration+MAG+LENGTH+WIDTH)
Therefore, the tree structure of identification of the most risky events is as follows:
fit
## n= 50000
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 50000 352576.600 0.12086000
## 2) LENGTH< 9.35 49663 158440.900 0.07361617 *
## 3) LENGTH>=9.35 337 177689.700 7.08308600
## 6) WIDTH< 366.5 242 30811.040 2.76859500 *
## 7) WIDTH>=366.5 95 130898.500 18.07368000
## 14) timeofyear=aut,sum 30 3124.167 6.16666700 *
## 15) timeofyear=spr,win 65 121557.900 23.56923000
## 30) hour>=19.5 26 17908.460 11.53846000 *
## 31) hour< 19.5 39 97377.440 31.58974000
## 62) WIDTH< 470 19 26141.680 19.26316000 *
## 63) WIDTH>=470 20 65606.200 43.30000000
## 126) WIDTH>=750 12 30596.920 23.58333000 *
## 127) WIDTH< 750 8 23346.880 72.87500000 *