Synopsis

The dataset comes from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The data analysis is preceded by cleaning the data - identification of records for which sufficient documentation is available. In addition, the data analysis consists of resolving various duplications of the event names provided in the database. Using the cleaned version of the dataset, the events responsible for maximal human and economic damage in the last 15 years are identified. Events such as tornadoes and floods are identified as principal causative events for severe human as well as economic damage. Appropriate management of resources and timely warnings can mitigate the damage caused by these events.

Data Processing

Loading the data and necessary packages

suppressWarnings(library(dplyr))
suppressWarnings(library(lubridate))
suppressWarnings(library(stringr))
suppressWarnings(library(reshape2))
suppressWarnings(library(ggplot2))
storm<-read.csv(bzfile("repdata-data-StormData.csv.bz2"))
dim(storm)
## [1] 902297     37

Cleaning the data

One of the records had a typo in it. This corresponds to the record with REFNUM 60493 (60593rd row in the dataset). The PROPDMGEXP variable, which is a multiplier was mentioned as B instead of M. This was corrected. Thanks to Community TA Gregory Horne for the tip.

storm[605953,]$PROPDMGEXP <- as.factor("M") #typo fixed. Thanks to Greg Horne, CTA, REFNUM = 605963

The next step is to weed out the relevant variables in the dataset. These correspond to the Beginning and End years, fatalities, injuries, property damage PROPDMG, crop damage CRPDMG and their corresponding multipliers - PROPDMGEXP and CROPDMGEXP, along with the REFNUM for the record. Following this the events for which atleast one of the population or economic damages are non-zero are filtered out

storm.sel <- select(storm,c(BGN_DATE,END_DATE,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP,REFNUM))
storm.filt <- storm.sel %>% filter(FATALITIES >0 | INJURIES >0 | PROPDMG >0 | CROPDMG > 0)
dim(storm.filt)
## [1] 254633     10

The year of the event is documented in both the beginning of the event as well as the end of the event. These are converted to datetime object and the year is extracted

storm.filt<-storm.filt %>% mutate(YEAR=year(as.Date(END_DATE,format="%m/%d/%Y")))
storm.filt<-storm.filt %>% mutate(YEAR_B=year(as.Date(BGN_DATE,format="%m/%d/%Y")))
year_grp<-storm.filt %>% group_by(YEAR,YEAR_B) %>% summarise(count=length(FATALITIES))
par(mfrow=c(1,2))
with(year_grp,plot(count~YEAR_B, type="p", col="black", pch = 19, main = 'Number of events',xlab="Year_Beginning"))
abline(v=1996,lty=3)
with(year_grp,plot(count~YEAR, type="p", col="black", pch = 19, xlim = c(1950,2011), main = 'Number of events',xlab='Year_End'))
abline(v=1996,lty=3)

The number of events (y axis) in each year (x axis) is then obtained and plotted - the two panels correspond to the beginning year and the ending year respectively.

We see that only for events occurring since 1996, documentation for both the beginning of the event and the end of the event is available - the year 1996 corresponds to the dotted vertical line in the plot. Thus, only events since 1996 are used for data analysis

years_of_int <- year_grp %>% filter(YEAR > 1995)
storm.filt.yr <- filter(storm.filt, YEAR %in% years_of_int$YEAR)
dim(storm.filt.yr)
## [1] 201318     12

This greatly reduces the number of records.

The next step is to convert the damages to their actual values using their multipliers.

table(storm.filt.yr$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
##   8448      0      0      0      0      0      0      0      0      0 
##      6      7      8      B      h      H      K      m      M 
##      0      0      0     31      0      0 185474      0   7365
table(storm.filt.yr$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 102767      0      0      0      2      0  96787      0   1762

The above filtering leaves behind only three types of multipliers, namely K, M and B which correspond to 10^3, 10^6 and 10^9 respectively. Using this, the actual damages are calculated and stored in PRDM for property damage and CRDM for crop damage.

mltpr <- c(1e3,1e6,1e9)
names(mltpr)<-c('K','M','B')
storm.filt.yr<-storm.filt.yr%>% mutate(PROPDMGEXP=as.character(PROPDMGEXP),CROPDMGEXP=as.character(CROPDMGEXP))
storm.filt.yr<-storm.filt.yr%>% mutate(PRDM = PROPDMG*mltpr[PROPDMGEXP],CRDM = CROPDMG*mltpr[CROPDMGEXP])
head(storm.filt.yr)
##            BGN_DATE          END_DATE       EVTYPE FATALITIES INJURIES
## 1  1/6/1996 0:00:00  1/7/1996 0:00:00 WINTER STORM          0        0
## 2 1/11/1996 0:00:00 1/11/1996 0:00:00      TORNADO          0        0
## 3 1/11/1996 0:00:00 1/11/1996 0:00:00    TSTM WIND          0        0
## 4 1/11/1996 0:00:00 1/11/1996 0:00:00    TSTM WIND          0        0
## 5 1/11/1996 0:00:00 1/11/1996 0:00:00    TSTM WIND          0        0
## 6 1/18/1996 0:00:00 1/19/1996 0:00:00    HIGH WIND          0        0
##   PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP REFNUM YEAR YEAR_B   PRDM  CRDM
## 1     380          K      38          K 248768 1996   1996 380000 38000
## 2     100          K       0            248769 1996   1996 100000    NA
## 3       3          K       0            248770 1996   1996   3000    NA
## 4       5          K       0            248771 1996   1996   5000    NA
## 5       2          K       0            248772 1996   1996   2000    NA
## 6     400          K       0            248774 1996   1996 400000    NA

The dataset has a lot of duplicate event names - they belong to 48 categories of events as given in the NWSI 10-1605 document page 6. Thus, some of these categories have to be combined/renamed. The first step is to convert all the event names to lower cases, and to trim out excess whitespaces in the EVTYPE variable

length(unique(storm.filt.yr$EVTYPE))
## [1] 222
storm.filt.yr<-storm.filt.yr%>% mutate(EVTYPE=tolower(EVTYPE))
storm.filt.yr<-storm.filt.yr%>% mutate(EVTYPE=gsub("/"," ",EVTYPE))
storm.filt.yr<-storm.filt.yr%>% mutate(EVTYPE=gsub("\\s+", " ", str_trim(EVTYPE)))

Combining event types

Combining the duplicate events into the actual 48 categories is a difficult task considering the ambiguity in the naming scheme of the EVTYPE variable. Instead, considering the overall objective of this project, we could look at the type of events significantly affecting human resources and group only such events.

ev_grp<-storm.filt.yr %>% group_by(EVTYPE) %>% summarise(FAT=sum(FATALITIES,na.rm=T),INJ=sum(INJURIES,na.rm=T),PRDM=sum(PRDM,na.rm=T),CRDM=sum(CRDM,na.rm = T))
counts<-storm.filt.yr %>% group_by(EVTYPE) %>% summarize(count = length(FATALITIES)) %>% select(count)
ev_grp <- cbind(ev_grp,counts)
ev_grp<-arrange(ev_grp,desc(PRDM+CRDM))
head(ev_grp,n=25)
##               EVTYPE  FAT   INJ        PRDM        CRDM count
## 1  hurricane typhoon   64  1275 69305840000  2607872800    72
## 2        storm surge    2    37 43193536000        5000   169
## 3              flood  414  6758 29059833550  4974778400  9513
## 4            tornado 1511 20667 24616945710   283425010 12366
## 5               hail    7   713 14595143420  2476029450 22679
## 6        flash flood  887  1674 15222253910  1334901700 19012
## 7          hurricane   61    46 11812819010  2741410000   126
## 8            drought    0     4  1046101000 13367566000   258
## 9     tropical storm   57   338  7642475550   677711000   410
## 10         high wind  235  1083  5247860360   633561300  5402
## 11          wildfire   75   911  4758667000   295472800   847
## 12         tstm wind  241  3629  4486156440   553915350 61778
## 13  storm surge tide   11     5  4641188000      850000    47
## 14 thunderstorm wind  130  1400  3382654440   398331000 43097
## 15         ice storm   82   318  3642248810    15660000   631
## 16  wild forest fire   12   545  3001782500   106782330   381
## 17      winter storm  191  1292  1532743250    11944000  1460
## 18      extreme cold  115    79    19760400  1308973000   166
## 19        heavy rain   94   230   584864440   728169800  1047
## 20      frost freeze    0     0    10480000  1094186000   117
## 21         lightning  651  4141   743077080     6898440 11152
## 22        heavy snow  107   698   634417540    71122100  1029
## 23           typhoon    0     5   600230000      825000     9
## 24          blizzard   70   385   525658950     7060000   228
## 25    excessive heat 1797  6391     7723700   492402000   685

The major events identified using a preliminary grouping seem to fall in the following categories: - Thunderstorm Wind - Flooding - Hurricane / Typhoon - Storm Surges / Tides - Tornado

Thus, event grouping is only done for cases where ambiguity arises for the above categories ### Thunderstorms The cases where events start with tstm or thunderstorm are identified and renamed to thunderstorm wind

unique(storm.filt.yr[grepl(("^tstm"),storm.filt.yr$EVTYPE),]$EVTYPE)
##  [1] "tstm wind"               "tstm wind hail"         
##  [3] "tstm wind (g45)"         "tstm wind 40"           
##  [5] "tstm wind 45"            "tstm wind (41)"         
##  [7] "tstm wind (g40)"         "tstm wind and lightning"
##  [9] "tstm wind (g35)"         "tstm wind g45"
storm.filt.yr<- storm.filt.yr %>% mutate(EVTYPE = ifelse(grepl("^tstm|^thunderstorm",EVTYPE),"thunderstorm wind",EVTYPE))

Flooding

The cases where events start with river are identified and renamed to river flood. Cases containing coastal flood or cstl flood are renamed to coastal flood. And cases containing flash flood are renamed to flash flood

unique(storm.filt.yr[grepl(("flood"),storm.filt.yr$EVTYPE),]$EVTYPE)
##  [1] "flash flood"              "flood"                   
##  [3] "ice jam flood (minor"     "erosion cstl flood"      
##  [5] "river flooding"           "coastal flooding"        
##  [7] "tidal flooding"           "coastal flood"           
##  [9] "river flood"              "coastal flooding erosion"
## [11] "flood flash flood"        "flash flood flood"       
## [13] "lakeshore flood"
storm.filt.yr<- storm.filt.yr %>% mutate(EVTYPE = ifelse(grepl("river flood|fld",EVTYPE),"flood",EVTYPE))
storm.filt.yr<- storm.filt.yr %>% mutate(EVTYPE = ifelse(grepl("coastal flood|cstl flood",EVTYPE),"coastal flood",EVTYPE))
storm.filt.yr<- storm.filt.yr %>% mutate(EVTYPE = ifelse(grepl("flash flood",EVTYPE),"flash flood",EVTYPE))

Hurricane/typhoon

The cases where events start with hurricane or typhoon are identified and renamed to hurricane

unique(storm.filt.yr[grepl(("hurricane|typhoon"),storm.filt.yr$EVTYPE),]$EVTYPE)
## [1] "hurricane"         "hurricane edouard" "typhoon"          
## [4] "hurricane typhoon"
storm.filt.yr<- storm.filt.yr %>% mutate(EVTYPE = ifelse(grepl("hurricane|typhoon",EVTYPE),"hurricane",EVTYPE))

Storm surge

The cases where events contain storm surge are identified and renamed to storm surge

unique(storm.filt.yr[grepl(("storm surge"),storm.filt.yr$EVTYPE),]$EVTYPE)
## [1] "storm surge"      "storm surge tide"
storm.filt.yr<- storm.filt.yr %>% mutate(EVTYPE = ifelse(grepl("storm surge",EVTYPE),"storm surge",EVTYPE))

Results

The dataset is cleaned (partially) to perform downstream analyses. As we are interested in asking questions of the form, which type of events impact the economy or human resources, the data is grouped by events, while summing the damage caused by these events. It is important to note that, a sum is more relevant than the mean damage because, events that are more potent but less frequent may not be as impactful as moderately potent but highly frequent events. Of interest is the total damage caused by any given event.

ev_grp<-storm.filt.yr %>% group_by(EVTYPE) %>% summarise(FAT=sum(FATALITIES,na.rm=T),INJ=sum(INJURIES,na.rm=T),PRDM=sum(PRDM,na.rm=T),CRDM=sum(CRDM,na.rm = T))
counts<-storm.filt.yr %>% group_by(EVTYPE) %>% summarize(count = length(FATALITIES)) %>% select(count)
ev_grp <- cbind(ev_grp,counts)
head(ev_grp)
##                   EVTYPE FAT INJ    PRDM     CRDM count
## 1    agricultural freeze   0   0       0 28820000     3
## 2 astronomical high tide   0   0 9425000        0     8
## 3  astronomical low tide   0   0  320000        0     2
## 4              avalanche 223 156 3711800        0   264
## 5          beach erosion   0   0  100000        0     1
## 6              black ice   1  24       0        0     1

Events impacting population health

The two indicators for population health are fatalities and injuries, corresponding to the variables FAT and INJ respectively, in the grouped by events dataset (ev_grp). Since both fatalities and injuries are relevant a sum of these two variables is more useful than either individually.

ev_grp<-ev_grp %>% mutate(POP_DMG=FAT+INJ)

The next step is to analyze the events to identify the ones that impact population health the most. The dataset is melted by the variables FAT and INJ, and then plotted to see which events are most impactful and also how.

evgrpmelt_pop<-melt(ev_grp,id=c("EVTYPE","POP_DMG"),measure.vars = c("FAT","INJ"))
evgrpmelt_pop<-arrange(evgrpmelt_pop,desc(POP_DMG))
p=ggplot(evgrpmelt_pop[1:10,],aes(y= value,x=reorder(evgrpmelt_pop[1:10,]$EVTYPE,desc(evgrpmelt_pop[1:10,]$POP_DMG)),fill=variable))
p=p+geom_bar(stat="identity")+theme_bw(base_size = 14)
p=p+labs(x="Event type",y="Count",title="Population Damage")
print(p)

The x axis corresponds to the event type, and the y axis corresponds to the number of cases of human damage. The total damage are comprised of both fatalities and injuries, the proportion of each represented by stacked colored bars.

The plot illustrates the top 5 events that cause the most population damage, namely, tornadoes, excessive heat, floods, thunderstorm winds, and lightnings. As one might expect, the number of injuries caused by these events outnumber the number of casualties, which are represented by the different colours on the barplot. The damage due to tornadoes are significantly higher than the other events, and thus, appropriate resource allocation to mitigate tornado events are in due order.

ev_grp<-arrange(ev_grp,desc(POP_DMG))
head(ev_grp)
##              EVTYPE  FAT   INJ        PRDM       CRDM  count POP_DMG
## 1           tornado 1511 20667 24616945710  283425010  12366   22178
## 2    excessive heat 1797  6391     7723700  492402000    685    8188
## 3             flood  444  6838 29244580200 5013161500  10301    7282
## 4 thunderstorm wind  379  5129  7913555880 1016942600 105372    5508
## 5         lightning  651  4141   743077080    6898440  11152    4792
## 6       flash flood  887  1674 15222268910 1334901700  19014    2561

It is also notable that excessive heat though not as frequent as the other events, adversely impacts population far more than floods, thunderstorms etc. It might be helpful to issue heat advisory warnings at appropriate times to counter this.

Events with greatest economic consequences

The two indicators for economic damage are property and crop damages, corresponding to the variables PRDM and CRDM respectively, in the grouped by events dataset (ev_grp). Since both of these are relevant a sum of these two variables is more useful than either individually

ev_grp<-ev_grp %>% mutate(ECO_DMG=PRDM+CRDM)

The next step is to analyze the events to identify the ones that impact the economy the most. The dataset is melted by the variables PRDM and CRDM, and then plotted to see which events are most impactful and also how.

evgrpmelt_eco<-melt(ev_grp,id=c("EVTYPE","ECO_DMG"),measure.vars = c("PRDM","CRDM"))
evgrpmelt_eco<-arrange(evgrpmelt_eco,desc(ECO_DMG))
p=ggplot(evgrpmelt_eco[1:10,],aes(y= value,x=reorder(EVTYPE,desc(ECO_DMG)),fill=variable))
p=p+geom_bar(stat="identity")+theme_bw(base_size = 14)
p=p+labs(x="Event type",y="Value in $",title="Economic Damage")
print(p)

The x axis corresponds to the event type, and the y axis corresponds to the value in $ of losses estimated. The total losses are comprised of both the property losses and crop related losses, the proportion of each represented by stacked colored bars.

The top events that cause the most economic damage are hurricanes, storm surges, floods, tornadoes and hail. The property damage far outweighs the crop damage in all the cases.

ev_grp<-arrange(ev_grp,desc(ECO_DMG))
head(ev_grp)
##        EVTYPE  FAT   INJ        PRDM       CRDM count POP_DMG     ECO_DMG
## 1   hurricane  125  1328 81718889010 5350107800   208    1453 87068996810
## 2 storm surge   13    42 47834724000     855000   216      55 47835579000
## 3       flood  444  6838 29244580200 5013161500 10301    7282 34257741700
## 4     tornado 1511 20667 24616945710  283425010 12366   22178 24900370720
## 5        hail    7   713 14595143420 2476029450 22679     720 17071172870
## 6 flash flood  887  1674 15222268910 1334901700 19014    2561 16557170610

This shows that better resource management for property are implemented keeping in mind not just the more frequent events like floods, tornadoes, but also less frequent but more impactful events like hurricane and storm surges.

It is also notable that events such as tornadoes and floods rank among the top 5 in both these analyses. This data analysis exercise thus offers a clear perspective for coming up with solutions to mitigate disaster events causing maximal human as well as economic damage