The dataset comes from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The data analysis is preceded by cleaning the data - identification of records for which sufficient documentation is available. In addition, the data analysis consists of resolving various duplications of the event names provided in the database. Using the cleaned version of the dataset, the events responsible for maximal human and economic damage in the last 15 years are identified. Events such as tornadoes and floods are identified as principal causative events for severe human as well as economic damage. Appropriate management of resources and timely warnings can mitigate the damage caused by these events.
suppressWarnings(library(dplyr))
suppressWarnings(library(lubridate))
suppressWarnings(library(stringr))
suppressWarnings(library(reshape2))
suppressWarnings(library(ggplot2))
storm<-read.csv(bzfile("repdata-data-StormData.csv.bz2"))
dim(storm)
## [1] 902297 37
One of the records had a typo in it. This corresponds to the record with REFNUM 60493 (60593rd row in the dataset). The PROPDMGEXP variable, which is a multiplier was mentioned as B instead of M. This was corrected. Thanks to Community TA Gregory Horne for the tip.
storm[605953,]$PROPDMGEXP <- as.factor("M") #typo fixed. Thanks to Greg Horne, CTA, REFNUM = 605963
The next step is to weed out the relevant variables in the dataset. These correspond to the Beginning and End years, fatalities, injuries, property damage PROPDMG, crop damage CRPDMG and their corresponding multipliers - PROPDMGEXP and CROPDMGEXP, along with the REFNUM for the record. Following this the events for which atleast one of the population or economic damages are non-zero are filtered out
storm.sel <- select(storm,c(BGN_DATE,END_DATE,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP,REFNUM))
storm.filt <- storm.sel %>% filter(FATALITIES >0 | INJURIES >0 | PROPDMG >0 | CROPDMG > 0)
dim(storm.filt)
## [1] 254633 10
The year of the event is documented in both the beginning of the event as well as the end of the event. These are converted to datetime object and the year is extracted
storm.filt<-storm.filt %>% mutate(YEAR=year(as.Date(END_DATE,format="%m/%d/%Y")))
storm.filt<-storm.filt %>% mutate(YEAR_B=year(as.Date(BGN_DATE,format="%m/%d/%Y")))
year_grp<-storm.filt %>% group_by(YEAR,YEAR_B) %>% summarise(count=length(FATALITIES))
par(mfrow=c(1,2))
with(year_grp,plot(count~YEAR_B, type="p", col="black", pch = 19, main = 'Number of events',xlab="Year_Beginning"))
abline(v=1996,lty=3)
with(year_grp,plot(count~YEAR, type="p", col="black", pch = 19, xlim = c(1950,2011), main = 'Number of events',xlab='Year_End'))
abline(v=1996,lty=3)
The number of events (y axis) in each year (x axis) is then obtained and plotted - the two panels correspond to the beginning year and the ending year respectively.
We see that only for events occurring since 1996, documentation for both the beginning of the event and the end of the event is available - the year 1996 corresponds to the dotted vertical line in the plot. Thus, only events since 1996 are used for data analysis
years_of_int <- year_grp %>% filter(YEAR > 1995)
storm.filt.yr <- filter(storm.filt, YEAR %in% years_of_int$YEAR)
dim(storm.filt.yr)
## [1] 201318 12
This greatly reduces the number of records.
The next step is to convert the damages to their actual values using their multipliers.
table(storm.filt.yr$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 8448 0 0 0 0 0 0 0 0 0
## 6 7 8 B h H K m M
## 0 0 0 31 0 0 185474 0 7365
table(storm.filt.yr$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 102767 0 0 0 2 0 96787 0 1762
The above filtering leaves behind only three types of multipliers, namely K, M and B which correspond to 10^3, 10^6 and 10^9 respectively. Using this, the actual damages are calculated and stored in PRDM for property damage and CRDM for crop damage.
mltpr <- c(1e3,1e6,1e9)
names(mltpr)<-c('K','M','B')
storm.filt.yr<-storm.filt.yr%>% mutate(PROPDMGEXP=as.character(PROPDMGEXP),CROPDMGEXP=as.character(CROPDMGEXP))
storm.filt.yr<-storm.filt.yr%>% mutate(PRDM = PROPDMG*mltpr[PROPDMGEXP],CRDM = CROPDMG*mltpr[CROPDMGEXP])
head(storm.filt.yr)
## BGN_DATE END_DATE EVTYPE FATALITIES INJURIES
## 1 1/6/1996 0:00:00 1/7/1996 0:00:00 WINTER STORM 0 0
## 2 1/11/1996 0:00:00 1/11/1996 0:00:00 TORNADO 0 0
## 3 1/11/1996 0:00:00 1/11/1996 0:00:00 TSTM WIND 0 0
## 4 1/11/1996 0:00:00 1/11/1996 0:00:00 TSTM WIND 0 0
## 5 1/11/1996 0:00:00 1/11/1996 0:00:00 TSTM WIND 0 0
## 6 1/18/1996 0:00:00 1/19/1996 0:00:00 HIGH WIND 0 0
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP REFNUM YEAR YEAR_B PRDM CRDM
## 1 380 K 38 K 248768 1996 1996 380000 38000
## 2 100 K 0 248769 1996 1996 100000 NA
## 3 3 K 0 248770 1996 1996 3000 NA
## 4 5 K 0 248771 1996 1996 5000 NA
## 5 2 K 0 248772 1996 1996 2000 NA
## 6 400 K 0 248774 1996 1996 400000 NA
The dataset has a lot of duplicate event names - they belong to 48 categories of events as given in the NWSI 10-1605 document page 6. Thus, some of these categories have to be combined/renamed. The first step is to convert all the event names to lower cases, and to trim out excess whitespaces in the EVTYPE variable
length(unique(storm.filt.yr$EVTYPE))
## [1] 222
storm.filt.yr<-storm.filt.yr%>% mutate(EVTYPE=tolower(EVTYPE))
storm.filt.yr<-storm.filt.yr%>% mutate(EVTYPE=gsub("/"," ",EVTYPE))
storm.filt.yr<-storm.filt.yr%>% mutate(EVTYPE=gsub("\\s+", " ", str_trim(EVTYPE)))
Combining the duplicate events into the actual 48 categories is a difficult task considering the ambiguity in the naming scheme of the EVTYPE variable. Instead, considering the overall objective of this project, we could look at the type of events significantly affecting human resources and group only such events.
ev_grp<-storm.filt.yr %>% group_by(EVTYPE) %>% summarise(FAT=sum(FATALITIES,na.rm=T),INJ=sum(INJURIES,na.rm=T),PRDM=sum(PRDM,na.rm=T),CRDM=sum(CRDM,na.rm = T))
counts<-storm.filt.yr %>% group_by(EVTYPE) %>% summarize(count = length(FATALITIES)) %>% select(count)
ev_grp <- cbind(ev_grp,counts)
ev_grp<-arrange(ev_grp,desc(PRDM+CRDM))
head(ev_grp,n=25)
## EVTYPE FAT INJ PRDM CRDM count
## 1 hurricane typhoon 64 1275 69305840000 2607872800 72
## 2 storm surge 2 37 43193536000 5000 169
## 3 flood 414 6758 29059833550 4974778400 9513
## 4 tornado 1511 20667 24616945710 283425010 12366
## 5 hail 7 713 14595143420 2476029450 22679
## 6 flash flood 887 1674 15222253910 1334901700 19012
## 7 hurricane 61 46 11812819010 2741410000 126
## 8 drought 0 4 1046101000 13367566000 258
## 9 tropical storm 57 338 7642475550 677711000 410
## 10 high wind 235 1083 5247860360 633561300 5402
## 11 wildfire 75 911 4758667000 295472800 847
## 12 tstm wind 241 3629 4486156440 553915350 61778
## 13 storm surge tide 11 5 4641188000 850000 47
## 14 thunderstorm wind 130 1400 3382654440 398331000 43097
## 15 ice storm 82 318 3642248810 15660000 631
## 16 wild forest fire 12 545 3001782500 106782330 381
## 17 winter storm 191 1292 1532743250 11944000 1460
## 18 extreme cold 115 79 19760400 1308973000 166
## 19 heavy rain 94 230 584864440 728169800 1047
## 20 frost freeze 0 0 10480000 1094186000 117
## 21 lightning 651 4141 743077080 6898440 11152
## 22 heavy snow 107 698 634417540 71122100 1029
## 23 typhoon 0 5 600230000 825000 9
## 24 blizzard 70 385 525658950 7060000 228
## 25 excessive heat 1797 6391 7723700 492402000 685
The major events identified using a preliminary grouping seem to fall in the following categories: - Thunderstorm Wind - Flooding - Hurricane / Typhoon - Storm Surges / Tides - Tornado
Thus, event grouping is only done for cases where ambiguity arises for the above categories ### Thunderstorms The cases where events start with tstm or thunderstorm are identified and renamed to thunderstorm wind
unique(storm.filt.yr[grepl(("^tstm"),storm.filt.yr$EVTYPE),]$EVTYPE)
## [1] "tstm wind" "tstm wind hail"
## [3] "tstm wind (g45)" "tstm wind 40"
## [5] "tstm wind 45" "tstm wind (41)"
## [7] "tstm wind (g40)" "tstm wind and lightning"
## [9] "tstm wind (g35)" "tstm wind g45"
storm.filt.yr<- storm.filt.yr %>% mutate(EVTYPE = ifelse(grepl("^tstm|^thunderstorm",EVTYPE),"thunderstorm wind",EVTYPE))
The cases where events start with river are identified and renamed to river flood. Cases containing coastal flood or cstl flood are renamed to coastal flood. And cases containing flash flood are renamed to flash flood
unique(storm.filt.yr[grepl(("flood"),storm.filt.yr$EVTYPE),]$EVTYPE)
## [1] "flash flood" "flood"
## [3] "ice jam flood (minor" "erosion cstl flood"
## [5] "river flooding" "coastal flooding"
## [7] "tidal flooding" "coastal flood"
## [9] "river flood" "coastal flooding erosion"
## [11] "flood flash flood" "flash flood flood"
## [13] "lakeshore flood"
storm.filt.yr<- storm.filt.yr %>% mutate(EVTYPE = ifelse(grepl("river flood|fld",EVTYPE),"flood",EVTYPE))
storm.filt.yr<- storm.filt.yr %>% mutate(EVTYPE = ifelse(grepl("coastal flood|cstl flood",EVTYPE),"coastal flood",EVTYPE))
storm.filt.yr<- storm.filt.yr %>% mutate(EVTYPE = ifelse(grepl("flash flood",EVTYPE),"flash flood",EVTYPE))
The cases where events start with hurricane or typhoon are identified and renamed to hurricane
unique(storm.filt.yr[grepl(("hurricane|typhoon"),storm.filt.yr$EVTYPE),]$EVTYPE)
## [1] "hurricane" "hurricane edouard" "typhoon"
## [4] "hurricane typhoon"
storm.filt.yr<- storm.filt.yr %>% mutate(EVTYPE = ifelse(grepl("hurricane|typhoon",EVTYPE),"hurricane",EVTYPE))
The cases where events contain storm surge are identified and renamed to storm surge
unique(storm.filt.yr[grepl(("storm surge"),storm.filt.yr$EVTYPE),]$EVTYPE)
## [1] "storm surge" "storm surge tide"
storm.filt.yr<- storm.filt.yr %>% mutate(EVTYPE = ifelse(grepl("storm surge",EVTYPE),"storm surge",EVTYPE))
The dataset is cleaned (partially) to perform downstream analyses. As we are interested in asking questions of the form, which type of events impact the economy or human resources, the data is grouped by events, while summing the damage caused by these events. It is important to note that, a sum is more relevant than the mean damage because, events that are more potent but less frequent may not be as impactful as moderately potent but highly frequent events. Of interest is the total damage caused by any given event.
ev_grp<-storm.filt.yr %>% group_by(EVTYPE) %>% summarise(FAT=sum(FATALITIES,na.rm=T),INJ=sum(INJURIES,na.rm=T),PRDM=sum(PRDM,na.rm=T),CRDM=sum(CRDM,na.rm = T))
counts<-storm.filt.yr %>% group_by(EVTYPE) %>% summarize(count = length(FATALITIES)) %>% select(count)
ev_grp <- cbind(ev_grp,counts)
head(ev_grp)
## EVTYPE FAT INJ PRDM CRDM count
## 1 agricultural freeze 0 0 0 28820000 3
## 2 astronomical high tide 0 0 9425000 0 8
## 3 astronomical low tide 0 0 320000 0 2
## 4 avalanche 223 156 3711800 0 264
## 5 beach erosion 0 0 100000 0 1
## 6 black ice 1 24 0 0 1
The two indicators for population health are fatalities and injuries, corresponding to the variables FAT and INJ respectively, in the grouped by events dataset (ev_grp). Since both fatalities and injuries are relevant a sum of these two variables is more useful than either individually.
ev_grp<-ev_grp %>% mutate(POP_DMG=FAT+INJ)
The next step is to analyze the events to identify the ones that impact population health the most. The dataset is melted by the variables FAT and INJ, and then plotted to see which events are most impactful and also how.
evgrpmelt_pop<-melt(ev_grp,id=c("EVTYPE","POP_DMG"),measure.vars = c("FAT","INJ"))
evgrpmelt_pop<-arrange(evgrpmelt_pop,desc(POP_DMG))
p=ggplot(evgrpmelt_pop[1:10,],aes(y= value,x=reorder(evgrpmelt_pop[1:10,]$EVTYPE,desc(evgrpmelt_pop[1:10,]$POP_DMG)),fill=variable))
p=p+geom_bar(stat="identity")+theme_bw(base_size = 14)
p=p+labs(x="Event type",y="Count",title="Population Damage")
print(p)
The x axis corresponds to the event type, and the y axis corresponds to the number of cases of human damage. The total damage are comprised of both fatalities and injuries, the proportion of each represented by stacked colored bars.
The plot illustrates the top 5 events that cause the most population damage, namely, tornadoes, excessive heat, floods, thunderstorm winds, and lightnings. As one might expect, the number of injuries caused by these events outnumber the number of casualties, which are represented by the different colours on the barplot. The damage due to tornadoes are significantly higher than the other events, and thus, appropriate resource allocation to mitigate tornado events are in due order.
ev_grp<-arrange(ev_grp,desc(POP_DMG))
head(ev_grp)
## EVTYPE FAT INJ PRDM CRDM count POP_DMG
## 1 tornado 1511 20667 24616945710 283425010 12366 22178
## 2 excessive heat 1797 6391 7723700 492402000 685 8188
## 3 flood 444 6838 29244580200 5013161500 10301 7282
## 4 thunderstorm wind 379 5129 7913555880 1016942600 105372 5508
## 5 lightning 651 4141 743077080 6898440 11152 4792
## 6 flash flood 887 1674 15222268910 1334901700 19014 2561
It is also notable that excessive heat though not as frequent as the other events, adversely impacts population far more than floods, thunderstorms etc. It might be helpful to issue heat advisory warnings at appropriate times to counter this.
The two indicators for economic damage are property and crop damages, corresponding to the variables PRDM and CRDM respectively, in the grouped by events dataset (ev_grp). Since both of these are relevant a sum of these two variables is more useful than either individually
ev_grp<-ev_grp %>% mutate(ECO_DMG=PRDM+CRDM)
The next step is to analyze the events to identify the ones that impact the economy the most. The dataset is melted by the variables PRDM and CRDM, and then plotted to see which events are most impactful and also how.
evgrpmelt_eco<-melt(ev_grp,id=c("EVTYPE","ECO_DMG"),measure.vars = c("PRDM","CRDM"))
evgrpmelt_eco<-arrange(evgrpmelt_eco,desc(ECO_DMG))
p=ggplot(evgrpmelt_eco[1:10,],aes(y= value,x=reorder(EVTYPE,desc(ECO_DMG)),fill=variable))
p=p+geom_bar(stat="identity")+theme_bw(base_size = 14)
p=p+labs(x="Event type",y="Value in $",title="Economic Damage")
print(p)
The x axis corresponds to the event type, and the y axis corresponds to the value in $ of losses estimated. The total losses are comprised of both the property losses and crop related losses, the proportion of each represented by stacked colored bars.
The top events that cause the most economic damage are hurricanes, storm surges, floods, tornadoes and hail. The property damage far outweighs the crop damage in all the cases.
ev_grp<-arrange(ev_grp,desc(ECO_DMG))
head(ev_grp)
## EVTYPE FAT INJ PRDM CRDM count POP_DMG ECO_DMG
## 1 hurricane 125 1328 81718889010 5350107800 208 1453 87068996810
## 2 storm surge 13 42 47834724000 855000 216 55 47835579000
## 3 flood 444 6838 29244580200 5013161500 10301 7282 34257741700
## 4 tornado 1511 20667 24616945710 283425010 12366 22178 24900370720
## 5 hail 7 713 14595143420 2476029450 22679 720 17071172870
## 6 flash flood 887 1674 15222268910 1334901700 19014 2561 16557170610
This shows that better resource management for property are implemented keeping in mind not just the more frequent events like floods, tornadoes, but also less frequent but more impactful events like hurricane and storm surges.
It is also notable that events such as tornadoes and floods rank among the top 5 in both these analyses. This data analysis exercise thus offers a clear perspective for coming up with solutions to mitigate disaster events causing maximal human as well as economic damage