Across the United States, which types of severe weather events have the most economic and public health impact? To answer this question, we explore the U.S. National Oceanic and Atmospheric Adminstration’s (NOAA) storm database. We focus on the years 2000-2011, and find top ten events across each of three measures: total monetary damage, injuries, and fatalities.
The results indicate that tornadoes, excessive heat, floods, and hurricanes are the most impactful events.
Since the database is compiled from user reports going as far back as the 1950’s, there are several important considerations to take into account. The event type (EVTYPE) column is not standardized, and contains many more unique values than the 47 described in the documentation. Additionally, going back in time, there are less events reported, thus giving the appearance of less damage. Finally, damage amounts several decades ago would be increased if inflation were accounted for. For these reasons, and to avoid spending the bulk of project time on data cleanup, we consider only events in the year 2000 and onwards.
The data was downloaded from here. Processing starts from the downloaded .bz2 archive.
Read in data and check the dimensions:
data <- read.csv(bzfile('data/repdata-data-StormData.csv.bz2'))
dim(data)
## [1] 902297 37
Remove columns not used in analysis:
keep <- which(names(data) %in% c("EVTYPE","BGN_DATE","FATALITIES","INJURIES",
"PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP"))
data <- data[, keep]
Convert BGN_DATE (event beginning date) to Date format, and discard data prior to 2000:
data$BGN_DATE <- as.Date(data$BGN_DATE,format="%m/%d/%Y 0:00:00")
data <- data[data$BGN_DATE >= '2000-01-01',]
Further reduce the dataset by removing any rows where damage, fatalities and injuries are all zero:
data <- data[data$FATALITIES != 0 | data$INJURIES != 0 |
data$PROPDMG != 0 | data$CROPDMG != 0, ]
Convert the PROPDMGEXP and CROPDMGEXP columns to integer multipliers, and combine with PROPDMG and CROPDMG to create columns with dollar damage amounts. Then combine these two columns to get TOTAL_DMG, the total dollar damage.
data$PROPDMG_NEW <- NA
data$PROPDMG_NEW <- sapply(1:nrow(data), function(x) {
if (data$PROPDMGEXP[x]=="K") {data$PROPDMG[x] * 1000}
else if (data$PROPDMGEXP[x]=="M") {data$PROPDMG[x] * 1000000}
else if (data$PROPDMGEXP[x]=="B") {data$PROPDMG[x] * 1000000000}
else {data$PROPDMG[x]}
})
data$CROPDMG_NEW <- NA
data$CROPDMG_NEW <- sapply(1:nrow(data), function(x) {
if (data$CROPDMGEXP[x]=="K") {data$CROPDMG[x] * 1000}
else if (data$CROPDMGEXP[x]=="M") {data$CROPDMG[x] * 1000000}
else if (data$CROPDMGEXP[x]=="B") {data$CROPDMG[x] * 1000000000}
else {data$CROPDMG[x]}
})
data$TOTAL_DMG <- data$PROPDMG_NEW + data$CROPDMG_NEW
Some of the values in EVTYPE have more than one spelling. For example, TSTM WIND and THUNDERSTORM WIND refer to the same event type. Examine the top 50 current `EVTYPE values:
head(sort(table(data$EVTYPE),decreasing=T),50)
##
## THUNDERSTORM WIND TSTM WIND HAIL
## 43097 37975 17732
## FLASH FLOOD TORNADO LIGHTNING
## 14950 9495 8377
## FLOOD HIGH WIND STRONG WIND
## 7857 4229 3344
## WINTER STORM WILDFIRE HEAVY RAIN
## 1098 847 846
## HEAVY SNOW ICE STORM EXCESSIVE HEAT
## 682 468 449
## WINTER WEATHER TROPICAL STORM RIP CURRENT
## 402 381 363
## WILD/FOREST FIRE AVALANCHE LAKE-EFFECT SNOW
## 227 224 194
## LANDSLIDE DROUGHT URBAN/SML STREAM FLD
## 190 187 181
## HEAT TSTM WIND/HAIL BLIZZARD
## 159 156 154
## RIP CURRENTS COASTAL FLOOD WINTER WEATHER/MIX
## 146 144 139
## FROST/FREEZE EXTREME COLD/WIND CHILL HIGH SURF
## 116 111 110
## MARINE TSTM WIND COLD/WIND CHILL STORM SURGE
## 109 90 89
## DUST STORM HURRICANE/TYPHOON DUST DEVIL
## 77 72 70
## LIGHT SNOW DENSE FOG FOG
## 65 58 53
## HEAVY SURF/HIGH SURF STORM SURGE/TIDE MARINE STRONG WIND
## 50 47 46
## HURRICANE EXTREME COLD DRY MICROBURST
## 43 41 38
## WIND TROPICAL DEPRESSION
## 36 35
Fix some frequently-occurring strings:
levels(data$EVTYPE)[levels(data$EVTYPE)=='TSTM WIND'] <- 'THUNDERSTORM WIND'
levels(data$EVTYPE)[levels(data$EVTYPE)=='TSTM WIND/HAIL'] <- 'THUNDERSTORM WIND'
levels(data$EVTYPE)[levels(data$EVTYPE)=='THUNDERSTORM WIND/HAIL'] <- 'THUNDERSTORM WIND'
levels(data$EVTYPE)[levels(data$EVTYPE)=='WILD/FOREST FIRE'] <- 'WILDFIRE'
levels(data$EVTYPE)[levels(data$EVTYPE)=='URBAN/SML STREAM FLD'] <- 'FLOOD'
levels(data$EVTYPE)[levels(data$EVTYPE)=='RIP CURRENTS'] <- 'RIP CURRENT'
levels(data$EVTYPE)[levels(data$EVTYPE)=='WINTER WEATHER/MIX'] <- 'WINTER WEATHER'
levels(data$EVTYPE)[levels(data$EVTYPE)=='STORM SURGE/TIDE'] <- 'STORM SURGE'
Drop unused EVTYPE levels, and check number of levels and dimensions of the processed dataset:
data$EVTYPE <- droplevels(data$EVTYPE)
length(levels(data$EVTYPE))
## [1] 102
dim(data)
## [1] 156334 11
The data is now ready for analysis.
In this section, we aggregate the data by event type to find the most impactful events.
For each of the three measures (total damage, injuries, fatalities), group each variable by event type, and calculate the percentage of total for each event:
damage_by_evtype <- aggregate(TOTAL_DMG ~ EVTYPE, data=data, sum)
damage_by_evtype <- damage_by_evtype[order(-damage_by_evtype$TOTAL_DMG),]
damage_by_evtype$TOTAL_DMG_PERC <- round(100*damage_by_evtype$TOTAL_DMG /
sum(damage_by_evtype$TOTAL_DMG),2)
injuries_by_evtype <- aggregate(INJURIES ~ EVTYPE,data=data,sum)
injuries_by_evtype <- injuries_by_evtype[order(-injuries_by_evtype$INJURIES),]
injuries_by_evtype$INJURIES_PERC <- round(100*injuries_by_evtype$INJURIES /
sum(injuries_by_evtype$INJURIES),2)
fatalities_by_evtype <- aggregate(FATALITIES ~ EVTYPE,data=data,sum)
fatalities_by_evtype <- fatalities_by_evtype[order(-fatalities_by_evtype$FATALITIES),]
fatalities_by_evtype$FATALITIES_PERC <- round(100*fatalities_by_evtype$FATALITIES /
sum(fatalities_by_evtype$FATALITIES),2)
For each measure, get top 10 events, calculate percentage of total contained in the top 10, and list the top 10 events for each group.
damage_top10 <- head(damage_by_evtype,10)
damage_top10 <- transform(damage_top10, EVTYPE = reorder(EVTYPE,TOTAL_DMG))
injuries_top10 <- head(injuries_by_evtype,10)
injuries_top10 <- transform(injuries_top10, EVTYPE = reorder(EVTYPE,INJURIES))
fatalities_top10 <- head(fatalities_by_evtype,10)
fatalities_top10 <- transform(fatalities_top10, EVTYPE = reorder(EVTYPE,FATALITIES))
damage_top10_perc <- sum(damage_top10$TOTAL_DMG_PERC)
injuries_top10_perc <- sum(injuries_top10$INJURIES_PERC)
fatalities_top10_perc <- sum(fatalities_top10$FATALITIES_PERC)
print(damage_top10)
## EVTYPE TOTAL_DMG TOTAL_DMG_PERC
## 29 FLOOD 1.389e+11 39.20
## 49 HURRICANE/TYPHOON 7.191e+10 20.29
## 80 STORM SURGE 4.781e+10 13.49
## 86 TORNADO 1.970e+10 5.56
## 37 HAIL 1.377e+10 3.89
## 28 FLASH FLOOD 1.278e+10 3.61
## 18 DROUGHT 9.982e+09 2.82
## 88 TROPICAL STORM 7.607e+09 2.15
## 98 WILDFIRE 7.419e+09 2.09
## 84 THUNDERSTORM WIND 6.246e+09 1.76
print(injuries_top10)
## EVTYPE INJURIES INJURIES_PERC
## 86 TORNADO 15213 43.30
## 23 EXCESSIVE HEAT 3708 10.55
## 84 THUNDERSTORM WIND 3187 9.07
## 59 LIGHTNING 2993 8.52
## 49 HURRICANE/TYPHOON 1275 3.63
## 39 HEAT 1222 3.48
## 98 WILDFIRE 1197 3.41
## 28 FLASH FLOOD 812 2.31
## 47 HIGH WIND 677 1.93
## 37 HAIL 540 1.54
print(fatalities_top10)
## EVTYPE FATALITIES FATALITIES_PERC
## 86 TORNADO 1193 19.90
## 23 EXCESSIVE HEAT 1013 16.90
## 28 FLASH FLOOD 600 10.01
## 59 LIGHTNING 466 7.77
## 73 RIP CURRENT 462 7.71
## 29 FLOOD 278 4.64
## 84 THUNDERSTORM WIND 248 4.14
## 39 HEAT 231 3.85
## 6 AVALANCHE 179 2.99
## 47 HIGH WIND 131 2.19
Create plots for each of the top 10 measures:
library(ggplot2)
options(scipen = 999)
p_damage <- ggplot(data=damage_top10, aes(x=EVTYPE,y=TOTAL_DMG/1e+09)) +
geom_bar(stat="identity",aes(fill=TOTAL_DMG_PERC)) +
theme(text = element_text(size=10)) +
labs(title="Top 10 Events by Economic Damage (2000-2011)",
x = "Event", y="Damage (Billion USD)") +
coord_flip()
p_injuries <- ggplot(data=injuries_top10, aes(x=EVTYPE,y=INJURIES)) +
geom_bar(stat="identity",aes(fill=INJURIES_PERC)) +
theme(text = element_text(size=10)) +
labs(title="Top 10 Events by Injuries (2000-2011)",
x = "Event", y="Injuries") +
coord_flip()
p_fatalities <- ggplot(data=fatalities_top10, aes(x=EVTYPE,y=FATALITIES)) +
geom_bar(stat="identity",aes(fill=FATALITIES_PERC)) +
theme(text = element_text(size=10)) +
labs(title="Top 10 Events by Fatalities (2000-2011)",
x = "Event", y="Fatalities") +
coord_flip()
The plots at the end of the report show the top ten event types in each of three measures for 2000-2011: economic damage (property and crop) in US dollars, injury counts, and fatality counts. In each plot, the bar fill gradient corresponds to percentage of the total.
The top-10 groups account 94.86%, 87.74%, and 80.1% of the total damage, injuries, and fatalities, respectively.
Tornadoes were most harmful in terms of injuries, causing 15213 injuries (43% of total).
Tornadoes and excessive heat were most harmful in terms of fatalities, causing 1193 and 1013 deaths (20% and 17% of total).
However, the most economic damage was caused by floods and hurricanes/typhoons, with 138.9 billion and 71.9 billion U.S. dollars in damage (39% and 20% of total).
print(p_damage)
print(p_injuries)
print(p_fatalities)
This analysis was written as the course project for the Reproducible Research Course on Coursera.