This study exploits the NOAA storm database to assess the damage caused by each weather event. The events in the database start in the year 1950 and end in 2011. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The analysis shows that the top weather hazard to human health is tornado (by far!) while flood is most harmful to economy.
Since recording time spans well over 50 year, there is a lot of inconsistency in the database. Let’s first take a look at the data before we start cleaning it.
library(ggplot2) # for plot
library(dplyr) # for data manipulation
library(car) # for recode()
df <- read.csv('repdata-data-StormData.csv.bz2')
head(df)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
The columns that we are intereted in are: EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. For event type, there are 985 distinct entries and many of them are duplicate. However, it is time consuming to group all similar events together. For this study, I will group the events that are most frequent.
event <- df %>%
group_by(EVTYPE) %>%
summarise(occurrence = sum(!is.na(EVTYPE))) %>%
arrange(desc(occurrence))
event[1:20,]
## Source: local data frame [20 x 2]
##
## EVTYPE occurrence
## (fctr) (int)
## 1 HAIL 288661
## 2 TSTM WIND 219940
## 3 THUNDERSTORM WIND 82563
## 4 TORNADO 60652
## 5 FLASH FLOOD 54277
## 6 FLOOD 25326
## 7 THUNDERSTORM WINDS 20843
## 8 HIGH WIND 20212
## 9 LIGHTNING 15754
## 10 HEAVY SNOW 15708
## 11 HEAVY RAIN 11723
## 12 WINTER STORM 11433
## 13 WINTER WEATHER 7026
## 14 FUNNEL CLOUD 6839
## 15 MARINE TSTM WIND 6175
## 16 MARINE THUNDERSTORM WIND 5812
## 17 WATERSPOUT 3796
## 18 STRONG WIND 3566
## 19 URBAN/SML STREAM FLD 3392
## 20 WILDFIRE 2761
So I will group “TSTM WIND”, “THUNDERSTORM WINDS” and “THUNDERSTORM WIND” as well as “MARINE TSTM WIND”and “MARINE THUNDERSTORM WIND”. Now top 20 events become:
df[df$EVTYPE == "TSTM WIND", ]$EVTYPE = "THUNDERSTORM WIND"
df[df$EVTYPE == "THUNDERSTORM WINDS", ]$EVTYPE = "THUNDERSTORM WIND"
df[df$EVTYPE == "MARINE TSTM WIND", ]$EVTYPE = "MARINE THUNDERSTORM WIND"
event <- df %>%
group_by(EVTYPE) %>%
summarise(occurrence = sum(!is.na(EVTYPE))) %>%
arrange(desc(occurrence))
event[1:20,]
## Source: local data frame [20 x 2]
##
## EVTYPE occurrence
## (fctr) (int)
## 1 THUNDERSTORM WIND 323346
## 2 HAIL 288661
## 3 TORNADO 60652
## 4 FLASH FLOOD 54277
## 5 FLOOD 25326
## 6 HIGH WIND 20212
## 7 LIGHTNING 15754
## 8 HEAVY SNOW 15708
## 9 MARINE THUNDERSTORM WIND 11987
## 10 HEAVY RAIN 11723
## 11 WINTER STORM 11433
## 12 WINTER WEATHER 7026
## 13 FUNNEL CLOUD 6839
## 14 WATERSPOUT 3796
## 15 STRONG WIND 3566
## 16 URBAN/SML STREAM FLD 3392
## 17 WILDFIRE 2761
## 18 BLIZZARD 2719
## 19 DROUGHT 2488
## 20 ICE STORM 2006
Next, we need to set the damage exponent correct. I first identify the unique entries and then assign correct exponent to understandable labels. Others are set to be 0.
unique(df$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(df$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
df$PROPDMGEXP <- as.numeric(recode(as.character(df$PROPDMGEXP),
"'K'=1e+3;'M'=1e+6;'B'=1e+9;'m'=1e+6;'h'=1e+2;'H'=1e+2;else = 0"))
df$CROPDMGEXP <- as.numeric(recode(as.character(df$CROPDMGEXP),
"'M'=1e+6;'K'=1e+3;'m'=1e+6;'B'=1e+9;'k'=1e+3;else=0"))
Across the United States, which types of events are most harmful with respect to population health?
damage.health <- df %>%
group_by(EVTYPE) %>%
summarise(Total = sum(FATALITIES, na.rm = T)+sum(INJURIES, na.rm = T)) %>%
arrange(desc(Total))
ggplot(data = damage.health[1:5,], aes(x = EVTYPE, y = Total, fill = EVTYPE)) +
geom_bar(stat = 'identity') + scale_x_discrete(limits=unique(damage.health[1:5,]$EVTYPE))+
ylab('Total Fatalities and Injuries') + ggtitle('Top 5 Events that Lead to Health Damage') + scale_fill_discrete(name="Event Type")+
theme_bw(base_size = 15) +
theme(legend.position = c(0.6, 0.6), axis.text.x = element_text(angle = 20, hjust = 0.5), axis.title.x = element_blank())
Fig.1 Top 5 Events that Lead to Health Damage
As we can see from the plot above, tornado is by far the most harmful hazard to population health. In fact, the total fatalities and injuries caused by tornado is more the than the rest combined. It accounts for 62% of the total.
Across the United States, which types of events have the greatest economic consequences?
damage.economic <- df %>%
group_by(EVTYPE) %>%
summarise(Total = sum(PROPDMG*PROPDMGEXP, na.rm = T)+sum(CROPDMG*CROPDMGEXP, na.rm = T)) %>%
arrange(desc(Total))
ggplot(data = damage.economic[1:5,], aes(x = EVTYPE, y = Total/1000000, fill = EVTYPE)) +
geom_bar(stat = 'identity') + scale_x_discrete(limits=unique(damage.economic[1:5,]$EVTYPE))+
ylab('Total Economic Damage (M$)') + ggtitle('Top 5 Events that Lead to Economic Damage') + scale_fill_discrete(name="Event Type")+
theme_bw(base_size = 15) +
theme(legend.position = c(0.6, 0.7), axis.text.x = element_text(angle = 20, hjust = 0.5), axis.title.x = element_blank())
Fig.2 Top 5 Events that Lead to Economic Damage
When it comes to damage to economy, flood is the most harmful. It accounts for 32% of the total economy loss. Note the unit of economic loss is in millions of dollars.
I have used NOAA storm database to analyze the most harmful weather hazard in terms of population health damage and economy loss. While the data shows inconsitency over the time, I have corrected the entries that are most frequent. The result is lightly affected by less frequent events, therefore considered as trustworthy.