Synopsis of Study

This study exploits the NOAA storm database to assess the damage caused by each weather event. The events in the database start in the year 1950 and end in 2011. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The analysis shows that the top weather hazard to human health is tornado (by far!) while flood is most harmful to economy.

Data Processing

Since recording time spans well over 50 year, there is a lot of inconsistency in the database. Let’s first take a look at the data before we start cleaning it.

library(ggplot2) # for plot
library(dplyr) # for data manipulation
library(car) # for recode()
df <- read.csv('repdata-data-StormData.csv.bz2')
head(df)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

The columns that we are intereted in are: EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. For event type, there are 985 distinct entries and many of them are duplicate. However, it is time consuming to group all similar events together. For this study, I will group the events that are most frequent.

event <- df %>%
  group_by(EVTYPE) %>%
  summarise(occurrence = sum(!is.na(EVTYPE))) %>%
  arrange(desc(occurrence))
event[1:20,]
## Source: local data frame [20 x 2]
## 
##                      EVTYPE occurrence
##                      (fctr)      (int)
## 1                      HAIL     288661
## 2                 TSTM WIND     219940
## 3         THUNDERSTORM WIND      82563
## 4                   TORNADO      60652
## 5               FLASH FLOOD      54277
## 6                     FLOOD      25326
## 7        THUNDERSTORM WINDS      20843
## 8                 HIGH WIND      20212
## 9                 LIGHTNING      15754
## 10               HEAVY SNOW      15708
## 11               HEAVY RAIN      11723
## 12             WINTER STORM      11433
## 13           WINTER WEATHER       7026
## 14             FUNNEL CLOUD       6839
## 15         MARINE TSTM WIND       6175
## 16 MARINE THUNDERSTORM WIND       5812
## 17               WATERSPOUT       3796
## 18              STRONG WIND       3566
## 19     URBAN/SML STREAM FLD       3392
## 20                 WILDFIRE       2761

So I will group “TSTM WIND”, “THUNDERSTORM WINDS” and “THUNDERSTORM WIND” as well as “MARINE TSTM WIND”and “MARINE THUNDERSTORM WIND”. Now top 20 events become:

df[df$EVTYPE == "TSTM WIND", ]$EVTYPE = "THUNDERSTORM WIND"
df[df$EVTYPE == "THUNDERSTORM WINDS", ]$EVTYPE = "THUNDERSTORM WIND"
df[df$EVTYPE == "MARINE TSTM WIND", ]$EVTYPE = "MARINE THUNDERSTORM WIND"

event <- df %>%
  group_by(EVTYPE) %>%
  summarise(occurrence = sum(!is.na(EVTYPE))) %>%
  arrange(desc(occurrence))
event[1:20,]
## Source: local data frame [20 x 2]
## 
##                      EVTYPE occurrence
##                      (fctr)      (int)
## 1         THUNDERSTORM WIND     323346
## 2                      HAIL     288661
## 3                   TORNADO      60652
## 4               FLASH FLOOD      54277
## 5                     FLOOD      25326
## 6                 HIGH WIND      20212
## 7                 LIGHTNING      15754
## 8                HEAVY SNOW      15708
## 9  MARINE THUNDERSTORM WIND      11987
## 10               HEAVY RAIN      11723
## 11             WINTER STORM      11433
## 12           WINTER WEATHER       7026
## 13             FUNNEL CLOUD       6839
## 14               WATERSPOUT       3796
## 15              STRONG WIND       3566
## 16     URBAN/SML STREAM FLD       3392
## 17                 WILDFIRE       2761
## 18                 BLIZZARD       2719
## 19                  DROUGHT       2488
## 20                ICE STORM       2006

Next, we need to set the damage exponent correct. I first identify the unique entries and then assign correct exponent to understandable labels. Others are set to be 0.

unique(df$PROPDMGEXP)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(df$CROPDMGEXP)
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M
df$PROPDMGEXP <- as.numeric(recode(as.character(df$PROPDMGEXP), 
    "'K'=1e+3;'M'=1e+6;'B'=1e+9;'m'=1e+6;'h'=1e+2;'H'=1e+2;else = 0"))
df$CROPDMGEXP <- as.numeric(recode(as.character(df$CROPDMGEXP), 
    "'M'=1e+6;'K'=1e+3;'m'=1e+6;'B'=1e+9;'k'=1e+3;else=0"))

Result

Q1

Across the United States, which types of events are most harmful with respect to population health?

damage.health <- df %>%
  group_by(EVTYPE) %>%
  summarise(Total = sum(FATALITIES, na.rm = T)+sum(INJURIES, na.rm = T)) %>%
  arrange(desc(Total))

ggplot(data = damage.health[1:5,], aes(x = EVTYPE, y = Total, fill = EVTYPE)) + 
  geom_bar(stat = 'identity') + scale_x_discrete(limits=unique(damage.health[1:5,]$EVTYPE))+
  ylab('Total Fatalities and Injuries') + ggtitle('Top 5 Events that Lead to Health Damage') + scale_fill_discrete(name="Event Type")+
  theme_bw(base_size = 15) + 
  theme(legend.position = c(0.6, 0.6), axis.text.x = element_text(angle = 20, hjust = 0.5), axis.title.x = element_blank())
Fig.1 Top 5 Events that Lead to Health Damage

Fig.1 Top 5 Events that Lead to Health Damage

As we can see from the plot above, tornado is by far the most harmful hazard to population health. In fact, the total fatalities and injuries caused by tornado is more the than the rest combined. It accounts for 62% of the total.

Q2

Across the United States, which types of events have the greatest economic consequences?

damage.economic <- df %>%
  group_by(EVTYPE) %>%
  summarise(Total = sum(PROPDMG*PROPDMGEXP, na.rm = T)+sum(CROPDMG*CROPDMGEXP, na.rm = T)) %>%
  arrange(desc(Total))

ggplot(data = damage.economic[1:5,], aes(x = EVTYPE, y = Total/1000000, fill = EVTYPE)) + 
  geom_bar(stat = 'identity') + scale_x_discrete(limits=unique(damage.economic[1:5,]$EVTYPE))+
  ylab('Total Economic Damage (M$)') + ggtitle('Top 5 Events that Lead to Economic Damage') + scale_fill_discrete(name="Event Type")+
  theme_bw(base_size = 15) + 
  theme(legend.position = c(0.6, 0.7), axis.text.x = element_text(angle = 20, hjust = 0.5), axis.title.x = element_blank())
Fig.2 Top 5 Events that Lead to Economic Damage

Fig.2 Top 5 Events that Lead to Economic Damage

When it comes to damage to economy, flood is the most harmful. It accounts for 32% of the total economy loss. Note the unit of economic loss is in millions of dollars.

Conclusion

I have used NOAA storm database to analyze the most harmful weather hazard in terms of population health damage and economy loss. While the data shows inconsitency over the time, I have corrected the entries that are most frequent. The result is lightly affected by less frequent events, therefore considered as trustworthy.