Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. In this document, we will analyze U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to address these two questions:
1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?

Data processing

Download and load the data

URL<- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

if (!file.exists("data")) dir.create("data")
if (!file.exists("./data/storm_data.csv.bz2")){
        download.file(URL, destfile = "./data/storm_data.csv.bz2", method="curl")
}

storm <- read.csv("./data/storm_data.csv.bz2")

Load required libraries

library('dplyr')
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library('ggplot2')

Subset and tidy the data

To address the two researh questions, we only need 7 variables from storm dataset. Below are the summaries of the required variables:

  • EVTYPE: types of events
  • FATALITIES: number of fatalities
  • INJURIES : number of injuries
  • PROPDMG: first few digits of property damage
  • PROPDMGEXP: exponent value of PROPDMG
  • CROPDMG: first few digits of crop damage
  • CROPDMGEXP: exponent value of CROPDMG

We subset the dataset into only variables of our interest, and store the output in variable df.

df <- subset(storm, select = c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))

There is confusion on how to handle exponent value of PROPDMGEXP and CROPDMGEXP columns of the database due to lack of official information in the NOAA website. We will follow several rules below in handling the exponent variables (Credits: https://github.com/flyingdisc/RepData_PeerAssessment2/blob/master/how-to-handle-PROPDMGEXP.md) :

These are possible values of CROPDMGEXP and PROPDMGEXP:

  • H,h,K,k,M,m,B,b,+,-,?,0,1,2,3,4,5,6,7,8, and blank-character

  • H,h = hundreds = 100
  • K,k = kilos = thousands = 1,000
  • M,m = millions = 1,000,000
  • B,b = billions = 1,000,000,000
  • (+) = 1
  • (-) = 0
  • (?) = 0
  • black/empty character = 0
  • numeric 0..8 = 10

We process PROPDMGEXP and CROPDMGEXP based on the rules defined above. Afterwards, we create two new variables PROPMUL and CROPMUL, which are results of multiplication between PROPDMG-PROPDMGEXP and CROPDMG-CROPDMGEXP. We store this tidy data set in variable df_tidy

replace_from <- c("H", "h", "K", "k", "M", "m", "B", "b", "+", "-", "?", "", 0:8)
replace_to <- c(100, 100, 1000, 1000, 1e+06, 1e+06, 1e+09, 1e+09, 1, 0, 0, 0, rep(10,9))

df_tidy <- df %>%
        mutate(PROPDMGEXP = plyr::mapvalues(PROPDMGEXP, replace_from, replace_to)) %>%
        mutate(CROPDMGEXP = plyr::mapvalues(CROPDMGEXP, replace_from, replace_to))
## The following `from` values were not present in `x`: k, b
## The following `from` values were not present in `x`: H, h, b, +, -, 1, 3, 4, 5, 6, 7, 8
df_tidy$PROPDMGEXP <- as.numeric(as.character(df_tidy$PROPDMGEXP))
df_tidy$CROPDMGEXP <- as.numeric(as.character(df_tidy$CROPDMGEXP))

df_tidy <- df_tidy %>%
        mutate(PROPMUL = PROPDMG * PROPDMGEXP) %>%
        mutate(CROPMUL = CROPDMG * CROPDMGEXP)

head(df_tidy)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0       1000       0          0
## 2 TORNADO          0        0     2.5       1000       0          0
## 3 TORNADO          0        2    25.0       1000       0          0
## 4 TORNADO          0        2     2.5       1000       0          0
## 5 TORNADO          0        2     2.5       1000       0          0
## 6 TORNADO          0        6     2.5       1000       0          0
##   PROPMUL CROPMUL
## 1   25000       0
## 2    2500       0
## 3   25000       0
## 4    2500       0
## 5    2500       0
## 6    2500       0

Results

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

We summarise total of injuries and fatalities, group by event types.

summary1 <- df_tidy %>%
        group_by(EVTYPE) %>%
        summarise(fatalities = sum(FATALITIES), injuries = sum(INJURIES), injuries_and_fatalities = sum(INJURIES+FATALITIES)) %>%
        arrange(desc(injuries_and_fatalities))

head(summary1,5)
## # A tibble: 5 x 4
##   EVTYPE         fatalities injuries injuries_and_fatalities
##   <fct>               <dbl>    <dbl>                   <dbl>
## 1 TORNADO              5633    91346                   96979
## 2 EXCESSIVE HEAT       1903     6525                    8428
## 3 TSTM WIND             504     6957                    7461
## 4 FLOOD                 470     6789                    7259
## 5 LIGHTNING             816     5230                    6046

Let’s see it clearer on the plot below

ggplot(head(summary1,5), aes(reorder(EVTYPE, -injuries_and_fatalities), injuries_and_fatalities)) + geom_bar(stat = "identity") + labs(title = "Injuries and Fatalities vs Event Types", x = "", y = "Number of injuries and fatalities")

From the summary and plot above, we clearly see Tornado is the most harmful with respect to population health, followed by excessive heat and thunderstorm wind.

Across the United States, which types of events have the greatest economic consequences?

We summarise total of crop and prop damages, group by event types.

summary2 <- df_tidy %>%
        group_by(EVTYPE) %>%
        summarise(prop_dmg = sum(PROPMUL), crop_dmg = sum(CROPMUL), crop_prop_dmg = sum(CROPMUL+PROPMUL)) %>%
        arrange(desc(crop_prop_dmg))

head(summary2,5)
## # A tibble: 5 x 4
##   EVTYPE                prop_dmg   crop_dmg crop_prop_dmg
##   <fct>                    <dbl>      <dbl>         <dbl>
## 1 FLOOD             144657709800 5661968450  150319678250
## 2 HURRICANE/TYPHOON  69305840000 2607872800   71913712800
## 3 TORNADO            56937162897  414954710   57352117607
## 4 STORM SURGE        43323536000       5000   43323541000
## 5 HAIL               15732269877 3025954650   18758224527

And plot the summary above.

ggplot(head(summary2,5), aes(reorder(EVTYPE, -crop_prop_dmg), crop_prop_dmg)) + geom_bar(stat = "identity") + labs(title = "Economic Consequences ($)", x = "", y = "")

From the summary and plot above, we see flood has the greatest economic consequences, followed by hurricane and tornado.