Synopsis

We are going to briefly investigate impact of registered weather events in US. Our research is based on historical data provided by National Oceanic and Atmospheric Administration. We’re going to try to determine which events are most harmful to US population and property.
This work is done as a part of Coursera Data Science specialization, Reproducible Research course.

Data Processing

Libraries first

library(dplyr)
library(stringr)
library(lattice)

Make sure you have storm data downloaded to your current working directory. It’s located here https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

raw_data <- read.csv("repdata-data-StormData.csv.bz2", stringsAsFactors=F)

Since there are over r 900k records in the original data source, first let’s get rid of those that don’t have influence on our analysis. We have to filter out events that have no damage to population nor property, use only those features that are necessary and get rid of case sensitivity.
To make things even better we’re gonna split data into 2 subsets - 1 for harm to population and 1 for damage to property.

harm2ppl <- filter(raw_data, FATALITIES > 0 | INJURIES > 0) %>%
        mutate(DATE = as.Date(BGN_DATE, format = "%m/%d/%Y %H:%M:%S"),
               EVTYPE = str_trim(toupper(EVTYPE))) %>%
        select(DATE,
               EVTYPE,
               FATALITIES,
               INJURIES)

dmg2prop <- filter(raw_data, PROPDMG > 0 | CROPDMG > 0) %>%
        mutate(DATE = as.Date(BGN_DATE, format = "%m/%d/%Y %H:%M:%S"),
               EVTYPE = str_trim(toupper(EVTYPE)),
               PROPDMGEXP = str_trim(toupper(PROPDMGEXP)),
               CROPDMGEXP = str_trim(toupper(CROPDMGEXP))) %>%
        select(DATE,
               EVTYPE,
               PROPDMG,
               PROPDMGEXP,
               CROPDMG,
               CROPDMGEXP)

Next let’s calculate total damage done to property and filter out remaining dirty data.
We’ll create TOTALDMG feature, by converting PROPDMGEXP and CROPDMGEXP to multipliers that we’ll use to calculate total damage done to property.

unique(dmg2prop$PROPDMGEXP)
##  [1] "K" "M" "B" ""  "+" "0" "5" "6" "4" "H" "2" "7" "3" "-"
# fix PROPDMGEXP, everything that makes no sence is 0
dmg2prop <- mutate(dmg2prop, PFACTOR = 0, CFACTOR = 0)
dmg2prop$PFACTOR[dmg2prop$PROPDMGEXP %in% c("K", "3")] <- 1e+03
dmg2prop$PFACTOR[dmg2prop$PROPDMGEXP %in% c("M", "6")] <- 1e+06
dmg2prop$PFACTOR[dmg2prop$PROPDMGEXP %in% c("B")] <- 1e+09
dmg2prop$PFACTOR[dmg2prop$PROPDMGEXP %in% c("5")] <- 1e+05
dmg2prop$PFACTOR[dmg2prop$PROPDMGEXP %in% c("4")] <- 1e+04
dmg2prop$PFACTOR[dmg2prop$PROPDMGEXP %in% c("H", "2")] <- 1e+02
dmg2prop$PFACTOR[dmg2prop$PROPDMGEXP %in% c("7")] <- 1e+07
# now CROPDMGEXP
unique(dmg2prop$CROPDMGEXP)
## [1] ""  "M" "K" "B" "?" "0"
dmg2prop$CFACTOR[dmg2prop$CROPDMGEXP == "M"] <- 1e+06
dmg2prop$CFACTOR[dmg2prop$CROPDMGEXP == "K"] <- 1e+03
dmg2prop$CFACTOR[dmg2prop$CROPDMGEXP == "B"] <- 1e+09
# add calculated total damage
dmg2prop <- dmg2prop %>%
        mutate(TOTALDMG = PROPDMG * PFACTOR + CROPDMG * CFACTOR) %>%
        filter(TOTALDMG > 0)

Next we want to classify all events according to NOAA classification. In order to classify events that are not in NOAA list we’re gonna use generalized Levenshtein distance between 2 strings - EVTYPE from storm data and NOAA defined type.

events <- data.frame(TYPE = c("ASTRONOMICAL LOW TIDE", "AVALANCHE", "BLIZZARD", "COASTAL FLOOD", "COLD/WIND CHILL", "DEBRIS FLOW", "DENSE FOG", "DENSE SMOKE", "DROUGHT", "DUST DEVIL", "DUST STORM", "EXCESSIVE HEAT", "EXTREME COLD/WIND CHILL", "FLASH FLOOD", "FLOOD", "FROST/FREEZE", "FUNNEL CLOUD", "FREEZING FOG", "HAIL", "HEAT", "HEAVY RAIN", "HEAVY SNOW", "HIGH SURF", "HIGH WIND", "HURRICANE (TYPHOON)", "ICE STORM", "LAKE-EFFECT SNOW", "LAKESHORE FLOOD", "LIGHTNING", "MARINE HAIL", "MARINE HIGH WIND", "MARINE STRONG WIND", "MARINE THUNDERSTORM WIND", "RIP CURRENT", "SEICHE", "SLEET", "STORM SURGE/TIDE", "STRONG WIND", "THUNDERSTORM WIND", "TORNADO", "TROPICAL DEPRESSION", "TROPICAL STORM", "TSUNAMI", "VOLCANIC ASH", "WATERSPOUT", "WILDFIRE", "WINTER STORM", "WINTER WEATHER")
, stringsAsFactors=F)
# add feature and fill it with NOAA event
harm2ppl <- mutate(harm2ppl, NOAATYPE="")
harm2ppl$NOAATYPE <- apply(harm2ppl, 1,
                           function(x) events$TYPE[which.min(adist(x[2],events$TYPE,
                                                                   partial=TRUE))])
# same for other data
dmg2prop <- mutate(dmg2prop, NOAATYPE="")
dmg2prop$NOAATYPE <- apply(dmg2prop, 1,
                           function(x) events$TYPE[which.min(adist(x[2],events$TYPE,
                                                                   partial=TRUE))])

Results

Let’s get to plotting now to show some results.

Question 1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

top10fatal <- harm2ppl %>%
        group_by(NOAATYPE) %>%
        summarise(FATALITIES=sum(FATALITIES),
                  INJURIES=sum(INJURIES),
                  TOTAL=sum(FATALITIES + INJURIES)) %>%
        arrange(desc(TOTAL))
top10fatal <- head(top10fatal, 10)
top10fatal
## Source: local data frame [10 x 4]
## 
##                    NOAATYPE FATALITIES INJURIES TOTAL
## 1                   TORNADO       5633    91364 96997
## 2            EXCESSIVE HEAT       3040     9077 12117
## 3  MARINE THUNDERSTORM WIND        758     9467 10225
## 4             COASTAL FLOOD        515     6894  7409
## 5                 LIGHTNING        817     5231  6048
## 6               FLASH FLOOD       1035     1802  2837
## 7                 ICE STORM         97     2128  2225
## 8                 HIGH WIND        299     1482  1781
## 9                  WILDFIRE         90     1606  1696
## 10             WINTER STORM        217     1415  1632
barchart(NOAATYPE~FATALITIES+INJURIES,
         data=top10fatal,
         auto.key=TRUE,
         xlab="Population",
         ylab="NOAA Event Type",
         main="Most Harmfull to Population Events")

Question 2: Across the United States, which types of events have the greatest economic consequences?

top10economy <- dmg2prop %>%
        arrange(DATE) %>%
        group_by(NOAATYPE) %>%
        summarise(DMG=sum(TOTALDMG)/1e+09) %>%
        arrange(desc(DMG))
top10economy <- head(top10economy, 10)
top10economy
## Source: local data frame [10 x 2]
## 
##                    NOAATYPE        DMG
## 1             COASTAL FLOOD 151.166335
## 2       HURRICANE (TYPHOON)  90.762533
## 3                   TORNADO  57.367113
## 4          STORM SURGE/TIDE  47.965579
## 5               FLASH FLOOD  19.120534
## 6                      HAIL  18.761864
## 7                   DROUGHT  15.025670
## 8  MARINE THUNDERSTORM WIND  12.343547
## 9           LAKESHORE FLOOD  10.305824
## 10                ICE STORM   8.981113
barchart(DMG~NOAATYPE,
         data=top10economy,
         scales=list(rot=c(30,0)),
         ylab="Damage",
         main="Most Damaging Event Types (in Billions USD)")

Conclusions

Let’s conclude what we have found. Tornado has caused most by far most injuries - more that all other events in TOP 10 together. Though gap in fatalities is much smaller, nevertheless tornado is also the deadliest event type. Together with excessive heat they are responsible for most fatal cases in US.
Regarding damage to property - coastal floods are causing most damage, which is also 4th in caused injuries.