We are going to briefly investigate impact of registered weather events in US. Our research is based on historical data provided by National Oceanic and Atmospheric Administration. We’re going to try to determine which events are most harmful to US population and property.
This work is done as a part of Coursera Data Science specialization, Reproducible Research course.
Libraries first
library(dplyr)
library(stringr)
library(lattice)
Make sure you have storm data downloaded to your current working directory. It’s located here https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
raw_data <- read.csv("repdata-data-StormData.csv.bz2", stringsAsFactors=F)
Since there are over r 900k records in the original data source, first let’s get rid of those that don’t have influence on our analysis. We have to filter out events that have no damage to population nor property, use only those features that are necessary and get rid of case sensitivity.
To make things even better we’re gonna split data into 2 subsets - 1 for harm to population and 1 for damage to property.
harm2ppl <- filter(raw_data, FATALITIES > 0 | INJURIES > 0) %>%
mutate(DATE = as.Date(BGN_DATE, format = "%m/%d/%Y %H:%M:%S"),
EVTYPE = str_trim(toupper(EVTYPE))) %>%
select(DATE,
EVTYPE,
FATALITIES,
INJURIES)
dmg2prop <- filter(raw_data, PROPDMG > 0 | CROPDMG > 0) %>%
mutate(DATE = as.Date(BGN_DATE, format = "%m/%d/%Y %H:%M:%S"),
EVTYPE = str_trim(toupper(EVTYPE)),
PROPDMGEXP = str_trim(toupper(PROPDMGEXP)),
CROPDMGEXP = str_trim(toupper(CROPDMGEXP))) %>%
select(DATE,
EVTYPE,
PROPDMG,
PROPDMGEXP,
CROPDMG,
CROPDMGEXP)
Next let’s calculate total damage done to property and filter out remaining dirty data.
We’ll create TOTALDMG feature, by converting PROPDMGEXP and CROPDMGEXP to multipliers that we’ll use to calculate total damage done to property.
unique(dmg2prop$PROPDMGEXP)
## [1] "K" "M" "B" "" "+" "0" "5" "6" "4" "H" "2" "7" "3" "-"
# fix PROPDMGEXP, everything that makes no sence is 0
dmg2prop <- mutate(dmg2prop, PFACTOR = 0, CFACTOR = 0)
dmg2prop$PFACTOR[dmg2prop$PROPDMGEXP %in% c("K", "3")] <- 1e+03
dmg2prop$PFACTOR[dmg2prop$PROPDMGEXP %in% c("M", "6")] <- 1e+06
dmg2prop$PFACTOR[dmg2prop$PROPDMGEXP %in% c("B")] <- 1e+09
dmg2prop$PFACTOR[dmg2prop$PROPDMGEXP %in% c("5")] <- 1e+05
dmg2prop$PFACTOR[dmg2prop$PROPDMGEXP %in% c("4")] <- 1e+04
dmg2prop$PFACTOR[dmg2prop$PROPDMGEXP %in% c("H", "2")] <- 1e+02
dmg2prop$PFACTOR[dmg2prop$PROPDMGEXP %in% c("7")] <- 1e+07
# now CROPDMGEXP
unique(dmg2prop$CROPDMGEXP)
## [1] "" "M" "K" "B" "?" "0"
dmg2prop$CFACTOR[dmg2prop$CROPDMGEXP == "M"] <- 1e+06
dmg2prop$CFACTOR[dmg2prop$CROPDMGEXP == "K"] <- 1e+03
dmg2prop$CFACTOR[dmg2prop$CROPDMGEXP == "B"] <- 1e+09
# add calculated total damage
dmg2prop <- dmg2prop %>%
mutate(TOTALDMG = PROPDMG * PFACTOR + CROPDMG * CFACTOR) %>%
filter(TOTALDMG > 0)
Next we want to classify all events according to NOAA classification. In order to classify events that are not in NOAA list we’re gonna use generalized Levenshtein distance between 2 strings - EVTYPE from storm data and NOAA defined type.
events <- data.frame(TYPE = c("ASTRONOMICAL LOW TIDE", "AVALANCHE", "BLIZZARD", "COASTAL FLOOD", "COLD/WIND CHILL", "DEBRIS FLOW", "DENSE FOG", "DENSE SMOKE", "DROUGHT", "DUST DEVIL", "DUST STORM", "EXCESSIVE HEAT", "EXTREME COLD/WIND CHILL", "FLASH FLOOD", "FLOOD", "FROST/FREEZE", "FUNNEL CLOUD", "FREEZING FOG", "HAIL", "HEAT", "HEAVY RAIN", "HEAVY SNOW", "HIGH SURF", "HIGH WIND", "HURRICANE (TYPHOON)", "ICE STORM", "LAKE-EFFECT SNOW", "LAKESHORE FLOOD", "LIGHTNING", "MARINE HAIL", "MARINE HIGH WIND", "MARINE STRONG WIND", "MARINE THUNDERSTORM WIND", "RIP CURRENT", "SEICHE", "SLEET", "STORM SURGE/TIDE", "STRONG WIND", "THUNDERSTORM WIND", "TORNADO", "TROPICAL DEPRESSION", "TROPICAL STORM", "TSUNAMI", "VOLCANIC ASH", "WATERSPOUT", "WILDFIRE", "WINTER STORM", "WINTER WEATHER")
, stringsAsFactors=F)
# add feature and fill it with NOAA event
harm2ppl <- mutate(harm2ppl, NOAATYPE="")
harm2ppl$NOAATYPE <- apply(harm2ppl, 1,
function(x) events$TYPE[which.min(adist(x[2],events$TYPE,
partial=TRUE))])
# same for other data
dmg2prop <- mutate(dmg2prop, NOAATYPE="")
dmg2prop$NOAATYPE <- apply(dmg2prop, 1,
function(x) events$TYPE[which.min(adist(x[2],events$TYPE,
partial=TRUE))])
Let’s get to plotting now to show some results.
top10fatal <- harm2ppl %>%
group_by(NOAATYPE) %>%
summarise(FATALITIES=sum(FATALITIES),
INJURIES=sum(INJURIES),
TOTAL=sum(FATALITIES + INJURIES)) %>%
arrange(desc(TOTAL))
top10fatal <- head(top10fatal, 10)
top10fatal
## Source: local data frame [10 x 4]
##
## NOAATYPE FATALITIES INJURIES TOTAL
## 1 TORNADO 5633 91364 96997
## 2 EXCESSIVE HEAT 3040 9077 12117
## 3 MARINE THUNDERSTORM WIND 758 9467 10225
## 4 COASTAL FLOOD 515 6894 7409
## 5 LIGHTNING 817 5231 6048
## 6 FLASH FLOOD 1035 1802 2837
## 7 ICE STORM 97 2128 2225
## 8 HIGH WIND 299 1482 1781
## 9 WILDFIRE 90 1606 1696
## 10 WINTER STORM 217 1415 1632
barchart(NOAATYPE~FATALITIES+INJURIES,
data=top10fatal,
auto.key=TRUE,
xlab="Population",
ylab="NOAA Event Type",
main="Most Harmfull to Population Events")
top10economy <- dmg2prop %>%
arrange(DATE) %>%
group_by(NOAATYPE) %>%
summarise(DMG=sum(TOTALDMG)/1e+09) %>%
arrange(desc(DMG))
top10economy <- head(top10economy, 10)
top10economy
## Source: local data frame [10 x 2]
##
## NOAATYPE DMG
## 1 COASTAL FLOOD 151.166335
## 2 HURRICANE (TYPHOON) 90.762533
## 3 TORNADO 57.367113
## 4 STORM SURGE/TIDE 47.965579
## 5 FLASH FLOOD 19.120534
## 6 HAIL 18.761864
## 7 DROUGHT 15.025670
## 8 MARINE THUNDERSTORM WIND 12.343547
## 9 LAKESHORE FLOOD 10.305824
## 10 ICE STORM 8.981113
barchart(DMG~NOAATYPE,
data=top10economy,
scales=list(rot=c(30,0)),
ylab="Damage",
main="Most Damaging Event Types (in Billions USD)")
Let’s conclude what we have found. Tornado has caused most by far most injuries - more that all other events in TOP 10 together. Though gap in fatalities is much smaller, nevertheless tornado is also the deadliest event type. Together with excessive heat they are responsible for most fatal cases in US.
Regarding damage to property - coastal floods are causing most damage, which is also 4th in caused injuries.