The analysis explore the NOAA storm database in attempt to understand the damage caused by severe weather events, both to human lives as well as the economic impacts. It is presented in such away that will facilitate the reader to understand how to prioritize resources to remediate some of the damages.
TORNADO catastrophes in central and eastern time zones regions of the US were the most harmful events, with approximate casualties of 90,000 people, out of them 5% are dead. Following are the FLOODS in central time zone, with 6,675 casualties, the third event relate to EXCESSIVE HEAT that causes 5,229 casualties, were 20% are dead. FLOODS that occur in the pacific time zone are the most costly events with accumulated economic damage of 119 billions dollars, were the majority of damage is to property. Next are the TORNADOS and HURRICANES happened in central time zone, with accumulated economic damage of 46 and 43 billions dollars respectively, with a majority of damage to property. We can conclude that the TORNADO is the most harmful and damaging event that ever happen in the US following by the FLOOD.
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
library(dplyr, quietly = TRUE, warn.conflicts = FALSE)
library(tidyr)
library(ggplot2)
library(knitr)
# download the data
datafile <- "repdata-data-StormData.csv"
if (!file.exists(paste(datafile, ".bz2", sep = ""))) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = paste(datafile, ".bz2", sep = ""), method = "curl")
}
## Warning: running command 'curl
## "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
## -o "repdata-data-StormData.csv.bz2"' had status 127
## Warning in
## download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
## : download had nonzero exit status
# unzip and load the data
if (!file.exists(datafile)) {
unzip(paste(datafile, ".bz2", sep = ""))
}
data <- read.csv(datafile)
The measured data has 902,297 observation of 37 variables.
# subset the relevant features
fatal <- select(data, EVTYPE, TIME_ZONE, FATALITIES, INJURIES)
# get summary of features
summary(fatal)
## EVTYPE TIME_ZONE FATALITIES
## HAIL :288661 CST :547493 Min. : 0.0000
## TSTM WIND :219940 EST :245558 1st Qu.: 0.0000
## THUNDERSTORM WIND: 82563 MST : 68390 Median : 0.0000
## TORNADO : 60652 PST : 28302 Mean : 0.0168
## FLASH FLOOD : 54277 AST : 6360 3rd Qu.: 0.0000
## FLOOD : 25326 HST : 2563 Max. :583.0000
## (Other) :170878 (Other): 3631
## INJURIES
## Min. : 0.0000
## 1st Qu.: 0.0000
## Median : 0.0000
## Mean : 0.1557
## 3rd Qu.: 0.0000
## Max. :1700.0000
##
# filter out observations without fatalities or injuries
fatal <- filter(fatal, FATALITIES!=0 | INJURIES!=0)
# summarize the number of fatalities and injuries per event type and time zone
fs <- group_by(fatal, EVTYPE, TIME_ZONE)
fs <- summarise(fs, fatalities = sum(FATALITIES, na.rm = TRUE),
injuries = sum(INJURIES, na.rm = TRUE),
total = fatalities + injuries,
fatalities_pct = paste(round(fatalities/total*100, digits=0),"%", sep=""),
injuries_pct = paste(round(injuries/total*100, digits=0),"%", sep=""))
# order by total casualties and extract the 10 most harmful events
fs <- head(fs[order(desc(fs$total)), ] , 10)
# reshape the data in order to produce a side by side column bar graph
fr <-
select(fs,EVTYPE,TIME_ZONE, fatalities, injuries, total ) %>%
gather(impact, affected_lifes, fatalities:injuries)
fr <- arrange(fr, desc(total))
# ajust the feature names
names(fs) <- c("event","time_zone","fatalities","injuries","total","fatalities_%","injuries_%")
# subset the relevant features
damage <- select(data, EVTYPE, TIME_ZONE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
# get summary
summary(damage)
## EVTYPE TIME_ZONE PROPDMG
## HAIL :288661 CST :547493 Min. : 0.00
## TSTM WIND :219940 EST :245558 1st Qu.: 0.00
## THUNDERSTORM WIND: 82563 MST : 68390 Median : 0.00
## TORNADO : 60652 PST : 28302 Mean : 12.06
## FLASH FLOOD : 54277 AST : 6360 3rd Qu.: 0.50
## FLOOD : 25326 HST : 2563 Max. :5000.00
## (Other) :170878 (Other): 3631
## PROPDMGEXP CROPDMG CROPDMGEXP
## :465934 Min. : 0.000 :618413
## K :424665 1st Qu.: 0.000 K :281832
## M : 11330 Median : 0.000 M : 1994
## 0 : 216 Mean : 1.527 k : 21
## B : 40 3rd Qu.: 0.000 0 : 19
## 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
# filter out observations without damage
damage <- filter(damage, PROPDMG!=0 && PROPDMGEXP %in% c("K","M", "B") | CROPDMG!=0 && CROPDMGEXP %in% c("K","M", "B"))
# scalling numbers
scallingFactor <- function(damage, exponent) {
if (exponent %in% c("b", "B")) {
return(damage * 1000000000)
}
if (exponent %in% c("k", "K")) {
return(damage * 1000)
}
if (exponent %in% c("m", "M")) {
return(damage * 1000000)
}
return(0)
}
# append scaled numbers
damage$prop_damage_scalling <- mapply(scallingFactor, damage$PROPDMG, damage$PROPDMGEXP)
damage$crop_damage_scalling <- mapply(scallingFactor, damage$CROPDMG, damage$CROPDMGEXP)
# summarize the damage per event type and time zone
dmg <- group_by(damage, EVTYPE, TIME_ZONE)
dmg <- summarise(dmg, prop_damage = sum(prop_damage_scalling, na.rm = TRUE)/1000000000,
crop_damage = sum(crop_damage_scalling, na.rm = TRUE)/1000000000,
total = round(prop_damage + crop_damage, digits=2))
# order by highest damage and extract the 10 most damaging events
dmg <- head(dmg[order(desc(dmg$total)), ] , 10)
# reshape the data in order to produce a side by side column bar graph
dmr <-
select(dmg,EVTYPE,TIME_ZONE, prop_damage, crop_damage, total ) %>%
gather(impact, damage_value, prop_damage:crop_damage)
dmr <- arrange(dmr, desc(total))
# ajust the feature names
names(dmg) <- c("event","time_zone","property_damage","crop_damage","total")
Table 1 - The most harmful events
kable(fs, digits=0)
| event | time_zone | fatalities | injuries | total | fatalities_% | injuries_% |
|---|---|---|---|---|---|---|
| TORNADO | CST | 5180 | 81524 | 86704 | 6% | 94% |
| TORNADO | EST | 440 | 9253 | 9693 | 5% | 95% |
| FLOOD | CST | 158 | 6517 | 6675 | 2% | 98% |
| EXCESSIVE HEAT | CST | 1040 | 4189 | 5229 | 20% | 80% |
| TSTM WIND | CST | 334 | 4755 | 5089 | 7% | 93% |
| LIGHTNING | EST | 389 | 3138 | 3527 | 11% | 89% |
| EXCESSIVE HEAT | EST | 678 | 2076 | 2754 | 25% | 75% |
| HEAT | CST | 800 | 1686 | 2486 | 32% | 68% |
| ICE STORM | EST | 12 | 1737 | 1749 | 1% | 99% |
| TSTM WIND | EST | 140 | 1604 | 1744 | 8% | 92% |
Chart 1 - Visulaize the most harmful events per time zone
ggplot(data=fr, aes(x=reorder(EVTYPE, order(total)), y=affected_lifes, fill=impact)) +
geom_bar(stat = "identity", position=position_dodge()) +
coord_flip() + xlab("Event") + ylab("Casualties") + facet_wrap(~TIME_ZONE)
Table 2 - The most dameging events (billions $)
kable(dmg, digits=0)
| event | time_zone | property_damage | crop_damage | total |
|---|---|---|---|---|
| FLOOD | PST | 118 | 1 | 119 |
| TORNADO | CST | 48 | 0 | 48 |
| HURRICANE/TYPHOON | CST | 46 | 2 | 48 |
| STORM SURGE | CST | 43 | 0 | 43 |
| HURRICANE/TYPHOON | EST | 23 | 1 | 24 |
| FLOOD | CST | 15 | 3 | 18 |
| DROUGHT | CST | 1 | 12 | 13 |
| FLOOD | EST | 11 | 2 | 12 |
| HAIL | CST | 9 | 2 | 11 |
| RIVER FLOOD | CST | 5 | 5 | 10 |
Chart 2 - Visulaize the most dameging events per time zone
ggplot(data=dmr, aes(x=reorder(EVTYPE, total), y=damage_value, fill=impact)) +
geom_bar(stat="identity", position=position_dodge()) +
coord_flip() + xlab("Event") + ylab("Economic damage - billions $") + facet_wrap(~TIME_ZONE)