This report explores storm data in order to analyze the impacts of different event types in the US on both human population health and economic health. The data are from the NOAA storm database as described here: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf Specifically, the report aims to answer the following questions:
The analysis excludes territories of the US, e.g. Puerto Rico.
We found that the type of event with the highest number of total fatalities and injuries is a tornado, but wild fires, while less common, on average result in the greatest number of deaths and injuries. With respect to economic cost we found that floods have had the greatest impact in absolute terms, while hurricanes have the highest average impact on property damage and crop damage.
We first read in the data from the raw csv file from the NOAA.
filename <- "repdata%2Fdata%2FStormData.csv.bz2"
fullpath <- paste0("data/", filename)
if (!file.exists(fullpath)) {
dir.create("data")
fileUrl <- paste0("https://d396qusza40orc.cloudfront.net/", filename)
download.file(fileUrl, destfile = fullpath, method = "curl")
}
storm_data <- read.csv(fullpath, na.strings=c("NA", ""), strip.white = TRUE, stringsAsFactors = FALSE)
num_rows <- nrow(storm_data)
A lot of processing then had to be done on the data. Looking at the EVTYPE variable, for example:
eventTypeFactor <- as.factor(storm_data$EVTYPE)
length(levels(eventTypeFactor))
## [1] 985
This is a huge number of event types compared to the 48 types listed in the National Weather Service documentation for the data. Looking at what these types actually were showed a lot of useless types that told us nothing about the actual type of event, e.g. “Summary of June 16”. An initial clean-up was done to remove these, and also to reduce the number of states being analyzed to just the 50 states plus DC.
cleanData <- function(data) {
filtered_states <- rbind(filter(data, STATE %in% state.abb),
filter(data, STATE == "DC"))
filter(filtered_states, !grepl(".*summary.*", EVTYPE,
ignore.case = TRUE))
}
storm_data <- cleanData(storm_data)
This still left us with a very messy EVTYPE variable. However, the questions this report addresses are about events that have human or economic impact. It was therefore deemed appropriate to filter the data to only those events which had such impacts.
For this we looked at the FATALITIES, INJURIES, PROPDMG and CROPDMG variables. The latter two however needed to be interpreted and some assumptions had to be made here. Each of them had a corresponding “EXP” column whose values could be numeric, or a character like “m”, “b”, “k”, or “h” (or the uppercase equivalent), or something like a question mark. We made the assumption that in the case of a numeric entry it should be taken as an exponent of 10, whereas the single characters referred to “million”, “billion”, etc. We wrote a function to convert them to multipliers for the PROPDMG and CROPDMG variables. Where the character could not be interpreted in this way, e.g. the “?” character, a mulitplier of 1 was used.
getMultiplier <- function(ex) {
# Default multiplier is 1
multiplier <- 1
asnum <- suppressWarnings(as.numeric(ex))
if (!is.na(asnum)) {
multiplier <- 10^asnum
}
else {
if (ex %in% c("h", "H")) {
multiplier = 100
}
else if (ex %in% c("k", "K")) {
multiplier = 1000
}
else if (ex %in% c("m", "M")) {
multiplier = 10^6
}
else if (ex %in% c("b", "B")) {
multiplier = 10^9
}
}
multiplier
}
We could now use this to create proper numeric columns for property damage and crop damage.
mutated <- mutate(storm_data, property_damage = PROPDMG * sapply(PROPDMGEXP, getMultiplier), crop_damage = CROPDMG * sapply(CROPDMGEXP, getMultiplier))
Now we could filter the data to only those rows whether there was either human impact, as judged by number of fatalities or injuries, or economic impact, as judged by property damage or crop damage.
filtered <- filter(mutated, FATALITIES > 0 | INJURIES > 0 | property_damage > 0 | crop_damage > 0)
num_rows_filtered <- nrow(filtered)
With this filtering we have reduced the number of records from 902297 to 253352. This also had a huge impact on the number of different event type values in the data we had to deal with.
eventTypeFactor <- as.factor(filtered$EVTYPE)
length(levels(eventTypeFactor))
## [1] 471
However there was still massive cleanup to do on these event types and we had to create a mapping function to attempt to map the given values to values in the list of 48.
For the sake of space, the mapping function is now shown here. Please see the appendix. We converted the event types to proper case before passing them to the mapper.
evtypesProper <- capwords(filtered$EVTYPE, strict = TRUE)
mapped <- mapEventTypes(evtypesProper)
withEventTypeFactor <- mutate(filtered, eventTypeFactor=as.factor(mapped))
numEventTypes <- length(levels(withEventTypeFactor$eventTypeFactor))
This gets us down to 185 event types. Clearly more work could be done here but for the purposes of this analysis this was deemed a sufficient cleaup.
We could now start doing some analysis of the data.
summarized <- ddply(withEventTypeFactor, .(eventTypeFactor), summarise,
totalFatal = sum(FATALITIES),
totalInjured = sum(INJURIES),
totalAll = sum(FATALITIES) + sum(INJURIES),
meanFatal = mean(FATALITIES),
meanInjured = mean(INJURIES),
meanAll = mean(FATALITIES + INJURIES)) %>%
arrange(desc(totalAll), desc(totalFatal), desc(totalInjured))
top_ten <- summarized[1:10,]
top_ten
## eventTypeFactor totalFatal totalInjured totalAll meanFatal
## 1 Tornado 5636 91407 97043 0.141044571
## 2 Heat 3132 9209 12341 3.232198142
## 3 Thunderstorm Wind 705 9404 10109 0.005921285
## 4 Flood 1475 8593 10068 0.046034768
## 5 Lightning 807 5226 6033 0.060841375
## 6 Ice Storm 89 1990 2079 0.124824684
## 7 High Wind 296 1518 1814 0.047849984
## 8 Winter Storm 216 1338 1554 0.143141153
## 9 Hail 45 1465 1510 0.001688429
## 10 Heavy Snow 129 1034 1163 0.092539455
## meanInjured meanAll
## 1 2.28751971 2.42856428
## 2 9.50361197 12.73581011
## 3 0.07898406 0.08490534
## 4 0.26818763 0.31422240
## 5 0.39399879 0.45484017
## 6 2.79102384 2.91584853
## 7 0.24539282 0.29324281
## 8 0.88667992 1.02982107
## 9 0.05496773 0.05665616
## 10 0.74175036 0.83428981
The following figure plots the top ten event types affecting human population health as measured in thousands of deaths and injuries
barplot(top_ten$totalAll / 1000, names.arg = top_ten$eventTypeFactor, main="Deaths and Injuries (in 1000s) by event type", las=2, ylim=c(0, 100), col=top_ten$eventTypeFactor)
Next we’ll look at the economic impacts of the events using property damage and crop damage.
summarized_economic <- ddply(withEventTypeFactor, .(eventTypeFactor), summarise,
totalPropertyDamage = sum(property_damage)/(10^6),
totalCropDamage = sum(crop_damage)/(10^6),
totalAll = (sum(property_damage) + sum(crop_damage))/(10^6),
meanPropertyDamage = mean(property_damage),
meanCropDamage = mean(crop_damage),
meanAll = mean(property_damage + crop_damage)) %>%
arrange(desc(totalAll), desc(totalPropertyDamage), desc(totalCropDamage))
top_ten_economic <- summarized_economic[1:10,]
nrow(summarized_economic)
## [1] 185
barplot(top_ten_economic$totalAll, names.arg = top_ten_economic$eventTypeFactor, main="Cost of property and crop damage (in millions of dollars) by event type", las=2, col=top_ten_economic$eventTypeFactor)
Looking just at totals, we learned that tornados have had the biggest impact on human popluation health and floods have had the biggest economic impact.
It is also useful to look at the average impact of each type of event and see which type of event has the highest impact on average.
summarized[which.max(summarized$meanAll),]
## eventTypeFactor totalFatal totalInjured totalAll meanFatal meanInjured
## 28 Wild Fires 3 150 153 0.75 37.5
## meanAll
## 28 38.25
Here we see that although wild fires do not account for a large total number of fatalities and injuries, the average wild fire results in a higher number of injuries and deaths than other types of events.
summarized_economic[which.max(summarized_economic$meanAll),]
## eventTypeFactor totalPropertyDamage totalCropDamage totalAll
## 2 Hurricane (Typhoon) 82429.12 4948.941 87378.06
## meanPropertyDamage meanCropDamage meanAll
## 2 436132905 26184872 462317777
Hurricanes on average result in the highest economic cost.
The functions for mapping the event types to known types and for proper casing them.
mapEventTypes <- function(data) {
pattern <- c(
"Astronomical Low Tide.*",
".*Avalanche.*",
".*Blizzard.*",
".*Coastal Flood.*",
"(^Extreme).*Chill.*",
".*landslide.*",
".*Dense Fog.*",
".*Dense Smoke.*",
".*Drought.*",
".*Dust Devil.*",
".*Dust Storm.*",
".*Excessive Heat.*",
"(.*Extreme.*Cold.*|Hyp(o|er)thermia.*)",
".*Flash Flood.*",
".*Flood.*",
".*(Frost|Freeze).*",
".*Funnel Cloud.*",
".*Freezing Fog.*",
".*Hail.*",
".*Heat.*",
".*Heavy Rain.*",
".*Heavy Snow.*",
".*High Surf.*",
".*High Wind.*",
".*(Hurricane|Typhoon).*",
".*Ice Storm.*",
".*Lake-Effect Snow.*",
".*Lakeshore Flood.*",
".*Lightning.*",
".*Marine Hail.*",
".*Marine High Wind.*",
".*Marine Strong Wind.*",
".*Marine Thunderstorm Wind.*",
".*Rip Current.*",
".*Seiche.*",
".*Sleet.*",
".*Storm (Surge|Tide).*",
".*Strong Wind.*",
".*(thunderstorm|tstm).*",
".*Tornado.*",
".*Tropical Depression.*",
".*Tropical Storm.*",
".*Tsunami.*",
".*Volcanic Ash.*",
".*Waterspout.*",
".*Wildfire.*",
".*Winter Storm.*",
".*Winter Weather.*"
)
replace <- c(
"Astronomical Low Tide",
"Avalanche",
"Blizzard",
"Coastal Flood",
"Cold/Wind Chill",
"Debris Flow",
"Dense Fog",
"Dense Smoke",
"Drought",
"Dust Devil",
"Dust Storm",
"Excessive Heat",
"Extreme Cold/Wind Chill",
"Flash Flood",
"Flood",
"Frost/Freeze",
"Funnel Cloud",
"Freezing Fog",
"Hail",
"Heat",
"Heavy Rain",
"Heavy Snow",
"High Surf",
"High Wind",
"Hurricane (Typhoon)",
"Ice Storm",
"Lake-Effect Snow",
"Lakeshore Flood",
"Lightning",
"Marine Hail",
"Marine High Wind",
"Marine Strong Wind",
"Marine Thunderstorm Wind",
"Rip Current",
"Seiche",
"Sleet",
"Storm Surge/Tide",
"Strong Wind",
"Thunderstorm Wind",
"Tornado",
"Tropical Depression",
"Tropical Storm",
"Tsunami",
"Volcanic Ash",
"Waterspout",
"Wildfire",
"Winter Storm",
"Winter Weather"
)
qdap::mgsub(pattern, replace, data, fixed = FALSE, ignore.case = TRUE)
}
# Taken from the help page for tolower, this function
# converts strings to proper case.
capwords <- function(s, strict = FALSE) {
cap <- function(s) paste(toupper(substring(s, 1, 1)),
{s <- substring(s, 2); if(strict) tolower(s) else s},
sep = "", collapse = " " )
sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}