The goal of this data analysis is to examine the damage done by various weather events in the United States and increase awarness concerning the extent of the damage. The data used here is from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and covers events from 1950 to November 2011. More information about this data is in the National Weather Service Storm Data Documentation. This document attempts to provide an overview on the danger posed by these weather events by answering the following two questions:
By aggregating all the available data, the results are that tornadoes are the most harmful in terms of human casualties and floods cause the most economic damage.
The following code makes sure that all the required packages for the analysis are installed and loaded.
# loading/installing the required packages
load_and_install_package <- function(package_name) {
if (!require(package_name, character.only = TRUE))
install.packages(package_name, character.only = TRUE)
library(package_name, character.only = TRUE)
}
load_and_install_package("dplyr")
load_and_install_package("knitr")
load_and_install_package("ggplot2")
load_and_install_package("RColorBrewer")
This code fragment assumes that the data is in the R working directory and it will unzip it and read the resulting CSV file.
# unzipping the dataset
if (!file.exists("repdata-data-StormData.csv")) {
if (!file.exists("repdata-data-StormData.csv.bz2")) {
stop("The Storm dataset needs to be in the current directory")
}
bunzip2("repdata-data-StormData.csv.bz2", unzip = "bz2")
}
# reading the data
data <- read.csv("repdata-data-StormData.csv", stringsAsFactors = FALSE)
In order to improve the accuracy of the aggregate statistics, natural disasters that contain specific words will be transferred to more general categories. This is necessary because some entries fit into one bin(flood for example) and others are spread around multiple bins, like drought(drought/heat/dry/warm).
In order to follow the clean data principle which states that one column should represent one variable, the columns that specify property damage and crop damage will be converted from the raw format(base and exponent) to numeric format, while getting rid of uninterpretable values. The exponents are interpreted using the following assumptions:
eventNames <- data$EVTYPE
eventNames[grepl("tornado|whirlwind|dust\\sdevil", eventNames, ignore.case=TRUE)] <- "TORNADO"
eventNames[grepl("ice|snow|cold|winter|frost|hypothermia|freez|sleet",
eventNames, ignore.case=TRUE)] <- "WINTER"
eventNames[grepl("wind|rain|thunderstorm|tstm|hail|storm", eventNames, ignore.case=TRUE)] <- "STORM"
eventNames[grepl("flood|fld", eventNames, ignore.case=TRUE)] <- "FLOOD"
eventNames[grepl("dry|fire|heat|drought|warm|hyperthermia", eventNames, ignore.case=TRUE)] <- "DROUGHT"
eventNames[grepl("lightn?ing", eventNames, ignore.case=TRUE)] <- "LIGHTNING"
eventNames[grepl("hurricane|cyclone|typhoon", eventNames, ignore.case=TRUE)] <- "HURRICANE"
data$EVTYPE <- as.factor(eventNames)
# cleaning up the propery damage and crop damage columns
data$PROPDMGEXP[data$PROPDMGEXP == "K" | data$PROPDMGEXP == "k"] <- "3"
data$PROPDMGEXP[data$PROPDMGEXP == "M" | data$PROPDMGEXP == "m"] <- "6"
data$PROPDMGEXP[data$PROPDMGEXP == "B" | data$PROPDMGEXP == "b"] <- "9"
data$PROPDMGEXP[data$PROPDMGEXP == ""] <- "0"
data$PROPDMGEXP <- suppressWarnings(as.numeric(data$PROPDMGEXP))
data$CROPDMGEXP[data$CROPDMGEXP == "K" | data$CROPDMGEXP == "k"] <- "3"
data$CROPDMGEXP[data$CROPDMGEXP == "M" | data$CROPDMGEXP == "m"] <- "6"
data$CROPDMGEXP[data$CROPDMGEXP == "B" | data$CROPDMGEXP == "b"] <- "9"
data$CROPDMGEXP[data$CROPDMGEXP == ""] <- "0"
data$CROPDMGEXP <- suppressWarnings(as.numeric(data$CROPDMGEXP))
# creating new columns that contain the actual values
data$PROPDAMAGE <- data$PROPDMG * (10 ^ data$PROPDMGEXP)
data$CROPDAMAGE <- data$CROPDMG * (10 ^ data$CROPDMGEXP)
The following code will extract from the dataset the necessary data for analyzing the effects of the monitored events on population health. These effects will be quantified by the sum of injuries and deaths and they will be aggregated by the event type. This dataset will be sorted in descending order by the total number of casualties.
populationData <- data %>%
filter(INJURIES > 0 | FATALITIES > 0) %>%
mutate(EventType = EVTYPE,
Casualties = INJURIES + FATALITIES) %>%
select(EventType, Casualties) %>%
group_by(EventType) %>%
summarise(
MeanCasualties = mean(Casualties, na.rm = T),
TotalCasualties = sum(Casualties, na.rm = T)) %>%
arrange(desc(TotalCasualties))
In order to analyze the economic damage caused by weather events, the property damage and crop damage columns will be summed together into a single column, then all the data will be aggregated by the event type and sorted in descending order by the total economic damage.
economicData <- data %>%
filter(PROPDAMAGE > 0 | CROPDAMAGE > 0) %>%
mutate(EventType = EVTYPE,
EconomicDamage = PROPDAMAGE + CROPDAMAGE) %>%
select(EventType, EconomicDamage) %>%
group_by(EventType) %>%
summarise(
MeanEconomicDamage = mean(EconomicDamage, na.rm = T),
TotalEconomicDamage = sum(EconomicDamage, na.rm = T)) %>%
arrange(desc(TotalEconomicDamage))
After having aggregated and sorted the population effect data in the data processing step, the only thing that remains is visualizing the results. Because there are still more than 50 disaster types, for each question only the top event categories by total damage will be visualized.
topPopulationEvents <- top_n(populationData, 6, TotalCasualties)
ggplot(topPopulationEvents, aes(x = EventType, y = TotalCasualties)) +
geom_bar(stat = "identity", aes(fill = EventType), color = "black") +
theme(legend.position="none") +
scale_fill_brewer(palette="Set1") +
ggtitle("Total Casualties by Event Type") +
xlab("Event type") +
ylab("Total casualties")
We can see from this plot that most human casualties are caused by tornadoes with a large margin, the total number in the selected time period being close to 100,000. The total casualties from tornadoes are greater than the combined casualties from the next 5 events.
topEconomicEvents <- top_n(economicData, 6, TotalEconomicDamage)
ggplot(topEconomicEvents, aes(x = EventType, y = TotalEconomicDamage / 10^9)) +
geom_bar(stat = "identity", aes(fill = EventType), color = "black") +
theme(legend.position="none") +
scale_fill_brewer(palette="Set1") +
ggtitle("Total Economic Damage by Event Type") +
xlab("Event type") +
ylab("Total economic damage (billion $)")
The plot above reveals that in the case of economic damage, floods caused the most total damage, with a cumulative total over the 1950-2011 period that exceeds 150 billion dollars. Here the distribution is more even, Hurricanes and storms each having caused about half as much economic damage as floods.