This report summarizes data from the National Climatic Data Center (NCDC) regarding economic and health costs sustained from naturally occurring climatic events in the United States. Specifically it reports the top six causes of monotary damage, fatalities, and injuries resulting from a set of 47 natural events. The data suggests that excessive heat was the leading cause of fatalities, tornadoes the leading cause of injuries, and floods the leading of monetary loss. Cumulative data was cleaned and summarized from the years 1996 to 2011. The following section describes the data cleaning process.
All data processes was accomplished using the R programming language and the following publicly available R packages.
library(lubridate)
library(ggplot2)
library(plyr)
The NCDC data set was downloaded as an compressed text file in the CSV format from the following URL.
f <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(f ,"stormData.bz2")
Next, the compressed file was unarchived and read into an R data frame.
stormFile <- bzfile("stormData.bz2")
storm <- read.csv(stormFile)
Several aspects of the data set were cleaned. One such aspect was the unique representation of monetary values. Values for property damage and crop damage were initially stored in the data set as two variables each. For example, property damage values were represented by the variables PROPDMG and PROPDMGEXP. The PROPDMGEXP variable was meant to be a power of 10 multiplier for the value stored in PROPDMG. The values stored in the exponent field should have been one of “B” (billions), “M” (millions), “K” (thousands), “H” (hundreds), or no entry. Data entry error resulted in several other entries in place of these expected values. This report uses the multiplier function listed below to translate each symbol into an appropriate multiplier. The symbols “+”, “-”, “?”, and no symbol result in a multiplier of 1, leaving the base field unchanged. All other non-intended values in the data were single digits. The presumed purpose of these symbols was to show a value in the ones place. For example, a 42 in the PROPDMG field combined with a 3 in the PROPDMGEXP field is assumed to be intended to represent a value of 423. The code below assigns a multiplier of 10 when PROPDMGEXP for all single-digit values. Thus 423 becomes 42 multiplied by 10 (420). The rounding error introduced by this algorithm is assumed to be negligible.
multiplier <- function(v) {
v <- toupper(v)
v <- ifelse(v == "B", 1000000000,
ifelse(v == "M", 1000000,
ifelse(v == "K", 1000,
ifelse(v == "H", 100,
ifelse(v %in% c("","+","-","?"), 1, 10)))))
return(v)
}
The NCDC data set spans the years from 1950 to 2011. However, data for all 47 event types were not gathered in parallel until 1996. Therefore this report restricts its analysis to the years from 1996 to 2011.
storm$BGN_DATE <- mdy_hms(storm$BGN_DATE)
storm$YEAR <- year(storm$BGN_DATE)
storm <- storm[storm$YEAR >= 1996,]
Damage amounts from property damage and crop damage were combined into a single variable using the multiplier function described above.
storm$PROPDMG <- storm$PROPDMG * multiplier(storm$PROPDMGEXP)
storm$CROPDMG <- storm$CROPDMG * multiplier(storm$CROPDMGEXP)
storm$DAMAGE <- storm$PROPDMG + storm$CROPDMG
Variables extraneous to the report were eliminated from the dataset. The variables kept were EVTYPE, FATALITIES, INJURIES, DAMAGE, and YEAR.
storm <- storm[,c(8,23,24,39,38)]
The greatest challenge of the dataset was the state of the EVTYPE variable. The data set was intended to have only 47 different event types, but over 900 unique values were present in the EVTYPE field due to typographic and other user input errors. The following normalizations were performed on the data to lessen the impact of this noise in the data (in order). All event types were converted to lower case. Leading whitespace was removed form the event types. All other consecutive whitespace were replaced by a single space. The abbreviation “tstm” was expanded to “thunderstorm”. Suffixes were removed from all strings that began with one of the 47 standard event types. One the normalizations were complete, only observations with an event type matching one of the standard event types were retained (97.4% of the original data from 1996 to 2011). It is assumed that the lost data did not significantly impact the analysis.
storm$EVTYPE <- tolower(storm$EVTYPE)
storm$EVTYPE <- sub("^\\s+", "", storm$EVTYPE)
storm$EVTYPE <- sub("\\s+", " ", storm$EVTYPE)
storm$EVTYPE <- sub("tstm", "thunderstorm", storm$EVTYPE)
evtypes <- c("astronomical low tide", "avalanche", "blizzard", "coastal flood",
"cold/wind chill", "debris flow", "dense fog", "dense smoke",
"drought", "dust devil", "dust storm", "excessive heat",
"extreme cold/wind chill", "flash flood", "flood", "freezing fog",
"frost/freeze", "hail", "heat", "heavy rain", "heavy snow",
"high surf", "high wind", "hurricane/typhoon", "ice storm",
"lakeshore flood", "lake-effect snow", "lightning", "marine hail",
"marine high wind", "marine strong wind",
"marine thunderstorm wind", "rip current", "seiche", "sleet",
"storm tide", "strong wind", "thunderstorm wind", "tornado",
"tropical depression", "tropical storm", "tsunami",
"volcanic ash", "waterspout", "wildfire", "winter storm",
"winter weather")
for (ev in evtypes) {
storm$EVTYPE <- sub(paste("^(", ev, ")", "(.+)", sep=""), "\\1", storm$EVTYPE)
}
storm <- storm[storm$EVTYPE %in% evtypes,]
The analysis concludes that the top causes of fatalities were excessive heat, tornadoes, flash floods, lightning, rip currents, and floods.
fatalities <- ddply(storm, .(EVTYPE), summarize, total_fatalities=sum(FATALITIES))
fatalities <- fatalities[order(fatalities$total_fatalities, decreasing = TRUE),]
topFatalities <- fatalities[1:6,]
g <- ggplot(fatalities, aes(EVTYPE, total_fatalities))
g <- g + geom_bar(stat="identity")
g <- g + ggtitle("Top Events Causing Fatalities in the U.S. (1996 - 2011)")
g <- g + xlab("Event") + ylab("Fatalities")
g + scale_x_discrete(limits=topFatalities$EVTYPE[1:6])
## Warning: Removed 39 rows containing missing values (position_stack).
The top causes of injuries were tornadoes, floods, excessive heat, thunderstorm wind, ligthning, and flash floods.
injuries <- ddply(storm, .(EVTYPE), summarize, total_injuries=sum(INJURIES))
injuries <- injuries[order(injuries$total_injuries, decreasing = TRUE),]
topInjuries <- injuries[1:6,]
g <- ggplot(injuries, aes(EVTYPE, total_injuries))
g <- g + geom_bar(stat="identity")
g <- g + ggtitle("Top Events Causing Injuries in the U.S. (1996 - 2011)")
g <- g + xlab("Event") + ylab("Injuries")
g + scale_x_discrete(limits=topInjuries$EVTYPE[1:6])
## Warning: Removed 39 rows containing missing values (position_stack).
The top causes of monetary damage were floods, hurricanes/typhoons, tornadoes, hail, flash floods, and drought.
damage <- ddply(storm, .(EVTYPE), summarize, total_damage=sum(DAMAGE))
damage <- damage[order(damage$total_damage, decreasing = TRUE),]
topDamage <- damage[1:6,]
topDamage$total_damage <- topDamage$total_damage / 1000000000
g <- ggplot(topDamage, aes(EVTYPE, total_damage))
g <- g + geom_bar(stat="identity")
g <- g + ggtitle("Top Events Causing Economic Damage in the U.S. (1996 - 2011)")
g <- g + xlab("Event") + ylab("Damage (Billions USD)")
g + scale_x_discrete(limits=topDamage$EVTYPE[1:6])