Using the NOAA storm database, analyses were performed to identify weather events that caused the largest numbers of fatalities and injuries and the highest dollar value damages to property and crops in the United States from the year 1950 until November 2011. The data for weather events was recorded over the years in inconsistent ways, necessitating substantial cleaning and grouping in order to allow extraction of meaningful results. Overall results show that tornados were the dominant weather event in terms of loss of life and injury, while floods caused the most property and crop damage.
There are two questions to be answered using the storm data, 1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences?
To answer the questions, the database repdata-data-StormData.csv.bz2 was first retrieved from the course website on 17 June 2014 in compressed form, after which it was uncompressed for processing on a local computer, resulting in the file repdata-data-StormData.csv. This csv file was read into a data frame in R.
# Read in the storm data csv file - 214 seconds
dat <- read.csv("repdata-data-StormData.csv")
The database includes more variables than are of interest for this analysis, so the columns of interest are extracted to reduce the data frame size.
# Id the variables of interest
theseVars <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
# Subset the data to the variables of interest
dat <- dat[, theseVars]
The data is further reduced by removing all the events that had no fatalities, injuries, property damage or crop damage.
# Subset data to only events that caused fatalities, injuries, property or crop damage.
dat <- dat[dat$FATALITIES!=0 | dat$INJURIES!=0 | dat$PROPDMG!=0 | dat$CROPDMG!=0,]
There are 985 event types listed in the data set with many of them being very similar. Extracting meaningful results requires grouping similar events together. The grouping is not straightforward as there are many weather phenomena that are mixed in an EVTYPE. For example, WINDS is included in EVTYPE values that also include THUNDERSTORM, SNOW, HURRICANE, TORNADO, FLOOD and others. So for grouping purposes it becomes necessary to choose a predominant weather phenomenon for a particular event.
The groupings chosen are as follows: HURRICANES: because hurricanes have such a wide ranging impact and include both winds and flooding they are given their own group. TORNADOS: although smaller in range of impact their unique features of rapid development and combination of thunderstorms and winds make them a unique group. FLOODS: all flood events that were not reported in conjunction with hurricanes or tonados are included in the flood group. SNOW: snow events that haven't been grouped with the above phenomena are grouped. HEAT: a weather event distinct from all the preceding and responsible for significant fatalities. OTHER: all other events not previously grouped.
A new variable is added to the data frame to identify each event as one of the chosen group.
# Define the names of the event groups
EVGNAMES <- c("HURRICANE", "TORNADO", "FLOOD", "SNOW", "HEAT", "WIND", "THUNDERSTORM")
# Format so the names can be used in a regex
toMatch <- paste(EVGNAMES, collapse="|")
# Add the event group type to each event
library(stringr)
dat$EVGROUP <- toupper(str_extract(dat$EVTYPE, ignore.case(toMatch)))
With event types grouped, the number of fatalities, injuries, and economic damage are totaled for each event group.
First the fatalities and injuries are totaled.
# Sum fatalities and injuries by event group
library(plyr)
groupFatInj <- ddply(dat, "EVGROUP", function(x) sum(x$FATALITIES))
groupFatInj$INJURIES <- ddply(dat, "EVGROUP", function(x) sum(x$INJURIES))[,2]
# Add names
names(groupFatInj) <- c("EVGROUP", "FATALITIES", "INJURIES")
# Order decreasing
groupFatInj <- groupFatInj[order(groupFatInj$FATALITIES, decreasing=TRUE), ]
# Make label names to use later for plots
xnames1 <- groupFatInj$EVGROUP
xnames1[which(is.na(xnames1))] <- "ALL OTHER"
# Plots for fatalities and injuries are generated below in the Results section
Next the property and crop damages are summed. But first, the multipliers must be applied. These are indicated by the values in the *EXP fields, with H meaning 100, K meaning 1000, M meaning 1000000 and B meaning 1000000000. There are other values in this field that are not defined in the code book and must be dealt with. The choice was made to set all other entries to the value of one.
# Sum property damage by event group
# First replace the PROPDMGEXP string with the corresponding integer
# There are many entries in the PROPDMGEXP field that do not comply with the codebook
# description - these are ignored and the raw PROPDMG number used without scaling.
# Convert the PROPDMGEXP and CROPDMGEXP variables to a character for further processing
dat$PROPDMGEXP <- as.character(dat$PROPDMGEXP)
dat$CROPDMGEXP <- as.character(dat$CROPDMGEXP)
# Change anything not in c("H","K","M","B") to a 1
validEXP <- c("H","K","M","B")
dat[!(dat$PROPDMGEXP %in% validEXP), "PROPDMGEXP"] <- 1
dat[!(dat$CROPDMGEXP %in% validEXP), "CROPDMGEXP"] <- 1
# Change the H's to 100
dat[dat$PROPDMGEXP=="H", "PROPDMGEXP"] <- 100
dat[dat$CROPDMGEXP=="H", "CROPDMGEXP"] <- 100
# Change the K's to 1000
dat[dat$PROPDMGEXP=="K", "PROPDMGEXP"] <- 1000
dat[dat$CROPDMGEXP=="K", "CROPDMGEXP"] <- 1000
# Change the M's to 1000000
dat[dat$PROPDMGEXP=="M", "PROPDMGEXP"] <- 1000000
dat[dat$CROPDMGEXP=="M", "CROPDMGEXP"] <- 1000000
# Change the B's to 1000000000
dat[dat$PROPDMGEXP=="B", "PROPDMGEXP"] <- 1000000000
dat[dat$CROPDMGEXP=="B", "CROPDMGEXP"] <- 1000000000
# Now that PROPDMGEXP and CROPDMGEXP are all numbers, make numeric
dat$PROPDMGEXP <- as.numeric(dat$PROPDMGEXP)
dat$CROPDMGEXP <- as.numeric(dat$CROPDMGEXP)
# Created a PROPCASH and CROPCASH vectors by multiplying .DMG and .EXP
PROPCASH <- dat$PROPDMG * dat$PROPDMGEXP
CROPCASH <- dat$CROPDMG * dat$CROPDMGEXP
# Add the PROPCASH and CROPCASH vectors to the data frame for further calculations
dat$PROPCASH <- PROPCASH
dat$CROPCASH <- CROPCASH
# Compute damages by event type
groupPropCrop <- ddply(dat, "EVGROUP", function(x) sum(x$PROPCASH))
groupPropCrop$CROPCASH <- ddply(dat, "EVGROUP", function(x) sum(x$CROPCASH))[,2]
# Add names
names(groupPropCrop) <- c("EVGROUP", "PROPCASH", "CROPCASH")
# Order decreasing
groupPropCrop <- groupPropCrop[order(groupPropCrop$PROPCASH, decreasing=TRUE), ]
# Make label names to be used later for plots
xnames2 <- groupPropCrop$EVGROUP
xnames2[which(is.na(xnames2))] <- "ALL OTHER"
# Plots of property and crop damages are generated below in the Results section.
Results are presented.
Impacts of events on human health are presented in the following figure. As the two plots show, tornados dominate as the cause of the most fatalities and injuries among all the weather events. Somewhat surprising is that HEAT events were the second highest cause of fatalities.
The plots also indicate that more work is needed to further extract and classify or group the events in the ALL OTHER category since that group has a large number of fatalities and injuries.
# Plot fatalities and injuries by event type
# Plot only the six highest fatality-causing events
highFat <- groupFatInj[1:6,]
xnames <- xnames1[1:6]
par(mfrow=c(2,1), mar=c(2,4,2,1), cex=0.6)
with(highFat, {
barplot(FATALITIES, main="Fatalities by Event Type",
ylab="Fatalities", col=rainbow(20), names.arg=xnames)
barplot(INJURIES, main="Injuries by Event Type",
ylab="Injuries", col=rainbow(20), names.arg=xnames)
})
FIGURE 1
Impacts of events on property and crops are presented in the following figure. FLOOD events have the largest cost in terms of property damage by a large margin. The ALL OTHER category is second, indicating further analysis is needed to break out major sub-groups within, particularly in light of the fact that ALL OTHER is the largest crop damage cause.
highProp <- groupPropCrop[1:6,]
xnames <- xnames2[1:6]
# Scale by 1,000,000 for better plotting
highProp[,2:3] <- round(highProp[,2:3]/1000000)
par(mfrow=c(2,1), mar=c(2,4,2,1), cex=0.6)
with(highProp, {
barplot(PROPCASH, main="Property Loss by Event Type",
ylab="Millions of Dollars", col=rainbow(20), names.arg=xnames)
barplot(CROPCASH, main="Crop Loss by Event Type",
ylab="Millions of Dollars", col=rainbow(20), names.arg=xnames)
})
FIGURE 2
It is clear that further work is needed to analyze more of the “ALL OTHER” events and determine if there are major related groups among the events. There are 179,175 events in this group and each contains at least one non-zero entry for fatalities, injuries, property damage or crop damage. Initial analysis did not show any one major event with extremely high impact. Significant manual analysis would likely be required to tease out more detail from this group.