The NOAA Storm Database is explored to answer two basic questions about severe weather events: 1) which events are most harmful to population health and 2) which events have the greatest economic consequences. For each question a metric is calculated from applicable elements within the database and the top 10 events are rank ordered and plotted. The analysis shows tornados are the most harmful to population health, while floods cause the greatest economic damage.
Load the libraries that are used in the analysis.
library(plyr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.4
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.5
Download the Storm Data file from the web and read it into a data frame.
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, destfile = "repdata_data_StormData.csv.bz2")
df <- read.csv("repdata_data_StormData.csv.bz2")
Question 1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
For this question we analyze the number of injuries and fatalities caused directly or indirectly by the event. Since fatalities are much more serious than injuries, we compute a weighted metric which reflects the impact on population health as:
impact = injuries + 10 x fatalities
The data is summarized by event type, injuries, and fatalities so we can get a total value for injuries and fatalities per event type. The impact is calculated as above. The data is sorted by impact and the top 10 rows retained, while discarding the rest.
## summarize
sdf <- ddply(df, c("EVTYPE"), summarize,
inj = sum(INJURIES),
fat = sum(FATALITIES))
## calculate the impact metric
sdf$impact <- sdf$inj + 10 * sdf$fat
## sort by impact
sdf <- sdf[order(-sdf$impact),]
## take the first 10 rows
sdf_top10 <- sdf[1:10,]
Question 2: Across the United States, which types of events have the greatest economic consequences?
For this question we analyze the property and crop damage caused by the event. A damage metric is calculated as the sum total of the property and crop damage.
The data is subsetted to just the columns of interest for this question…namely the event type, property damage, and crop damage. The damage attributes are broken into two parts: a numeric value and a magnitude. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. In order to calculate the damage metric we need to convert the magnitude values to numeric values as follows:
It turns out some of the records have invalid magnitudes which cannot be interpreted as to their value. These rows are removed from the data.
The damage metric is calculated as:
damage = property damage*magnitude + crop damage*magnitude
A special function is created to facilitate this calculation by handling any NA values which are present in the data. The data is summarized by event type and damage, then sorted and the top 10 rows retained, while discarding the rest.
## subset to the columns of interest
dmg_df <- df[,c("EVTYPE","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
## replace the alphabetical exponent characters with values
## H=100,K=1000,M=1000000,B=1000000000
dmg_df$PROPDMGEXP <- revalue(dmg_df$PROPDMGEXP, c("B"="1000000000", "M"="1000000", "m"="1000000", "K"="1000","h"="100","H"="100"))
dmg_df$CROPDMGEXP <- revalue(dmg_df$CROPDMGEXP, c("B"="1000000000", "M"="1000000", "m"="1000000", "K"="1000","k"="1000"))
# convert factors to character, so the filter function can be used
dmg_df$PROPDMGEXP <- as.character(dmg_df$PROPDMGEXP)
dmg_df$CROPDMGEXP <- as.character(dmg_df$CROPDMGEXP)
## remove the rows which have invalid exponents
valid <- c("", "0", "100", "1000", "1000000", "1000000000")
dmg_df <- filter(dmg_df, PROPDMGEXP %in% valid)
dmg_df <- filter(dmg_df, CROPDMGEXP %in% valid)
# convert to numeric so these can be used in damage_metric calculation
dmg_df$PROPDMGEXP <- as.numeric(dmg_df$PROPDMGEXP)
dmg_df$CROPDMGEXP <- as.numeric(dmg_df$CROPDMGEXP)
## compute the damage metric
damage_metric <- function(v1, v2, v3, v4){v1[is.na(v1)] <- 0; v2[is.na(v2)] <- 1; v3[is.na(v3)] <- 0; v4[is.na(v4)] <- 1; return(v1*v2 + v3*v4)}
dmg_df <- within(dmg_df, damage <- damage_metric(PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP))
## summarize
sdmg_df <- ddply(dmg_df, c("EVTYPE"), summarize,
damage_sum = sum(damage))
## sort by impact
sdmg_df <- sdmg_df[order(-sdmg_df$damage_sum),]
## take the first 10 rows
sdmg_df_top10 <- sdmg_df[1:10,]
The following chart shows the top 10 events for impact on population health. Tornados cause by far the most impact, followed by Excessive Heat, and Lightning as the top 3 causes.
## plot the population health impact chart
ggplot(sdf_top10, aes(reorder(EVTYPE, -impact), impact)) + geom_bar(stat = "identity") + theme(axis.text.x=element_text(angle=90, hjust=1)) + ggtitle("Top 10 Events for Impact on Population Health") + labs(x="Event Type", y="Impact")
The following chart shows the top 10 events for economic impact. Floods are the number one cause, followed by Hurricanes/Typhoons, then Tornadoes for the top 3.
## plot the economic impact result
ggplot(sdmg_df_top10, aes(reorder(EVTYPE, -damage_sum), damage_sum)) + geom_bar(stat = "identity") + theme(axis.text.x=element_text(angle=90, hjust=1)) + ggtitle("Top 10 Events for Economic Impact") + labs(x="Event Type", y="Impact")
The analysis shows tornados are the most harmful to population health, while floods cause the greatest economic damage.