Analysis of the United States NOAA Severe Storm Event Data

Synopsis

Two questions will be answered in this report:

Across the United States, which types of events are most harmful with respect to population health? Across the United States, which types of events have the greatest economic consequences?

This analysis looks at the data collected and published by the National Oceanic & Atmospheric Administration(NOAA) on severe storm events. The National Weather Service(NWS) did not begin reporting on all of the 48 events, considered to be severe storm events by the NOAA, until 1996. To ensure accurate comparison between events this analysis only uses data collected after the change in reporting procedure.

Data Processing

The data table was downloaded from this SOURCE on the 16th of September 2014 at 8:16am.

##Load data 
storm_df <- read.csv("stormdata.csv.bz2", stringsAsFactors = FALSE)

The first issue to address is filtering the data frame to remove all observations recorded prior to 1996.

##Load dplyr for data manipulation
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
##change data table format
df <- tbl_df(storm_df) 

##Format date column for sorting
df$BGN_DATE <- as.Date(df$BGN_DATE, "%m/%d/%Y")

##Remove all data for events recorded prior to JAN 1st, 1996
df <- filter(df, BGN_DATE >= "1996-01-01")

The next issue to tackle in the data processing step is to deal with the fact that there are over 400 unique event observations, when there should only be 48 according to NOAA documentation. The most thorough and simple way I found to deal with creating uniformity in the identifiers was to create a data table using the 48 identifiers and matching them with the wrongly coded observations using cut and paste. Some event observations were ignored if they could not easily be identified. The data frame I used is available at this GitHub Repository.

##Change to Remove spaces, hyphens and slashes to reduce the number of unique observations and make matching easier
df$EVTYPE <- tolower(df$EVTYPE)
df$EVTYPE<- gsub(" ", "", df$EVTYPE)
df$EVTYPE <- gsub("-", "", df$EVTYPE)
df$EVTYPE <- gsub("/", "", df$EVTYPE)

##load in data table with uniform identifiers
event <- read.csv("./event.txt", sep=";", stringsAsFactors=FALSE)

##Create a new variable that gives observations an identifier from the NOAA list of 48 events
##Do this by searching for event observations that match the identifiers recorded in the the event data frame
##Marking the indices of the EVTYPE observations
##For indices that matched attatch the uniform event identifier to the new Event column
for (i in 1:nrow(event)) {
        index<- grep(event[i, "Expressions_Used"], df[, "EVTYPE"])
        if (length(index) > 0) {
                df[index, "Event"] <- event[i, "Event_Name"]
                
        }
}

Then the lack of uniformity in the cost multiplier columns must be addressed. For the ease of computing these values I chose to replace all multipliers with there given exponent integer value. (Thus an h becomes a 2 for 10^2, a t becomes 3, and so on.)

##Change property damage multipliers to integer value
df$PROPDMGEXP[df$PROPDMGEXP %in% c( "", "-", "+")] <- 0
df$PROPDMGEXP[df$PROPDMGEXP %in% c("h", "H")] <- 2
df$PROPDMGEXP[df$PROPDMGEXP == "K"] <- 3
df$PROPDMGEXP[df$PROPDMGEXP %in% c("m", "M")] <- 6
df$PROPDMGEXP[df$PROPDMGEXP == "B"] <- 9

##Change crop damage multipliers to integer value
df$CROPDMGEXP[df$CROPDMGEXP %in% c( "", "?")] <- 0
df$CROPDMGEXP[df$CROPDMGEXP %in% c("k", "K")] <- 3
df$CROPDMGEXP[df$CROPDMGEXP %in% c("m", "M")] <- 6
df$CROPDMGEXP[df$CROPDMGEXP == "B"] <- 9 

The final step in processing to answer the first question was to calculate the total cost to public health by combining the fatality and injury count, and then finding the sum total for each event type.

##Select event and casualty data, then calculate total number each observation
##Then find sum total by event
##Arrange in descending order
results1 <-
        df %>%
        select(Event, FATALITIES, INJURIES) %>%
        mutate(TOTAL = FATALITIES + INJURIES) %>%
        group_by(Event) %>%  
        summarize(Fatalities = sum(FATALITIES), 
        Injuries = sum(INJURIES), 
        Total = sum(TOTAL)) %>%
        arrange(desc(Total))

To answer the second question, I used a similar process totaling the property and crop damage estimates and then found the sum total for each event type.

##Format multiplier columns as integers
df$PROPDMGEXP <- as.numeric(df$PROPDMGEXP )
df$CROPDMGEXP <- as.numeric(df$CROPDMGEXP )

##Select crop and property damage data, then calculate total cost for each observation
##Then find sum total by event in billion dollars
##Arrange in descending order
results2 <- df %>% 
        select(Event, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)%>%
        filter(PROPDMG > 0 | CROPDMG > 0) %>%
        mutate(TOTALDMG = PROPDMG * 10^PROPDMGEXP + CROPDMG*10^CROPDMGEXP)%>% 
        group_by(Event)%>%
        summarize(Total_Damage = sum(TOTALDMG)) %>%
        mutate(Total_in_B = Total_Damage/10^9) %>%
        arrange(desc(Total_in_B))

Results

Across the United States, which types of events are most harmful with respect to population health?

##Subset Data to only graph top 20
results1 <- results1[1:20,]


library(ggplot2)

g<- ggplot(results1, aes(reorder(Event, Total), Total))

g + geom_bar(fill = "mediumorchid4", stat = "identity") + coord_flip() +ggtitle("The Effects of Severe Storm Events on Population Health") + xlab ("Event") + ylab("Casualties \n Deaths and Injuries")

plot of chunk unnamed-chunk-7

The graph shows the cost to population health of the top twenty events with the greatest number of casualties. Tornadoes cost the most to public health in terms of the total number of casualties.

Across the United States, which types of events have the greatest economic consequences?

results2 <- results2[1:20,]


g<- ggplot(results2, aes(reorder(Event, Total_in_B), Total_in_B))

g + geom_bar(fill = "darkolivegreen4", stat = "identity")+ coord_flip() + ggtitle("The Economic Cost of Severe Storm Events") + xlab("Event") + ylab ("Property and Crop Losses \n In Billions of Dollar")

plot of chunk unnamed-chunk-8

This graph shows the cost to property and crops of the top twenty events with the greatest economic cost. Flooding has the greatest economic impact.