Synopsis: We have analyzed data from the NOAA Storm Database used to track storm events in the US from January 1950 to April 2014. The questions we want to answer are (1) which types of events are most harmful with respect to population health and (2) which types of events have the greatest economic consequences. We used the injuries and fatalities data in the NOAA database to determine the most harmful events for human health. We used the property and crop damage estimates to determine the events with the greatest economic consequences. Our findings are that Tornados are most harmful to human health and that Hurricanes/Typhoons are the most devastating from an economic perspective. Note: Without the adjustment shown for the Flood outlier (as documented below), Floods are the most devastating economic event.

Data Processing

Read the zip from from the NOAA website and read the data using read.csv

stormdata <- read.csv("repdata_data_StormData.csv.bz2", header = TRUE)

str(stormdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

From the NOAA Storm Events Database, not all events were tracked until 1996. We removed observations prior to 1996 to make the analysis consistent between the various events. See http://www.ncdc.noaa.gov/stormevents/details.jsp for a summary of the data discrepancies pre and post 1996.

timeAdjusted_data <- subset(stormdata, as.Date(stormdata$BGN_DATE, format = "%m/%d/%Y") >= "1996-01-01")  

Subset data to include columns for fatalities, injuries and property damage for each event type. The columns retained are:

EVTYPE - event type (e.g. tornado, flood, etc.)
FATALITIES - number of fatalities
INJURIES - number of injuries
PROPDMG - property damage 
PROPDMGEXP - magnitude of property damage (e.g. thousands, millions)
CROPDMG - crop damage 
CROPDMGEXP - magnitude of crop damage (e.g. thousands, millions)
subData <- timeAdjusted_data[,colnames(timeAdjusted_data) %in% c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG",  "CROPDMGEXP")] 

Note: There are outliers in the data, especially with regard to the damages caused by floods. The below subset of the data shows that in particular the Flood in row 605953 has property damage of $115 billion. However, an internet search found this article http://www.cityofnapa.org/index.php?option=com_content&task=view&id=907&Itemid=30 which shows $115 million in damage. This row of data was adjusted to millions for this analysis.

billionData <- subData[(subData$CROPDMGEXP %in% c("B", "b") | subData$PROPDMGEXP %in% c("B", "b")), ]

subData[(subData$PROPDMG == 115 & subData$PROPDMGEXP == "B"), 5] <- "M"

We performed an initial analysis by finding the total fatalities/injuries and the total property/crop damage for each event type. This initial analysis was to find the top 50 events. Our goal was to collapse the 200+ events into the 48 categories as identified by NOAA. For the top 50 events (done separately for the fatality/injury vs. property/crop damage) the following categories were not collapsed as it is unclear to which category they should be collapsed: LANDSLIDE, STORM SURGE, SNOW SQUALL and DRY MICROBURST. Hoevwer, each of these excluded events have fewer than 100 injuries/fatalities so did not impact the final result. All of the top 50 property/crop damage events fit into one of the 48 categories.

Our analysis showed that these were the event labels that needed to be cleaned up.

subData$EVTYPE <- gsub("^thunderstorm.*|TSTM.*", "Thunderstorm", subData$EVTYPE, ignore.case = TRUE)
subData$EVTYPE <- gsub("^flash flood.*", "Flash Flood", subData$EVTYPE, ignore.case = TRUE)
subData$EVTYPE <- gsub(".*flood.*|.*FLD", "Flood", subData$EVTYPE, ignore.case = TRUE)
subData$EVTYPE <- gsub("Rip Current.*", "Rip Current", subData$EVTYPE, ignore.case = TRUE)
subData$EVTYPE <- gsub("Wild.*", "WildFire", subData$EVTYPE, ignore.case = TRUE)
subData$EVTYPE <- gsub(".*Fog.*", "Dense Fog", subData$EVTYPE, ignore.case = TRUE)
subData$EVTYPE <- gsub("glaze|.*ice.*", "Ice Storm", subData$EVTYPE, ignore.case = TRUE)
subData$EVTYPE <- gsub("extreme.*", "Extreme Cold/Wind Chill", subData$EVTYPE, ignore.case = TRUE)
subData$EVTYPE <- gsub("winter weather.*|wintry.*", "Winter Weather", subData$EVTYPE, ignore.case = TRUE)
subData$EVTYPE <- gsub(".*Surf.*", "High Surf", subData$EVTYPE, ignore.case = TRUE)
subData$EVTYPE <- gsub("hurricane.*|typhoon.*", "Hurricane/Typhoon", subData$EVTYPE, ignore.case = TRUE)
subData$EVTYPE <- gsub(".*frost|.*freeze", "Frost/Freeze", subData$EVTYPE, ignore.case = TRUE)
subData$EVTYPE <- gsub(".*wind.*", "Strong Wind", subData$EVTYPE, ignore.case = TRUE)

###Ensure all event types are lower case with capital first letter
lowerCap <- function(x) {
        s <- strsplit(x, " ")[[1]]
        paste(toupper(substring(s, 1,1)), tolower(substring(s, 2)),
              sep="", collapse=" ")
}

subData$EVTYPE <- sapply(subData$EVTYPE, lowerCap)

We adjusted the crop and property damage estimates to include exponential factors found in EXP variables (e.g. thousands, millions). After making this adjustment we deleted EXP variables from the data.

levels(subData$CROPDMGEXP)
## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"
levels(subData$PROPDMGEXP)
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
subData$CROPDMGEXP <- gsub("K|k", 1000, subData$CROPDMGEXP)
subData$CROPDMGEXP <- gsub("M|m", 1000000, subData$CROPDMGEXP)
subData$CROPDMGEXP <- gsub("B|b", 1000000000, subData$CROPDMGEXP)

subData$CROPDMG <- subData$CROPDMG * as.numeric(subData$CROPDMGEXP)
subData$CROPDMGEXP <- NULL

subData$PROPDMGEXP <- gsub("H|h", 100, subData$PROPDMGEXP)
subData$PROPDMGEXP <- gsub("K|k", 1000, subData$PROPDMGEXP)
subData$PROPDMGEXP <- gsub("M|m", 1000000, subData$PROPDMGEXP)
subData$PROPDMGEXP <- gsub("B|b", 1000000000, subData$PROPDMGEXP)

subData$PROPDMG <- subData$PROPDMG * as.numeric(subData$PROPDMGEXP)
subData$PROPDMGEXP <- NULL

Results

Sum injuries and fatalities to find those events with the largest population health impact

We summed injuries and fatalities by event to find those with the largest impact to human health.

popHealth_data <- aggregate(FATALITIES + INJURIES ~ EVTYPE, subData, FUN = "sum")

colnames(popHealth_data) <- c("eventType", "incidentNumber")

We removed rows with no injury or fatality incidents. This is for the injury/fatality analysis only. Rows to retain for the damages analysis is done separately

popHealth_data <- popHealth_data[popHealth_data$incidentNumber > 0, ]

popHealth_data_ordered <- popHealth_data[order(-popHealth_data[,2]),]

maxHealthEvent <- popHealth_data_ordered[1,1]

rownames(popHealth_data_ordered) <- NULL

Plot 10 Most Harmful Events with Respect to Population Health

Plot of the first 10 rows of the data sorted (in descending order) of injuries + fatalities

library(ggplot2)
plot_health <- ggplot(popHealth_data_ordered[1:10,], aes(eventType, incidentNumber, fill=eventType))     
plot_health + geom_bar(stat="identity") +
labs(title = "Top 10 Most Harmful Events with Respect to Population Health") +
xlab("Events") +
ylab("Number of injuries and fatalities")  +
theme(axis.text.x=element_text(angle=-90)) 

plot of chunk plotHealth

The event with the largest impact to population health is Tornado.

Sum property and crop damages to find the events have the greatest economic consequences

We summed crop and property damage (after adjusting for their respective exponential factors) by event to find those with the largest economic impact.

propDmg_data <- aggregate(PROPDMG + CROPDMG ~ EVTYPE, subData, FUN = "sum")

colnames(propDmg_data) <- c("eventType", "totalDamages")

We removed rows with no property or crop damage. This is for the property and crop damage analysis only. Rows to retain for injuries and fatalities analysis is done separately.

propDmg_data <- propDmg_data[propDmg_data$totalDamages > 0, ]

propDmg_data_ordered <- propDmg_data[order(-propDmg_data[,2]),]

propDmg_data_ordered$totalDamages <- propDmg_data_ordered$totalDamages / 1000000

maxDamage <- propDmg_data_ordered[1,1]

rownames(propDmg_data_ordered) <- NULL

Plot 10 Events with the Greatest Economic Consequences

Plot of the first 10 rows of the data sorted (in descending order) of total damages

plot_damages <- ggplot(propDmg_data_ordered[1:10,] , aes(eventType, totalDamages, fill=eventType))     
plot_damages + geom_bar(stat="identity") +
        labs(title = "Top 10 Events with Greatest Economic Consequences") +
        xlab("Events") +
        ylab("Total Damage (in millions)")  +
        theme(axis.text.x=element_text(angle=-90)) 

plot of chunk plotDamage

The event causing the most costly damages is Hurricane/typhoon.