Synopsis

This report provides an analysis of the most damaging and most injurious/fatal weather events in the U.S. from 1950 to 2011 based upon the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. It contains information on many characteristics of major weather events throughout the United States; most importantly for this analysis are estimates of injuries/fatalities and property/crop damage costs.

The results show that, in general, tornadoes are the weather event that causes the greatest damage and the most injuries/fatalities U.S. wide. However, the record-setting hurricane activity of 2005 greatly skews the results when those events are included in the analysis. When included, floods and hurricanes appear to be the most damage producing types of events, but interestingly tornadoes remain the most injurious type of event. An interesting follow up study would test a hypothesis that tornadoes may be the leading cause of human harm due to lack of early enough warning or lack of effective warnings ahead of the event.

Data Processing

The data for this analysis was provided by the requestor. All the data needed for this analysis is pulled from this database.

The following libraries are required for this analysis.

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(tidyr))

Reading the raw dataset

The entire database is initially read from the raw text file named repdata-data-StormData.csv.bz2. The file is a bzip2 encoded CSV file of the NOAA StormData database. The first row is read as header data. This file must be located in the same working directory as the processing script.

NOTE: it may take several minutes to read the database file. The results are cached for this report to avoid the delay for casual reading.

rdata <- read.csv("repdata-data-StormData.csv.bz2", stringsAsFactors=FALSE)

The following shows the characteristics of the raw data.

dim(rdata)
## [1] 902297     37
head(rdata)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

Getting the data ready for analysis

This report uses the following variables from the dataset: BGN_DATE, EVTYPE, PROPDMG, FATALITIES, INJURIES, PROPDMGEXP, CROPDMG, and CROPDMGEXP. These are separated out.

pdata <- rdata %>% select(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG:CROPDMGEXP) %>%
         filter(PROPDMGEXP %in% c("B", "b", "M", "m", "K", "k", "") & 
                CROPDMGEXP %in% c("B", "b", "M", "m", "K", "k", ""))

Note that some processing of the raw data is begun in the code above. One thing discovered about the NOAA database is that there is little adherence to the published data guidelines for some variables, PROPDMGEXP and CROPDMGEXP in particular. The processing of these are described in more detail below, but suffice it to say that some of the raw records need to be filtered out because the values of these items cannot be interpreted. Fortunately, less than 0.04% of the data is discarded.

The event date information (BGN_DATE) is originally read as date/time character vectors. These are converted to true Date class. A new column is added which contains just the year portion of the date (as a character vector).

pdata$BGN_DATE <- as.Date(pdata$BGN_DATE, format="%m/%d/%Y")
pdata$BGN_DATE_YR <- year(pdata$BGN_DATE)

The property damage and crop damage figures are stored in the database in an unusual form. Property damage is supposed to be stored in numeric form rounded to three significant digits in the PROPDMG variable. This value is then multiplied by an “exponent” provided by PROPDMGEXP to get the true value. The expected values for PROPDMGEXP are “B” for billions, “M” for millions, “K” for thousands, and nothing (“”) for anything less than 1,000. Lower case letters are also acceptable. For example, if the PROPDMG value is 155 and the PROPDMGEXP value is “K” then the true property damage value is $155 * 1,000 = $155,000. The crop damage variables (CROPDMG and CROPDMGEXP) work the same way.

The code below computes true damage values. To do so, it creates two additional variables, PROPX and CROPX, as numeric multipliers based on the character values in PROPDMGEXP and CROPDMGEXP, respectively. The PROPDMG and CROPDMG values are then multipled by PROPX and CROPX, respectively, and the results placed back into PROPDMG and CROPDMG.

This code may seem really awkard, but after testing several alternatives, this algorithm was far faster than any other.

pdata$PROPX <- 1
pdata[pdata$PROPDMGEXP=="B","PROPX"] <- 1.0e9
pdata[pdata$PROPDMGEXP=="b","PROPX"] <- 1.0e9
pdata[pdata$PROPDMGEXP=="M","PROPX"] <- 1.0e6
pdata[pdata$PROPDMGEXP=="m","PROPX"] <- 1.0e6
pdata[pdata$PROPDMGEXP=="K","PROPX"] <- 1000
pdata[pdata$PROPDMGEXP=="k","PROPX"] <- 1000
pdata$PROPDMG <- pdata$PROPDMG * pdata$PROPX

pdata$CROPX <- 1
pdata[pdata$CROPDMGEXP=="B","CROPX"] <- 1.0e9
pdata[pdata$CROPDMGEXP=="b","CROPX"] <- 1.0e9
pdata[pdata$CROPDMGEXP=="M","CROPX"] <- 1.0e6
pdata[pdata$CROPDMGEXP=="m","CROPX"] <- 1.0e6
pdata[pdata$CROPDMGEXP=="K","CROPX"] <- 1000
pdata[pdata$CROPDMGEXP=="k","CROPX"] <- 1000
pdata$CROPDMG <- pdata$CROPDMG * pdata$CROPX

The data is now in a form ready for analysis.

Results

Total damage by year

This report first addresses the types of events that have caused the most damage to property and to crops. The reason why this is looked at first will become apparent in the results below.

To begin the analysis, the processed storm data is grouped by year of event occurence (BGN_DATE_YR). The property damage and crop damage values are totaled for each year and the results plotted.

# Create the plot data for all years.
gdata <- pdata %>% group_by(BGN_DATE_YR) %>% 
         summarise(PROPDMGTOT=sum(PROPDMG) / 1.0e9, CROPDMGTOT=sum(CROPDMG) / 1.0e9) %>%
         gather("TYPE", "COST", PROPDMGTOT:CROPDMGTOT)

gdata$DCAT <- "All Years"  # For plot facet grouping & labeling

# Create the same plot data, but exclude years 2005 & 2006.
g2 <- gdata[!(gdata$BGN_DATE_YR %in% c(2005, 2006)),]

g2$DCAT <- "Excluding 2005-2006"  # For plot facet grouping & labeling

gdata <- bind_rows(gdata, g2)  # Append both data frames for plotting

a <- ggplot(data=gdata, aes(x=BGN_DATE_YR, y=COST, fill=TYPE, group=1)) + 
     facet_grid(DCAT ~ ., scales = "free") +
     geom_bar(stat="identity") + 
     scale_fill_manual(values=c("red", "blue"), name="Damage Type", 
                       labels = c("Crop", "Property")) +
     theme(plot.title=element_text(face="bold"), axis.title=element_text(face="bold"),
           legend.position="bottom") +
     xlab("Year") + 
     ylab("Total Damage (x $1B)") + 
     ggtitle("Annual Property & Crop Damage\n1950-2011")

print(a)

Even a causal glance at the “All Years” chart above reveals a major spike in the results for the years 2005 and 2006. An initial reaction may be to think that this must be bad data. However, it is not. Recall that 2005 was the worst season on record for tropical storm activity.

“The 2005 season was the most active season on record, shattering records on repeated occasions. A record 28 tropical and subtropical storms formed, of which a record fifteen became hurricanes. Of these, seven strengthened into major hurricanes, a record-tying five became Category 4 hurricanes, a record four of which reached Category 5 strength, the highest categorization for North Atlantic tropical cyclones. Among these Category 5 storms was Hurricane Wilma, the most intense hurricane ever recorded in the Atlantic.”

“The most notable storms of the season were the five Category 4 and Category 5 hurricanes: Dennis, Emily, Katrina, Rita, and Wilma, along with the Category 1 Hurricane Stan. These storms made a combined twelve landfalls as major hurricanes (Category 3 strength or higher) throughout Cuba, Mexico, and the Gulf Coast of the United States, causing over $100 billion (2005 USD) in damages and at least 2,048 deaths.”

Source: Wikipedia

The bottom, “Excluding 2005-2006” chart shows what the damage totals look like without years 2005 and 2006. This difference between 2005/2006 and the other years will carry forward in the next analysis of which types of events are most damaging.

A final note regarding this plot is the apparent difference in reported damage starting in 1993. This is shown most dramatically in the lower chart. It is unknown what happened in 1993 to cause this change. Perhaps better reporting mechanisms were implemented. Or, maybe the number of reporting sites was increased. For the remainder of the analysis, the pre-1993 data is included in the results.

Top damage producing events

This report now considers the types of events that produce the most damage. The data is grouped by event type and the property damage and crop damage values combined into a single total damage value for each event type. The totals are then sorted in descending order and the top four separated out for plotting.

In consideration of the 2005/2006 data, the information is, again, presented as two charts: one covering all years and the other with 2005/2006 excluded.

# Create the plot data for all years.
gdata <- pdata %>% group_by(EVTYPE) %>% 
         summarise(DMGTOT=(sum(PROPDMG) + sum(CROPDMG)) / 1.0e9) %>%
         arrange(desc(DMGTOT)) %>%
         slice(1:4)

gdata$DCAT <- "All Years"  # For plot facet grouping & labeling

# Create the same plot data, but exclude years 2005 & 2006.
g2 <- pdata[!(pdata$BGN_DATE_YR %in% c(2005, 2006)),] %>% 
      group_by(EVTYPE) %>% 
      summarise(DMGTOT=(sum(PROPDMG) + sum(CROPDMG)) / 1.0e9) %>%
      arrange(desc(DMGTOT)) %>%
      slice(1:4)

g2$DCAT <- "Excluding 2005-2006"  # For plot facet grouping & labeling

gdata <- bind_rows(gdata, g2)  # Append both data frames for plotting

gdata$EVTYPE <- factor(gdata$EVTYPE, levels=gdata$EVTYPE)  # for X axis sorting

a <- ggplot(data=gdata, aes(x=EVTYPE, y=DMGTOT, fill=EVTYPE, group=1)) + 
     facet_grid(DCAT ~ .) +
     geom_bar(stat="identity") + 
     theme(plot.title=element_text(face="bold"), axis.title=element_text(face="bold"),
           legend.position="bottom", legend.title = element_blank(), 
           axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
     xlab("Event Type") + 
     ylab("Total Damage (x $1B)") + 
     ggtitle("Top Damage Producing Event Types\n1950-2011")

print(a)

It is clear that floods, hurrucanes/typhoons, and storm surges easily dominate when 2005 and 2006 are included. The results are much different when those years are excluded - tornadoes then become the dominate damage-causing event. As hail frequently occurs with tornadoes, some portion of that damage is likely related to tornado events. The total damage produced across all four top events is also significantly less when 2005/2006 are not included.

Most injurious/fatal events

The analysis now turns to looking at which events produce the highest number of injuries and fatalaties. The data is grouped by event type and the number of injuries and fatalities totaled for each event type. A total of both injuries and fatalities is also computed. This total is then sorted in descending order and the top four separated out for plotting.

In consideration of the 2005/2006 data, the information is, again, presented as two charts: one covering all years and the other with 2005/2006 excluded.

# Create the plot data for all years.
gdata <- pdata %>% group_by(EVTYPE) %>% 
    summarise(INJTOT=sum(INJURIES) / 1000, FATTOT=sum(FATALITIES) / 1000, 
              IFTOT=INJTOT + FATTOT) %>%
    gather("TYPE", "TOTAL", INJTOT:FATTOT) %>%
    arrange(desc(IFTOT)) %>%
    slice(1:8)

gdata$DCAT <- "All Years"  # For plot facet grouping & labeling

# Create the same plot data, but exclude years 2005 & 2006.
g2 <- pdata[!(pdata$BGN_DATE_YR %in% c(2005, 2006)),] %>% 
    group_by(EVTYPE) %>% 
    summarise(INJTOT=sum(INJURIES) / 1000, FATTOT=sum(FATALITIES) / 1000, 
              IFTOT=INJTOT + FATTOT) %>%
    gather("TYPE", "TOTAL", INJTOT:FATTOT) %>%
    arrange(desc(IFTOT)) %>%
    slice(1:8)

g2$DCAT <- "Excluding 2005-2006"  # For plot facet grouping & labeling

gdata <- bind_rows(gdata, g2)  # Append both data frames for plotting

gdata$EVTYPE <- factor(gdata$EVTYPE, levels=gdata$EVTYPE)  # for X axis sorting

# Create the plot.
a <- ggplot(data=gdata, aes(x=EVTYPE, y=TOTAL, fill=TYPE, group=1)) + 
    facet_grid(DCAT ~ .) +
    geom_bar(stat="identity") + 
    scale_fill_manual(values=c("red", "gold2"), name="", 
                      labels = c("Fatalities", "Injuries")) +
    theme(plot.title=element_text(face="bold"), axis.title=element_text(face="bold"),
          legend.position="bottom") +
    #    theme(plot.title=element_text(face="bold"), axis.title=element_text(face="bold"),
    #          legend.position="bottom", legend.title = element_blank(), 
    #          axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
    xlab("Event Type") + 
    ylab("Total Injuries/Fatalities (x 1,000)") + 
    ggtitle("Most Injurious/Fatal Event Types\n1950-2011")

print(a)

It is clear that tornados are the cause of the vast majority of human injuries and fatalities. It appears that, whereas floods and hurricanes produce the most economic damage, they are not the direct causes of the greatest storm-related personal harm. This is further evident in that there is very little noticable difference between the upper and lower charts (i.e., with and without years 2005 and 2006).

It would be interesting to study why hurricanes and floods are not more closely associated with human harm. It is the author’s speculation that perhaps the reason has to do with pre-event warnings. Hurricanes form in the ocean and are typically tracked for a number of days before they make landfall. Weather science is able to predict storm strength, and where/when landfall will occur as well as the potential path of the storm over land. This gives people a long lead time to prepare for the event.

Tornadoes, on the other hand, as not as predictable. Weather forecasts provide warnings of storms with the potential to spawn tornadoes, but cannot yet predicting the actual time, location, and magnitude of a real tornado. Thus, people tend to be caught by surprise and unprepared. Would better warning systems and/or better preparation result in fewer injuries/fatalities?