Synopsis

We examine the NOAA database of major storms and weather events to determine which types of events are the most harmful with respect to population health (injuries and fatalities) and which events inflict the most economic damage. We select data from January 1996 to November 2011, and for each type of event in the database we calculate the total number of deaths, injuries caused, and amount of damage (to property and crops, in US dollars). For each, we plot the top 10 events. We find that the most damaging events in terms of public health are tornadoes and excessive heat, whereas floods cause the most economic damage.

Introduction

Due to the damage and potential loss of life that they inflict, storms and other severe weather events create both public health and economic problems for communities and municipalities. Therefore, a key concern for all municipal administrations should be the assessment of weather events, how often they occur and how much damage they can cause.

Our goal in this project is to answer some basic questions about severe weather events in the United States, namely a) which types of events are most harmful with respect to population health and b) which types of events have the most serious economic consequences. To do that we’ll use the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data Processing

We start by downloading the NOAA storm data in CSV format, compressed via the bzip2 algorithm. There is no need to unzip the data file, as R’s read.csv() function can handle compressed files.

# Start by downloading the data and storing it in a subdirectory called data/
# which will be created if it doesn't exist
# If you need to reload the data, just delete the data/ subdirectory and run
# the script again
if (!file.exists("data")) {
    dir.create("data")
    
    fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    download.file(fileUrl, destfile="data/StormData.csv.bz2", method="curl")
    
    # Record the date in which we downloaded the file
    dateDownloaded <- date()        
}

# Load the data into R as a dataframe
df <- read.csv("data/StormData.csv.bz2", stringsAsFactors=FALSE)

To answer the questions we’re interested in, we’ll use the following columns from the dataset:

Columm name Type Format Description
BGN_DATE character MM/DD/YYYY HH:MM:SS Date in which the event starts.
EVTYPE character Type of event (tornado, blizzard, wildfire etc).
FATALITIES numeric Number of fatalities caused by the event.
INJURIES numeric Number of (non-fatal) injuries caused by the event.
PROPDMG numeric NNNN.NN Estimated value of property damage. To obtain the value in USD, this has to be multiplied by the exponent indicated by PROPDMGEXP (see below).
PROPDMGEXP character Multiplier for the value listed in PROPDMG. If “K”, “M” or “B”, the value in PROPDMG is listed in thousands, millions or billions USD, respectively; if empty, use PROPDMG unmodified.
CROPDMG numeric NNN.NN Estimated value of crop damage. To obtain the value in USD, this has to be multiplied by the exponent indicated by CROPDMGEXP.
CROPDMGEXP character Multiplier for the value listed in CROPDMG. Uses the same convention as PROPDMGEXP.

We do not select any location information because we are interested in the damages and fatalities/injuries across the entire United States.

The NOAA database we’ll use in this work contains data from January 1950 to November 2011. Due to changes in the data collection and processing procedures over time, there are unique periods of record available depending on the event type. According to the documentation in the NOAA website, before January 1996 only data for tornadoes, thunderstorms and hail were recorded. Starting on January 1996, 48 different types of events are recorded. In order to examine a more complete sample of events, we’ll use the data from January 1996 to November 2011 in this analysis.

library(plyr)
library(dplyr)
library(lubridate)

df$BGN_DATE <- mdy_hms(df$BGN_DATE)

dfstorm <- filter(
    select(df, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP,
           CROPDMG, CROPDMGEXP),
    BGN_DATE > dmy("01/01/1996"))

# Remove unneeded objects
rm("df")

Data cleaning

Before starting the analysis, there are a number of issues with the source data that need fixing. In this section we detail the steps we’ll take to clean up the dataset.

Type of event

The first issue is that the dataset appears to contain many more types of events than the 48 quoted in the documentation:

length(unique(dfstorm$EVTYPE))
## [1] 515

One of the problems is that the same event type is entered using different capitalization (e.g. “Freezing rain”, “Freezing Rain” and “FREEZING RAIN”). We’ll use the capwords() function found in the documentation for chartr() to convert everything to a common standard.

capwords <- function(s, strict = FALSE) {
    cap <- function(s) paste(toupper(substring(s, 1, 1)),
                  {s <- substring(s, 2); if(strict) tolower(s) else s},
                             sep = "", collapse = " " )
    sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}

dfstorm$EVTYPE <- capwords(dfstorm$EVTYPE, strict=TRUE)
length(unique(dfstorm$EVTYPE))
## [1] 438

We’ll next remove all entries in EVTYPE that start with “Summary Of”, as it is not clear which type of event they refer to. There are other entries which have an empty space inserted before the name, we’ll correct that too.

dfstorm <- dfstorm[!grepl("^Summary", dfstorm$EVTYPE),]
dfstorm$EVTYPE <- gsub("^\ *", "", dfstorm$EVTYPE)

At this point, even though we have some categories left that might be equivalent (e.g. “Winter Weather/mix” and “Winter Weather Mix”) we stop processing this column any further because of time constraints.

Property and crop damage amounts

Next we obtain the amounts of property and crop damages in USD. In the original dataset, not all rows in PROPDMGEXP and CROPDMGEXP have valid codes; however, after the previous data cleaning steps, we’re left with valid entries only (with one exception in PROPDMGEXP, which we remove below).

table(dfstorm$PROPDMGEXP)
## 
##             0      B      K      M 
## 276091      1     32 369934   7374
table(dfstorm$CROPDMGEXP)
## 
##             B      K      M 
## 372972      4 278685   1771
dfstorm <- filter(dfstorm, PROPDMGEXP!='0')

GetDamageMultiplier <- function(x) {
    multiplier <- 1
    if (x=="K") {
        multiplier <- 1.0e3
    } else if (x=="M") {
        multiplier <- 1.0e6
    } else if (x=="B") {
        multiplier <- 1.0e9
    }
    multiplier
}

dfstorm$propDamage <- dfstorm$PROPDMG *
    sapply(dfstorm$PROPDMGEXP, GetDamageMultiplier)
dfstorm$cropDamage <- dfstorm$CROPDMG * 
    sapply(dfstorm$CROPDMGEXP, GetDamageMultiplier)

Results

Events most harmful to public health

Using the processed dataset, we’ll first determine which type of event caused the most loss of life and/or injuries in the period January 1996 to November 2011.

by_type <- group_by(dfstorm, EVTYPE)
health <- summarise(by_type,
                    totdeaths=sum(FATALITIES, na.rm=TRUE),
                    totinjuries=sum(INJURIES, na.rm=TRUE))
summary(health)
##     EVTYPE            totdeaths        totinjuries     
##  Length:364         Min.   :   0.00   Min.   :    0.0  
##  Class :character   1st Qu.:   0.00   1st Qu.:    0.0  
##  Mode  :character   Median :   0.00   Median :    0.0  
##                     Mean   :  23.99   Mean   :  159.3  
##                     3rd Qu.:   1.00   3rd Qu.:    1.0  
##                     Max.   :1797.00   Max.   :20667.0

Some types of events didn’t cause any deaths in the time period considered, others didn’t cause any injuries. We’ll remove all events that didn’t cause any kind of health issue.

health <- filter(health, totdeaths+totinjuries>0)
summary(health)
##     EVTYPE            totdeaths        totinjuries     
##  Length:125         Min.   :   0.00   Min.   :    0.0  
##  Class :character   1st Qu.:   1.00   1st Qu.:    1.0  
##  Mode  :character   Median :   2.00   Median :    7.0  
##                     Mean   :  69.85   Mean   :  463.8  
##                     3rd Qu.:  28.00   3rd Qu.:   84.0  
##                     Max.   :1797.00   Max.   :20667.0

Now we plot the top 10 most harmful events, by number of fatalities caused, across the United States between January 1996 and November 2011.

library(ggplot2)
dfplot <- arrange(health, desc(totdeaths))[1:10,]

dfplot <- transform(dfplot,
                    EVTYPE=reorder(EVTYPE, totdeaths))

fig1 <- ggplot(dfplot) +
    geom_bar(aes(x=EVTYPE, y=totdeaths), stat="identity", fill="dodger blue") +
    coord_flip() +
    xlab("Type of Weather Event") +
    ylab(paste("Number of Fatalities Caused")) +
    ggtitle("Top 10 Deadliest Weather Events in the US 1996-2011")
print(fig1)

The figure clearly shows that most fatalities due to weather events in this time frame were caused by excessive heat (1797 deaths), followed by tornadoes (1511 deaths).

The next figure plots the top 10 events by number of injuries caused across the United States, in the same time period.

dfplot <- arrange(health, desc(totinjuries))[1:10,]

dfplot <- transform(dfplot,
                    EVTYPE=reorder(EVTYPE, totinjuries))

fig2 <- ggplot(dfplot) +
    geom_bar(aes(x=EVTYPE, y=totinjuries), stat="identity", fill="dodger blue") +
    coord_flip() +
    xlab("Type of Weather Event") +
    ylab(paste("Number of Injuries Caused")) +
    ggtitle("Top 10 Weather Events by Injuries Caused in the US 1996-2011")
print(fig2)

Now in terms of number of injuries caused, tornadoes are the most harmful event with 2.066710^{4} injuries caused across the United States between January 1996 and November 2011, followed by floods with 6758 injuries. Excessive heat, though the cause of the most deaths in the time period considered, drops to third place by number of injuries caused.

Events resulting in the most economic damage

Now we turn to the economic consequences of extreme weather events. We’ll determine which events caused the most total damage (property plus crops) between January 1996 and November 2011 across the United States.

econ <- summarise(by_type,
                  totpropdmg=sum(propDamage, na.rm=TRUE),
                  totcropdmg=sum(cropDamage, na.rm=TRUE))
summary(econ)
##     EVTYPE            totpropdmg          totcropdmg       
##  Length:364         Min.   :0.000e+00   Min.   :0.000e+00  
##  Class :character   1st Qu.:0.000e+00   1st Qu.:0.000e+00  
##  Mode  :character   Median :0.000e+00   Median :0.000e+00  
##                     Mean   :1.008e+09   Mean   :9.547e+07  
##                     3rd Qu.:1.530e+05   3rd Qu.:0.000e+00  
##                     Max.   :1.439e+11   Max.   :1.337e+10

We’ll remove all events that caused less than USD 1,000 in damage:

econ$totdmg <- econ$totpropdmg+econ$totcropdmg
econ <- filter(econ, totdmg>1000)
summary(econ)
##     EVTYPE            totpropdmg          totcropdmg       
##  Length:148         Min.   :0.000e+00   Min.   :0.000e+00  
##  Class :character   1st Qu.:3.650e+04   1st Qu.:0.000e+00  
##  Mode  :character   Median :6.448e+05   Median :0.000e+00  
##                     Mean   :2.478e+09   Mean   :2.348e+08  
##                     3rd Qu.:1.530e+07   3rd Qu.:5.550e+06  
##                     Max.   :1.439e+11   Max.   :1.337e+10  
##      totdmg         
##  Min.   :2.000e+03  
##  1st Qu.:6.525e+04  
##  Median :1.344e+06  
##  Mean   :2.713e+09  
##  3rd Qu.:4.059e+07  
##  Max.   :1.489e+11

Finally, we plot the top 10 most damaging events, by total amount of damage caused, across the United States between January 1996 and November 2011.

dfplot <- arrange(econ, desc(totdmg))[1:10,]

dfplot <- transform(dfplot,
                    EVTYPE=reorder(EVTYPE, totdmg))

fig3 <- ggplot(dfplot) +
    geom_bar(aes(x=EVTYPE, y=totdmg/1.0e9), stat="identity", fill="dodger blue") +
    coord_flip() +
    xlab("Type of Weather Event") +
    ylab(paste("Amount of Damage (Billions of USD)")) +
    ggtitle("Top 10 Most Damaging Weather Events in the US 1996-2011")
print(fig3)

The figure clearly shows that floods are the type of weather event that has the most severe economic consequences.