The Most Harmful Weather Events in the USA

Reproducible Research Peer Assignement 2

Synopsis

Purpose of this report is to understand which severe weather events are most harmful in the USA. Data used comes from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. It covers period of over 60 years from year 1950 to the end of November 2011.

The analysis is presented in two parts. We find out the events that have the most impact on:

Health of the population, measured in numbers of injuries and fatalities.
Economy divided in to property and crop damage and measured in dollars.

Main results are that tornados cause the most injuries and fatalities. Floods do the most harm in economic measure.

Data Processing

Getting the data

The data for this assignment comes in the form of a comma-separated-value file compressed via the bzip2 algorithm. It is publicly available and was downloaded from here: Storm Data [47Mb]

You can find how some of the variables are constructed/defined here:
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ

There are 902297 records in the data set.

Reading in the data

# Load needed packages and set default options
library(knitr)
opts_chunk$set(message=F, warning = T, prompt = F ) # Remove warnings and messages from knitr output

# Load needed packages and set default options
library(R.utils) # For unarchiving .bz2-files
library(ggplot2) # For plotting
library(plyr) # For ddply function
library(gridExtra) # For multipanel plots
library(reshape2) # For melt function

options(scipen = 999) # Turn off scientific notation

# Unarchive the downloaded file
bunzip2('repdata-data-StormData (1).csv.bz2', 'repdata-data-StormData (01).csv', remove = F, overwrite = T)
# Read in the .csv data
df <- read.csv('repdata-data-StormData (01).csv', stringsAsFactors = F)

Preprocessing the data

Data is preprocessed to find out top five events for different categories.

# Convert EVTYPE column to factor
df$EVTYPE  <- as.factor(df$EVTYPE)


## HEALTH  data processing 

# Subset only needed data
df_accident <- subset(df, select = c(EVTYPE,FATALITIES, INJURIES))

# Sum number of fatalities and injuries per different event types
df_accident  <- ddply(df_accident, .(EVTYPE), 
                      summarize, 
                      FATALITIES = sum(FATALITIES), 
                      INJURIES = sum(INJURIES)
)

df_health  <- ddply(df_accident, .(EVTYPE),
                    transform,
                    HEALTH = sum(FATALITIES, INJURIES)
)

# Order for top causes of bith, fatalities and injuries and select top five
health_top  <- df_health[with(df_health, (order(-HEALTH))), ]
health_top  <- health_top[1:5, ]

# Order for top causes of fatalities and select top five
fatalities_top  <- df_accident[with(df_accident, (order(-FATALITIES))), ]
fatalities_top  <- fatalities_top[1:5, ]

# Order for top causes of injury and select top five
injuries_top  <- df_accident[with(df_accident, (order(-INJURIES))), ]
injuries_top  <- injuries_top[1:5, ]

Removing unclear records from analysis

Dollar values of damages are coded in 2 separate columns. This is what the Storm Data Documentation (p.12) has to say about it:
“Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.”

There are some records with other monetary nominators. There is now way to know what is meant by them, so decided to remove those records. We are dealing with a large data set, so this should have no effect on finding out top events.

# ECONOMIC data processing

# PROPERTY

# Subset only needed data
df_prop <- subset(df, select = c(EVTYPE,PROPDMG, PROPDMGEXP))

# Remove rows that are not "K", "M" and "B" with NA
df_prop <- df_prop[df_prop$PROPDMGEXP %in% c('K', 'M', 'B'), ]
# Replace abbreviations "K", "M" and "B" with zeros
df_prop$PROPDMGEXP <- gsub('K', '1000', df_prop$PROPDMGEXP)
df_prop$PROPDMGEXP <- gsub('M', '1000000', df_prop$PROPDMGEXP)
df_prop$PROPDMGEXP <- gsub('B', '1000000000', df_prop$PROPDMGEXP)
# Multiply amounts with nominator values to a new column to get amounts as numerical
df_prop$PROPD <- df_prop$PROPDMG * as.numeric(df_prop$PROPDMGEXP) 

# Sum damages per event type
df_prop <- aggregate(df_prop$PROPD, by=list(Category=df_prop$EVTYPE), FUN=sum, na.rm = T)
#Rename column because previous line changes it
names(df_prop)[1:2] <- c('EVTYPE', 'PROPD')

# Order for top causes of fatalities and select top fivex
prop_top <- df_prop[with(df_prop, (order(-PROPD))), ]
prop_top  <- prop_top[1:5, ]

# Divide the top values by 1 billion to make them easier on the eyes
prop_top$PROPD  <- prop_top$PROPD / 1000000000


# CROP

# Subset only needed data
df_crop <- subset(df, select = c(EVTYPE,CROPDMG, CROPDMGEXP))

# Remove rows that are not "K", "M" and "B" with NA
df_crop <- df_crop[df_crop$CROPDMGEXP %in% c('K', 'M', 'B'), ]
# Replace abbreviations "K", "M" and "B" with zeros
df_crop$CROPDMGEXP <- gsub('K', '1000', df_crop$CROPDMGEXP)
df_crop$CROPDMGEXP <- gsub('M', '1000000', df_crop$CROPDMGEXP)
df_crop$CROPDMGEXP <- gsub('B', '1000000000', df_crop$CROPDMGEXP)
# Multiply amounts with nominator values to a new column to get amounts as numerical
df_crop$CROPD <- df_crop$CROPDMG * as.numeric(df_crop$CROPDMGEXP)

# Sum damages per event type
df_crop <- aggregate(df_crop$CROPD, by=list(Category=df_crop$EVTYPE), FUN=sum, na.rm = T)
#Rename column because previous line changes it
names(df_crop)[1:2] <- c('EVTYPE', 'CROPD')

# Order for top causes of fatalities and select top five
crop_top  <- df_crop[with(df_crop, (order(-CROPD))), ]
crop_top  <- crop_top[1:5, ]

# Divide the top values by 1 billion to make them easier on the eyes
crop_top$CROPD  <- crop_top$CROPD / 1000000000


# TOTAL Economic consequences

# Merge property and crop damage data frames
eco_top <- merge(df_prop, df_crop)
# Sum property and crop damages
eco_top  <- ddply(eco_top, .(EVTYPE),
                  transform,
                  ECOCON = sum(PROPD, CROPD)
)

# Order for top causes of fatalities and select top five
eco_top  <- eco_top[with(eco_top, (order(-ECOCON))), ]
eco_top  <- eco_top[1:5, ]

# Divide the top values by 1 billion to make them easier on the eyes
eco_top$ECOCON  <- eco_top$ECOCON / 1000000000

Results

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

# PLOT1 Combined Fatalities and Injuries

# Reorder EVTYPE factor to order bars in the plot by number of fatalities
health_top$EVTYPE <- reorder(health_top$EVTYPE, health_top$HEALTH, order = T)
# Create plot
plot1  <- 
    ggplot(health_top, aes(EVTYPE, HEALTH)) +
    geom_bar(stat='identity', fill = 'purple1') +
    labs(title = 'Combined number of Fatalities and Injuries', x = NULL, y = NULL) +
    coord_flip() +
    guides(fill = F) +
    geom_text(aes(label = HEALTH, size = 4), show_guide = F) +
    theme_bw()

## PLOT2 Top five causes of fatalities

# Reorder EVTYPE factor to order bars in the plot by number of fatalities
fatalities_top$EVTYPE <- reorder(fatalities_top$EVTYPE, fatalities_top$FATALITIES, order = T)
# Create plot
plot2  <- 
    ggplot(fatalities_top, aes(EVTYPE, FATALITIES)) +
    geom_bar(stat='identity', fill = 'red4') +
    labs(title = 'Number of Fatalities', x = NULL, y = NULL) +
    coord_flip() +
    guides(fill = F) +
    geom_text(aes(label = FATALITIES, size = 4), show_guide = F) +
    theme_bw()

## PLOT3 Top five causes of injuries

# Reorder EVTYPE factor to order bars in the plot by number of injuries
injuries_top$EVTYPE <- reorder(injuries_top$EVTYPE, injuries_top$INJURIES, order = T)
# Create plot
plot3  <- 
    ggplot(injuries_top, aes(EVTYPE, INJURIES)) +
    geom_bar(stat='identity', fill = 'red2') +
    labs(title = 'Number of Injuries', x = NULL, y = NULL) +
    coord_flip() +
    geom_text(aes(label = INJURIES, size = 4), show_guide = F) +
    theme_bw()

# Combine plots
grid.arrange(plot2, plot3, plot1)

plot of chunk unnamed-chunk-5

In the above chart, top five events causing fatalities, injuries and both combined are plotted. Tornados are clearly the most dangerous events for human health.

Across the United States, which types of events have the greatest economic consequences?

# PLOT4 Top five for PROPERTY damage

# Reorder EVTYPE factor to order bars in the plot by number of fatalities
prop_top$EVTYPE <- reorder(prop_top$EVTYPE, -prop_top$PROPD)

plot4  <- 
    ggplot(prop_top, aes(EVTYPE, PROPD)) +
    geom_bar(stat='identity', fill = 'slategrey') +
    labs(title = 'Property Damage', x = NULL, y = 'Billions of Dollars') +
    guides(fill = F) +
    geom_text(aes(label = PROPD, size = 4), show_guide = F) +
    theme_bw()

# PLOT5 Top five for CROP damage

# Reorder EVTYPE factor to order bars in the plot by number of fatalities
crop_top$EVTYPE <- reorder(crop_top$EVTYPE, -crop_top$CROPD)

plot5  <- 
    ggplot(crop_top, aes(EVTYPE, CROPD)) +
    geom_bar(stat='identity', fill = 'green4') +
    labs(title = 'Crop Damage', x = NULL, y = 'Billions of Dollars') +
    guides(fill = F) +
    geom_text(aes(label = CROPD, size = 4), show_guide = F) +
    theme_bw()

# PLOT6 Top OVERALL economic damage

# Reorder EVTYPE factor to order bars in the plot by number of fatalities
eco_top$EVTYPE <- reorder(eco_top$EVTYPE, -eco_top$ECOCON)

plot6  <- 
    ggplot(eco_top, aes(EVTYPE, ECOCON)) +
               geom_bar(stat='identity', fill = 'gold4') +
               labs(title = 'Overall Damage', x = NULL, y = 'Billions of Dollars') +
               guides(fill = F) +
               geom_text(aes(label = ECOCON, size = 4), show_guide = F) +
               theme_bw()

grid.arrange(plot4, plot5, plot6)

plot of chunk unnamed-chunk-6

In the above chart, top five events causing property damage, crop damage and both combined are plotted. We can see that floods are causing the most economic damage.