Analysis of Health and Economics impacts of storms and severe weather events in the United States

Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This analysis involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States from 1950 to 2011, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

This report presents an analysis of this database to determine the events which have the more impact in respect to health and economic consequences. It will answer the following questions :

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Data Processing

The following packages are required to perform the analysis:

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(tidyr)
library(stringr)
library(scales)
library(R.utils)

## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.0 (2015-02-19) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.19.0 (2015-02-27) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## 
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## 
## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save
## 
## R.utils v2.1.0 (2015-05-27) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
## 
## The following object is masked from 'package:utils':
## 
##     timestamp
## 
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings

Data loading

The dataset will be automatically downloaded from the Coursera website and unziped in the data folder.

if (!file.exists("./data/repdata-data-StormData.csv.bz2")) {
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
        "./data/repdata-data-StormData.csv.bz2", method='curl')
}
if (!file.exists("./data/repdata-data-StormData.csv")) {
    bunzip2("./data/repdata-data-StormData.csv.bz2", "./data/repdata-data-StormData.csv", remove = FALSE)
}
d <- read.csv('./data/repdata-data-StormData.csv')

Here is a quick look at the data:

glimpse(d)

## Observations: 902297
## Variables:
## $ STATE__    (dbl) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ BGN_DATE   (fctr) 4/18/1950 0:00:00, 4/18/1950 0:00:00, 2/20/1951 0:...
## $ BGN_TIME   (fctr) 0130, 0145, 1600, 0900, 1500, 2000, 0100, 0900, 20...
## $ TIME_ZONE  (fctr) CST, CST, CST, CST, CST, CST, CST, CST, CST, CST, ...
## $ COUNTY     (dbl) 97, 3, 57, 89, 43, 77, 9, 123, 125, 57, 43, 9, 73, ...
## $ COUNTYNAME (fctr) MOBILE, BALDWIN, FAYETTE, MADISON, CULLMAN, LAUDER...
## $ STATE      (fctr) AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL...
## $ EVTYPE     (fctr) TORNADO, TORNADO, TORNADO, TORNADO, TORNADO, TORNA...
## $ BGN_RANGE  (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ BGN_AZI    (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ BGN_LOCATI (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ END_DATE   (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ END_TIME   (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ COUNTY_END (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ COUNTYENDN (lgl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ END_RANGE  (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ END_AZI    (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ END_LOCATI (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ LENGTH     (dbl) 14.0, 2.0, 0.1, 0.0, 0.0, 1.5, 1.5, 0.0, 3.3, 2.3, ...
## $ WIDTH      (dbl) 100, 150, 123, 100, 150, 177, 33, 33, 100, 100, 400...
## $ F          (int) 3, 2, 2, 2, 2, 2, 2, 1, 3, 3, 1, 1, 3, 3, 3, 4, 1, ...
## $ MAG        (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FATALITIES (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 4, 0, ...
## $ INJURIES   (dbl) 15, 0, 2, 2, 2, 6, 1, 0, 14, 0, 3, 3, 26, 12, 6, 50...
## $ PROPDMG    (dbl) 25.0, 2.5, 25.0, 2.5, 2.5, 2.5, 2.5, 2.5, 25.0, 25....
## $ PROPDMGEXP (fctr) K, K, K, K, K, K, K, K, K, K, M, M, K, K, K, K, K,...
## $ CROPDMG    (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ CROPDMGEXP (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ WFO        (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ STATEOFFIC (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ ZONENAMES  (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ LATITUDE   (dbl) 3040, 3042, 3340, 3458, 3412, 3450, 3405, 3255, 333...
## $ LONGITUDE  (dbl) 8812, 8755, 8742, 8626, 8642, 8748, 8631, 8558, 874...
## $ LATITUDE_E (dbl) 3051, 0, 0, 0, 0, 0, 0, 0, 3336, 3337, 3402, 3404, ...
## $ LONGITUDE_ (dbl) 8806, 0, 0, 0, 0, 0, 0, 0, 8738, 8737, 8644, 8640, ...
## $ REMARKS    (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ REFNUM     (dbl) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...

Preprocessing

The first step is to select the columns that relates the event types (EVTYPE) with their consequences to the population health (FATALITIES and INJURIES) and economy (PROPDMG and CROPDMG and their relative exponents). I also exclude events without consequences.

dd <- d %>%
    select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
    filter(FATALITIES != 0 | INJURIES != 0 | PROPDMG != 0 | CROPDMG != 0)

For the economic damages, I cleanup the exponents and include the whole figure directly in the main damage columns.

First the property damages. The exponents look as follow:

levels(dd$PROPDMGEXP)

##  [1] ""  "+" "-" "0" "1" "2" "3" "4" "5" "6" "7" "8" "?" "B" "H" "K" "M"
## [18] "h" "m"

So we need to clean that up. k, m, h and b mean x1000, x1000000, etc.. I assume numeric values to be exponent to 10 (i.e. 2 is equivalent to multiplying by 100). Others values are ignored (PROPDMG multiplied by 1).

dd <- dd %>% mutate(PROPDMG =
                    ifelse(PROPDMGEXP == 'K' | PROPDMGEXP == 'k' | PROPDMGEXP == '3', PROPDMG * 1000,
                    ifelse(PROPDMGEXP == 'M' | PROPDMGEXP == 'm' | PROPDMGEXP == '6', PROPDMG * 1000000,
                    ifelse(PROPDMGEXP == 'h' | PROPDMGEXP == 'H' | PROPDMGEXP == '2', PROPDMG * 100,
                    ifelse(PROPDMGEXP == 'B' | PROPDMGEXP == 'b' | PROPDMGEXP == '9', PROPDMG * 1000000000,
                    ifelse(PROPDMGEXP == '4', PROPDMG * 10000,
                    ifelse(PROPDMGEXP == '5', PROPDMG * 100000,
                    ifelse(PROPDMGEXP == '7', PROPDMG * 10000000,
                    ifelse(PROPDMGEXP == '8', PROPDMG * 100000000,
                    PROPDMG)))))))))

I use the same logic for the crop damages:

dd <- dd %>% mutate(CROPDMG =
                    ifelse(CROPDMGEXP == 'K' | CROPDMGEXP == 'k' | CROPDMGEXP == '3', CROPDMG * 1000,
                    ifelse(CROPDMGEXP == 'M' | CROPDMGEXP == 'm' | CROPDMGEXP == '6', CROPDMG * 1000000,
                    ifelse(CROPDMGEXP == 'h' | CROPDMGEXP == 'H' | CROPDMGEXP == '2', CROPDMG * 100,
                    ifelse(CROPDMGEXP == 'B' | CROPDMGEXP == 'b' | CROPDMGEXP == '9', CROPDMG * 1000000000,
                    ifelse(CROPDMGEXP == '4', CROPDMG * 10000,
                    ifelse(CROPDMGEXP == '5', CROPDMG * 100000,
                    ifelse(CROPDMGEXP == '7', CROPDMG * 10000000,
                    ifelse(CROPDMGEXP == '8', CROPDMG * 100000000,
                    CROPDMG)))))))))

The exponent columns are no longer required for this analysis.

dd <- dd %>%
    select(EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG)

Now I clean up the event types.

length(levels(dd$EVTYPE))

## [1] 985

There are 985 types of events in the original dataset. As we already reduced the dataset, let’s update the levels:

dd$EVTYPE <- factor(dd$EVTYPE)
length(levels(dd$EVTYPE))

## [1] 488

We’re still left with 488 types of events, that’s almost 10 times more than in the documentation. Let’s start by setting all the event types uppercase and removing the trailing white spaces:

dd <- dd %>%
    mutate(EVTYPE = as.factor(toupper(str_trim(EVTYPE))))
length(levels(dd$EVTYPE))

## [1] 444

We’re now at 444 types of events.

Here I make some assumption as per how the events should have been classified and make the necessary changes:

dd[grepl('THUNDERSTORM WIND', dd$EVTYPE),]$EVTYPE <- as.factor('THUNDERSTORM WIND')
dd[grepl('TSTM WIND', dd$EVTYPE),]$EVTYPE <- as.factor('THUNDERSTORM WIND')
dd[grepl('^FLOOD', dd$EVTYPE),]$EVTYPE <- as.factor('FLOOD')
dd[grepl('^FLASH FLOOD', dd$EVTYPE),]$EVTYPE <- as.factor('FLASH FLOOD')
dd[grepl('^HAIL', dd$EVTYPE),]$EVTYPE <- as.factor('HAIL')
dd[grepl('^FUNNEL', dd$EVTYPE),]$EVTYPE <- as.factor('FUNNEL CLOUD')
dd[grepl('^HIGH WIND', dd$EVTYPE),]$EVTYPE <- as.factor('HIGH WIND')
dd[grepl('^HURRICANE', dd$EVTYPE),]$EVTYPE <- as.factor('HURRICANE/TYPHOON')
dd[grepl('^LIGHTNING', dd$EVTYPE),]$EVTYPE <- as.factor('LIGHTNING')

dd$EVTYPE <- factor(dd$EVTYPE)
length(levels(dd$EVTYPE))

## [1] 323

That’s enough for the analysis.

Let’s sum all the columns to have for each event type their impact:

dd <- dd %>%
    group_by(EVTYPE) %>%
    summarise_each(funs(sum))

Let’s tidy the dataset to show the event type, the impact type and the impact value:

dd <- dd %>%
    gather(IMPACTTYPE, VALUE, FATALITIES, INJURIES, PROPDMG, CROPDMG)

Events which are most harmful with repect to population health

I will create a separate dataset with the 10 most harmful events with respect to the population health. This includes the fatalities and injuries. I use a temporary TOTAL column to properly order the events types by their total impact.

d_health <- dd %>%
    group_by(EVTYPE) %>%
    filter(IMPACTTYPE == 'FATALITIES' | IMPACTTYPE == 'INJURIES' ) %>%
    mutate(TOTAL = sum(VALUE)) %>%
    ungroup %>%
    arrange(desc(TOTAL)) %>%
    top_n(20, TOTAL) %>%
    select(-TOTAL)

d_health$EVTYPE <- factor(d_health$EVTYPE, levels = rev(unique(d_health$EVTYPE)))

ggplot(data=d_health, aes(x = EVTYPE, y = VALUE, fill=IMPACTTYPE)) +
    geom_bar(stat="identity") +
    ggtitle('Events which are most harmful with repect to population health, 1950-2011') +
    ylab('Population impacted') +
    xlab('Type of event') +
    scale_fill_discrete(name='Type of impact', labels=c('Fatalities', 'Injuries')) +
    coord_flip()

Events with the greatest economic impact

I will create a separate dataset with the 10 type of events having the greatest economic consequences. This includes property damages and crop damages. I use the same logic as for the health impacts.

d_eco <- dd %>%
    group_by(EVTYPE) %>%
    filter(IMPACTTYPE == 'PROPDMG' | IMPACTTYPE == 'CROPDMG' ) %>%
    mutate(TOTAL = sum(VALUE)) %>%
    ungroup %>%
    arrange(desc(TOTAL)) %>%
    top_n(20, TOTAL) %>%
    select(-TOTAL)

d_eco$EVTYPE <- factor(d_eco$EVTYPE, levels = rev(unique(d_eco$EVTYPE)))

ggplot(data=d_eco, aes(x = EVTYPE, y = VALUE, fill=IMPACTTYPE)) +
    geom_bar(stat="identity") +
    ggtitle('Types of events which have the greatest economic consequences, 1950-2011') +
    ylab('Costs of damages') +
    xlab('Type of event') +
    scale_fill_discrete(name='Type of damage', labels=c('Properties', 'Crops')) +
    scale_y_continuous(labels = comma) +
    coord_flip()

Results

From the plots above, we can draw the following conclusions :

Tornados have by far the higher impact on the population health, followed by thunderstorm winds and excessive heat.
In terms of fatalities, the excessive heat is more harmful than the thunderstorm winds.
Floods have the higher economic impact, followed by hurricanes (typhoons) and tornados.

Reproducible Research: Peer Assessment 2