Author: Kendall Giles
Contact: @kendallgiles

Synopsis

In this report we look at the the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, analyzing the data to find weather events most harmful to people and from an economic standpoint. Since severe weather events can cause harm to human populations as well as impact local economies, understanding which weather events are more harmful than others can help governments prioritize resources.

Before analyzing the data using R, we perform simple data cleaning steps to account for missing and corrupted data values.

Our preliminary analysis indicates that, including injuries and fatalities, the most harmful weather events to humans are tornados; after combining property and crop damage, the most harmful economic weather events are floods.

Introduction

Severe weather events can cause harm to human populations as well as impact local economies–injuries, fatalities, property damage, and crop damage are of particular concern. Since severe weather events can cause harm to human populations as well as impact local economies, understanding which weather events are more harmful than others can help governments prioritize resources.

To help addresses these concenrs, we explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks characteristics of major storms and other weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property/crop damage.

Two main questions are the focus of this analysis:

Across the United States, which types of weather events are most harmful with respect to population health?
Across the United States, which types of weather events have the greatest economic consequences?

This analysis was conducted using R (version 3.1.2) in RStudio. This report, an example of literate (statistical) programming, was created with the package knitr and using the package dplyr, in addition to the default packages loaded by RStudio.

Note that all R processing code used in the analysis is presented in the text of this document–normally some non-essential-to-the-reader code segments are hidden, but for demonstration and transparancy purposes, all code and all outputs are presented herein.

Data Processing

The NOAA storm database covers recorded weather events in the US from 1950 to November 2011. Additional documentation about this dataset includes:

The raw NOAA Storm dataset, a csv file compressed via bzip2, was downloaded. This dataset was uncompressed and placed in the root of the R working directory. The resulting raw .csv data file is 561.6 MB.

The dataset was read into memory. Note that missing values were coded as empty strings, and the header was read to get the variable names.

raw.data <- read.csv("./noaa_storm_data.csv", header=TRUE, na.strings = "")

The raw dataset contains 902297 observations of 37 variables each, and takes up 409.4 Mb of memory:

dim(raw.data)

## [1] 902297     37

print(object.size(raw.data), units="Mb")

## 409.4 Mb

Here are the first few rows of the dataset:

head(raw.data)

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
## 2 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
## 3 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
## 4 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
## 5 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
## 6 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0    <NA>       <NA>   14.0   100 3   0          0
## 2         NA         0    <NA>       <NA>    2.0   150 2   0          0
## 3         NA         0    <NA>       <NA>    0.1   123 2   0          0
## 4         NA         0    <NA>       <NA>    0.0   100 2   0          0
## 5         NA         0    <NA>       <NA>    0.0   150 2   0          0
## 6         NA         0    <NA>       <NA>    1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP  WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0       <NA> <NA>       <NA>      <NA>
## 2        0     2.5          K       0       <NA> <NA>       <NA>      <NA>
## 3        2    25.0          K       0       <NA> <NA>       <NA>      <NA>
## 4        2     2.5          K       0       <NA> <NA>       <NA>      <NA>
## 5        2     2.5          K       0       <NA> <NA>       <NA>      <NA>
## 6        6     2.5          K       0       <NA> <NA>       <NA>      <NA>
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806    <NA>      1
## 2     3042      8755          0          0    <NA>      2
## 3     3340      8742          0          0    <NA>      3
## 4     3458      8626          0          0    <NA>      4
## 5     3412      8642          0          0    <NA>      5
## 6     3450      8748          0          0    <NA>      6

With respect to weather event types, indicated by the variable EVTYPE, there are two main categories of event consequences that were tracked in the dataset:

Harm to the human population, either fatality or injury as the result of a storm, is quantified by the variables: FATALITIES and INJURIES.
Ecomonic harm, which in this dataset encompasses property damage and crop damage, is quantified by the variables PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP.

Some Initial Examples

Calculating some statistics about harm to human populations and property:

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

hail.data <- filter(raw.data, EVTYPE=="HAIL")
sum(hail.data$FATALITIES)

## [1] 15

water.data <- filter(raw.data, EVTYPE=="WATERSPOUT")
sum(water.data$INJURIES)

## [1] 29

wind.data <- filter(raw.data, EVTYPE=="WIND DAMAGE")
prop.damage.exponents <- wind.data %>% filter(PROPDMG>0) %>% select(PROPDMGEXP)
table(prop.damage.exponents)

## prop.damage.exponents
## - ? + 0 1 2 3 4 5 6 7 8 B h H K m M 
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0

sum(wind.data$PROPDMG)

## [1] 77

As an example, in the years encompassed by this dataset, 1950-2011, there were 15 recorded fatalities from hail, 29 recorded injuries from waterspouts, and $77K in recorded wind damage to property.

tsunami.data <- filter(raw.data, EVTYPE=="TSUNAMI")
sum(tsunami.data$FATALITIES)

## [1] 33

But, note that there were 33 recorded fatalities due to tsunamis during the same time period, indicating that some weather events do more harm than others.

Cleaning Based on EVTYPE

There are 985 weather event types recorded in the dataset, such as: TORNADO, TSTM WIND, HAIL, FREEZING RAIN, SNOW, ICE STORM/FLASH FLOOD, SNOW/ICE, WINTER STORM, HURRICANE OPAL/HIGH WINDS, THUNDERSTORM WINDS.

length(unique(raw.data$EVTYPE))

## [1] 985

as.character(unique(raw.data$EVTYPE)[1:10])

##  [1] "TORNADO"                   "TSTM WIND"                
##  [3] "HAIL"                      "FREEZING RAIN"            
##  [5] "SNOW"                      "ICE STORM/FLASH FLOOD"    
##  [7] "SNOW/ICE"                  "WINTER STORM"             
##  [9] "HURRICANE OPAL/HIGH WINDS" "THUNDERSTORM WINDS"

However, according to the National Weather Service Documentation, there are only meant to be 48 event types recorded. This means that the dataset is not clean–through poor user interface design and data collection methods the data in the dataset has incomplete, wrong, or missing values.

Also, from this document on the NOAA website describing the storm events database, it can be seen that from 1950-1955, only TORNADO events were recorded; from 1955-1996 only TORNADO, THUNDERSTORM WIND, and HAIL events were recorded. It was only starting in 1996 that all 48 event types were recorded. Thus, in the early years, the data are severely biased towards a subset of the weather events, since not all weather events were recorded.

So, for the purposes of this preliminary analysis, only weather events that started in or after 1996 will be used (using the BGN_DATE to get the date the weather event started):

library(dplyr)
relevant.data <- filter(raw.data, as.numeric(format(as.Date(as.character(raw.data$BGN_DATE), "%m/%d/%Y %H:%M:%S"), "%Y")) >= 1996)
dim(relevant.data)

## [1] 653530     37

Cleaning Economic Data

As noted previously the variables for economic harm are record in PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP.

According to the NWS documentation, the only valid damage estimate multipliers are “K” for thousands, “M” for millions, and “B” for billions–eg, 1.55 in the PROPDMG variable and B in the PROPDMGEXP variable means the property damage was $1,550,000,000. So, a PROPDMGEXP value of K corresponds to a PROPDMG multiplier of 1000, M a multiplier of 1000000, and B a multiplier of 1000000000.

However, in looking at the values in the PROPDMGEXP and CROPDMGEXP variables, we see that some values are missing, and the PROPDMPEXP variable has some eponents coded as 0. For this report, damage estimates with missing (NA) values will be removed, under the assumption the dollar amount of the damages was either unknown or unavailable. Exponents with 0 (i.e., indicating an exponent multiplier of 10 to the 0 = 1) will indicate that the damage estimates are in units of dollars.

unique(as.character(relevant.data$PROPDMGEXP))

## [1] "K" NA  "M" "B" "0"

unique(as.character(relevant.data$CROPDMGEXP))

## [1] "K" NA  "M" "B"

So, removing NA damage estimate records and multiplying the damage estimates by the appropriate exponent:

prop.data <- relevant.data %>% select(EVTYPE, PROPDMG, PROPDMGEXP)
crop.data <- relevant.data %>% select(EVTYPE, CROPDMG, CROPDMGEXP)
dim(prop.data)

## [1] 653530      3

dim(crop.data)

## [1] 653530      3

prop.data <- prop.data[!is.na(prop.data$PROPDMGEXP),]
dim(prop.data)

## [1] 377345      3

crop.data <- crop.data[!is.na(crop.data$CROPDMGEXP),]
dim(crop.data)

## [1] 280461      3

prop.0 <- prop.data %>% filter(as.character(PROPDMGEXP)=="0") %>% select(EVTYPE, PROPDMG) %>% rename(DMG = PROPDMG)
prop.K <- prop.data %>% filter(as.character(PROPDMGEXP)=="K") %>% mutate(PROPDMG = PROPDMG*1000) %>% select(EVTYPE, PROPDMG) %>% rename(DMG = PROPDMG)
prop.M <- prop.data %>% filter(as.character(PROPDMGEXP)=="M") %>% mutate(PROPDMG = PROPDMG*1000000) %>% select(EVTYPE, PROPDMG) %>% rename(DMG = PROPDMG)
prop.B <- prop.data %>% filter(as.character(PROPDMGEXP)=="B") %>% mutate(PROPDMG = PROPDMG*1000000000) %>% select(EVTYPE, PROPDMG) %>% rename(DMG = PROPDMG)

crop.K <- crop.data %>% filter(as.character(CROPDMGEXP)=="K") %>% mutate(CROPDMG = CROPDMG*1000) %>% select(EVTYPE, CROPDMG) %>% rename(DMG = CROPDMG)
crop.M <- crop.data %>% filter(as.character(CROPDMGEXP)=="M") %>% mutate(CROPDMG = CROPDMG*1000000) %>% select(EVTYPE, CROPDMG) %>% rename(DMG = CROPDMG)
crop.B <- crop.data %>% filter( as.character(CROPDMGEXP)=="B") %>% mutate(CROPDMG = CROPDMG*1000000000) %>% select(EVTYPE, CROPDMG) %>% rename(DMG = CROPDMG)

economic.data <- rbind(prop.0, prop.K, prop.M, prop.B, crop.K, crop.M, crop.B )
dim(economic.data)

## [1] 657806      2

Results

Harm to Humans: Fatalities and Injuries

Here is a plot of the sum of the number of fatalities by event type over the years 1996-2011:

grouped.data <- group_by(relevant.data, EVTYPE)
sums.by.event <- summarise_each(grouped.data, funs(sum))
plot(1:nrow(sums.by.event), sums.by.event$FATALITIES, main="Sum of Fatalities by Weather Type, 1996-2011", type="p", pch=19, xlab="Weather Type Index", ylab="Sum of Fatalities")

The x-axis of the plot is the index of the weather type event, while the y-axis is the sum of the fatalities over the years 1996-2011. While most of the fatality sums are near zero for most of the weather events, there is one weather event with a large number of fatalities (> 1500).

To find this weather event name, we find the index of the weather event with the largest sum of fatalities, and the resulting max fatal weather event is EXCESSIVE HEAT.

as.character(sums.by.event$EVTYPE[which.max(sums.by.event$FATALITIES)])

## [1] "EXCESSIVE HEAT"

If we perform a similar analysis, but including injuries as well as fatalities, then the most harmful weather event during 1996-2001 is TORNADO.

grouped.data <- relevant.data %>% group_by(EVTYPE) %>% select(EVTYPE, FATALITIES, INJURIES)
sums.by.event <- summarise_each(grouped.data, funs(sum))

plot(1:nrow(sums.by.event), sums.by.event$FATALITIES + sums.by.event$INJURIES, main="Sum of Fatalities + Injuries by Weather Type, 1996-2011", type="p", pch=19, xlab="Weather Type Index", ylab="Sum of Fatalities + Injuries")

as.character(sums.by.event$EVTYPE[which.max(sums.by.event$FATALITIES + sums.by.event$INJURIES)])

## [1] "TORNADO"

To be sure, we arrange the data and look at the top 20 events by sum of fatalities and injuries:

ordered.sums.by.event <- sums.by.event %>% mutate(SUM = FATALITIES + INJURIES) %>% select(EVTYPE, SUM) %>% arrange(desc(SUM))
ordered.sums.by.event[1:20,]

## Source: local data frame [20 x 2]
## 
##               EVTYPE   SUM
## 1            TORNADO 22178
## 2     EXCESSIVE HEAT  8188
## 3              FLOOD  7172
## 4          LIGHTNING  4792
## 5          TSTM WIND  3870
## 6        FLASH FLOOD  2561
## 7  THUNDERSTORM WIND  1530
## 8       WINTER STORM  1483
## 9               HEAT  1459
## 10 HURRICANE/TYPHOON  1339
## 11         HIGH WIND  1318
## 12          WILDFIRE   986
## 13        HEAVY SNOW   805
## 14               FOG   772
## 15              HAIL   720
## 16  WILD/FOREST FIRE   557
## 17       RIP CURRENT   549
## 18      RIP CURRENTS   496
## 19          BLIZZARD   455
## 20         ICE STORM   400

We see for example one event recorded as TSTM WIND and one event labeled as THUNDERSTORM WIND. TSTM WIND does not appear as a valid weather event, though it could be that this event is a mis-labelling of the valid weather event THUNDERSTORM WIND, a corruption perhaps due to a poor user interface or data collection method. If they are in fact meant to be the same weather event, then the SUM values would need to be added together. However, the resulting combination of TSTM WIND and THUNDERSTORM WIND values is still less than the sum for the TORNADO weather event.

Due to the sharp dropoff in sums for all the following weather events, there appears to be no other mislabeled weather events that, when all summed together with the valid weather event label observations, would top the TORNADO sum.

Economic Harm

Along similar lines, we can plot the property and crop damage by weather event and find the weather event corresponding to the max property and crop damage sum:

grouped.data <- group_by(economic.data, EVTYPE)
sums.by.event <- summarise_each(grouped.data, funs(sum))
damage.sums <- sums.by.event$DMG


plot(1:nrow(sums.by.event), damage.sums, main="Sum of Crop and Property Damage, 1996-2011", type="p", pch=19, xlab="Weather Type Index", ylab="Sum of Property + Crop Damage")

as.character(sums.by.event$EVTYPE[which.max(damage.sums)])

## [1] "FLOOD"

We find that the most harmful weather event from an economic standpoint is FLOOD.

And as with the analysis on weather event mislabellings with fatality/injury observations, from looking at the first 20 property and crop damage observations from the ordered set, we see no mislabelled observations that, when combined, would sum to greater than that for FLOOD events.

ordered.sums.by.event <- sums.by.event %>% arrange(desc(DMG))
ordered.sums.by.event[1:20,]

## Source: local data frame [20 x 2]
## 
##               EVTYPE          DMG
## 1              FLOOD 148919611950
## 2  HURRICANE/TYPHOON  71913712800
## 3        STORM SURGE  43193541000
## 4            TORNADO  24900370720
## 5               HAIL  17071172870
## 6        FLASH FLOOD  16557105610
## 7          HURRICANE  14554229010
## 8            DROUGHT  14413667000
## 9     TROPICAL STORM   8320186550
## 10         HIGH WIND   5881421660
## 11          WILDFIRE   5054139800
## 12         TSTM WIND   5031941790
## 13  STORM SURGE/TIDE   4642038000
## 14 THUNDERSTORM WIND   3780985440
## 15         ICE STORM   3657908810
## 16  WILD/FOREST FIRE   3108564830
## 17      WINTER STORM   1544687250
## 18        HEAVY RAIN   1313034240
## 19      EXTREME COLD   1308733400
## 20      FROST/FREEZE   1103566000

Conclusions

In this report we looked at the the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, analyzing the data to find weather events most harmful to people and from an economic standpoint.

We discovered that this dataset was not clean–it contained missing and corrupted values. Some data cleaning was performed, but note that other cleaning decisions could be made with more knowledge of the dataset and data collection methods. With this particular dataset, it appeared that some weather event observations were mislabelled, but correcting for those mislabellings would not have changed our results for the specific questions we were asking.

Thus, for the purposes of this preliminary investigation, including injuries and fatalities, in this dataset the most harmful weather event to humans appears to be tornados; after combining property and crop damage, the most harmful economic event appears to be floods.

Exploring Harmful Weather Events in the NOAA Storm Database