NOAA storm data exploration of damage, fatalities and injuries

Synopsis

In this report, some interesting aspects of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database were explored.

In particular, these two questions were addressed:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

To address question 1, I aggregated all fatalities (held in column FATALITIES) and injuries (held in column INJURIES) on variable EVTYPE and processed the result to explore the most harmful event types. The main conclusion from this exploration is that Tornados are by far the most important cause of human health impact (fatalities and injuries).

To address question 2, I took a similar approach but focused on crop and property damage. Here, the main cause of economic impact is Flooding, followed by Hurricanes/Typhoons and Tornados.

Finally, the geographic distribution of fatalities and economic impact across the US was investigated.

Data Processing

Source data documentation

The Data sources for this analysis are listed below.

The raw data are available at this url https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The documentation for the data is available at https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf

The National Climatic Data Center Storm Events FAQ is available at https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf

Data loading

First read the data and explore the contents and size

## local variables
bz2_data_file <- "/Users/michiel/Desktop/repdata-data-StormData.csv.bz2"
remote_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

## fetch file
if (!file.exists(bz2_data_file)) {
    print(paste("downloading", bz2_data_file))
    download.file(url = remote_url, destfile = bz2_data_file)
} else {
    print(paste("reading cached copy of", bz2_data_file))
}

## [1] "reading cached copy of /Users/michiel/Desktop/repdata-data-StormData.csv.bz2"

## read from bz2 file
stormData <- read.table(
        bzfile(bz2_data_file),
        sep = ",",
        header = TRUE,
        na.strings = ""
    )
dim(stormData)

## [1] 902297     37

Thus, in total 902297 weather related events have been recorded into this database.

The dataset also holds date/time information, but these are not relevant for this exploration and are left as they are.

Results

Question 1

This is the question addressed here: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Variables to be considered:

EVTYPE the event type
FATALITIES the number of fatalities of the event
INJURIES the number of injuries of the event

To address this question, I chose to simply sum all fatalities and injuries per event type, and then select the types responsible for (1) the most fatalities and (2) the most injuries (in that order). Te dplyr package was used for data processing and the ggplot2 package for visualization.

First the data were explored a bit. There are 985 event types in total. The most frequent types of events are these:

## first look at the counts of predominant event types
head(sort(table(stormData$EVTYPE), decreasing = TRUE))

## 
##              HAIL         TSTM WIND THUNDERSTORM WIND           TORNADO 
##            288661            219940             82563             60652 
##       FLASH FLOOD             FLOOD 
##             54277             25326

## total number of event types
length(table(stormData$EVTYPE))

## [1] 985

There are so many event types, a selection will be made before the actual visualization. This is to prevent the picture being cluttered. This is OK, because the question asks for the most harmful event types. Therefore, only the 10 most harmful events were selected. This selection is based on two-way sorting: first on fatalities and then on injuries. A stacked bar chart is in my opinion the best way to visualize the summed effect of the combination of fatalities and injuries while still showing their individual effects.

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

popHealth <- group_by(stormData, EVTYPE) %>%
    summarise("FATALITIES" = sum(FATALITIES), "INJURIES" = sum(INJURIES))

## convert into dataframe for further analysis
popHealth <- as.data.frame(popHealth)

## select 10 most lethal event types (secondary sorting on injuries)
selectedPopHealth <- head(
    popHealth[order(
        popHealth$FATALITIES, 
        popHealth$INJURIES, 
        decreasing = TRUE), ],
    n = 10)

## determine Event order; this is done to sort the bars of the 
## chart in decreasing FATALITY order
selectedPopHealth$event.type.f = factor(
            selectedPopHealth$EVTYPE, 
            levels = as.character(selectedPopHealth$EV))

## reformat to be able to create stacked bar chart
selPopHealthExpanded <- data.frame(
    "EventType" = rep(selectedPopHealth$event.type.f, 2), 
    "HarmType" = rep(
        names(selectedPopHealth)[2:3],
        each = nrow(selectedPopHealth)),
    "Frequency" = c(selectedPopHealth$FATALITIES, selectedPopHealth$INJURIES),
    "Log10Frequency" = c(log10(selectedPopHealth$FATALITIES), log10(selectedPopHealth$INJURIES)))

## the total counts of harmful events and the split over categories
ggplot(selPopHealthExpanded, aes(x = EventType, y = Frequency, fill=HarmType)) +
    geom_bar(stat = "identity") + 
    xlab("Event Type") + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    ggtitle("Population health impact by event type")

Figure 1: Population health impact. Visualized as a stacked bar chart of fatalities and injuries and plit over event types. The x-axis shows the top-10 most harmful event types. The y-axis (“Frequency”) is the number of fatalities/injuries.

As can be seen, Tornados are by far the most harmfull types of events, both in number of fatalities as in number of injuries. It is followed at great distance by Excessive heat. Not surprisingly, the number of injuries exceeds the number of fatalities by a factor of 9.2788379.

Question 2:

Across the United States, which types of events have the greatest economic consequences?

The documentation states “Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.”

Columns of interest here are, besides the EVTYPE variable, columns 25 - 28 of stormData:

PROPDMG: prperty damage amount
PROPDMGEXP: property damage magnitude - Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions
CROPDM: crop damage amount
CROPDMGEXP: crop damage magnitude

This allows for a similar strategy as in the previous question.

Explore and process the raw damage data

table(stormData$PROPDMGEXP, useNA = "ifany")

## 
##      -      ?      +      0      1      2      3      4      5      6 
##      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M   <NA> 
##      5      1     40      1      6 424665      7  11330 465934

table(stormData$CROPDMGEXP, useNA = "ifany")

## 
##      ?      0      2      B      k      K      m      M   <NA> 
##      7     19      1      9     21 281832      1   1994 618413

Apparently, this is a highly inconsistent dataset: The Magnitude factors “K”, “M” and “B” listed in the documentation are there, but also many others. I choose to convert the numeric values to the corresponding exponent; i.e. 2 equals 100 and 3 equals 1000 and so on. Also, lower case letters “k” and “m” will be treated as “K” and “M”, respectively. Question marks and + signs will simply be ignored (but these occur only a few times).

Now, the numeric value will be combined with the magnitude value to obtain a single numeric representation of the damage variables using the strategy outlined above.

## define a function that can combine numbers with the range of magnitude values 
## to get regular numeric values. All ambiguous values ("-","?", "+") will be 
## ignored - i.e. treated as magnitude 0.
combinedMagWithNum <- function(numbers, mag) {
    numericMags <- numeric(length(mag))
    numericMags[mag == "1"] <- 1
    numericMags[mag == "2" | mag == "h" | mag == "H"] <- 2
    numericMags[mag == "3" | mag == "k" | mag == "K"] <- 3
    numericMags[mag == "4"] <- 4
    numericMags[mag == "5"] <- 5
    numericMags[mag == "6" | mag == "m" | mag == "M"] <- 6
    numericMags[mag == "7"] <- 7
    numericMags[mag == "8"] <- 8
    numericMags[mag == "9" | mag == "b" | mag == "B"] <- 9
    numbers * 10 ^ numericMags
}

## create property.damage column combining both number and magnitude
stormData$property.damage <- combinedMagWithNum(stormData$PROPDMG, stormData$PROPDMGEXP)

## create crop.damage column combining both number and magnitude
stormData$crop.damage <- combinedMagWithNum(stormData$CROPDMG, stormData$CROPDMGEXP)

Looking at the distribution of property damage, this is around a mean of 4.745941410^{5}. The maximum event was 1.1510^{11} which was a FLOOD in CA, in 2006 but surprisingly without any recorded fatalities or injuries.

(Selected using stormData[stormData$property.damage == max(stormData$property.damage), ])

Next, the 10 event types that are economycally the most severe will be plotted.

damage <- group_by(stormData, EVTYPE) %>%
    summarise("property.damage" = sum(property.damage), "crop.damage" = sum(crop.damage))

## convert into dataframe for further analysis
damage <- as.data.frame(damage)

## select 10 most costly event types by summing property and crop damage
selectedDamage <- head(
    damage[order(
        damage$property.damage + 
        damage$crop.damage, 
        decreasing = TRUE), ],
    n = 10)

## determine Event order; this is done to sort the bars of the 
## chart in decreasing FATALITY order
selectedDamage$event.type.f = factor(
            selectedDamage$EVTYPE, 
            levels = as.character(selectedDamage$EV))

## reformat to be able to create stacked bar chart
damageExpanded <- data.frame(
    "EventType" = rep(selectedDamage$event.type.f, 2), 
    "DamageType" = rep(names(selectedDamage)[2:3], each = nrow(selectedDamage)),
    "Sum" = c(selectedDamage$property.damage, selectedDamage$crop.damage),
    "Log10Frequency" = c(log10(selectedDamage$property.damage), log10(selectedDamage$crop.damage)))

## stacked bar chart is in my opinion the best way to visualize both 
## the sums of damage events and the split over categories
ggplot(damageExpanded, aes(x = EventType, y = Sum, fill=DamageType)) +
    geom_bar(stat = "identity") + 
    xlab("Event Type") + 
    ylab("Sum of damage") + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    ggtitle("Economic consequences by event type")

Figure 2: Economic impact. Visualized as a stacked bar chart and plit over event types. The x-axis shows the top-10 economically most costly event types. The y-axis (“Frequency”) is the sum of the damage for the event type, split over crop and property damage.

Thus, with respect to economic impact, Flooding is the most severy type, followed by Hurricane/Typhoon and Tornado. Property damage outnumbers crop damage by a factor 8.7207395.

Overlap between the top 10 rankings of the two categories “Human health” and “damage” are these event types:

selectedPopHealth$event.type.f[selectedPopHealth$event.type.f %in% selectedDamage$event.type.f]

## [1] TORNADO     FLASH FLOOD FLOOD      
## 10 Levels: TORNADO EXCESSIVE HEAT FLASH FLOOD HEAT LIGHTNING ... AVALANCHE

Distribution of fatalities and damages across the US

Since the event types are not equally spread across the United States, and since the questions also start with “Across the United states, …”, I decided to also explore the spatial distribution of Fatalities (as indicator of human health impact) and summed total of property damage and crop damage.

Here follows the creation of a panel of maps indicating the number of fatalities and total economic impact per state.

To convert the two-letter code of the US states to their full names, needed by the maps package, I borrowed a function from this blog post.

## this function can be used to convert 2-letter codes to state names
## 'x' is the column of a data.frame that holds 2 digit state codes
stateFromLower <-function(x) {
    #read 52 state codes into local variable [includes DC (Washington D.C. and PR (Puerto Rico)]
    st.codes<-data.frame(
        state=as.factor(c("AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
            "HI", "IA", "ID", "IL", "IN", "KS", "KY", "LA", "MA", "MD", "ME",
            "MI", "MN", "MO", "MS",  "MT", "NC", "ND", "NE", "NH", "NJ", "NM",
            "NV", "NY", "OH", "OK", "OR", "PA", "PR", "RI", "SC", "SD", "TN",
            "TX", "UT", "VA", "VT", "WA", "WI", "WV", "WY")),
        full=as.factor(c("alaska","alabama","arkansas","arizona","california","colorado",
            "connecticut","district of columbia","delaware","florida","georgia",
            "hawaii","iowa","idaho","illinois","indiana","kansas","kentucky",
            "louisiana","massachusetts","maryland","maine","michigan","minnesota",
            "missouri","mississippi","montana","north carolina","north dakota",
            "nebraska","new hampshire","new jersey","new mexico","nevada",
            "new york","ohio","oklahoma","oregon","pennsylvania","puerto rico",
            "rhode island","south carolina","south dakota","tennessee","texas",
            "utah","virginia","vermont","washington","wisconsin",
            "west virginia","wyoming"))
        )
    ## create an nx1 data.frame of state codes from source column
    st.x<-data.frame(state=x)
    ## match source codes with codes from 'st.codes' local variable and use to return the full state name
    refac.x<-st.codes$full[match(st.x$state,st.codes$state)]
    ## return the full state names in the same order in which they appeared in the original source
    return(refac.x)
}

## mapproj package required for this functionality
library(maps)

## 
##  # ATTENTION: maps v3.0 has an updated 'world' map.        #
##  # Many country borders and names have changed since 1990. #
##  # Type '?world' or 'news(package="maps")'. See README_v3. #

## summarize all damage, fatalities and injuries per state
damageByState <- group_by(stormData, STATE) %>%
    summarise(
        "property.damage" = sum(property.damage), 
        "crop.damage" = sum(crop.damage), 
        "damage" = sum(property.damage) + sum(crop.damage),
        "fatalities" = sum(FATALITIES), 
        "injuries" = sum(INJURIES))

## convert to data frame
damageByState <- as.data.frame(damageByState)
## add correct state name
damageByState$full.state <- stateFromLower(damageByState$STATE)
## remove funny state code rows that cannot be displayed in this map
damageByState <- damageByState[!is.na(damageByState$full.state), ]
## load states map from package::maps
states_map <- map_data("state")
## create map data
damageMap <- merge(states_map, damageByState, by.x = "region", by.y = "full.state")
damageMap <- arrange(damageMap, group, order)

## create the plots
g1 <- ggplot(damageMap, aes(x=long, y=lat, group=group, fill=fatalities)) +
    geom_polygon(colour="black") +
    scale_fill_gradient2(low="#559999", mid="grey90", high="#BB650B",
                         midpoint=median(damageByState$fatalities)) +
    coord_map("polyconic")

## Warning: Non Lab interpolation is deprecated

g2 <- ggplot(damageMap, aes(x=long, y=lat, group=group, fill=damage)) +
    geom_polygon(colour="black") +
    scale_fill_gradient2(low="#559999", mid="grey90", high="#BB650B",
                         midpoint=median(damageByState$damage)) +
    coord_map("polyconic")

## Warning: Non Lab interpolation is deprecated

## now print the plots in one panel
library(gridExtra)
grid.arrange(g1, g2, nrow=2, top = "Total Fatalities (top) and Damage (in US$ - bottom) of weather events since 1950")

Figure 3. Geographical distribution across the united States of fatalities (as primary indicator of health impact) and summed damages for both property and crop damages (in US$).

Thus, as can be seen from these plots, there is quite a difference in geographical distribution between fatalities and economic damage. When planning precautionary and protective actions, this should be taken into account.

end