Severe Weather Analysis, Assignment 2

Synopsis

This data analysis looks at the severe weather events data of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database in the period from 1950 to 2011. The data contains information on fatalities and injuries as well as property and crop damages that resulted from severe weather conditions. The informations was grouped and aggregated to present the total health and total damage figures over the period of the most significant events. In this analysis it was considered to cluster the weather events based on the 48 standard NOAA weather event types, but a standard “pmatch” function together with small adjustments to improve the grouping of key events showed that our top 10 events covers well over 98% on the total victims and damages. Hence more sophisticated clustering was deemed unnecessary. Tornadoes, thunderstorms and excessive heat are the key causes of weather victims. While floods are the most significant cause of damage, followed by hurricanes and tornadoes.

Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data Processing

Download, decompress and read data.

#    Download dataset
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists("StormData.csv.bz2")) {
     download.file(fileUrl, destfile = "StormData.csv.bz2", method = "curl")
     message("Storm data downloaded on: ", date())
}
#    Unzip dataset and read into dataframe
connection <- bzfile("StormData.csv.bz2", "r")
stormData <- read.table(connection, sep = ",", header = TRUE, fill = TRUE)
close(connection)
#    Show dimensions of data
dim(stormData)

## [1] 902297     37

Load all required libraries and define short function to format large numbers

library(dplyr)
library(stringr)
library(scales)
library(tidyr)
library(ggplot2)
Print <- function(x) formatC(x, decimal.mark="", big.mark=",", digits = 0, format = "f")

Showing the health and economic effects with the current list of event types (EVTYPE) is not very effective as this list contains 985 different weather events.

To reduce and normalize this list the following steps are taken:

healthEffect <- filter(stormData, FATALITIES > 0 | INJURIES > 0)

For the health effects we need to select the columns FATALITIES and INJURIES where the victim count is larger than 0. The other observations are irrelevant. This reduduced the original dataset from 902297 to 21929 observations, and the number of event types reduces to 220.

damageEffect <- filter(stormData, PROPDMG > 0 | CROPDMG > 0)

For the economic effects we need to select the columns for property (PROPDMG) and crop damage (CROPDMG) were the dollar value is larger than 0. The other observations are irrelevant. This reduced the original dataset from 902297 to 245031 observations, and the number of event types reduces to 431.

Calculate the real damage by multiplying with the exponent. Records with invalid exponents are a small minority and are ignored to prevent making invalid damage assumptions.

for (i in seq_along(damageEffect$EVTYPE)) {
#    Property damage calculation
     if (damageEffect$PROPDMGEXP[i] == "K" | damageEffect$PROPDMGEXP[i] == "k") {
          damageEffect$PROPDAMAGE[i] <- damageEffect$PROPDMG[i] * 1000
     } else if (damageEffect$PROPDMGEXP[i] == "M" | damageEffect$PROPDMGEXP[i] == "m") {
          damageEffect$PROPDAMAGE[i] <- damageEffect$PROPDMG[i] * 1000000
     } else if (damageEffect$PROPDMGEXP[i] == "B" | damageEffect$PROPDMGEXP[i] == "b") {
          damageEffect$PROPDAMAGE[i] <- damageEffect$PROPDMG[i] * 1000000000
     } else if (damageEffect$PROPDMGEXP[i] == "") {
          damageEffect$PROPDAMAGE[i] <- damageEffect$PROPDMG[i]
     }
     #    Crop damage calculation
     if (damageEffect$CROPDMGEXP[i] == "K" | damageEffect$CROPDMGEXP[i] == "k") {
          damageEffect$CROPDAMAGE[i] <- damageEffect$CROPDMG[i] * 1000
     } else if (damageEffect$CROPDMGEXP[i] == "M" | damageEffect$CROPDMGEXP[i] == "m") {
          damageEffect$CROPDAMAGE[i] <- damageEffect$CROPDMG[i] * 1000000
     } else if (damageEffect$CROPDMGEXP[i] == "B" | damageEffect$CROPDMGEXP[i] == "b") {
          damageEffect$CROPDAMAGE[i] <- damageEffect$CROPDMG[i] * 1000000000
     } else if (damageEffect$CROPDMGEXP[i] == "") {
          damageEffect$CROPDAMAGE[i] <- damageEffect$CROPDMG[i]
     }
}

The total property damage in the measured period is: 427,324,917,627 dollar. The total crop damage in the measured period is: 49,104,191,921 dollar.

The number of number of servere weather event types is still unmanageable large. So for comprehensive reporting we need to normalize the events to the 48 standard Storm Data Events as defined in paragraph 2.1.1 of above referenced Storm Data Documentation. These standards types are defined in the provided dataset: stormEventTable.csv. We match the weather event types against these standard weather events types to cluster them.

eventTypes <- read.table("stormEventTable.csv", sep = ",", header = TRUE)
eventTypes

##                     weather
## 1     Astronomical Low Tide
## 2                 Avalanche
## 3                  Blizzard
## 4             Coastal Flood
## 5           Cold/Wind Chill
## 6               Debris Flow
## 7                 Dense Fog
## 8               Dense Smoke
## 9                   Drought
## 10               Dust Devil
## 11               Dust Storm
## 12           Excessive Heat
## 13  Extreme Cold/Wind Chill
## 14              Flash Flood
## 15                    Flood
## 16             Frost/Freeze
## 17             Funnel Cloud
## 18             Freezing Fog
## 19                     Hail
## 20                     Heat
## 21               Heavy Rain
## 22               Heavy Snow
## 23                High Surf
## 24                High Wind
## 25      Hurricane (Typhoon)
## 26                Ice Storm
## 27         Lake-Effect Snow
## 28          Lakeshore Flood
## 29                Lightning
## 30              Marine Hail
## 31         Marine High Wind
## 32       Marine Strong Wind
## 33 Marine Thunderstorm Wind
## 34              Rip Current
## 35                   Seiche
## 36                    Sleet
## 37         Storm Surge/Tide
## 38              Strong Wind
## 39        Thunderstorm Wind
## 40                  Tornado
## 41      Tropical Depression
## 42           Tropical Storm
## 43                  Tsunami
## 44             Volcanic Ash
## 45               Waterspout
## 46                 Wildfire
## 47             Winter Storm
## 48           Winter Weather

Match the weather events in the records against the standard weather types as defined by NOAA. To improve matching the following changes have been made to the event types:

TSTM is translated into Thunderstorm
Hurricane event are adjusted to the standard event name
River Flood data is normalized to Flood
TORNADOES was normalized to TORNADO

#    First match health data
healthEffect$EVTYPE <- str_replace_all(healthEffect$EVTYPE, "TSTM", "THUNDERSTORM")
healthEffect$EVTYPE <- str_replace_all(healthEffect$EVTYPE, "HURRICANE/TYPHOON", "Hurricane (Typhoon)")
healthEffect$EVTYPE <- str_replace_all(healthEffect$EVTYPE, "HURRICANE OPAL", "Hurricane (Typhoon)")
healthEffect$EVTYPE <- str_replace_all(healthEffect$EVTYPE, "RIVER FLOOD", "FLOOD")
healthEffect$EVTYPE <- str_replace_all(healthEffect$EVTYPE, "TORNADOES", "TORNADO ")
healthEffect$STDTYP <- pmatch(toupper(healthEffect$EVTYPE), toupper(eventTypes$weather), duplicates.ok = TRUE)
#    Next match damage data
damageEffect$EVTYPE <- str_replace_all(damageEffect$EVTYPE, "TSTM", "THUNDERSTORM")
damageEffect$EVTYPE <- str_replace_all(damageEffect$EVTYPE, "HURRICANE/TYPHOON", "Hurricane (Typhoon)")
damageEffect$EVTYPE <- str_replace_all(damageEffect$EVTYPE, "HURRICANE OPAL", "Hurricane (Typhoon)")
damageEffect$EVTYPE <- str_replace_all(damageEffect$EVTYPE, "RIVER FLOOD", "FLOOD")
damageEffect$EVTYPE <- str_replace_all(damageEffect$EVTYPE, "TORNADOES", "TORNADO ")
damageEffect$STDTYP <- pmatch(toupper(damageEffect$EVTYPE), toupper(eventTypes$weather), duplicates.ok = TRUE)

The matching left a percentage of 7.03% unmatched to standard weather types.

Results

Across the United States, which types of events (as indicated in the 𝙴𝚅𝚃𝚈𝙿𝙴 variable) are most harmful with respect to population health?

Calculate the total number of victims and present them by (standard) weather event type in decending order. Variable Index gives the index in the standard weather event table. If the standard event type could not be found, then the variable Index is NA. In that case additional matching rules are required.

mostVCT <- group_by(healthEffect, EVTYPE)
mostVCT <- summarise(mostVCT, Index = min(STDTYP), Fatalities = sum(FATALITIES), Injuries = sum(INJURIES), Victims = sum(FATALITIES + INJURIES))
mostVCT <- arrange(mostVCT, desc(Victims))
head(mostVCT, n=10)

## Source: local data frame [10 x 5]
## 
##               EVTYPE Index Fatalities Injuries Victims
##                (chr) (int)      (dbl)    (dbl)   (dbl)
## 1            TORNADO    40       5633    91346   96979
## 2  THUNDERSTORM WIND    39        637     8445    9082
## 3     EXCESSIVE HEAT    12       1903     6525    8428
## 4              FLOOD    15        472     6791    7263
## 5          LIGHTNING    29        816     5230    6046
## 6               HEAT    20        937     2100    3037
## 7        FLASH FLOOD    14        978     1777    2755
## 8          ICE STORM    26         89     1975    2064
## 9       WINTER STORM    47        206     1321    1527
## 10         HIGH WIND    24        248     1137    1385

The total number of victims (fatalities plus injuries) is 155,673 people. Above Top 10 list covers the following percentage of the total victims: 99% hence further modelling to cluster event types to standard types was not really necessary.

top10vct <- gather(mostVCT[1:10, c(1,3,4)], "Impact", "Victims", convert = TRUE, Fatalities, Injuries)
ggplot(top10vct, aes(x = reorder(EVTYPE, -Victims), Victims, fill = factor(Impact))) + 
       geom_bar(stat = "identity", position = "dodge") + 
       theme(text = element_text(size = 15, face = "bold"), 
             legend.text = element_text(size = 12, face = "plain"),
             axis.text.x = element_text(size = 12, face = "plain", angle = 60, vjust = 0.5)) +
       guides(fill = guide_legend(title = "Type of victims")) + 
       ggtitle("Top 10 Severe weather impact on health") +
       scale_x_discrete("Severe weather event") +
       scale_y_log10("Total victims")

Across the United States, which types of events have the greatest economic consequences?

mostDMG <- group_by(damageEffect, EVTYPE)
mostDMG <- summarise(mostDMG, Index = min(STDTYP), Properties = sum(PROPDAMAGE), Crop = sum(CROPDAMAGE), Total.damage = sum(PROPDAMAGE + CROPDAMAGE))
mostDMG <- arrange(mostDMG, desc(Total.damage))
head(mostDMG, n=10)

## Source: local data frame [10 x 5]
## 
##                 EVTYPE Index   Properties        Crop Total.damage
##                  (chr) (int)        (dbl)       (dbl)        (dbl)
## 1                FLOOD    15 149776655307 10691427450 160468082757
## 2  Hurricane (Typhoon)    25  72478686000  2626872800  75105558800
## 3              TORNADO    40  56937435483   414953110  57352388593
## 4          STORM SURGE    37  43323536000        5000  43323541000
## 5                 HAIL    19  15732591777  3025954453  18758546230
## 6          FLASH FLOOD    14  16141136717  1421317100  17562453817
## 7              DROUGHT     9   1046106000 13972566000  15018672000
## 8            HURRICANE    25  11868319010  2741910000  14610229010
## 9            ICE STORM    26   3944952810  5022113500   8967066310
## 10   THUNDERSTORM WIND    39   7968149582   968850400   8936999982

The total damage in (properties plus crop) is 476,429,109,548 US$. Above Top 10 list covers the following percentage of the total damage: 98.1% hence further modelling to cluster event types to standard types was not really necessary.

top10dmg <- gather(mostDMG[1:10, c(1,3,4)], "Impact", "Damage", convert = TRUE, Properties, Crop)
ggplot(top10dmg, aes(x = reorder(EVTYPE, -Damage), Damage, fill = factor(Impact))) + 
       geom_bar(stat = "identity", position = "dodge") + 
       theme(text = element_text(size = 15, face = "bold"), 
             legend.text = element_text(size = 12, face = "plain"),
             axis.text.x = element_text(size = 12, face = "plain", angle = 60, vjust = 0.5)) +
       guides(fill = guide_legend(title = "Type of damage")) +
       ggtitle("Top 10 Severe weather impact on damage [in US$]") +
       scale_x_discrete("Severe weather event") +
       scale_y_log10("Total damage")

Note: With a few minor steps to improve custering towards the standard weather types the percentage victims or damage can easily be increase to above 99.5% but it won’t impact the the top list. E.g. Group all Flood or Hail events together.

Appendix Session and equipment information

sessionInfo()

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.4 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.1.0 tidyr_0.4.1   scales_0.4.0  stringr_1.0.0 dplyr_0.4.3  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3      digest_0.6.8     assertthat_0.1   grid_3.2.2      
##  [5] plyr_1.8.3       R6_2.1.2         gtable_0.2.0     DBI_0.3.1       
##  [9] formatR_1.2.1    magrittr_1.5     evaluate_0.8     stringi_1.0-1   
## [13] lazyeval_0.1.10  rmarkdown_0.9.5  tools_3.2.2      munsell_0.4.3   
## [17] yaml_2.1.13      parallel_3.2.2   colorspace_1.2-6 htmltools_0.2.6 
## [21] knitr_1.11