Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

After analyzing the data following reproducible research techniques, we found that there is extensive overlap of the main weather sources for both casualties (including injuries and fatalities) and monetary damage (including damage done to both crops and property.) In both instances, the main phenomena to blame can be centered around flash floods, tornadoes, excessive heat waves, thunderstorm winds, and hail.

Data Processing Section

Acquisition and Preparation

The base of the analysis was the U.S. National Geographic and Atmospheric Administration’s (NOAA) storm database. This storm tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries and property damages. The database comes in the form of a comma-separated value file compressed via the bzip2 algorithm to reduce its size, and can be downloaded from the following source:

National Weather Service Storm Data Documentation

The brief for data analysis included the following steps.

  1. Downloading the data from its source location
  2. Unzipping the file
  3. Loading the data into an R data frame
  4. Cleaning data fields such as date and selecting only data from the year 2000 forwards

Initialization of Libraries

Several R libraries where used in preparation for the analysis.

library(R.utils)  
## Warning: package 'R.utils' was built under R version 3.2.5
library(dplyr)
library(xtable)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.2.4
library(reshape)

The libraries include utilities for file management, HTML output, and plotting.

Downloading the Data

In order to download the data from source, the following procedure was utilized. Note that all data manipulation happens on code without any manual transformation.

URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
bz2File <- "repdata_data_StormData.csv.bz2"
studyFile <- "StormData.csv"

download.file(URL, bz2File, mode = "wb")
bunzip2(bz2File, studyFile, remove = FALSE)

stormData <- read.csv("StormData.csv", as.is = TRUE, comment.char = "")
str(stormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Two major changes where made in the data file. The first one was to transform dates in the BGN_DATE column from string characters to date classes in R, storing the value in the DATE variable so as not to disrupt the original data structure. Once the change was performed, the DATE variable was used to filter data from the specific date January 1st, 2000. This is to analyze the last eleven years of weather phenomena and give a much clearer trend of weather patterns and disruption to both economic and life activities.

stormData$DATE <- as.POSIXct(stormData$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")
stormData <- filter(stormData, DATE >= "2000/01/01")

For the purpose of the analysis, several constants where calculated, all of them pertaining to accumulators later used to calculate percentages and tendencies of data.

## Counter Variables Used in Later Calculations
totalFatalities <- sum(stormData$FATALITIES)
totalInjuries <- sum(stormData$INJURIES)
totalPropDmg <- sum(stormData$PROPDMG)
totalCropDmg <- sum(stormData$CROPDMG)
totalDmg <- (sum(stormData$PROPDMG) + sum(stormData$CROPDMG))

This is all the data wrangling involved in the analysis, being the questions rather straightforward to answer wit the statistical data obtained in regards to number of casualties, fatalities, and types of damage from storms and other weather phenomena.

Results Section

Sources of Injuries and Fatalities from Weather Phenomena

In order to best answer which weather phenomena cause the most negative impact on human welfare, the data set was grouped by weather phenomena and the ranked in accordance to the tally of injuries and fatalities. Injuries was our initial point of search. The variable INJURIES registers the number of injuries caused by a particular weather event (identified by the variable EVT_TYPE) in a given point in time (identified by the transformed variable DATE.) By grouping and accumulating the registers, we can analyze the top ten sources of injuries as described by the following code:

## Extract most injuries by event type since year 2000
injuries <- group_by(stormData, EVTYPE) %>%
    summarize(sumInjuries = sum(INJURIES),
              percInjuries = sum(INJURIES) / totalInjuries) %>%
    arrange(desc(sumInjuries))
head(injuries, 10)
## Source: local data frame [10 x 3]
## 
##               EVTYPE sumInjuries percInjuries
##                (chr)       (dbl)        (dbl)
## 1            TORNADO       15213   0.43301170
## 2     EXCESSIVE HEAT        3708   0.10554180
## 3          LIGHTNING        2993   0.08519056
## 4          TSTM WIND        1753   0.04989611
## 5  THUNDERSTORM WIND        1400   0.03984858
## 6  HURRICANE/TYPHOON        1275   0.03629067
## 7               HEAT        1222   0.03478211
## 8           WILDFIRE         911   0.02593004
## 9        FLASH FLOOD         812   0.02311217
## 10         HIGH WIND         677   0.01926963

Some preliminary observations from the previous table.

  • The table above shows the dangerous effects of tornadoes with over fifteen thousand injuries accumulated through the years and 43% of all injuries incurred in weather phenomena.
  • The second phenomena is excessive heat, with over three-thousand seven hundred injuries and 10% of all injuries incurred.
  • The third weather event with high injuries is lighting with close to three-thousand injuries and 8% of all injuries incurred
  • The fourth and fifth most dangerous phenomena are described with different label but seem to be the same one in respect, thunderstorm wind. This phenomena has some two thousand one hundred casualties together. If was a conscious decision not to change the data labels for the data set since the change in labels is evidence of either database integrity and/or change in categorization of the phenomena.

Injuries are a major key classifier for weather phenomena. The impact on the lives of people (and in this case not one or hundreds but thousands) are pivotal for assessing contention plans and proper budgets for future weather alterations. However more impacting than injuries are fatalities forthcoming from acute weather patterns.

The tally of fatalities as given by extreme weather patterns is very similar to the algorithm utilized for injuries, if not almost identical. The variable FATALITIES registers the number of fatalities caused by a particular weather event (identified by the variable EVT_TYPE) in a given point in time (identified by the transformed variable DATE.) By grouping and accumulating the registers, we can analyze the top ten sources of fatalities as described by the following code:

## Extract most fatalities in event type since year 2000
fatalities <- group_by(stormData, EVTYPE) %>%
    summarize(sumFatalities = sum(FATALITIES), 
              percFatalities = sum(FATALITIES)/totalFatalities) %>%
    arrange(desc(sumFatalities))
head(fatalities, 10)
## Source: local data frame [10 x 3]
## 
##               EVTYPE sumFatalities percFatalities
##                (chr)         (dbl)          (dbl)
## 1            TORNADO          1193     0.19903237
## 2     EXCESSIVE HEAT          1013     0.16900234
## 3        FLASH FLOOD           600     0.10010010
## 4          LIGHTNING           466     0.07774441
## 5        RIP CURRENT           340     0.05672339
## 6              FLOOD           266     0.04437771
## 7               HEAT           231     0.03853854
## 8          AVALANCHE           179     0.02986320
## 9          HIGH WIND           131     0.02185519
## 10 THUNDERSTORM WIND           130     0.02168836

Not surprisingly, many of the weather phenomena repeat themselves in the table above. We can review that:

  • Tornadoes, with over one thousand one hundred deaths, are the number one phenomena for fatalities, much like they were the first for injuries. Given the strong trend in both variables, it is safe to assume that tornadoes should be the number one priority when preparing and forecasting for weather effect mitigation.
  • Excessive heat is the number two source for fatalities, with over one thousand and not far from tornadoes, much alike being the second source for injuries. As mentioned with tornadoes, two number two position classifications makes excessive heat an easy second prioritization when planning.
  • Flash floods are the third source of fatalities with sis hundred cases; this variable was not mentioned above but listed as injury source in the ninth position.
  • Lighting, the fourth source of casualties in our table with over four hundred sixty cases, was the third source of injuries, making it another strong contender when it comes to prioritization and planning.
  • Finally, rip current lists as the fifth most common source for casualties, with three hundred and forty deaths, but it is not on the top ten list of injuries.

There is a lot of overlap among sources for casualties and injuries. Tables make for clear definitions, but a plot comparing both classifications in probably a more visual way to understand the importance and weight of each phenomena. Bar plots were created for visualizing.

## Plot to compare effects of events on fatalities and injuries
plot1 <- ggplot(data = fatalities[1:10,], aes(x = factor(EVTYPE), y = sumFatalities)) + 
    geom_bar(stat = "identity") + coord_flip() + 
    scale_x_discrete(limits = fatalities$EVTYPE[10:1]) +
    xlab("Type of Event") + ylab("Sum of Fatalities") + 
    ggtitle("Ranking of Fatalities by Event Type (2000-Present)") +
    scale_fill_brewer(palette = "Greys") +
    theme(axis.text=element_text(size=8), axis.title=element_text(size=11,face="bold"))


plot2 <- ggplot(data = injuries[1:10,], aes(x = factor(EVTYPE), y = sumInjuries)) + 
    geom_bar(stat = "identity") + coord_flip() +
    scale_x_discrete(limits = injuries$EVTYPE[10:1]) +
    xlab("Type of Event") + ylab("Sum of Injuries") + 
    ggtitle("Ranking of Injuries by Event Type (2000-Present)") +
    scale_fill_brewer(palette = "Greys") +   
    theme(axis.text=element_text(size=8), axis.title=element_text(size=11,face="bold"))

grid.arrange(plot1, plot2, nrow = 2)

The scope of this work is to analyze in a reproducible manner the major sources of fatalities and injuries from weather events without necessarily giving any recommendations on the matter. However, we thought interesting to offer a way to categorize weather events by assigning classification using both injuries and fatalities as weight categories. We call this classification casualties, and it is composed of both measures of injuries and fatalities per weather phenomena, ranked according to a) injuries in first order, and b) fatalities in second order.

## Extract the most fatalities and injuries by event since year 2000
casualties <- group_by(stormData, EVTYPE) %>%
    summarize(sumInjuries = sum(INJURIES),
              sumFatalities = sum(FATALITIES),
              percInjuries = sum(INJURIES) / totalInjuries,
              percFatalities = sum(FATALITIES) / totalFatalities) %>%
    arrange(desc(sumInjuries, sumFatalities))
head(casualties, 10)
## Source: local data frame [10 x 5]
## 
##               EVTYPE sumInjuries sumFatalities percInjuries percFatalities
##                (chr)       (dbl)         (dbl)        (dbl)          (dbl)
## 1            TORNADO       15213          1193   0.43301170     0.19903237
## 2     EXCESSIVE HEAT        3708          1013   0.10554180     0.16900234
## 3          LIGHTNING        2993           466   0.08519056     0.07774441
## 4          TSTM WIND        1753           116   0.04989611     0.01935269
## 5  THUNDERSTORM WIND        1400           130   0.03984858     0.02168836
## 6  HURRICANE/TYPHOON        1275            64   0.03629067     0.01067734
## 7               HEAT        1222           231   0.03478211     0.03853854
## 8           WILDFIRE         911            75   0.02593004     0.01251251
## 9        FLASH FLOOD         812           600   0.02311217     0.10010010
## 10         HIGH WIND         677           131   0.01926963     0.02185519

This is perhaps a more opinionated way to prioritize effects of weather phenomena, but one which we feel helps those in charge of making contingency plans and forecasting resources accordingly. It is also easier to visualize ranking of weather effects like so:

## Build a plot of the top 10 casualty types since year 2000, comparing injuries and fatalities side by side
subset <- data.frame(casualties$EVTYPE, casualties$sumInjuries, casualties$sumFatalities)
colnames(subset) <- c("TYPE", "INJURIES", "FATALITIES")
subset <- melt(subset[1:10,], id = c("TYPE"))
ggplot(subset, aes(factor(TYPE), value, fill = variable)) + 
    geom_bar(stat="identity", position = "dodge") + 
    scale_x_discrete(limits = subset$TYPE) + 
    scale_fill_brewer(palette = "Paired") +
    xlab("Event Types") + ylab("Casualties") + ggtitle("Sources of Casualties by Event Type (2000-Present)") +
    theme(axis.text=element_text(size=10), axis.title=element_text(size=12,face="bold"))

From the above plot it’s clear that the priority on assigning resources and contigency plans against waether events should be concentrated on tornadoes, events of excessive heat, flash floods, and thunderstorm winds.

Sources of Damage to Property and Crops from Weather Phenomena

The effect of weather events on human life can be devastating. But the after-effects of damage to property and crops is also a variable to take into consideration. If not fatal to human life, it has a devastating effect on communities and government bodies who will have to deal with the financial consequences.

The NOAA database provides exacting figures of damage for both property and crop. Given that the economical nature of both variable has less disparity for comparison than injuries versus fatalities, the analysis becomes simpler in nature. We query the total losses incurred in crop damage (given by the variable CROPDMG) and property damages (given by the variable PROPDMG) tallied by weather event type (using the variable EVTYPE for grouping.) We also build a new variable, totalDmg for total damages, summing the latter two, again grouping by event type. The ranking is given by three variables, total damages first (totalDmg), property damage second (sumPropDmg), and crop damage for tie breaking (sumCropDmg.)

## What type of events cause the most damage since the year 2000
damages <- group_by(stormData, EVTYPE) %>% 
    summarize(totalDmg = sum(PROPDMG) + sum(CROPDMG),
              sumPropDmg = sum(PROPDMG), 
              sumCropDmg = sum(CROPDMG), 
              percDmg = (sum(PROPDMG) + sum(CROPDMG)) / (totalPropDmg + totalCropDmg),
              percPropDmg = sum(PROPDMG) / totalPropDmg, 
              percCropDmg = sum(CROPDMG) / totalCropDmg) %>%
    arrange(desc(totalDmg, sumPropDmg, sumCropDmg)) 
head(damages, 10)
## Source: local data frame [10 x 7]
## 
##               EVTYPE   totalDmg sumPropDmg sumCropDmg    percDmg
##                (chr)      (dbl)      (dbl)      (dbl)      (dbl)
## 1        FLASH FLOOD 1131715.05  999333.42  132381.63 0.16492912
## 2            TORNADO  980746.61  907111.70   73634.91 0.14292792
## 3  THUNDERSTORM WIND  928920.36  862257.36   66663.00 0.13537509
## 4          TSTM WIND  865286.92  811528.22   53758.70 0.12610154
## 5               HAIL  815812.65  452533.47  363279.18 0.11889147
## 6              FLOOD  792567.18  671747.56  120819.62 0.11550382
## 7          LIGHTNING  397297.29  395884.69    1412.60 0.05789964
## 8          HIGH WIND  259038.15  247108.53   11929.62 0.03775061
## 9       WINTER STORM   97746.93   97093.93     653.00 0.01424503
## 10          WILDFIRE   87371.54   83007.34    4364.20 0.01273299
## Variables not shown: percPropDmg (dbl), percCropDmg (dbl)

The table not only includes total damages for both crop and property by event type, but also a percentage indicator to facilitate prioritizing variables. Again we see overlap of the same types of events that affected injuries and fatalities. Flash floods are the number one cause for economic damages to both property and crops, followed closely by tornadoes. Thunderstorm winds occupy the third and fourth place (again, this might just be a lack of nomenclature discrepancy on the NOAA database), while flood becomes the number fifth source.

A plot bearing total damage by source of event makes the analysis easy to grasp. For purposes of easy visual understanding, we subset the data using just three variables (event type, property damage, and crop damage) to accelerate and simplify the plot.

## Build a plot of the top 10 damage types since year 2000, comparing property & crop damage
subset2 <- data.frame(damages$EVTYPE, damages$sumPropDmg, damages$sumCropDmg)
colnames(subset2) <- c("TYPE", "PROPERTY", "CROP")
subset2 <- melt(subset2[1:10,], id = c("TYPE"))
ggplot(subset2, aes(factor(TYPE), value, fill = variable)) + 
    geom_bar(stat="identity", position = "stack") + 
    scale_fill_brewer(palette = "Paired") + scale_x_discrete(limits = subset2$TYPE) +
    xlab("Event Type") + ylab("Monetary Damage USD") + ggtitle("Sources of Economical Damage by Weather Event (2000-Present)") +
    theme(axis.text=element_text(size=10), axis.title=element_text(size=12,face="bold"))

The plot only reinforces the fact that the most costly weather events measured from a property and crop damage perspective are:

  1. floods, normal and flash floods,
  2. tornadoes,
  3. thunderstorm winds as measured by both variables, and
  4. hail.

With these weather events in mind, we deem much easier to plan ahead for severe weather events and prioritize resource allocation.

– END