Synopsis

The goal of this project is to analyze the National Oceanic and Atmospheric Administration’s (NOAA) Storm Events Database, in order to answer two questions:

First question requires drawing a total of fatalities and injuries for each event type from 1996 to 2011. Tornadoes appear to be a leader among the most harmful events, with 1511 fatal and 20667 injury cases. Second question requires calculating the total damage dealt to property and crops, expressed in billion USD. Floods being the event with the greatest economic consequences of all, totaling 177.6 Bn USD from 1996 to 2011, expressed in prices of 2015.

Dataset

The data for this report comes in the form of a comma-separated-value file compressed via the bzip2 algorithm. The file is downloadable from the course (Reproducible research - Peer Assesment 2) web site:

There is also some documentation of the database available. Here you may find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records.

Special note

Also it should be noted that data for all event types were recorded only since 1996 as clearly indicated here.

Data Processing

For the entire analysis and making of this report “dplyr”, “magrittr”, “knitr”, “lubridate” and “markdown” R libraries are used. The code makes extensive use of pipe operators (%>%, %<>%) from “magrittr” R library to streamline function calls.

library(dplyr)
library(magrittr)
library(knitr)
library(markdown)
library(lubridate)
Sys.setlocale("LC_ALL","English")

Loading data into R

The data is loaded into R with the following code:

if (!"StormData.bz2" %in% dir()) {
  url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
  download.file(url, "StormData.bz2")
}

if (!"stormdata_raw" %in% ls()) {
  stormdata_raw <- read.csv("StormData.bz2")
}

The code checks if the file is already downloaded and if not it downloads it, then it checks if the data is loaded into R workspace and if not, it loads it. Please make notice that R function read.csv facilitates handling bz2 archives, so unzipping is not required.

Subsetting variables

After loading the original data a subsetting has been done. For the purpose of this project only following variables would be used for further analysis:

  • BGN_DATE - date of the event;
  • EVTYPE - event type;
  • FATALITIES - number of fatalities, caused by event;
  • INJURIES - number of injuries, caused by event;
  • PROPDMG - damage to property, caused by event (in USD);
  • PROPDMGEXP - characters, used to indicate magnitude of the damage (the details would be refered to later);
  • CROPDMG - damage to crops, caused by event;
  • CROPDMGEXP - characters, used to indicate magnitude of the damage.
stormdata <- stormdata_raw %>% 
  select(BGN_DATE,
         EVTYPE, 
         FATALITIES, INJURIES, 
         PROPDMG, PROPDMGEXP,
         CROPDMG, CROPDMGEXP)

Subsetting years

As it has been mentioned in a special note of “Dataset” section data for all event types were recorded only since 1996 as it is explicitly indicated at “Event types” section of NOAA Website.

In order to carry out a proper comparison between all event types the data before 1996 is disregarded by the following code:

stormdata %<>% 
  mutate(YEAR = year(strptime(BGN_DATE, format = "%m/%d/%Y %H:%M:%S"))) %>%
  filter(YEAR >= 1996) %>%
  select(-BGN_DATE)

Cleaning Event Types (EVTYPE)

EVTYPE variable is presented by 516 unique values. It suggests considerable cleaning.

stormdata$EVTYPE %>% unique %>% length
## [1] 516

In order to clean the variable following steps should be taken:

  • All letters should be brought up to upper case;
  • Multiple space characters should be replaced by a single space character;
  • Leading and trailing space characters should be removed;
  • Event types should be brought as close to NOAA definitions as possible (this step includes extensive use of grep function).
# Bringing all letters to upper case
stormdata$EVTYPE %<>%  as.character %>% toupper

# Multiple space characters to a single space character.
stormdata$EVTYPE <- gsub("[[:space:]]+", " ", stormdata$EVTYPE)

# Fixing leading and trailing space characters
stormdata$EVTYPE <- gsub("^[[:space:]]+|[[:space:]]+$", "", stormdata$EVTYPE)

# Bringing event types to NOAA standardized definitions as close as possible
stormdata$EVTYPE[stormdata$EVTYPE %in% "AVALANCE"] <- "AVALANCHE"
stormdata$EVTYPE[grep("ASTRONOMICAL HIGH TIDE|SURF|SWELLS", stormdata$EVTYPE)] <- "HIGH SURF"
stormdata$EVTYPE[grep("BLIZZARD", stormdata$EVTYPE)] <- "BLIZZARD"
stormdata$EVTYPE[grep("COASTAL|CSTL", stormdata$EVTYPE)] <- "COASTAL FLOOD"
stormdata$EVTYPE[stormdata$EVTYPE %in% c("DAM BREAK", "FLOOD FLASH")] <- "FLASH FLOOD"
stormdata$EVTYPE[grep("DROUGHT", stormdata$EVTYPE)] <- "DROUGHT"
stormdata$EVTYPE[grep("DUST DEVIL", stormdata$EVTYPE)] <- "DUST DEVIL"
stormdata$EVTYPE[intersect(grep("DUST", stormdata$EVTYPE), 
                           grep("DEVIL", stormdata$EVTYPE,
                                invert = TRUE))] <- "DUST STORM"
stormdata$EVTYPE[grep("(EXTREME|EXCESSIVE|RECORD).*(COLD|WIND CHILL)", 
                      stormdata$EVTYPE)] <- "EXTREME COLD WIND CHILL"
stormdata$EVTYPE[intersect(grep("EXTREME", stormdata$EVTYPE, 
                                invert = TRUE), 
                           grep("COLD|WIND CHILL|HYPOTHERMIA", stormdata$EVTYPE))] <- "COLD WIND CHILL"
stormdata$EVTYPE[grep("(EXTREME|EXCESSIVE|RECORD).*HEAT", stormdata$EVTYPE)] <- "EXCESSIVE HEAT"
stormdata$EVTYPE[intersect(grep("EXCESSIVE", stormdata$EVTYPE, 
                                invert = TRUE), 
                           grep("HEAT|HYPERTHERMIA|UNSEASONABLY WARM|WARM", stormdata$EVTYPE))] <- "HEAT"
stormdata$EVTYPE[grep("(EXCESSIVE|HEAVY|HVY|RECORD).*RAIN", stormdata$EVTYPE)] <- "HEAVY RAIN"
stormdata$EVTYPE[grep("(EXCESSIVE|HEAVY|HVY|RECORD).*SNOW", stormdata$EVTYPE)] <- "HEAVY SNOW"
stormdata$EVTYPE[grep("FLASH.*(FLOOD|FLD)", stormdata$EVTYPE)] <- "FLASH FLOOD"
stormdata$EVTYPE[intersect(grep("COASTAL|FLASH", stormdata$EVTYPE, 
                                invert = TRUE),
                           grep("(FLOOD|FLD)", stormdata$EVTYPE))] <- "FLOOD"
stormdata$EVTYPE[grep("FREEZ.*FOG", stormdata$EVTYPE)] <- "FREEZING FOG"
stormdata$EVTYPE[intersect(grep("FREEZING", stormdata$EVTYPE, 
                                invert = TRUE), 
                           grep("FOG", stormdata$EVTYPE))] <- "DENSE FOG"
stormdata$EVTYPE[grep("(FREEZE|FROST)", stormdata$EVTYPE)] <- "FROST FREEZE"
stormdata$EVTYPE[grep("GLAZE|ICE STORM|FREEZING RAIN|FREEZING DRIZZLE", 
                      stormdata$EVTYPE)] <- "ICE STORM"
stormdata$EVTYPE[grep("HURRICANE|TYPHOON", stormdata$EVTYPE)] <- "HURRICANE TYPHOON"
stormdata$EVTYPE[intersect(grep("MARINE", stormdata$EVTYPE, 
                                invert = TRUE), 
                           grep("HAIL", stormdata$EVTYPE))] <- "HAIL"
stormdata$EVTYPE[grep("LIGHTNING", stormdata$EVTYPE)] <- "LIGHTNING"
stormdata$EVTYPE[grep("RIP CURRENT", stormdata$EVTYPE)] <- "RIP CURRENT"
stormdata$EVTYPE[intersect(grep("ASTRONOMICAL", stormdata$EVTYPE, 
                                invert = TRUE), 
                           grep("STORM SURGE|TIDE", stormdata$EVTYPE))] <- "STORM SURGE TIDE"
stormdata$EVTYPE[grep("MARINE.*HAIL", stormdata$EVTYPE)] <- "MARINE HAIL"
stormdata$EVTYPE[grep("MARINE.*(THUNDERSTORM|TSTM|MICROBURST|DOWNBURST|GUSTNADO)", 
                      stormdata$EVTYPE)] <- "MARINE THUNDERSTORM WIND"
stormdata$EVTYPE[intersect(grep("MARINE", stormdata$EVTYPE, 
                                invert = TRUE), 
                           grep("THUNDERSTORM|TSTM|MICROBURST|DOWNBURST|GUSTNADO", 
                                stormdata$EVTYPE))] <- "THUNDERSTORM WIND"
stormdata$EVTYPE[grep("(MUD|LAND|ROCK).*(SLIDE|SLUMP)", stormdata$EVTYPE)] <- "DEBRIS FLOW"
stormdata$EVTYPE[grep("TORNADO|TORNDAO", stormdata$EVTYPE)] <- "TORNADO"
stormdata$EVTYPE[grep("TROPICAL STORM", stormdata$EVTYPE)] <- "TROPICAL STORM"
stormdata$EVTYPE[grep("MARINE.*WIND", stormdata$EVTYPE)] <- "MARINE HIGH WIND"
stormdata$EVTYPE[intersect(grep("COLD|THUNDERSTORM", stormdata$EVTYPE, 
                                invert = TRUE), 
                           grep("WIND|WND", stormdata$EVTYPE))] <- "HIGH WIND"
stormdata$EVTYPE[grep("WATERSPOUT", stormdata$EVTYPE)] <- "WATERSPOUT"
stormdata$EVTYPE[grep("(WILD|FOREST|BRUSH|GRASS).*FIRE", stormdata$EVTYPE)] <- "WILDFIRE"
stormdata$EVTYPE[grep("WINTER WEATHER|WINTRY MIX", stormdata$EVTYPE)] <- "WINTER WEATHER"
stormdata$EVTYPE[grep("WINTER STORM", stormdata$EVTYPE)] <- "WINTER STORM"
stormdata$EVTYPE[grep("SUMMARY", stormdata$EVTYPE)] <- "OTHER"

Even though the list now has 160 values the data is ready for further analysis. The most prominent events, in terms of questions addressed to the data, have already taken shape as seen further in course of analysis.

stormdata$EVTYPE %>% unique %>% length
## [1] 160

Calculating damage magnitude

Values in CROPDMGEXP and PROPDMGEXP variables in original dataset include alphabetical characters to signify magnitude: “K” for thousands, “M” for millions, and “B” for billions. Also the variables include integers 0-8 and “?”, “-”, “+” characters. All of them occur earlier then 1996 and at the moment the data set has only one observation with such character. Since damage values are zeroes for the observation it is removed from the dataset.

exp_values <- c(0:8, "?", "-", "+")
exp_stormdata_prop <- stormdata %>%
  filter(PROPDMGEXP %in% exp_values)

exp_stormdata_crop <- stormdata %>%
  filter(CROPDMGEXP %in% exp_values)

exp_stormdata <- rbind(exp_stormdata_prop, 
                       exp_stormdata_crop)

exp_stormdata
##   BGN_DATE      EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 1     2011 FLASH FLOOD          0        0       0          0       0
##   CROPDMGEXP
## 1          K
rm(exp_stormdata_prop, 
   exp_stormdata_crop, 
   exp_stormdata)

With the following code the abovementioned record is deleted from dataset.

stormdata %<>% 
  filter(!(CROPDMGEXP %in% exp_values), 
         !(PROPDMGEXP %in% exp_values))

As it was mentioned earlier alphabetical characters are used to signify magnitude (“K” for thousands, “M” for millions, and “B” for billions). Following code is used to translate characters into magnitude factors.

stormdata$PROPDMGEXP %<>% as.character %>% toupper
stormdata$CROPDMGEXP %<>% as.character %>% toupper

stormdata %<>% 
  mutate(PROPDMGEXP = ifelse(PROPDMGEXP == "K",
                             yes = 1000,
                             no = ifelse(PROPDMGEXP == "M",
                                         yes = 1000000,
                                         no = ifelse(PROPDMGEXP == "B", 
                                                     yes = 1000000000,
                                                     no = 1))),
         CROPDMGEXP = ifelse(CROPDMGEXP == "K",
                             yes = 1000,
                             no = ifelse(CROPDMGEXP == "M",
                                         yes = 1000000,
                                         no = ifelse(CROPDMGEXP == "B", 
                                                     yes = 1000000000,
                                                     no = 1))))

Adjusting for inflation

In order to provide better understanding of the damage extent it is important to adjust total damage for inflation. For the purpose of this report figures are adjusted to the values of 2015. Inflation factors were taken from US Inflation Calculator.

During this step both EXP(damage magnitude extent) and InfFACTOR(inflation factor) are applied to property and crops damage values. For the purpose of further analysis damage is calculated in billion USD.

Inf_factors <- matrix(data = c(1996, 1.513,
                               1997, 1.479,
                               1998, 1.456,
                               1999, 1.425, 
                               2000, 1.378,
                               2001, 1.341,
                               2002, 1.319,
                               2003, 1.290,
                               2004, 1.256,
                               2005, 1.215,
                               2006, 1.177,
                               2007, 1.145, 
                               2008, 1.102,
                               2009, 1.106, 
                               2010, 1.088,
                               2011, 1.055), 
                      nrow = 16, ncol = 2, byrow = T) %>% data.frame(stringsAsFactors = F)
colnames(Inf_factors) <- c("YEAR", "InfFACTOR")

stormdata %<>% 
  merge(Inf_factors, by = "YEAR") %>% 
  mutate(PROPDMG = PROPDMG*InfFACTOR*PROPDMGEXP/1000000000,
         CROPDMG = CROPDMG*InfFACTOR*CROPDMGEXP/1000000000) %>% 
  select(-YEAR, 
         -InfFACTOR, 
         -PROPDMGEXP, 
         -CROPDMGEXP)

This step concludes data processing operations. Further manipulations with data would be carried out in order to answer 2 main questions of this report.

Most harmful types of events with respect to population health of the United States

In order to find out what are the most harmful event types with respect to population health of the United States it is reasonable to use data from FATALITIES and INJURIES variables. Also the cumulative harm, expressed in sum of the variables, would be calculated.

Next chunk of code helps to ouline five most harmful event types across United States. EVTYPE, FATALITIES and INJURIES variables are selected. Sums of all injuries and fatalities is calulated across each event type. As mentioned earlier, only five event types are selected, which represent the most harmful ones.

five_most_harmful <- stormdata %>%
  select(EVTYPE, 
         FATALITIES, 
         INJURIES) %>%
  group_by(EVTYPE) %>%
  summarise(Fatalities = sum(FATALITIES),
            Injuries = sum(INJURIES),
            Total_F_I = sum(FATALITIES, INJURIES)) %>%
  top_n(5, Total_F_I) %>%
  arrange(Total_F_I)

Plotting the data

Below is a barplot interpretation of the data, calculated in previous step.

par(mar=c(5, 8, 2, 2),
    las=1, 
    cex.axis=1)

barplot(five_most_harmful$Total_F_I,
        names.arg = five_most_harmful$EVTYPE,
        main = "Fig.1: Top five most harmful events 1996 - 2011",
        xlab = "Injuries and fatalities combined",
        horiz = TRUE, xlim = c(0, 25000),
        col = c(rep("gray", length(five_most_harmful$EVTYPE)-1), "dark red"),
        cex.names = 0.7, space = 0.4, border = NA, xaxt = "n")

axis(side = 1, at = seq(0, 25000, 5000), 
     labels = c("0", "5k", "10k", "15k", "20k", "25k"))

text(five_most_harmful$Total_F_I[5]-6000, 6.475, 
     paste("Fatalities: ", five_most_harmful$Fatalities[5], "\n", "Injuries: ",
           five_most_harmful$Injuries[5]), cex=0.8, font=2, pos=4, col = "white")

for (i in 1:4) {
  text(five_most_harmful$Total_F_I[i] + 0.25, 6.475-1.4*(abs(i-5)), 
       paste("Fatalities: ", five_most_harmful$Fatalities[i], "\n", "Injuries: ", 
             five_most_harmful$Injuries[i]), 
       cex=0.7, font=2, pos=4)
}

Plot clearly indicates, that Tornado is a definite leader among most harmful for population health type of weather events. Tornadoes have taken away lives of 1511 and injured 20677 people during 1996-2011.

Events, which caused the greatest damage to economy of the United States.

Since all preparations for finding out which events caused the greatest consequences to the economy of United States has been done, the only thing left is to subset the data and calculate total damage to both property and crops. The following code serves this exact purpose.

economic_impact <- stormdata %>%
  select(EVTYPE, PROPDMG, CROPDMG) %>%
  group_by(EVTYPE) %>%
  summarise(TotalDMG = sum(PROPDMG, CROPDMG)) %>%
  top_n(10, TotalDMG) %>%
  arrange(TotalDMG)

Plotting the data

Below is a barplot interpretation of the data, calculated in previous step.

par(mar=c(5, 8, 2, 2),
    las=1, 
    cex.axis=1)

barplot(economic_impact$TotalDMG,
        names.arg = economic_impact$EVTYPE,
        main = "Fig.2: Top ten most economically impactful events 1996 - 2011 \n (in 2015 prices)",
        xlab = "Total damage to property and crops, billion USD",
        horiz = TRUE, xlim = c(0, 175),
        col = c(rep("gray", length(economic_impact$EVTYPE)-1), "dark red"),
        cex.names = 0.7, space = 0.4, border = NA, xaxt = "n")

axis(side = 1, at = seq(0, 175, 25), 
     labels = c("0", "25", "50", "75", "100", "125", "150", "175"))

text(economic_impact$TotalDMG[10] - 34.5, 13.525, 
     paste(round(economic_impact$TotalDMG[10], digits = 2), "Bn"), 
     cex=1, font=2, pos=4, col = "white")

for (i in 1:9) {
  text(economic_impact$TotalDMG[i] + 0.25, 13.525-1.4*(abs(i-10)), 
       paste(round(economic_impact$TotalDMG[i], digits = 2), "Bn"), 
       cex=0.7, font=2, pos=4)
  }

As seen from the plot, Floods caused the most damage to US economy.

Results