Title: Basic Exploration of the NOAA Storm Database

1. Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

The goal of this assignment is to explore the NOAA Storm Database and answer two basic questions about severe weather events:

    1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
    1. Across the United States, which types of events have the greatest economic consequences?

2. Data Processing

2.1 Preparing Environment for the Project

To avoid misusing data or mixing files, is recommended to create a specific folder for the project inside your working directory:

mainDir <- getwd()
subDir <- "storm_data"

if (file.exists(subDir)){
    setwd(file.path(mainDir, subDir))
} else {
    dir.create(file.path(mainDir, subDir))
    setwd(file.path(mainDir, subDir))
}

Loading required packages:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(knitr)

Complementary information:

sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.4
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.15.1  tidyr_0.6.1   ggplot2_2.2.1 dplyr_0.5.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9      codetools_0.2-15 digest_0.6.12    rprojroot_1.2   
##  [5] assertthat_0.1   plyr_1.8.4       grid_3.3.2       R6_2.2.0        
##  [9] gtable_0.2.0     DBI_0.6          backports_1.0.5  magrittr_1.5    
## [13] scales_0.4.1     evaluate_0.10    stringi_1.1.2    lazyeval_0.2.0  
## [17] rmarkdown_1.3    tools_3.3.2      stringr_1.2.0    munsell_0.4.3   
## [21] yaml_2.1.14      colorspace_1.3-2 htmltools_0.3.5  tibble_1.2

2.2. Importing Data

The data for this project come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. It can be downloaded from:

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

## Download dataset from website into new working directory

if (!file.exists("repdata_data_StormData.csv.bz2")){
   download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
   destfile="repdata_data_StormData.csv.bz2", quiet = FALSE,
   mode = "w", cacheOK = TRUE, method="libcurl")
}

## Read file
FullStormData <- read.csv("repdata_data_StormData.csv.bz2")

2.4. Calculating Number of Harmful Events

For the purpose of this project an event is considered harmful when it causes injuries and/or fatalities.

Create data frame with the necessary data and remove NA values:

harmfulevents <- FullStormData
harmfulevents$FATALITIES <- as.numeric(harmfulevents$FATALITIES)
harmfulevents$INJURIES <- as.numeric(harmfulevents$INJURIES)
harmfulevents <- harmfulevents[(!is.na(harmfulevents$FATALITIES)) | (!is.na(harmfulevents$INJURIES)), c("EVTYPE","FATALITIES","INJURIES")]

Aggregate injuries and fatalities numbers for each type of event and rearrange them on descending order:

sumharmfulevents <- aggregate(. ~ EVTYPE, harmfulevents, sum)
sumharmfulevents <- arrange(sumharmfulevents, desc(FATALITIES + INJURIES))

## Remove no longer necessary data frame from memory
rm(harmfulevents)

Create a plot for the top 5 events with greater numbers of injuries and fatalities:

sumharmfulevents <- sumharmfulevents[1:5,]
sumharmfulevents <- gather(sumharmfulevents, HMTYPE, TOTAL, FATALITIES:INJURIES)
hfeplot <- ggplot(sumharmfulevents, aes(x = reorder(EVTYPE, -TOTAL),
    y = TOTAL, fill = HMTYPE)) + 
        geom_bar(stat = "identity") +
        labs(x = "Weather Event", y = "Number Occurrences") +
        labs(title = "Top 5 Most Harmful Weather Events") +
        labs(fill = "Occurrence Type") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0))

2.5. Calculating Economic Consequences

For the purpose of this project economic consequences are extracted from the following data frame columns:

  • PROPDMG: Property damages;
  • PROPDMGEXP: Exponent value for property damages;
  • CROPDMG: Crop damages;
  • CROPDMGEXP: Exponent value for crop damages.

Create data frame with the necessary data and remove NA values:

economicconsequences <- FullStormData
economicconsequences$PROPDMG <- as.numeric(economicconsequences$PROPDMG)
economicconsequences$CROPDMG <- as.numeric(economicconsequences$CROPDMG)
economicconsequences <- economicconsequences[(!is.na(economicconsequences$PROPDMG)) | 
    (!is.na(economicconsequences$CROPDMG)) | (!is.na(economicconsequences$PROPDMGEXP)) |
    (!is.na(economicconsequences$CROPDMGEXP)), c("EVTYPE","PROPDMG","CROPDMG","PROPDMGEXP","CROPDMGEXP")]

## Remove no longer necessary data frame from memory
rm(FullStormData)

Before adding values from property and crop damages, is necessary to put all data in the same order of magnitude.

Magnitudes for property damages:

levels(as.factor(economicconsequences$PROPDMGEXP))
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"

Assuming that:

  • “B” refers to billions;
  • “m” or “M” refer to millions;
  • “K” refers to thousands;
  • “h” or “H” refer to hundreds;
  • “0”, “1”, “2”, “3”, “4”, “5”, “6”, “7” and “8” are 10^n, where “n” is PROPDMGEXP; and
  • “”, “-”, “?” and “+” refer to single values or misspellings

Then:

for(i in 1:length(economicconsequences$PROPDMGEXP)) {
    if (economicconsequences$PROPDMGEXP[i] == "B") {
        economicconsequences$PROPDMG[i] <- economicconsequences$PROPDMG[i] * 1000000000
    } else if (economicconsequences$PROPDMGEXP[i] == "8") {
        economicconsequences$PROPDMG[i] <- economicconsequences$PROPDMG[i] * 100000000
    } else if (economicconsequences$PROPDMGEXP[i] == "7") {
        economicconsequences$PROPDMG[i] <- economicconsequences$PROPDMG[i] * 10000000
    } else if (economicconsequences$PROPDMGEXP[i] == "m" | economicconsequences$PROPDMGEXP[i] == "M" | economicconsequences$PROPDMGEXP[i] == "6") {
        economicconsequences$PROPDMG[i] <- economicconsequences$PROPDMG[i] * 1000000
    } else if (economicconsequences$PROPDMGEXP[i] == "5") {
        economicconsequences$PROPDMG[i] <- economicconsequences$PROPDMG[i] * 100000
    } else if (economicconsequences$PROPDMGEXP[i] == "4") {
        economicconsequences$PROPDMG[i] <- economicconsequences$PROPDMG[i] * 10000
    } else if (economicconsequences$PROPDMGEXP[i] == "K"  | economicconsequences$PROPDMGEXP[i] == "3") {
        economicconsequences$PROPDMG[i] <- economicconsequences$PROPDMG[i] * 1000
    } else if (economicconsequences$PROPDMGEXP[i] == "h" | economicconsequences$PROPDMGEXP[i] == "H"  | economicconsequences$PROPDMGEXP[i] == "2") {
        economicconsequences$PROPDMG[i] <- economicconsequences$PROPDMG[i] * 100
    } else if (economicconsequences$PROPDMGEXP[i] == "1") {
        economicconsequences$PROPDMG[i] <- economicconsequences$PROPDMG[i] * 10
    } else
        economicconsequences$PROPDMG[i] <- economicconsequences$PROPDMG[i] * 1
}

Magnitudes for crop damages:

levels(as.factor(economicconsequences$CROPDMGEXP))
## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

Assuming that:

  • “B” refers to billions;
  • “m” or “M” refer to millions;
  • “k” or “K” refer to thousands;
  • “0” and “2” are 10^n, where “n” is PROPDMGEXP; and
  • “” and “?” refer to single values or misspellings.

Then:

for(i in 1:length(economicconsequences$CROPDMGEXP)) {
    if (economicconsequences$CROPDMGEXP[i] == "B") {
        economicconsequences$CROPDMG[i] <- economicconsequences$CROPDMG[i] * 1000000000
    } else if (economicconsequences$CROPDMGEXP[i] == "m" | economicconsequences$CROPDMGEXP[i] == "M") {
        economicconsequences$CROPDMG[i] <- economicconsequences$CROPDMG[i] * 1000000
    } else if (economicconsequences$CROPDMGEXP[i] == "k" | economicconsequences$PROPDMGEXP[i] == "K") {
        economicconsequences$CROPDMG[i] <- economicconsequences$CROPDMG[i] * 1000
    } else if (economicconsequences$PROPDMGEXP[i] ==  "2") {
        economicconsequences$CROPDMG[i] <- economicconsequences$CROPDMG[i] * 100
    } else
        economicconsequences$CROPDMG[i] <- economicconsequences$CROPDMG[i] * 1
}

Aggregate damages for each type of event and rearrange them on descending order:

economicconsequences <- economicconsequences[c("EVTYPE","PROPDMG","CROPDMG")]
sumeconomicconsequences <- aggregate(. ~ EVTYPE, economicconsequences, sum)
sumeconomicconsequences <- arrange(sumeconomicconsequences, desc(PROPDMG + CROPDMG))

## Remove no longer necessary data frame from memory
rm(economicconsequences)

Create a plot for the top 10 events with highest economic consequences:

sumeconomicconsequences <- sumeconomicconsequences[1:10,]
sumeconomicconsequences <- gather(sumeconomicconsequences, DMTYPE, TOTAL, PROPDMG:CROPDMG)
ecplot <- ggplot(sumeconomicconsequences, aes(x = reorder(EVTYPE, -TOTAL),
    y = TOTAL/10^9, fill = DMTYPE)) + 
        geom_bar(stat = "identity") +
        labs(x = "Event", y = "Damages in billions of US$") +
        labs(title = "Top 10 Weather Events whith Highest Economic Consequences") +
        labs(fill = "Damage Type") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0))

2.6 Restoring Environment

Since we have changed the default working directory, it is now recommended that we change it back to the previous one:

 setwd(mainDir)

3. Results

3.1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

From the plot created on item 2.4:

print(hfeplot)

sumharmfulevents
##            EVTYPE     HMTYPE TOTAL
## 1         TORNADO FATALITIES  5633
## 2  EXCESSIVE HEAT FATALITIES  1903
## 3       TSTM WIND FATALITIES   504
## 4           FLOOD FATALITIES   470
## 5       LIGHTNING FATALITIES   816
## 6         TORNADO   INJURIES 91346
## 7  EXCESSIVE HEAT   INJURIES  6525
## 8       TSTM WIND   INJURIES  6957
## 9           FLOOD   INJURIES  6789
## 10      LIGHTNING   INJURIES  5230
tornadoinjuries <- subset(sumharmfulevents, EVTYPE=="TORNADO"&HMTYPE=="INJURIES", TOTAL)
tornadoinjuries <- tornadoinjuries/10^3
tornadofatalities <- subset(sumharmfulevents, EVTYPE=="TORNADO"&HMTYPE=="FATALITIES", TOTAL)
tornadofatalities <- tornadofatalities/10^3

It is possible to see that, across the United States, tornado (with 5.6 thousand fatalities and 91.35 thousand injuries) is the most harmful weather events.

3.2. Across the United States, which types of events have the greatest economic consequences?

From the plot created on item 2.5:

print(ecplot)

sumeconomicconsequences
##               EVTYPE  DMTYPE        TOTAL
## 1              FLOOD PROPDMG 144657709807
## 2  HURRICANE/TYPHOON PROPDMG  69305840000
## 3            TORNADO PROPDMG  56947380676
## 4        STORM SURGE PROPDMG  43323536000
## 5               HAIL PROPDMG  15735267513
## 6        FLASH FLOOD PROPDMG  16822673978
## 7            DROUGHT PROPDMG   1046106000
## 8          HURRICANE PROPDMG  11868319010
## 9        RIVER FLOOD PROPDMG   5118945500
## 10         ICE STORM PROPDMG   3944927860
## 11             FLOOD CROPDMG   5605254720
## 12 HURRICANE/TYPHOON CROPDMG   2605174701
## 13           TORNADO CROPDMG    380761846
## 14       STORM SURGE CROPDMG            5
## 15              HAIL CROPDMG   2850955098
## 16       FLASH FLOOD CROPDMG   1380145763
## 17           DROUGHT CROPDMG  13965343230
## 18         HURRICANE CROPDMG   2741910000
## 19       RIVER FLOOD CROPDMG   5028185275
## 20         ICE STORM CROPDMG   5021610504
floodpropdmg <- subset(sumeconomicconsequences, EVTYPE=="FLOOD"&DMTYPE=="PROPDMG", TOTAL)
floodpropdmg <- floodpropdmg/10^9
floodtotaldmg <-  subset(sumeconomicconsequences, EVTYPE=="FLOOD"&DMTYPE=="PROPDMG", TOTAL) +
                  subset(sumeconomicconsequences, EVTYPE=="FLOOD"&DMTYPE=="CROPDMG", TOTAL)
floodtotaldmg <- floodtotaldmg/10^9
droughtcrpdmg <- subset(sumeconomicconsequences, EVTYPE=="DROUGHT"&DMTYPE=="CROPDMG", TOTAL)
droughtcrpdmg <- droughtcrpdmg/10^9

It is possible to see that, across the United States, flood (with around $144.7 billion) has the highest property damage value and drought (with around $13.97 billion) has the highest crop damage value.But the total of $150.3 billion puts flood as the event with greatest economic consequences.