Synopsis

This report documents my analysis of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. After downloading, decompressing and importing the data, I aggregated the data to compare total injuries and fatalities caused by different weather events as well as the economic damages inflicted by those events. It turns out that the most fatalities and injuries are caused by tornados whereas floods inflict the highest economic damages. These results are visualized by simple bar charts.

Data Processing

This part describes the data extracting, loading and transformation operations that were performed in order to get a tidy data set for further analyses.

Reading in the data

The data file is found here

The following code downloads the data from the specified URL and saves it locally before loading it into R.

library(data.table)
library(R.utils)

## Warning: package 'R.utils' was built under R version 3.2.5

## Loading required package: R.oo

## Loading required package: R.methodsS3

## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.

## R.oo v1.20.0 (2016-02-17) successfully loaded. See ?R.oo for help.

## 
## Attaching package: 'R.oo'

## Die folgenden Objekte sind maskiert von 'package:methods':
## 
##     getClasses, getMethods

## Die folgenden Objekte sind maskiert von 'package:base':
## 
##     attach, detach, gc, load, save

## R.utils v2.3.0 (2016-04-13) successfully loaded. See ?R.utils for help.

## 
## Attaching package: 'R.utils'

## Das folgende Objekt ist maskiert 'package:utils':
## 
##     timestamp

## Die folgenden Objekte sind maskiert von 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings

if(!dir.exists("data")){
        dir.create("data")
}
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
              "data/StormData.csv.bz2")
bunzip2("data/StormData.csv.bz2", "data/StormData.csv", remove=FALSE, overwrite = TRUE)
data <- fread("data/StormData.csv")

## 
Read 0.0% of 967216 rows
Read 21.7% of 967216 rows
Read 34.1% of 967216 rows
Read 49.6% of 967216 rows
Read 60.0% of 967216 rows
Read 72.4% of 967216 rows
Read 78.6% of 967216 rows
Read 87.9% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:11

## Warning in fread("data/StormData.csv"): Read less rows (902297) than were
## allocated (967216). Run again with verbose=TRUE and please report.

Now, the variables required for further analysis are converted into factors and an additional variable, summing up fatalities and injuries caused by the weather events is introduced.

library(magrittr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## Die folgenden Objekte sind maskiert von 'package:stats':
## 
##     filter, lag

## Die folgenden Objekte sind maskiert von 'package:base':
## 
##     intersect, setdiff, setequal, union

data <- data %>%
        mutate(EVTYPE = as.factor(EVTYPE), 
               POP_HEALTH_CONSEQUENCES = FATALITIES + INJURIES)

Results

This section presents the results of my analysis.

Events with the highest harmful impact on population health

In a first step, the data was analyzed to find the events that caused the most injuries and fatalities in the US over the considered period. For this reasion, I aggregated the data in order to filter out the ten events with the highest impact on population health:

library(data.table)

## 
## Attaching package: 'data.table'

## Die folgenden Objekte sind maskiert von 'package:dplyr':
## 
##     between, last

dataPlot.pop <- data %>%
        select(EVTYPE, POP_HEALTH_CONSEQUENCES) %>%
        aggregate(. ~ EVTYPE,data = ., FUN = sum) %>%
        data.table

dataPlot.pop <- dataPlot.pop[order(POP_HEALTH_CONSEQUENCES, decreasing = TRUE)] %>%
        head(10) %>%
        melt("EVTYPE", "POP_HEALTH_CONSEQUENCES")

The prepared data now saves as a basis for the subsequent plot:

str(dataPlot.pop)

## Classes 'data.table' and 'data.frame':   10 obs. of  3 variables:
##  $ EVTYPE  : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 130 856 170 464 275 153 427 760 972
##  $ variable: Factor w/ 1 level "POP_HEALTH_CONSEQUENCES": 1 1 1 1 1 1 1 1 1 1
##  $ value   : num  96979 8428 7461 7259 6046 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Now this code plots the ten events with the highest negative impact with respect to both, fatalities and injuries combined:

library(ggplot2)
g <- dataPlot.pop %>%
        filter(variable == "POP_HEALTH_CONSEQUENCES") %>%
        ggplot(aes(x = EVTYPE, y = value)) +
        geom_bar(stat = "identity") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
        ylab("") + 
        xlab("") + 
        ggtitle("Number of total injuries and fatalities in the US per weather event")
g

As one can see, by far the most people in the US over the considered period were harmed by tornados.

Events with the highest economic impacts

In order to calculate the overall economic impacts, one needs to transform the abbreviations given for the magnitude into actual numbers and aggregate them. For this purpose, I converted the entries in the damage EXP columns as follows: “h” and “H” into hundreds, “k” and “K” into thousands, “m” and “M” into millions and “b” and “B” into billions.

data <- data[,':='(PROPDMG_MULT = 1, CROPDMG_MULT = 1)] %>%
        mutate(PROPDMG_MULT = ifelse(tolower(PROPDMGEXP) == "h", 100, ifelse(tolower(PROPDMGEXP) == "k",
                                                                             1000, ifelse(tolower(PROPDMGEXP) == "m",
                                                                                          1000000, ifelse(tolower(PROPDMGEXP) == "b", 1000000000,1)))),
               CROPDMG_MULT = ifelse(tolower(CROPDMGEXP) == "h", 100, ifelse(tolower(CROPDMGEXP) == "k",
                                                                             1000, ifelse(tolower(CROPDMGEXP) == "m",
                                                                                          1000000, ifelse(tolower(CROPDMGEXP) == "b", 1000000000,1)))))
data <- data %>%
        mutate(PROPDMG_NUM = PROPDMG*PROPDMG_MULT,
               CROPDMG_NUM = CROPDMG*CROPDMG_MULT) %>%
        mutate(ECON_DMG = PROPDMG_NUM + CROPDMG_NUM)

As the data is now in the right format, I can aggregate it first:

library(data.table)
dataPlot.econ <- data %>%
        select(EVTYPE, ECON_DMG) %>%
        aggregate(. ~ EVTYPE,data = ., FUN = sum) %>%
        data.table

dataPlot.econ <- dataPlot.econ[order(ECON_DMG, decreasing = TRUE)] %>%
        head(10) %>%
        melt("EVTYPE", "ECON_DMG")

Its structure is visualized here:

str(dataPlot.econ)

## Classes 'data.table' and 'data.frame':   10 obs. of  3 variables:
##  $ EVTYPE  : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 170 411 834 670 244 153 95 402 590 427
##  $ variable: Factor w/ 1 level "ECON_DMG": 1 1 1 1 1 1 1 1 1 1
##  $ value   : num  1.50e+11 7.19e+10 5.74e+10 4.33e+10 1.88e+10 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Now it can be plotted:

library(ggplot2)
g <- dataPlot.econ %>%
        ggplot(aes(x = EVTYPE, y = value)) +
        geom_bar(stat = "identity") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
        ylab("Economic damage in $") + 
        xlab("") + 
        ggtitle("Total economic damage in the US per weather event")
g

As one can see, whereas most people were hurt by tornados, the highest economic damage was caused by flooding.

Analysis of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database