Synopsis

The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database includes information about event types causing damage on public health and the economy. The following analysis uses this database in order to tackle the two questions:

The database includes comparable data since 1996, therefore the analysis will use data since 1996. The coding of the event types differs from instructions from the original code book. Therefore, a matching of the event types from the dataset and the original codebook had to be done and is described below. Impacts on public health are separated in injuries and fatalities. Damage on the economy are separated in crop and property damage.

The results are:

The most harmful events concerning injuries and fatalities are tornados. However, the most harmful event concerning fatalities is excessive heat.

The most harmful events concerning economic damage are floods. However, the most harmful event concerning crop damage is drought.

Analysis

Loading Required Libraries

#required libraries
library(dplyr)
library(ggplot2)
library(reshape2)
library(stringdist)
## Warning: package 'stringdist' was built under R version 3.5.2

Getting data

The data is available via a bz2-file, which can be opened with read.csv.

# Getting data
data <- read.csv("repdata_data_StormData.csv.bz2")

Data Processing

Before 1996 the dataset gradually were added new event types. Since 1996 the dataset wasn’t changed concerning new event types. Therefore we will focus on the years from 1996.

# Transform BGN_DATE to date-variable and filter only Events after 01/01/1996
data0 <- data
data0$date <- as.Date(data[["BGN_DATE"]], "%m/%d/%Y %H:%M:%S")
data0 <- data0 %>% filter(date > "1996-01-01")

However, the event types are coded in a messy way. First, we will lower all cases in order to get a more comprehensive data set. In the second step we will erase all event types including the word “summary” which do not appear to be events

# Lower all cases in EVTYPE
data0$EVTYPE <- tolower(data0$EVTYPE)

# Delete all Events containing "Summary" in EVTYPE
data0 <- data0[!grepl("Summary", data0$EVTYPE),]

In the third step a variable eventtypes48 is created in order to match the event types from the original codebook the the event types from the dataset. The matching is done with amatch and a maximal distance of 3.

# Creating eventtypes48 which includes all 48 Event Types from the code book.
eventtypes48 <- c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood","Cold/Wind Chill","Debris Flow","Dense Fog","Dense Smoke","Drought","Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill","Flash Flood","Flood","Freezing Fog","Frost/Freeze","Funnel Cloud","Hail","Heat","Heavy Rain","Heavy Snow","High Surf","High Wind","Hurricane/Typhoon","Ice Storm","Lakeshore Flood","Lake-Effect Snow","Lightning","Marine Hail","Marine High Wind","Marine Strong Wind","Marine Thunderstorm Wind","Rip Current","Seiche","Sleet","Storm Tide","Strong Wind","Thunderstorm Wind","Tornado","Tropical Depression","Tropical Storm","Tsunami","Volcanic Ash","Waterspout","Wildfire","Winter Storm","Winter Weather")

# Via amatch the event types from the dataset are matched to the original Codebook
data0$match <- amatch(data0$EVTYPE, eventtypes48, maxDist = 3)
data0 <- data0[is.na(data0$match) == FALSE,]

# A dataframe df48 is created for further matching
df48 <- as.data.frame(eventtypes48)
df48$match <- 1:nrow(df48)

# The original dataset and df48 get merged in order to match the original Event Type from the codebook
data0 <- (merge(df48, data0, by = "match"))
data0 <- subset(data0, select=-c(EVTYPE,match))
colnames(data0)[colnames(data0) == "eventtypes48"] <- "EVTYPE"

Processing the data for the health impact analysis

From the processed data0 the columns evtype, injuries and fatalities are extracted which indicate the impact of events on public health. The steps include creating a new column which sums the incidents of injuries and fatalities and preparing the data for visualization. The dataset healthdata4 is used for further visualization.

# Select three relevant columns: EVTYPE, INJURIES, FATALITIES
healthdata1 <- data0 %>% select(EVTYPE, INJURIES, FATALITIES)

# Aggregate Injuries and Fatalities to new column Sum for further sorting
healthdata1 <- aggregate(healthdata1[-1], healthdata1["EVTYPE"], sum)

# Create Sum of Injuries and Fatalities Incidents
healthdata1$SUM <- healthdata1$INJURIES + healthdata1$FATALITIES

# Order data by Sum of Injuries and Fatalities
healthdata2 <- healthdata1[order(healthdata1$SUM, decreasing=TRUE), ]

# Select first twenty rows for better visualization
healthdata3 <- healthdata2[1:20, ]

# Erase before created sum column
healthdata3 <- healthdata3[1:3]

# Melt data from wide to long format
healthdata4 <- melt(healthdata3, id.var="EVTYPE")

Processing the data for the economic impact analysis

From the processed data0 the columns evtype, cropdmg, cropdmgexp, propdmg and propdmgexp are extracted which indicate the impact of events the economy.

In the first step new columns need to be calculated from the information of the dmg and dmgexp columns. dmgexp includes information about the exponent values of the dmg columns.

# Calculation of total economic impact
dmgdata1 <- data0 %>% select(EVTYPE, CROPDMG, CROPDMGEXP, PROPDMG, PROPDMGEXP)

dmgdata1$CROPDMGCAL <- 0
dmgdata1$PROPDMGCAL <- 0

dmgdata1$CROPDMGCAL <- ifelse(grepl("", dmgdata1$CROPDMGEXP), dmgdata1$CROPDMG * 0, dmgdata1$CROPDMGCAL)
dmgdata1$CROPDMGCAL <- ifelse(grepl("K", dmgdata1$CROPDMGEXP), dmgdata1$CROPDMG * 1000, dmgdata1$CROPDMGCAL)
dmgdata1$CROPDMGCAL <- ifelse(grepl("M", dmgdata1$CROPDMGEXP), dmgdata1$CROPDMG * 1000000, dmgdata1$CROPDMGCAL)
dmgdata1$CROPDMGCAL <- ifelse(grepl("B", dmgdata1$CROPDMGEXP), dmgdata1$CROPDMG * 1000000000, dmgdata1$CROPDMGCAL)
dmgdata1$CROPDMGCAL <- ifelse(grepl(" ", dmgdata1$CROPDMGEXP), dmgdata1$CROPDMG * 0, dmgdata1$CROPDMGCAL)
dmgdata1$CROPDMGCAL <- ifelse(grepl("0", dmgdata1$CROPDMGEXP), dmgdata1$CROPDMG * 0, dmgdata1$CROPDMGCAL)

dmgdata1$PROPDMGCAL <- ifelse(grepl("", dmgdata1$PROPDMGEXP), dmgdata1$PROPDMG * 0, dmgdata1$PROPDMGCAL)
dmgdata1$PROPDMGCAL <- ifelse(grepl("K", dmgdata1$PROPDMGEXP), dmgdata1$PROPDMG * 1000, dmgdata1$PROPDMGCAL)
dmgdata1$PROPDMGCAL <- ifelse(grepl("M", dmgdata1$PROPDMGEXP), dmgdata1$PROPDMG * 1000000, dmgdata1$PROPDMGCAL)
dmgdata1$PROPDMGCAL <- ifelse(grepl("B", dmgdata1$PROPDMGEXP), dmgdata1$PROPDMG * 1000000000, dmgdata1$PROPDMGCAL)
dmgdata1$PROPDMGCAL <- ifelse(grepl(" ", dmgdata1$PROPDMGEXP), dmgdata1$PROPDMG * 0, dmgdata1$PROPDMGCAL)
dmgdata1$PROPDMGCAL <- ifelse(grepl("0", dmgdata1$PROPDMGEXP), dmgdata1$PROPDMG * 0, dmgdata1$PROPDMGCAL)

The following steps include creating new column which sums the economic impact on property and crop. The further steps are needed for preparing the dataset for visualization. The dataset dmgdata6 is used for further visualization.

# Erase DMG and DMGEXP columns
dmgdata2 <- subset(dmgdata1, select=-c(CROPDMG,CROPDMGEXP, PROPDMG, PROPDMGEXP))

# Aggregate Injuries and Fatalities to new column Sum for further sorting
dmgdata3 <- aggregate(dmgdata2[-1], dmgdata2["EVTYPE"], sum)

# Create Sum of Injuries and Fatalities Incidents
dmgdata3$SUM <- dmgdata3$CROPDMGCAL + dmgdata3$PROPDMGCAL

# Order data by Sum of Injuries and Fatalities
dmgdata4 <- dmgdata3[order(dmgdata3$SUM, decreasing=TRUE), ]

# Select first twenty rows for better visualization
dmgdata5 <- dmgdata4[1:20, ]

# Erase before created sum column
dmgdata5 <- dmgdata5[1:3]
colnames(dmgdata5)[colnames(dmgdata5) == "CROPDMGCAL"] <- "Crop Damage"
colnames(dmgdata5)[colnames(dmgdata5) == "PROPDMGCAL"] <- "Prop Damage"

# Melt data from wide to long format
dmgdata6 <- melt(dmgdata5, id.var="EVTYPE")

Results

Impact on public health

### HEALTH ANALYSIS 

# 6 Most harmful events concerning Injuries
head(healthdata1[order(healthdata1$INJURIES, decreasing=TRUE), ])
##               EVTYPE INJURIES FATALITIES   SUM
## 37           Tornado    20667       1511 22178
## 13             Flood     6770        432  7202
## 11    Excessive Heat     6391       1797  8188
## 27         Lightning     4140        650  4790
## 12       Flash Flood     1674        887  2561
## 36 Thunderstorm Wind     1400        130  1530
# 6 Most harmful events concerning Fatalities
head(healthdata1[order(healthdata1$FATALITIES, decreasing=TRUE), ])
##            EVTYPE INJURIES FATALITIES   SUM
## 11 Excessive Heat     6391       1797  8188
## 37        Tornado    20667       1511 22178
## 12    Flash Flood     1674        887  2561
## 27      Lightning     4140        650  4790
## 32    Rip Current      503        542  1045
## 13          Flood     6770        432  7202
# 6 Most harmful events concerning sum of Injuries & Fatalities
head(healthdata1[order(healthdata1$SUM, decreasing=TRUE), ])
##               EVTYPE INJURIES FATALITIES   SUM
## 37           Tornado    20667       1511 22178
## 11    Excessive Heat     6391       1797  8188
## 13             Flood     6770        432  7202
## 27         Lightning     4140        650  4790
## 12       Flash Flood     1674        887  2561
## 36 Thunderstorm Wind     1400        130  1530

The most harmful events concerning injuries and fatalities are tornados. However, the most harmful event concerning fatalities is excessive heat.

The following barchart shows the 20 most harmful events concerning public health in decreasing order.

# Plot stacked barchart 
ggplot(healthdata4, aes(x = reorder(EVTYPE, -value), y = value, fill = variable)) + 
        geom_bar(stat = "identity") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
        labs(x = "Event Type") +
        labs(y = "Incidents") +
        labs(title = "20 most harmful Events since 01/01/1996", subtitle = "A summary of all injuries and fatalities segmented by Events")

Impact on Economy

### ECONOMY ANALYSIS 

# 6 Most harmful events concerning Crop damage
head(dmgdata4[order(dmgdata4$CROPDMGCAL, decreasing=TRUE), ])
##               EVTYPE  CROPDMGCAL   PROPDMGCAL          SUM
## 8            Drought 13367566000   1046101000  14413667000
## 13             Flood  4974778400 143945387550 148920165950
## 23 Hurricane/Typhoon  2607872800  69305840000  71913712800
## 17              Hail  2476279450  14595443420  17071722870
## 12       Flash Flood  1334901700  15222253910  16557155610
## 15      Frost/Freeze  1094186000     10480000   1104666000
# 6 Most harmful events concerning Property damage
head(dmgdata4[order(dmgdata4$PROPDMGCAL, decreasing=TRUE), ])
##               EVTYPE CROPDMGCAL   PROPDMGCAL          SUM
## 13             Flood 4974778400 143945387550 148920165950
## 23 Hurricane/Typhoon 2607872800  69305840000  71913712800
## 37           Tornado  283425010  24616905710  24900330720
## 12       Flash Flood 1334901700  15222253910  16557155610
## 17              Hail 2476279450  14595443420  17071722870
## 39    Tropical Storm  677711000   7642475550   8320186550
# 6 Most harmful events concerning sum of crop & property damage
head(dmgdata4[order(dmgdata4$SUM, decreasing=TRUE), ])
##               EVTYPE  CROPDMGCAL   PROPDMGCAL          SUM
## 13             Flood  4974778400 143945387550 148920165950
## 23 Hurricane/Typhoon  2607872800  69305840000  71913712800
## 37           Tornado   283425010  24616905710  24900330720
## 17              Hail  2476279450  14595443420  17071722870
## 12       Flash Flood  1334901700  15222253910  16557155610
## 8            Drought 13367566000   1046101000  14413667000

The most harmful events concerning economic damage are floods. However, the most harmful event concerning crop damage is drought.

The following barchart shows the 20 most harmful events concerning economic damage in decreasing order.

# Plot stacked barchart 
ggplot(dmgdata6, aes(x = reorder(EVTYPE, -value), y = value, fill = variable)) + 
        geom_bar(stat = "identity") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
        labs(x = "Event Type") +
        labs(y = "Economic Damage") +
        labs(title = "20 Events with highest economic damage since 01/01/1996", subtitle = "A summary of crop damage and property damage segmented by Events")