The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database includes information about event types causing damage on public health and the economy. The following analysis uses this database in order to tackle the two questions:
The database includes comparable data since 1996, therefore the analysis will use data since 1996. The coding of the event types differs from instructions from the original code book. Therefore, a matching of the event types from the dataset and the original codebook had to be done and is described below. Impacts on public health are separated in injuries and fatalities. Damage on the economy are separated in crop and property damage.
The results are:
The most harmful events concerning injuries and fatalities are tornados. However, the most harmful event concerning fatalities is excessive heat.
The most harmful events concerning economic damage are floods. However, the most harmful event concerning crop damage is drought.
#required libraries
library(dplyr)
library(ggplot2)
library(reshape2)
library(stringdist)
## Warning: package 'stringdist' was built under R version 3.5.2
The data is available via a bz2-file, which can be opened with read.csv.
# Getting data
data <- read.csv("repdata_data_StormData.csv.bz2")
Before 1996 the dataset gradually were added new event types. Since 1996 the dataset wasn’t changed concerning new event types. Therefore we will focus on the years from 1996.
# Transform BGN_DATE to date-variable and filter only Events after 01/01/1996
data0 <- data
data0$date <- as.Date(data[["BGN_DATE"]], "%m/%d/%Y %H:%M:%S")
data0 <- data0 %>% filter(date > "1996-01-01")
However, the event types are coded in a messy way. First, we will lower all cases in order to get a more comprehensive data set. In the second step we will erase all event types including the word “summary” which do not appear to be events
# Lower all cases in EVTYPE
data0$EVTYPE <- tolower(data0$EVTYPE)
# Delete all Events containing "Summary" in EVTYPE
data0 <- data0[!grepl("Summary", data0$EVTYPE),]
In the third step a variable eventtypes48 is created in order to match the event types from the original codebook the the event types from the dataset. The matching is done with amatch and a maximal distance of 3.
# Creating eventtypes48 which includes all 48 Event Types from the code book.
eventtypes48 <- c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood","Cold/Wind Chill","Debris Flow","Dense Fog","Dense Smoke","Drought","Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill","Flash Flood","Flood","Freezing Fog","Frost/Freeze","Funnel Cloud","Hail","Heat","Heavy Rain","Heavy Snow","High Surf","High Wind","Hurricane/Typhoon","Ice Storm","Lakeshore Flood","Lake-Effect Snow","Lightning","Marine Hail","Marine High Wind","Marine Strong Wind","Marine Thunderstorm Wind","Rip Current","Seiche","Sleet","Storm Tide","Strong Wind","Thunderstorm Wind","Tornado","Tropical Depression","Tropical Storm","Tsunami","Volcanic Ash","Waterspout","Wildfire","Winter Storm","Winter Weather")
# Via amatch the event types from the dataset are matched to the original Codebook
data0$match <- amatch(data0$EVTYPE, eventtypes48, maxDist = 3)
data0 <- data0[is.na(data0$match) == FALSE,]
# A dataframe df48 is created for further matching
df48 <- as.data.frame(eventtypes48)
df48$match <- 1:nrow(df48)
# The original dataset and df48 get merged in order to match the original Event Type from the codebook
data0 <- (merge(df48, data0, by = "match"))
data0 <- subset(data0, select=-c(EVTYPE,match))
colnames(data0)[colnames(data0) == "eventtypes48"] <- "EVTYPE"
From the processed data0 the columns evtype, injuries and fatalities are extracted which indicate the impact of events on public health. The steps include creating a new column which sums the incidents of injuries and fatalities and preparing the data for visualization. The dataset healthdata4 is used for further visualization.
# Select three relevant columns: EVTYPE, INJURIES, FATALITIES
healthdata1 <- data0 %>% select(EVTYPE, INJURIES, FATALITIES)
# Aggregate Injuries and Fatalities to new column Sum for further sorting
healthdata1 <- aggregate(healthdata1[-1], healthdata1["EVTYPE"], sum)
# Create Sum of Injuries and Fatalities Incidents
healthdata1$SUM <- healthdata1$INJURIES + healthdata1$FATALITIES
# Order data by Sum of Injuries and Fatalities
healthdata2 <- healthdata1[order(healthdata1$SUM, decreasing=TRUE), ]
# Select first twenty rows for better visualization
healthdata3 <- healthdata2[1:20, ]
# Erase before created sum column
healthdata3 <- healthdata3[1:3]
# Melt data from wide to long format
healthdata4 <- melt(healthdata3, id.var="EVTYPE")
From the processed data0 the columns evtype, cropdmg, cropdmgexp, propdmg and propdmgexp are extracted which indicate the impact of events the economy.
In the first step new columns need to be calculated from the information of the dmg and dmgexp columns. dmgexp includes information about the exponent values of the dmg columns.
# Calculation of total economic impact
dmgdata1 <- data0 %>% select(EVTYPE, CROPDMG, CROPDMGEXP, PROPDMG, PROPDMGEXP)
dmgdata1$CROPDMGCAL <- 0
dmgdata1$PROPDMGCAL <- 0
dmgdata1$CROPDMGCAL <- ifelse(grepl("", dmgdata1$CROPDMGEXP), dmgdata1$CROPDMG * 0, dmgdata1$CROPDMGCAL)
dmgdata1$CROPDMGCAL <- ifelse(grepl("K", dmgdata1$CROPDMGEXP), dmgdata1$CROPDMG * 1000, dmgdata1$CROPDMGCAL)
dmgdata1$CROPDMGCAL <- ifelse(grepl("M", dmgdata1$CROPDMGEXP), dmgdata1$CROPDMG * 1000000, dmgdata1$CROPDMGCAL)
dmgdata1$CROPDMGCAL <- ifelse(grepl("B", dmgdata1$CROPDMGEXP), dmgdata1$CROPDMG * 1000000000, dmgdata1$CROPDMGCAL)
dmgdata1$CROPDMGCAL <- ifelse(grepl(" ", dmgdata1$CROPDMGEXP), dmgdata1$CROPDMG * 0, dmgdata1$CROPDMGCAL)
dmgdata1$CROPDMGCAL <- ifelse(grepl("0", dmgdata1$CROPDMGEXP), dmgdata1$CROPDMG * 0, dmgdata1$CROPDMGCAL)
dmgdata1$PROPDMGCAL <- ifelse(grepl("", dmgdata1$PROPDMGEXP), dmgdata1$PROPDMG * 0, dmgdata1$PROPDMGCAL)
dmgdata1$PROPDMGCAL <- ifelse(grepl("K", dmgdata1$PROPDMGEXP), dmgdata1$PROPDMG * 1000, dmgdata1$PROPDMGCAL)
dmgdata1$PROPDMGCAL <- ifelse(grepl("M", dmgdata1$PROPDMGEXP), dmgdata1$PROPDMG * 1000000, dmgdata1$PROPDMGCAL)
dmgdata1$PROPDMGCAL <- ifelse(grepl("B", dmgdata1$PROPDMGEXP), dmgdata1$PROPDMG * 1000000000, dmgdata1$PROPDMGCAL)
dmgdata1$PROPDMGCAL <- ifelse(grepl(" ", dmgdata1$PROPDMGEXP), dmgdata1$PROPDMG * 0, dmgdata1$PROPDMGCAL)
dmgdata1$PROPDMGCAL <- ifelse(grepl("0", dmgdata1$PROPDMGEXP), dmgdata1$PROPDMG * 0, dmgdata1$PROPDMGCAL)
The following steps include creating new column which sums the economic impact on property and crop. The further steps are needed for preparing the dataset for visualization. The dataset dmgdata6 is used for further visualization.
# Erase DMG and DMGEXP columns
dmgdata2 <- subset(dmgdata1, select=-c(CROPDMG,CROPDMGEXP, PROPDMG, PROPDMGEXP))
# Aggregate Injuries and Fatalities to new column Sum for further sorting
dmgdata3 <- aggregate(dmgdata2[-1], dmgdata2["EVTYPE"], sum)
# Create Sum of Injuries and Fatalities Incidents
dmgdata3$SUM <- dmgdata3$CROPDMGCAL + dmgdata3$PROPDMGCAL
# Order data by Sum of Injuries and Fatalities
dmgdata4 <- dmgdata3[order(dmgdata3$SUM, decreasing=TRUE), ]
# Select first twenty rows for better visualization
dmgdata5 <- dmgdata4[1:20, ]
# Erase before created sum column
dmgdata5 <- dmgdata5[1:3]
colnames(dmgdata5)[colnames(dmgdata5) == "CROPDMGCAL"] <- "Crop Damage"
colnames(dmgdata5)[colnames(dmgdata5) == "PROPDMGCAL"] <- "Prop Damage"
# Melt data from wide to long format
dmgdata6 <- melt(dmgdata5, id.var="EVTYPE")
### HEALTH ANALYSIS
# 6 Most harmful events concerning Injuries
head(healthdata1[order(healthdata1$INJURIES, decreasing=TRUE), ])
## EVTYPE INJURIES FATALITIES SUM
## 37 Tornado 20667 1511 22178
## 13 Flood 6770 432 7202
## 11 Excessive Heat 6391 1797 8188
## 27 Lightning 4140 650 4790
## 12 Flash Flood 1674 887 2561
## 36 Thunderstorm Wind 1400 130 1530
# 6 Most harmful events concerning Fatalities
head(healthdata1[order(healthdata1$FATALITIES, decreasing=TRUE), ])
## EVTYPE INJURIES FATALITIES SUM
## 11 Excessive Heat 6391 1797 8188
## 37 Tornado 20667 1511 22178
## 12 Flash Flood 1674 887 2561
## 27 Lightning 4140 650 4790
## 32 Rip Current 503 542 1045
## 13 Flood 6770 432 7202
# 6 Most harmful events concerning sum of Injuries & Fatalities
head(healthdata1[order(healthdata1$SUM, decreasing=TRUE), ])
## EVTYPE INJURIES FATALITIES SUM
## 37 Tornado 20667 1511 22178
## 11 Excessive Heat 6391 1797 8188
## 13 Flood 6770 432 7202
## 27 Lightning 4140 650 4790
## 12 Flash Flood 1674 887 2561
## 36 Thunderstorm Wind 1400 130 1530
The most harmful events concerning injuries and fatalities are tornados. However, the most harmful event concerning fatalities is excessive heat.
The following barchart shows the 20 most harmful events concerning public health in decreasing order.
# Plot stacked barchart
ggplot(healthdata4, aes(x = reorder(EVTYPE, -value), y = value, fill = variable)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = "Event Type") +
labs(y = "Incidents") +
labs(title = "20 most harmful Events since 01/01/1996", subtitle = "A summary of all injuries and fatalities segmented by Events")
### ECONOMY ANALYSIS
# 6 Most harmful events concerning Crop damage
head(dmgdata4[order(dmgdata4$CROPDMGCAL, decreasing=TRUE), ])
## EVTYPE CROPDMGCAL PROPDMGCAL SUM
## 8 Drought 13367566000 1046101000 14413667000
## 13 Flood 4974778400 143945387550 148920165950
## 23 Hurricane/Typhoon 2607872800 69305840000 71913712800
## 17 Hail 2476279450 14595443420 17071722870
## 12 Flash Flood 1334901700 15222253910 16557155610
## 15 Frost/Freeze 1094186000 10480000 1104666000
# 6 Most harmful events concerning Property damage
head(dmgdata4[order(dmgdata4$PROPDMGCAL, decreasing=TRUE), ])
## EVTYPE CROPDMGCAL PROPDMGCAL SUM
## 13 Flood 4974778400 143945387550 148920165950
## 23 Hurricane/Typhoon 2607872800 69305840000 71913712800
## 37 Tornado 283425010 24616905710 24900330720
## 12 Flash Flood 1334901700 15222253910 16557155610
## 17 Hail 2476279450 14595443420 17071722870
## 39 Tropical Storm 677711000 7642475550 8320186550
# 6 Most harmful events concerning sum of crop & property damage
head(dmgdata4[order(dmgdata4$SUM, decreasing=TRUE), ])
## EVTYPE CROPDMGCAL PROPDMGCAL SUM
## 13 Flood 4974778400 143945387550 148920165950
## 23 Hurricane/Typhoon 2607872800 69305840000 71913712800
## 37 Tornado 283425010 24616905710 24900330720
## 17 Hail 2476279450 14595443420 17071722870
## 12 Flash Flood 1334901700 15222253910 16557155610
## 8 Drought 13367566000 1046101000 14413667000
The most harmful events concerning economic damage are floods. However, the most harmful event concerning crop damage is drought.
The following barchart shows the 20 most harmful events concerning economic damage in decreasing order.
# Plot stacked barchart
ggplot(dmgdata6, aes(x = reorder(EVTYPE, -value), y = value, fill = variable)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = "Event Type") +
labs(y = "Economic Damage") +
labs(title = "20 Events with highest economic damage since 01/01/1996", subtitle = "A summary of crop damage and property damage segmented by Events")