Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the this site.
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Import the data and process the data:
library(tidyverse)
# Data Processing
# Import Data
raw_data <- read.csv(file = "./repdata_data_StormData.csv.bz2")
# Remove unnecessary features
data <- raw_data %>% select(EVTYPE, CROPDMG, CROPDMGEXP, PROPDMG, PROPDMGEXP, FATALITIES, INJURIES)
data$EVTYPE <- factor(data$EVTYPE)Replace abbreviations with full word descriptors:
## Standardize abbreviations
data$EVTYPE <- gsub("CSTL", "COASTAL", data$EVTYPE)
data$EVTYPE <- gsub("FLDG", "FLOOD", data$EVTYPE)
data$EVTYPE <- gsub("HVY", "HEAVY", data$EVTYPE)
data$EVTYPE <- gsub("SML", "SMALL", data$EVTYPE)
data$EVTYPE <- gsub("STRM", "STREAM", data$EVTYPE)
data$EVTYPE <- gsub("TSTMW", "THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub("TSTM", "THUNDERSTORM", data$EVTYPE)
data$EVTYPE <- gsub("VOG", "FOG", data$EVTYPE)
data$EVTYPE <- gsub("WND", "WIND", data$EVTYPE) Replace exponential reference with their respective multiplier for crop and property damages:
# Set CROPDMGEXP and PROPDMGEXP values to multiplier number
## 0 to 10 -> 10
data$CROPDMGEXP <- gsub("[[:digit:]]", "10", data$CROPDMGEXP)
data$PROPDMGEXP <- gsub("[[:digit:]]", "10", data$PROPDMGEXP)
## + -> 1
data$CROPDMGEXP <- gsub("\\+", "1", data$CROPDMGEXP)
data$PROPDMGEXP <- gsub("\\+", "1", data$PROPDMGEXP)
## -,? -> 0
data$CROPDMGEXP <- gsub("[-\\?]", "0", data$CROPDMGEXP)
data$PROPDMGEXP <- gsub("[-\\?]", "0", data$PROPDMGEXP)
## H,h, -> 100
## K,k, -> 1000
## M,m, -> 1000000
## B,b -> 1000000000
### Crop Damage
data$CROPDMGEXP <- gsub("[Hh]", "100", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("[Kk]", "1000", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("[Mm]", "1000000", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("[Bb]", "1000000000", data$CROPDMGEXP)
### Property Damage
data$PROPDMGEXP <- gsub("[Hh]", "100", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("[Kk]", "1000", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("[Mm]", "1000000", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("[Bb]", "1000000000", data$PROPDMGEXP)
## clean exponent data
### empty character as 0
data$CROPDMGEXP[data$CROPDMGEXP == ""] <- 0
data$PROPDMGEXP[data$PROPDMGEXP == ""] <- 0
### convert to numeric class
data$CROPDMGEXP <- as.numeric(data$CROPDMGEXP)
data$PROPDMGEXP <- as.numeric(data$PROPDMGEXP)
## calculate $ value = (number * exponent-number)
data <- data %>% mutate(`Crop Damage` = CROPDMG * CROPDMGEXP)
data <- data %>% mutate(`Property Damage` = PROPDMG * PROPDMGEXP) Using the processed data from above we will analyse the effect of each type of event on health outcomes (injuries and fatalities) and economic consequences (evaluation of crop and property damages).
Total count of each event between 1950 and 2011:
library(tidyverse)
data %>%
select(`Event Type` = EVTYPE) %>%
group_by(`Event Type`) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) %>%
filter(Count >= 10) # Filter for events that have been registered at least 10 timesTotal count of fatalities and injuries for each event between 1950 and 2011:
data %>%
select(`Event Type` = EVTYPE, Fatalities = FATALITIES, Injuries = INJURIES) %>%
group_by(`Event Type`) %>%
summarise(`Total Fatalities` = sum(Fatalities),
`Total Injuries` = sum(Injuries)) %>% # Calculate total number of injuries and fatalities for each event type
arrange(desc(`Total Fatalities`), desc(`Total Injuries`)) %>% # arrange rows by decreasing number of total fatalities and injuries
filter(`Total Fatalities` >= 5 & `Total Injuries` >= 5 ) # Remove entries with less than 5 injuries and fatalities %>% library(RColorBrewer)
health_data <- data %>%
select(`Event Type` = EVTYPE, Fatalities = FATALITIES, Injuries = INJURIES) %>%
group_by(`Event Type`) %>%
summarise(`Total Fatalities` = sum(Fatalities),
`Total Injuries` = sum(Injuries)) %>% # Calculate total number of injuries and fatalities for each event type
arrange(desc(`Total Fatalities`), desc(`Total Injuries`)) %>% # arrange rows by decreasing number of total fatalities and injuries
filter(`Total Fatalities` >= 5 & `Total Fatalities` >= 5 ) %>% # Remove entries with less than 5 injuries and fatalities
.[1:20, ] %>%
gather(key = "Parameter", value = "Value", -`Event Type`)
health_data %>%
ggplot(mapping = aes(x = reorder(health_data$`Event Type`, (health_data$Value)), y = Value, fill = Parameter)) +
geom_bar(stat="identity", position="stack") +
labs(title = "Total Fatalities and Injuries by Event type",
subtitle = "U.S. National Oceanic and Atmospheric Administration's (NOAA),\n1950-2011",
caption = "Top 20 of event types.",
y = NULL) +
theme_minimal() +
theme(axis.title.y = element_blank()) +
coord_flip()data %>%
select(`Event Type` = EVTYPE, Fatalities = FATALITIES, Injuries = INJURIES) %>%
group_by(`Event Type`) %>%
summarise(`Total Fatalities` = sum(Fatalities),
`Total Injuries` = sum(Injuries)) %>% # Calculate total number of injuries and fatalities for each event type
arrange(desc(`Total Fatalities`), desc(`Total Injuries`)) %>% # arrange rows by decreasing number of total fatalities and injuries
filter(`Total Fatalities` >= 5 & `Total Injuries` >= 5 ) %>% # Remove entries with less than 5 injuries and fatalities %>%
# gather(key = "Parameter", value = "Value", -`Event Type`) %>%
ggplot(mapping = aes(x = `Event Type`, y = 1, fill = `Total Fatalities`)) +
geom_tile() +
labs(title = "Total Fatalities by Event type",
subtitle = "U.S. National Oceanic and Atmospheric Administration's (NOAA),\n1950-2011",
y = NULL) +
coord_flip() +
theme_minimal() +
theme(aspect.ratio = 3, axis.ticks.x = element_blank(), axis.text.x = element_blank()) library(tidyverse)
damage_data <- data %>%
select(`Event Type` = EVTYPE, `Crop Damage`, `Property Damage`) %>%
group_by(`Event Type`) %>%
summarise(`Total Crop Damage` = sum(`Crop Damage`),
`Total Property Damage` = sum(`Property Damage`)) %>% # Calculate totals for each event type
arrange(desc(`Total Property Damage`), desc(`Total Crop Damage`)) %>% # Arrange rows by decreasing values
filter(`Total Property Damage` > 0 & `Total Crop Damage` > 0 ) %>% # Remove entries with less than 5 injuries and fatalities
.[1:20, ] %>%
gather(key = "Parameter", value = "Value", -`Event Type`)
damage_data %>%
ggplot(mapping = aes(x = reorder(damage_data$`Event Type`, (damage_data$Value)), y = Value, fill = Parameter)) +
geom_bar(stat="identity", position="stack") +
labs(title = "Total Damage by Event type",
subtitle = "U.S. National Oceanic and Atmospheric Administration's (NOAA),\n1950-2011",
caption = "Top 20 of event types.",
y = NULL) +
theme_minimal() +
theme(axis.title.y = element_blank()) +
coord_flip()Summary of total damages by type of event:*
data %>%
select(`Event Type` = EVTYPE, `Crop Damage`, `Property Damage`) %>%
group_by(`Event Type`) %>%
summarise(`Total Crop Damage` = sum(`Crop Damage`),
`Total Property Damage` = sum(`Property Damage`)) %>% # Calculate totals for each event type
arrange(desc(`Total Property Damage`), desc(`Total Crop Damage`)) %>% # Arrange rows by decreasing values
filter(`Total Property Damage` > 0 & `Total Crop Damage` > 0 ) %>% # Remove entries with less than 5 injuries and fatalities
.[1:30, ] %>%
mutate(Total = `Total Crop Damage` + `Total Property Damage`) %>%
arrange(desc(Total))