This is an analysis of severe wheater events using the NOAA Storm Database. The goal of this analysis is to show the weather events that have the greatest economic consequences and the ones that are most harmful with respect to population health.
The conclusion is that FLOOD has the greatest economic consequence and TORNADO is the most harmful to population health.
The data for this analysis come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
First, we will load the packages used in this analysis
require(dplyr)
require(lubridate)
require(ggplot2)
Use the code bellow to read the raw data
# This process will take a wile, be patiente...
data <- read.csv(bzfile("repdata-data-StormData.csv.bz2"), stringsAsFactors = FALSE)
# Store the data in a variable called df, where we will make transforsmations.
# That way, the raw data are intact in the data variable if we need it latter
df <- data
The documentation says that data collected before 1996 was almost incomplete, so in this analysis, we will filter only the events ocurred after 1995.
# Convert the variable to date format
df$BGN_DATE <- as.Date(data$BGN_DATE, "%m/%d/%Y")
# Filter only year >= to 1996
df <- dplyr::filter(df, year(BGN_DATE) >= 1996)
In this analysis we will consider the total number of fatalities + injuries as the indicator of health harmfull
# Filter only events where fatalities or injuries happened
dfHealth <- select(df, EVTYPE, FATALITIES, INJURIES) %>% filter(FATALITIES > 0 | INJURIES > 0)
# Create a new variable with the sum of Fatalities and Injuries
dfHealth <- dfHealth %>% rowwise() %>% mutate(TotalValue = FATALITIES + INJURIES)
# Agreggating the data by Event Type, sum the total of them and grab the 10 most harmful events
dfTopDamageHealth <- dfHealth %>% group_by(EVTYPE) %>% summarise(Total = sum(TotalValue)) %>% arrange(desc(Total)) %>% top_n(10)
## Warning: Grouping rowwise data frame strips rowwise nature
## Selecting by Total
# Order the events for plotting
dfTopDamageHealth <- transform(dfTopDamageHealth, EVTYPE = reorder(EVTYPE, Total))
Now that the data is processed we will create a plot to show the top 10 events that are most harmful to population health, based on fatalities + injuries. The plot will be showed in the results section
healthPlot <- ggplot(dfTopDamageHealth) +
geom_bar(aes(x = EVTYPE, y = Total, fill = Total), stat="identity") +
coord_flip() +
theme(axis.text.y = element_text(size=rel(0.8))) +
ggtitle("Most harmful events with respect to population health") +
theme(plot.title = element_text(lineheight=.8, face="bold")) +
ylab("Total harmfull (fatalities + injuries)") +
xlab("Event Type")
Now, we will do the analysis for economic damages, we will consider the sum of dollars caused by Property Damage and Crop Damage to rank the events.
The documentation shows that the variables CROPDMGEXP and PROPDMGEXP are used to express the values in Thousands, Millions and Billions, so we will create a function that return the complete number based on the expression.
calculateTotalValue <- function(x, y){
multiplier <- 0
if(y == "K"){
multiplier = 1000
}
if(y == "M") {
multiplier = 1000000
}
if(y == "B"){
multiplier = 1000000000
}
as.numeric(x * multiplier)
}
Now, we process the data based on damages
# Filter only events where damages, property or crop, happened
dfEconomy <- select(df, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>% filter(PROPDMG > 0 | CROPDMG > 0)
# Creating new variables with the sum of the amounts of damages
dfEconomy <- dfEconomy %>% rowwise() %>%
mutate(
TotalPropValue = calculateTotalValue(PROPDMG, PROPDMGEXP),
TotalCropValue = calculateTotalValue(CROPDMG, CROPDMGEXP),
TotalValue = TotalPropValue + TotalCropValue
)
# Agreggating the data by Event Type, sum the total of them and grab the 10 most costly events
dfTopDamageEconomy <- dfEconomy %>% group_by(EVTYPE) %>% summarise(Total = round((sum(TotalValue)/1000000))) %>% arrange(desc(Total)) %>% top_n(10)
## Warning: Grouping rowwise data frame strips rowwise nature
## Selecting by Total
# Order the events for plotting
dfTopDamageEconomy <- transform(dfTopDamageEconomy, EVTYPE = reorder(EVTYPE, Total))
Now that the data is processed we will create a plot to show the top 10 events that have the greatest economic consequences based on damage cost. The plot will be showed in the results section
economyPlot <- ggplot(dfTopDamageEconomy) +
geom_bar(aes(x = EVTYPE, y = Total, fill = Total), stat="identity") +
coord_flip() +
theme(axis.text.y = element_text(size=rel(0.8))) +
ggtitle("Events with greatest economic consequences") +
theme(plot.title = element_text(lineheight=.8, face="bold")) +
ylab("Total Damage Amount (in millions)") +
xlab("Event Type")
Now that our data are processed, we will plot the results to answer the questions of this analysis.
print(healthPlot)
Based on the plot, we can conclude that TORNADO is the most harmful event to population health, followed by EXCESSIVE HEAT and FLOOD.
print(economyPlot)
Based on the plot, we can conclude that FLOOD has the greatest economic consequence, followed by HURRICANE/TYPHOON and STORM SURGE.