Synopsis

This is an analysis of severe wheater events using the NOAA Storm Database. The goal of this analysis is to show the weather events that have the greatest economic consequences and the ones that are most harmful with respect to population health.

The conclusion is that FLOOD has the greatest economic consequence and TORNADO is the most harmful to population health.

About the data

The data for this analysis come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data Processing

First, we will load the packages used in this analysis

require(dplyr)
require(lubridate)
require(ggplot2)

Use the code bellow to read the raw data

# This process will take a wile, be patiente...
data <- read.csv(bzfile("repdata-data-StormData.csv.bz2"), stringsAsFactors = FALSE)
# Store the data in a variable called df, where we will make transforsmations.
# That way, the raw data are intact in the data variable if we need it latter
df <- data

The documentation says that data collected before 1996 was almost incomplete, so in this analysis, we will filter only the events ocurred after 1995.

# Convert the variable to date format
df$BGN_DATE <- as.Date(data$BGN_DATE, "%m/%d/%Y")

# Filter only year >= to 1996
df <- dplyr::filter(df, year(BGN_DATE) >= 1996)

Population health

In this analysis we will consider the total number of fatalities + injuries as the indicator of health harmfull

# Filter only events where fatalities or injuries happened
dfHealth <- select(df, EVTYPE, FATALITIES, INJURIES) %>% filter(FATALITIES > 0 | INJURIES > 0)

# Create a new variable with the sum of Fatalities and Injuries
dfHealth <- dfHealth %>% rowwise() %>% mutate(TotalValue = FATALITIES + INJURIES)

# Agreggating the data by Event Type, sum the total of them and grab the 10 most harmful events
dfTopDamageHealth <- dfHealth %>% group_by(EVTYPE) %>% summarise(Total = sum(TotalValue)) %>% arrange(desc(Total)) %>% top_n(10)
## Warning: Grouping rowwise data frame strips rowwise nature
## Selecting by Total
# Order the events for plotting
dfTopDamageHealth <- transform(dfTopDamageHealth, EVTYPE = reorder(EVTYPE, Total))

Now that the data is processed we will create a plot to show the top 10 events that are most harmful to population health, based on fatalities + injuries. The plot will be showed in the results section

healthPlot <- ggplot(dfTopDamageHealth) + 
    geom_bar(aes(x = EVTYPE, y = Total, fill = Total), stat="identity") +
    coord_flip() +
    theme(axis.text.y = element_text(size=rel(0.8))) +
    ggtitle("Most harmful events with respect to population health") +
    theme(plot.title = element_text(lineheight=.8, face="bold")) + 
    ylab("Total harmfull (fatalities + injuries)") +
    xlab("Event Type")

Economic consequences

Now, we will do the analysis for economic damages, we will consider the sum of dollars caused by Property Damage and Crop Damage to rank the events.

The documentation shows that the variables CROPDMGEXP and PROPDMGEXP are used to express the values in Thousands, Millions and Billions, so we will create a function that return the complete number based on the expression.

calculateTotalValue <- function(x, y){
    multiplier <- 0
    
    if(y == "K"){
        multiplier = 1000  
    } 
    if(y == "M") {
        multiplier = 1000000
    }
    if(y == "B"){
        multiplier = 1000000000  
    } 
    as.numeric(x * multiplier)
}

Now, we process the data based on damages

# Filter only events where damages, property or crop, happened
dfEconomy <- select(df, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>% filter(PROPDMG > 0 | CROPDMG > 0)

# Creating new variables with the sum of the amounts of damages
dfEconomy <- dfEconomy %>% rowwise() %>% 
            mutate(
                TotalPropValue = calculateTotalValue(PROPDMG, PROPDMGEXP),
                TotalCropValue = calculateTotalValue(CROPDMG, CROPDMGEXP),
                TotalValue = TotalPropValue + TotalCropValue
            )

# Agreggating the data by Event Type, sum the total of them and grab the 10 most costly events
dfTopDamageEconomy <- dfEconomy %>% group_by(EVTYPE) %>% summarise(Total = round((sum(TotalValue)/1000000))) %>% arrange(desc(Total)) %>% top_n(10)
## Warning: Grouping rowwise data frame strips rowwise nature
## Selecting by Total
# Order the events for plotting
dfTopDamageEconomy <- transform(dfTopDamageEconomy, EVTYPE = reorder(EVTYPE, Total))

Now that the data is processed we will create a plot to show the top 10 events that have the greatest economic consequences based on damage cost. The plot will be showed in the results section

economyPlot <- ggplot(dfTopDamageEconomy) + 
    geom_bar(aes(x = EVTYPE, y = Total, fill = Total), stat="identity") +
    coord_flip() +
    theme(axis.text.y = element_text(size=rel(0.8))) +
    ggtitle("Events with greatest economic consequences") +
    theme(plot.title = element_text(lineheight=.8, face="bold")) + 
    ylab("Total Damage Amount (in millions)") +
    xlab("Event Type")

Results

Now that our data are processed, we will plot the results to answer the questions of this analysis.

Which types of events are most harmful with respect to population health?

print(healthPlot)

Based on the plot, we can conclude that TORNADO is the most harmful event to population health, followed by EXCESSIVE HEAT and FLOOD.

Which types of events have the greatest economic consequences?

print(economyPlot)

Based on the plot, we can conclude that FLOOD has the greatest economic consequence, followed by HURRICANE/TYPHOON and STORM SURGE.