Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

With this data analysis, we will determine the most economically damaging event to property and crops and the most harmful event to population health in terms of injuries and fatalities. We will also determine the ten most damaging events that cause the most fatalities, injuries, crop damages, and property damages through barplots.

Information about the dataset

Finding the dataset on the web

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

Variables of interest in the dataset

Not all of the variables available in the dataset are of interest to us for answering the requred questions.

  • For measuring damages in terms of population health, only the variables representing the number of injuries and fatalities (INJURIES and FATALITIES respectively), will be considered.
  • For the purpose of measuring economical damages, the variables representing property damages (PROPDMG and PROPDMGEXP) and crop damages (CROPDMG and CROPDMGEXP) will be considered.

Loading required libraries

Before we begin, we need to install (if required) and load the necessary libraries that will be used in this analysis.

# Installing and loading required packages
if(!require(dplyr))
{
    install.packages("dplyr")
    library(dplyr)
}
if(!require(ggplot2))
{
    install.packages("ggplot2")
    library(ggplot2)
}
if(!require(gridExtra))
{
    install.packages("gridExtra")
    library(gridExtra)
}

Data Processing

Now, we will download and read the raw data.

fileURL = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dest = "repdata_data_StormData.csv.bz2"
if(!file.exists(dest))
    download.file(fileURL,dest)
storm = read.csv(dest)

The following chunks of code will process the data so that it can be used for analysis.

Select only the variables of interest as described earlier.

stormData = select(storm, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

1. Which types of events are most harmful to population health?

To address this question, we need to focus on the FATALITIES and INJURIES variables as described earlier. We will group the data according to the type of event and calculate total number of fatalities and injuries caused by each event.

fatalities = stormData %>% 
    group_by(EVTYPE) %>% 
    summarize(total_fat = sum(FATALITIES)) %>%
    arrange(desc(total_fat))

injuries = stormData %>% 
    group_by(EVTYPE) %>% 
    summarize(total_inj = sum(INJURIES)) %>%
    arrange(desc(total_inj))

We need to re-level the factors (event type, EVTYPE) according to the number of fatalities and injuries. This is required to plot the barplot in a decreasing order of fatalities and injuries.

# Releveling the factors according to number of fatalities and injuries
fatalities$EVTYPE = factor(fatalities$EVTYPE, levels = fatalities$EVTYPE)
injuries$EVTYPE = factor(injuries$EVTYPE, levels = injuries$EVTYPE)

The following multipanel barplot shows the ten events that caused the most number of fatalities and injuries.

plot1 = ggplot(data = fatalities[1:10,], aes(y = total_fat, x = EVTYPE)) + 
    geom_bar(stat="identity") + xlab("Event") + ylab("Total number of fatalities") +
    ggtitle("Most fatal events to population health by fatalities") + coord_flip()

plot2 = ggplot(data = injuries[1:10,], aes(y = total_inj, x = EVTYPE)) + 
    geom_bar(stat="identity") + xlab("Event") + ylab("Total number of injuries") +
    ggtitle("Most fatal events to population health by injuries") + coord_flip()

grid.arrange(plot1, plot2, nrow=2)

From the above plot, we can clearly observe that tornadoes are by far the most harmful to population health.

pfat_tornado = round(fatalities[1,2]/sum(fatalities[,2]) * 100, 2)
pinj_tornado = round(injuries[1,2]/sum(injuries[,2]) * 100, 2)

In fact, tornadoes are responsible for 37.19% of total fatalities and 65% of total injuries.

2. Which types of events have the greatest economic consequences?

This time we need to focus on the four damage variables damage variables as described earlier. These are PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP. PROPDMGEXP and CROPDMGEXP denote the power of 10 (think “exponent” for “EXP”) which should be multiplied to PROPDMG and CROPDMG respectively, to obtain the actual damage in USD. Apart from numbers, these are expressed as symbols which mean the following:

  • H or h = multiply by 10^2 (hundred)
  • K or k = multiply by 10^3 (thousand)
  • M or m = multiply by 10^6 (million)
  • B or b = multiply by 10^9 (billion)
  • garbage symbols = replace them by NA

The final calculated value of crop and property damage will be stored in the newly created variables “crop_dmg” and “prop_dmg” respectively.

actual_value = c("2","3","6","9","0")
coded_value = c("[Hh]","[Kk]","[Mm]","[Bb]","[+-/?]")
for(i in 1:length(actual_value))
{
    stormData$CROPDMGEXP = gsub(coded_value[i], actual_value[i], stormData$CROPDMGEXP)
    stormData$PROPDMGEXP = gsub(coded_value[i], actual_value[i], stormData$PROPDMGEXP)
}
stormData$CROPDMGEXP[stormData$CROPDMGEXP == ""] = "0"
stormData$PROPDMGEXP[stormData$PROPDMGEXP == ""] = "0"
stormData$CROPDMGEXP = as.numeric(stormData$CROPDMGEXP)
stormData$PROPDMGEXP = as.numeric(stormData$PROPDMGEXP)

stormData = stormData %>%
    mutate(crop_dmg = CROPDMG * 10^CROPDMGEXP, prop_dmg = PROPDMG * 10^PROPDMGEXP)

Now that we have calculated the value of the damages in USD, we proceed to group the data by event type and calculate the total crop and property damage caused by that event.

crop = stormData %>%
    group_by(EVTYPE) %>%
    summarize(crop_dmg = sum(crop_dmg)) %>%
    arrange(desc(crop_dmg))

prop = stormData %>%
    group_by(EVTYPE) %>%
    summarize(prop_dmg = sum(prop_dmg)) %>%
    arrange(desc(prop_dmg))

We have to re-level the factors again to help with the barplots.

# Releveling the factors according to crop and property damages
crop$EVTYPE = factor(crop$EVTYPE, levels = crop$EVTYPE)
prop$EVTYPE = factor(prop$EVTYPE, levels = prop$EVTYPE)

The following multipanel barplot shows the ten events that caused the most crop and property damage.

plot3 = ggplot(data = crop[1:10,], aes(y = crop_dmg, x = EVTYPE)) + 
    geom_bar(stat="identity") + xlab("Event") + ylab("Total damages in crop in USD") +
    ggtitle("Most economically damaging events by crop damage") + coord_flip()

plot4 = ggplot(data = prop[1:10,], aes(y = prop_dmg, x = EVTYPE)) + 
    geom_bar(stat="identity") + xlab("Event") + ylab("Total damages in property in USD") +
    ggtitle("Most economically damaging events by property damage") + coord_flip()

grid.arrange(plot3, plot4, nrow=2)

Finally, we calculate the total damage (sum of crop and property damage) caused by all the events and plot the top ten of those events.

total_dmg = stormData %>%
    group_by(EVTYPE) %>%
    summarize(damage = sum(prop_dmg) + sum(crop_dmg)) %>%
    arrange(desc(damage))

# Releveling the factors according to total damage
total_dmg$EVTYPE = factor(total_dmg$EVTYPE, levels = total_dmg$EVTYPE)

ggplot(data = total_dmg[1:10,], aes(y = damage, x = EVTYPE)) + 
    geom_bar(stat="identity") + xlab("Event") + ylab("Total damages in USD") +
    ggtitle("Most economically damaging events by total damage") + coord_flip()

Observing from the above plot, floods are by far the most economically damaging event.

percentage_flood = round(total_dmg[1,2]/sum(total_dmg[,2]) * 100,2)

In fact, floods are responsible for 31.49% of total damages.

Results

From the barplots and analysis presented here, we can conclude that

The top ten most damaging events with respect to total fatalities, injurues and toatl damages caused in propoerty and crop can be seen in the barplots drawn in this analysis.