Synopsis

In this analysis we aim to describe what type of events are most harmful with respect to population health and to economy across the United States. To investigate this we explored the NOAA Storm Database. From this data, we found that tornadoes are the most harmful event to population health, both in terms of injuries and fatalities. With respect to economic consequencies, floods ranks first.

Data Processing

Dataset

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. It can be downloaded from: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

There is also some documentation of the database available:

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Loading the data

Firts, lets download the required file:

file_name <- "repdata_data_StormData.csv.bz2" 
file_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2" 
if(!file.exists(file_name)) {
    download.file(file_url, destfile = file_name)
    }

We have obtained a file in “bz2” format. According to instructions, it’s a compressed csv file where:

  • separator character is “,”
  • decimal character is “.”
  • missing values are writen as blank spaces (“”)

With that information, we can load the csv file into a “df_storm_data” data frame. According to R documentation, read.table function can read inside a bz2 file so we don’t decompress it. We choose to keep the headers and read strings as factors.

Note: it’s a big file and reading it takes quite a few time. We can’t afford to read it every time we process the document. On the other hand, the file is not going to change so we can cache it.

df_storm_data <- read.table(file = file_name, 
                            header = TRUE, 
                            sep = ",", 
                            dec = ".",
                            na.strings = "",
                            stringsAsFactors = TRUE)

Let’s check the dimension and field names:

dim(df_storm_data)
## [1] 902297     37
names(df_storm_data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Cleaning the data

Using dplyr package we are going the following:

  • We’ll focus on 7 fields of df_storm_data dataframe as we need no more:
    • EVTYPE: type of event
    • INJURIES: number of injuries
    • FATALITIES: number of fatalities
    • PROPDMG: monetary amount of damage on properties
    • PROPDMGEXP: measure units for PROPDMG
    • CROPDMG: monetary amout of damage on crops
    • CROPDMGEXP: measure units for CROPDMG
  • We’ll make a new field PROPERTIES_DAMAGE using the amount in PROPDMG and the character in PROPDMGEXP: h or H for hundred dollars, k Or K for thousand dollars, m or M for million dollars, b or B por billion dollars. Any other character will be treated as an indicator of “zero dollars”.
  • We’ll make a new field CROPS_DAMAGE using the amount in CROPDMG and the character in CROPDMGEXP: h or H for hundred dollars, k Or K for thousand dollars, m or M for million dollars, b or B por billion dollars. Any other character will be treated as an indicator of “zero dollars”.
  • We’ll translate EVTYPE to upper case to avoid repetitions and rename it to something more descriptive: EVENT_TYPE
  • Finally, we’ll make a new “df_mydata” dataframe with the fields: EVENT_TYPE, PROPERTIES_DAMAGE, CROPS_DAMAGE, INJURIES, FATALITIES.
if (!require("dplyr")) {
  install.packages("dplyr")
  require("dplyr")
}
df_mydata <- df_storm_data %>% select(EVTYPE,
                                      INJURIES, FATALITIES,
                                      PROPDMG, PROPDMGEXP,
                                      CROPDMG, CROPDMGEXP) %>%
    mutate(PROPERTIES_DAMAGE = 
               ifelse(
                   PROPDMGEXP %in% c("h","H"), 
                   PROPDMG * 10^2, 
                   ifelse(
                       PROPDMGEXP %in% c("k","K"), 
                       PROPDMG * 10^3,
                       ifelse(
                           PROPDMGEXP %in% c("m","M"),
                           PROPDMG * 10^6,
                           ifelse(
                               PROPDMGEXP %in% c("b","B"),
                               PROPDMG * 10^9,
                               0
                           ))))) %>% 
    mutate(CROPS_DAMAGE = 
               ifelse(
                   CROPDMGEXP %in% c("h","H"), 
                   CROPDMG * 10^2, 
                   ifelse(
                       CROPDMGEXP %in% c("k","K"), 
                       CROPDMG * 10^3,
                       ifelse(
                           CROPDMGEXP %in% c("m","M"),
                           CROPDMG * 10^6,
                           ifelse(
                               CROPDMGEXP %in% c("b","B"),
                               CROPDMG * 10^9,
                               0
                           ))))) %>%
    mutate(EVENT_TYPE = toupper(EVTYPE)) %>%
    select(EVENT_TYPE, PROPERTIES_DAMAGE, CROPS_DAMAGE,
           INJURIES, FATALITIES)

RESULTS

Question 1: Across the United States, which types of events are most harmful with respect to population health?

We’ll separate the harm type in injuries and fatalities.

Injuries

To address the injury risk, we’ll make a “df_injuries” data frame as follows:

  • We’ll summarise injuries by event type.
  • We’ll arrange the data frame by total injuries, from greatest to least.
df_injuries <- df_mydata %>% 
    select(EVENT_TYPE, INJURIES) %>%
    group_by(EVENT_TYPE) %>%
    summarise(INJURIES = sum(INJURIES)) %>%
    arrange(-INJURIES)

Let’s plot it.

ggplot2 graphic package will be needed:

if (!require("ggplot2")) {
  install.packages("ggplot2")
  require("ggplot2")
}
ggplot(data = df_injuries[1:10,], 
       aes(x = reorder(EVENT_TYPE, -INJURIES), 
           y = INJURIES)) +
    geom_bar(stat = "identity", fill = "green") + 
    theme(axis.text.x = element_text(angle=45, hjust=1)) +
    labs(title = "Top 10 injury causing events", 
         x = "Event type",
         y = "Number of injuries") 

Tornadoes are clearly the main injury causing event, followed by tstm winds, floods and excesive heat.

Fatalities

To address the death risk, we’ll make a “df_fatalities” data frame as follows:

  • We’ll summarise fatalities by event type.
  • We’ll arrange the data frame by total fatalities, from greatest to least.
df_fatalities <- df_mydata %>% 
    select(EVENT_TYPE, FATALITIES) %>%
    group_by(EVENT_TYPE) %>%
    summarise(FATALITIES = sum(FATALITIES)) %>%
    arrange(-FATALITIES)

Let’s plot it.

ggplot(data = df_fatalities[1:10,], 
       aes(x = reorder(EVENT_TYPE, -FATALITIES), 
           y = FATALITIES)) +
    geom_bar(stat = "identity", fill = "green") + 
    theme(axis.text.x = element_text(angle=45, hjust=1)) +
    labs(title = "Top 10 death causing events", 
         x = "Event type",
         y = "Number of fatalities") 

Tornadoes are clearly the main death causing event, followed by excesive heat, flash floods and heat.

Question 2: Across the United States, which types of events have the greatest economic consequences?

To address the risks on economy, we’ll make a “df_economics” data frame as follows:

  • We’ll summarise the damage on properties and the damage on crops by event type. We’ll put the sum of both damage types in a new field TOTAL_DAMAGE.
  • We’ll keep EVENT_TYPE and TOTAL_DAMAGE fields as they are and melt PROPERTIES_DAMAGE and CROPS_DAMAGE in a variable/value format.
  • We’ll arrange the data frame by total damage, from greatest to least.

Package “reshape2” will be needed for melting:

if (!require("reshape2")) {
  install.packages("reshape2")
  require("reshape2")
}
df_economics <- df_mydata %>% 
    select(EVENT_TYPE, PROPERTIES_DAMAGE, CROPS_DAMAGE) %>%
    group_by(EVENT_TYPE) %>%
    summarise(PROPERTIES_DAMAGE = sum(PROPERTIES_DAMAGE), 
           CROPS_DAMAGE = sum(CROPS_DAMAGE)) %>%
    mutate(TOTAL_DAMAGE = PROPERTIES_DAMAGE + CROPS_DAMAGE)

df_economics <- melt(df_economics, 
                  id = c("EVENT_TYPE", "TOTAL_DAMAGE"),
                  measure.vars = c("PROPERTIES_DAMAGE","CROPS_DAMAGE"))
    

df_economics <- df_economics %>% arrange(-TOTAL_DAMAGE)

Let’s plot it.

ggplot(data = df_economics[1:20,], 
       aes(x = reorder(EVENT_TYPE, -value), 
           y = value,
           fill = variable)) +
    geom_bar(stat = "identity") + 
    theme(axis.text.x = element_text(angle=45, hjust=1)) +
    labs(title = "Top 10 property and crop damaging events", 
         x = "Event type",
         y = "Total damage in dollars") 

Floods have clearly the greatest impact on economy, followed by hurricanes(typhons), tornadoes and storm surges. In most cases, the impact comes from damages on properties more than crops.