Reproducible Research - NOAA Storm Events Data Analysis

Assignment 2

by Joshua Jin

June 16, 2014

Synopsis

This assignment is to analyze and report public health and material damage of severe weather events. This document describes the data handling, data transformation, data analysis and reporting. It produced bar charts showing the top 10 most significant events in term of fatalities, injuries and economic damage caused. The result of the analysis is sufficient to answer the two questions listed below. Overall, tornadoes are the most hazardous to human health with 5,633 reported fatalities and 91,346 reported injuries, and floods have been responsible for the most economic damage, more than $150 billion US dollars.

Context

The goal of the data analysis is to address the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. cross the United States, which types of events have the greatest economic consequences?

Data

Storm Data is an official publication of the National Oceanic and Atmospheric Administration. It receives data from Storm Data from the National Weather Service. The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. It can download the file from the course web site:

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data Processing

Data descriptions and justification for data transformations

According to the documentation provided, 7 out of 37 available variables in the dataset are used for data analysis:

  • EVTYPE: event type (e.g. tornado, flood, etc.)
  • FATALITIES: number directly killed as a measure of harm to human health
  • INJURIES: number directly injured as a measure of harm to human health
  • PROPDMG: property damage as a economic damage in USD
  • PROPDMGEXP: magnitude of property damage (e.g. thousands, millions USD, etc.)
  • CROPDMG: crop damage and as a economic damage in USD
  • CROPDMGEXP: magnitude of crop damage (e.g. thousands, millions USD, etc.)

The data processing contains the following steps:

  1. Load referenced libraries and input data
  2. Subset fatalities data. Create a data frame (fatalities.df) that contains the 10 most harmful events based on the total fatalities
  3. Subset injuries data. Create a data frame (injuries.df) that contains the 10 harmful events based on the total injuries
  4. Subset property and crop damages data. Create a data frame damaged.event.data. It contains all records that have caused fatalities, injuries, property damages, and crop damages, including dollar multipliers (PROPDMGEXP and CROPDMGEXP).
  5. Calculate the total damages for each of harmful events by multiply the base dollar amount to its multiplier. Store the result to a data frame summary.dmg.df for charting.
  6. Present bar charts

Load referenced libraries and input data

Note: From R, set the working directory first. Here some of data analysis is relying on sqldf package to leverage someone who is familiar with SQL language already. The charting is depended on ggplot2 package.

setwd("~/My_Projects/JohnsHDS/05_Reproducible_Research/Assignment_2")

library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required namespace: tcltk
## Loading required package: RSQLite
## Loading required package: DBI
## Loading required package: RSQLite.extfuns
library(ggplot2)

z <- bzfile("repdata_data_StormData.csv.bz2", open = "r")
data <- read.csv(z)
close(z)

Subset fatalities data

Create a data frame (fatalities.df) that contains the 10 most harmful events based on the total fatalities.

fatalities.df <- sqldf('
    SELECT 
        SUM(FATALITIES) as TotalFatalities,
        EVTYPE
    FROM data
    GROUP BY EVTYPE
    ORDER BY TotalFatalities  DESC
    LIMIT 10')
## Loading required package: tcltk
# Report the top 10 fatalities and event types
fatalities.df
##    TotalFatalities         EVTYPE
## 1             5633        TORNADO
## 2             1903 EXCESSIVE HEAT
## 3              978    FLASH FLOOD
## 4              937           HEAT
## 5              816      LIGHTNING
## 6              504      TSTM WIND
## 7              470          FLOOD
## 8              368    RIP CURRENT
## 9              248      HIGH WIND
## 10             224      AVALANCHE

Subset fatalities data

Create a data frame (injuries.df) that contains the 10 harmful events based on the total injuries.

injuries.df <- sqldf('
    SELECT 
        SUM(INJURIES) as TotalInjuries,
        EVTYPE
    FROM data
    GROUP BY EVTYPE
    ORDER BY TotalInjuries  DESC
    LIMIT 10')

# Report the top 10 injuries and event types
injuries.df
##    TotalInjuries            EVTYPE
## 1          91346           TORNADO
## 2           6957         TSTM WIND
## 3           6789             FLOOD
## 4           6525    EXCESSIVE HEAT
## 5           5230         LIGHTNING
## 6           2100              HEAT
## 7           1975         ICE STORM
## 8           1777       FLASH FLOOD
## 9           1488 THUNDERSTORM WIND
## 10          1361              HAIL

Subset property and crop damages data

Create a data frame damaged.event.data. It contains all records that has caused fatalities, injuries, property damages, and crop damages, including dollar multipliers (PROPDMGEXP and CROPDMGEXP).

damaged.event.data <- sqldf('
    SELECT 
        EVTYPE,
        FATALITIES, 
        INJURIES, 
        PROPDMG,
        CROPDMG,
        PROPDMGEXP, 
        CROPDMGEXP
    FROM data
    WHERE FATALITIES + INJURIES + PROPDMG + CROPDMG > 0
    ORDER BY FATALITIES DESC  ')

Calculate total property and crop damages

Calculate the total damages for each of harmful events by multiply the base dollar amount to its multiplier. Store the result to a data frame summary.dmg.df for charting.

# Set multipliers (e.g. M for thousands) that will be used for calculating damages
multiplier <- c( "0"= 10^0, "2"=10^2, "3"=10^3, "4"=10^4, "5"=10^5, "6"=10^6, 
                 "7"=10^7, H=10^2, h=10^2, M=10^6, m=10^6, K=10^3, k=10^3, 
                 B=10^9, b= 10^9)

damaged.event.data$PROPDMGDollar <- damaged.event.data$PROPDMG * 
                       multiplier[as.character(damaged.event.data$PROPDMGEXP)]
damaged.event.data$PROPDMGDollar[is.na(damaged.event.data$PROPDMGDollar)] <- 0 

damaged.event.data$CROPDMGDollar <- damaged.event.data$CROPDMG * 
                       multiplier[as.character(damaged.event.data$CROPDMGEXP)]
damaged.event.data$CROPDMGDollar[is.na(damaged.event.data$CROPDMGDollar)] <- 0

summary.dmg.df <- sqldf('
    SELECT 
        SUM(PROPDMGDollar) as TotalDollarPropDMG,
        SUM(CROPDMGDollar) as TotalDollarCropDMG,
        SUM(PROPDMGDollar) + SUM(CROPDMGDollar) as TotalDollarDMG,
        EVTYPE
    FROM "damaged.event.data"
    GROUP BY EVTYPE
    ORDER BY TotalDollarDMG  DESC
    LIMIT 10')

# Report the top 10 event types caused property and crop damages
summary.dmg.df[,3:4]
##    TotalDollarDMG            EVTYPE
## 1       1.503e+11             FLOOD
## 2       7.191e+10 HURRICANE/TYPHOON
## 3       5.736e+10           TORNADO
## 4       4.332e+10       STORM SURGE
## 5       1.876e+10              HAIL
## 6       1.824e+10       FLASH FLOOD
## 7       1.502e+10           DROUGHT
## 8       1.461e+10         HURRICANE
## 9       1.015e+10       RIVER FLOOD
## 10      8.967e+09         ICE STORM

Plot bar charts

p1 <- ggplot(fatalities.df, aes(reorder(EVTYPE,TotalFatalities),
                                TotalFatalities, fill=factor(EVTYPE))) +
    geom_bar(stat="identity") +
    xlab("Event Type") +
    ylab("Total Fatalities") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) 
print(p1)

plot of chunk unnamed-chunk-6

p2 <- ggplot(injuries.df, aes(reorder(EVTYPE,TotalInjuries), 
                              TotalInjuries, fill=factor(EVTYPE))) +
    geom_bar(stat="identity") +
    xlab("Event Type") +
    ylab("Total Injuries") +    
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) 
print(p2)

plot of chunk unnamed-chunk-6

# Plot damage 
p3 <- ggplot(summary.dmg.df, aes(reorder(EVTYPE,TotalDollarDMG),
                                 TotalDollarDMG, fill=factor(EVTYPE))) +
    geom_bar(stat="identity") +
    xlab("Event Type") +
    ylab("Total Damages ($)") +    
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) 
print(p3)

plot of chunk unnamed-chunk-6

Results

By looking at result numbers and charts, it is obvious that tornadoes are the most hazardous to human health with 5,633 reported fatalities and 91,346 reported injuries. Floods have been responsible for the most economic damage with more than $150 billion US dollars.

In this assignment, it provided a step-by-step approach for analyzing data. The document was prepared with R version 3.0.2 on Mac OS X v10.9.3.