Reproducible Research Assignment #2

Sunday, September 21, 2014

This the assignment #2 of Reproducible Research Assignment for Storm Data Analysis.


Damange Analysis of National Weather Service Storm Data (1950-2011)

Synopsis

To understand the damages (public health and economic problems) for communities and municipalities in different types of natural disasters and events, we explored and ran simple analysis on storm database of the U.S. National Oceanic and Atmospheric Administration (NOAA). The database of NOAA contains records of damages in different types of events from 1950 to 2011.

In our first data analysis, we filtered and then processed the valid dataset to accumulate the number of fatalities and injuries by different types of events. So, we can understand how these events impact health of people. The second analysis about the economic lost. We accumulate the damage cost (product s and crops) by different types of events.
The results show hail, thunderstorm and flood are ranked the top natural disasters of health of the citizens. The hurricane, storm and flood are the top threats of economic lost.

CAVEAT: Many records in the database can be either lost or incorrect (human typing error), the valid data are less than 10% of the original data. The analyzed results might not reflect the reality.

Data Processing

## load library and declare trim function
library(xtable)
library(ggplot2)
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.18.0 (2014-02-22) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## 
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## 
## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save
## 
## R.utils v1.33.0 (2014-08-24) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## 
## The following object is masked from 'package:utils':
## 
##     timestamp
## 
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
  1. Load the data:
    • Download storm data (repdata_data_StormData.csv.bz2 ), extract repdata_data_StormData.csv from repdata_data_StormData.csv.bz2 and read dataset.

      library(knitr)
      opts_chunk$set(cache=TRUE)
      source_url="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
      source_file="repdata_data_StormData.csv.bz2"
      data_file="repdata_data_StormData.csv"
      ## download file and extract file
      if (!file.exists(source_file)) {
          ## download repdata_data_StormData.csv.bz2
          download.file(source_url, source_file, method="curl", quiet=T)        
      }
      if (!file.exists(data_file)){
          ## extract bz2 file
          bunzip2(source_file,data_file)
      }
      
      ## read csv data
      data<-read.csv(data_file, sep=",", header=T, quote ="", nrows=902297)
    • Remove unused rows, empty events, keep (EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP) and trim character rows

      ## remove unused rows
      wdata <- data[,c(8,23:28)]
      ## trim and set to upper case
      wdata$EVTYPE=toupper(trim(wdata$EVTYPE))
      wdata$PROPDMGEXP=toupper(trim(wdata$PROPDMGEXP))
      wdata$CROPDMGEXP=toupper(trim(wdata$CROPDMGEXP))
      
      ## remove NA in EVTYPE
      wdata <- subset(wdata,!(wdata$EVTYPE=="" | wdata$EVTYPE=="?"))
      
      ## transform factor columns to numeric columns
      wdata$FATALITIES <- as.numeric(wdata$FATALITIES)
      wdata$INJURIES <- as.numeric(wdata$INJURIES)
      wdata$PROPDMG <- as.numeric(wdata$PROPDMG)
      wdata$CROPDMG <- as.numeric(wdata$CROPDMG)
      
      ## remove rows of no damages
      wdata <- subset(wdata,!(wdata$FATALITIES==0 
                    & wdata$INJURIES==0
                    & wdata$PROPDMG==0
                    & wdata$CROPDMG==0))
    • Fix values in PROPDMGEXP and CROPDMGEXP

      ## values of PROPDMGEXP & CROPDMGEXP
      ## 0 = 1
      ## H = 100
      ## K = 1000
      ## M = 1000000
      ## B = 1000000000
      ## Others = (unknown or typo, ignore these rows)
      
      wdata <- subset(wdata, wdata$PROPDMGEXP %in% c("0","H","K","M","B"))
      wdata <- subset(wdata, wdata$CROPDMGEXP %in% c("0","H","K","M","B"))
  2. Process and calculate impact of population health
    • Add new column for total impacted persons
    • Calculate total impacted persons
    • Aggregate impacted persons by EVTYPE

      ## add new column TOTAL_PEOPLE to calculate sum op FATALITIES INJURIES
      wdata$TOTAL_PEOPLE = wdata$FATALITIES+wdata$INJURIES
      
      ## aggregate sum of TOTAL_PEOPLE by different EVTYPE
      people_data <- aggregate(TOTAL_PEOPLE~EVTYPE, data=wdata, FUN=sum)
      people_data <- people_data[order(people_data$TOTAL_PEOPLE,decreasing=T),]       
  3. Process and calculate damage of products and crops
    • Add new columns for total damage cost
    • Calculate total damage cost
    • Aggregate total damage cost by EVTYPE

      ## add new columns PROD_LOST, CROP_LOST, TOTAL_LOST
      ## calculate total damange of product and crop
      wdata$PROD_LOST=wdata$PROPDMG*sapply(wdata$PROPDMGEXP, function(x) {
          if (x=="0") {1} 
          else if (x=="H") {100}
          else if (x=="K") {1000}
          else if (x=="M") {1000000}
          else if (x=="B") {1000000000}
      })
      wdata$CROP_LOST=wdata$CROPDMG*sapply(wdata$CROPDMGEXP, function(x) {
          if (x=="0") {1} 
          else if (x=="H") {100}
          else if (x=="K") {1000}
          else if (x=="M") {1000000}
          else if (x=="B") {1000000000}
      })
      wdata$TOTAL_LOST=wdata$PROD_LOST+wdata$CROP_LOST
      
      ## aggregate sum of TOTAL_LOST by different EVTYPE
      cost_data <- aggregate(TOTAL_LOST~EVTYPE, data=wdata, FUN=sum)
      cost_data <- cost_data[order(cost_data$TOTAL_LOST,decreasing=T),]

Results

  1. Which types of events are most harmful with respect to population health?
    • The top types of events and impacted population
    names(people_data) <- c("Event", "Population")
    xt <- xtable(head(people_data[,1:2],10))
    Event Population
    HAIL 3292226.00
    TSTM WIND 2572767.00
    FLASH FLOOD 998069.00
    TORNADO 544724.00
    THUNDERSTORM WINDS 470629.00
    FLOOD 313161.00
    LIGHTNING 149042.00
    EXCESSIVE HEAT 105759.00
    HIGH WIND 104744.00
    URBAN/SML STREAM FLD 93152.00
     
    • Diagram of fatalities and injuries of different types of events
    hplot <- ggplot(data=people_data[1:10,], 
                    aes(x=reorder(Event,-Population), y=Population)) + 
        geom_bar(stat="identity", colour="grey", fill="grey",
                 position=position_dodge()) +
        theme(axis.text.x=element_text(angle=90,hjust=1)) +
        xlab("") + ylab("Fatalities & Injuries")
    print(hplot)

    plot of chunk unnamed-chunk-6

  2. Which types of events have the greatest economic consequences?
    • The top types of events and economic lost
    names(cost_data) <- c("Event", "Lost")
    xt <- xtable(head(cost_data[,1:2],10))
    Event Lost
    HURRICANE 1006281994000.00
    HURRICANE OPAL 520677000000.00
    ICE STORM 499939764981.00
    FLASH FLOOD 385171102648.00
    WILDFIRE 291079702000.00
    HURRICANE OPAL/HIGH WINDS 244233000000.00
    “TORNADOES 223000000607.00
    FLOOD 158942458000.00
    HAIL 127807687180.00
    TSTM WIND 81978601000.00
     
    • Diagram of economic lost of different types of events
    hplot <- ggplot(data=cost_data[1:10,], 
                    aes(x=reorder(Event,-Lost), y=Lost)) + 
        geom_bar(stat="identity", colour="grey", fill="grey",
                 position=position_dodge()) +
        theme(axis.text.x=element_text(angle=90,hjust=1)) +
        xlab("") + ylab("Economic Lost (USD)")
    print(hplot)

    plot of chunk unnamed-chunk-8