Synopsis

The goal of the analysis is to find impacts from various weather events in the U.S in terms of public health and economic damage. The data set was imported from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, and details are described below in multiple steps. The script clean up the event name, and it uses records after 1995.

The analysis shows tornado is the most critical event which affects the number of fatalities and injuries. The property and crop damage anaysis indicates flood is the largest impact on our economy. I also would like to draw your attention to the excessive heat because higher fatality rate than injury. Within the same analysis, drought is only the one that contains larger crop damage than property one.

Data Processing

In this section, data processing steps are presented.

Setup

First, the work directory path needs to be setup. If you are using R Studio, you can also set it from Session > Set Working Directory > To Source File Location. Here is my example below.

setwd("~/work/r/class/datasciencecoursera/notes/class5_repdata/assignment2")

Below is the list of required R packages for this analysis:

library(data.table)
library(parallel)
library(stringdist)
library(R.utils)
library(xtable)
library(ggplot2)
library(reshape2)

Download and extract the Zip file if it was not extracted yet

if (!file.exists("storm.bz2")){
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","storm.bz2",method="curl")
}
if (!file.exists("types.csv")){
  download.file("https://raw.githubusercontent.com/kiichi/datasciencecoursera/master/notes/class5_repdata/assignment2/valid.csv","valid.csv",method="curl")
}

Note: NOAA’s documentations are here

Loading Data

Next, load the data file. Since I got error with fread function, I’m using read.csv first, then I’m converting into data.table type.

d<-data.table(read.csv("storm.bz2"))

Utility Functions

These functions are created for proprocessing phase. cleanup removes white spaces and force characters to uppercase. unit_num function is for mapping unit of damage in dollar amount (K,M and B). More detailed descriptions are available in following section.

# Trim spaces, and remove after first parenthesis.
# Also force the string to uppercase.
cleanup <- function (x)  {
  toupper(gsub("^\\s+|\\s+$", "",gsub("\\(.+","",x)))
}

# Converting character K,M and B multiply by 1000, 1000000 and 1000000000.
unit_num<-function(unit){
  if (is.na(unit)){
    1
  }
  else if (unit == "K"){
    1000
  }
  else if (unit=="M"){
    1000000
  }
  else if (unit=="B"){
    1000000000
  }
  else {
    1
  }
}

Data Transformation

Data transformation policy

  • Fix everything to upper case
  • Remove all event that contains “summary” in the description
  • Remove after parenthesis
  • Correct similar event name using valid event names (see valid.csv from above pdf reference)
  • Remove if all rows that PROPDMG,CROPDMG,INJURIES and FATALITIES columns are zero
  • Multiply the property / crop damage depend on the unit. Reference (NATIONAL WEATHER SERVICE INSTRUCTION pdf page 12 second paragraph - “Alphabetical characters used to signify magnitude include K for thousands, M for millions, and B for billions.”
# Remove before 1995
d<-d[as.Date(d$BGN_DATE,"%m/%d/%Y") >= as.Date("1/1/1996","%m/%d/%Y"),]

# Remove if all of PROPDMG,CROPDMG,INJURIES and FATALITIES columns are zero
d<-d[(PROPDMG+CROPDMG+INJURIES+FATALITIES)>0,]

# Upper case and Trim
d[,c("EVTYPE")] <- cleanup(d$EVTYPE)
d[,c("PROPDMGEXP")] <- cleanup(d$PROPDMGEXP)
d[,c("CROPDMGEXP")] <- cleanup(d$CROPDMGEXP)

# Remove summary
d<-d[!(EVTYPE %like% 'SUMMARY'),]

Load valid event names, and find simlar words that needs to be corrected. Default osa algorithm was chosen for string distance calculation, and the threthshold is 1. If the distance is more than 1, the logic keeps original event name. In this way, I’m expecting to detect differences of singular v.s. prulal, hyphen, etc…

uni<-unique(d$EVTYPE)
valid<-cleanup(read.csv("valid.csv")$x)

# find minimum distance of words
# if the distance is too big (>= 2), skip it.
find_word<-function(original){
  dst<-stringdist(original,valid)
  if (min(dst)>=2){
    original
  }
  else {
    valid[which.min(dst)]
  }
}

#Build translation table
trn<-data.table(original=uni)
trn[,c("translated")]<-sapply(uni,find_word)
trn<-trn[original!=translated,]

By processing distances between valid event type dictionary and data file’s original event types, I found following list of similar words.

trn
##            original       translated
## 1:     RIP CURRENTS      RIP CURRENT
## 2:     STRONG WINDS      STRONG WIND
## 3: LAKE EFFECT SNOW LAKE-EFFECT SNOW
## 4:       HIGH WINDS        HIGH WIND
# Look into the translation table and replace the 
# event name if any.
replace_event<-function(event_type){
  if (any(trn$original==event_type)){
    trn[original==event_type,translated]
  }
  else {
    event_type
  }
}

d[,c("EVTYPE")]<-sapply(d$EVTYPE,replace_event)
# Unique Number of Events in the dataset
length(unique(d$EVTYPE))
## [1] 179

Finally, we are going to calculate total number of fatalities, injuries, economic damage which are from property damage and crop damage.

d[,c("PHTTL"):=FATALITIES+INJURIES]                    #Public Health Total
d[,c("PROPDMGMULT")] <- sapply(d$PROPDMGEXP,unit_num)  # Multiplier for Property Damage
d[,c("CROPDMGMULT")] <- sapply(d$CROPDMGEXP,unit_num)  # Multiplier for Crop Damage
d[,c("PROPDMG"):=PROPDMG*PROPDMGMULT]                  # Update Property Damage
d[,c("CROPDMG"):=CROPDMG*CROPDMGMULT]                  # Update Crop Damage
d[,c("DMGTTL"):=PROPDMG+CROPDMG]                       # Damage Total

Results

Results of processed data are presented below.

Ranking: Public Health

Generate top 10 with following steps:

  1. Group by events
  2. Sort and extract top 10
  3. Display
rk_ph<-d[,list(TOTAL=sum(PHTTL),FATALITIES=sum(FATALITIES),INJURIES=sum(INJURIES)),by=EVTYPE]
rk_ph<-head(rk_ph[order(TOTAL,decreasing=T),],10)
print(xtable(rk_ph,digits=0),type="html",format.args=list(big.mark = ","))
EVTYPE TOTAL FATALITIES INJURIES
1 TORNADO 22,178 1,511 20,667
2 EXCESSIVE HEAT 8,188 1,797 6,391
3 FLOOD 7,172 414 6,758
4 LIGHTNING 4,792 651 4,141
5 TSTM WIND 3,870 241 3,629
6 FLASH FLOOD 2,561 887 1,674
7 THUNDERSTORM WIND 1,530 130 1,400
8 WINTER STORM 1,483 191 1,292
9 HEAT 1,459 237 1,222
10 HURRICANE/TYPHOON 1,339 64 1,275

.

m_ph<-melt(rk_ph[,list(EVTYPE,FATALITIES,INJURIES)], id=c("EVTYPE"))
ggplot(m_ph,aes(factor(EVTYPE,levels=rk_ph[order(rk_ph$TOTAL),]$EVTYPE),fill=variable,y=value)) + geom_bar(stat="identity") + coord_flip() + labs(title="Public Health",x="Event Type",y="Population Affected by Events")

plot of chunk plotph

Ranking: Economic Damage

Generate top 10 with following steps:

  1. Group by events
  2. Sort and extract top 10
  3. Display
rk_ed<-d[,list(TOTAL=sum(DMGTTL),PROPDMG=sum(PROPDMG),CROPDMG=sum(CROPDMG)),by=EVTYPE]
rk_ed<-head(rk_ed[order(TOTAL,decreasing=T),],10)
print(xtable(rk_ed,digits=0),type="html",format.args=list(big.mark = ","))
EVTYPE TOTAL PROPDMG CROPDMG
1 FLOOD 148,919,611,950 143,944,833,550 4,974,778,400
2 HURRICANE/TYPHOON 71,913,712,800 69,305,840,000 2,607,872,800
3 STORM SURGE 43,193,541,000 43,193,536,000 5,000
4 TORNADO 24,900,370,720 24,616,945,710 283,425,010
5 HAIL 17,071,172,870 14,595,143,420 2,476,029,450
6 FLASH FLOOD 16,557,155,610 15,222,253,910 1,334,901,700
7 HURRICANE 14,554,229,010 11,812,819,010 2,741,410,000
8 DROUGHT 14,413,667,000 1,046,101,000 13,367,566,000
9 TROPICAL STORM 8,320,186,550 7,642,475,550 677,711,000
10 HIGH WIND 5,881,921,660 5,248,360,360 633,561,300

.

m_ed<-melt(rk_ed[,list(EVTYPE,PROPDMG,CROPDMG)], id=c("EVTYPE"))
ggplot(m_ed,aes(factor(EVTYPE,levels=rk_ed[order(rk_ed$TOTAL),]$EVTYPE),fill=variable,y=value/1000000000.0)) + geom_bar(stat="identity") + coord_flip() + labs(title="Economic Damage",x="Event Type",y="Damage in $ (Billion)")

plot of chunk ploted

Miscellaneous

Generate .md document using knitr library as you need

#library(knitr)
#knit2html("assignment2_noaa_analysis.Rmd")

The published URL is

Github URL