Synopsis

The goal of the analysis is to find impacts from various weather events in the U.S in terms of public health and economic damage. The data set was imported from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, and details are described below in multiple steps. The script clean up the event name, and it uses records after 1995.

The analysis shows tornado is the most critical event which affects the number of fatalities and injuries. The property and crop damage anaysis indicates flood is the largest impact on our economy. I also would like to draw your attention to the excessive heat because higher fatality rate than injury. Within the same analysis, drought is only the one that contains larger crop damage than property one.

Data Processing

In this section, data processing steps are presented.

Setup

First, the work directory path needs to be setup. If you are using R Studio, you can also set it from Session > Set Working Directory > To Source File Location. Here is my example below.

setwd("~/work/r/class/datasciencecoursera/notes/class5_repdata/assignment2")

Below is the list of required R packages for this analysis:

library(data.table)
library(parallel)
library(stringdist)
library(R.utils)
library(xtable)
library(ggplot2)
library(reshape2)

Download and extract the Zip file if it was not extracted yet

if (!file.exists("storm.bz2")){
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","storm.bz2",method="curl")
}
if (!file.exists("types.csv")){
  download.file("https://raw.githubusercontent.com/kiichi/datasciencecoursera/master/notes/class5_repdata/assignment2/valid.csv","valid.csv",method="curl")
}

Note: NOAA’s documentations are here

Loading Data

Next, load the data file. Since I got error with fread function, I’m using read.csv first, then I’m converting into data.table type.

d<-data.table(read.csv("storm.bz2"))

Utility Functions

These functions are created for proprocessing phase. cleanup removes white spaces and force characters to uppercase. unit_num function is for mapping unit of damage in dollar amount (K,M and B). More detailed descriptions are available in following section.

# Trim spaces, and remove after first parenthesis.
# Also force the string to uppercase.
cleanup <- function (x)  {
  toupper(gsub("^\\s+|\\s+$", "",gsub("\\(.+","",x)))
}

# Converting character K,M and B multiply by 1000, 1000000 and 1000000000.
unit_num<-function(unit){
  if (is.na(unit)){
    1
  }
  else if (unit == "K"){
    1000
  }
  else if (unit=="M"){
    1000000
  }
  else if (unit=="B"){
    1000000000
  }
  else {
    1
  }
}

Data Transformation

Data transformation policy

Fix everything to upper case
Remove all event that contains “summary” in the description
Remove after parenthesis
Correct similar event name using valid event names (see valid.csv from above pdf reference)
Remove if all rows that PROPDMG,CROPDMG,INJURIES and FATALITIES columns are zero
Multiply the property / crop damage depend on the unit. Reference (NATIONAL WEATHER SERVICE INSTRUCTION pdf page 12 second paragraph - “Alphabetical characters used to signify magnitude include K for thousands, M for millions, and B for billions.”

# Remove before 1995
d<-d[as.Date(d$BGN_DATE,"%m/%d/%Y") >= as.Date("1/1/1996","%m/%d/%Y"),]

# Remove if all of PROPDMG,CROPDMG,INJURIES and FATALITIES columns are zero
d<-d[(PROPDMG+CROPDMG+INJURIES+FATALITIES)>0,]

# Upper case and Trim
d[,c("EVTYPE")] <- cleanup(d$EVTYPE)
d[,c("PROPDMGEXP")] <- cleanup(d$PROPDMGEXP)
d[,c("CROPDMGEXP")] <- cleanup(d$CROPDMGEXP)

# Remove summary
d<-d[!(EVTYPE %like% 'SUMMARY'),]

Load valid event names, and find simlar words that needs to be corrected. Default osa algorithm was chosen for string distance calculation, and the threthshold is 1. If the distance is more than 1, the logic keeps original event name. In this way, I’m expecting to detect differences of singular v.s. prulal, hyphen, etc…

uni<-unique(d$EVTYPE)
valid<-cleanup(read.csv("valid.csv")$x)

# find minimum distance of words
# if the distance is too big (>= 2), skip it.
find_word<-function(original){
  dst<-stringdist(original,valid)
  if (min(dst)>=2){
    original
  }
  else {
    valid[which.min(dst)]
  }
}

#Build translation table
trn<-data.table(original=uni)
trn[,c("translated")]<-sapply(uni,find_word)
trn<-trn[original!=translated,]

By processing distances between valid event type dictionary and data file’s original event types, I found following list of similar words.

trn

##            original       translated
## 1:     RIP CURRENTS      RIP CURRENT
## 2:     STRONG WINDS      STRONG WIND
## 3: LAKE EFFECT SNOW LAKE-EFFECT SNOW
## 4:       HIGH WINDS        HIGH WIND

# Look into the translation table and replace the 
# event name if any.
replace_event<-function(event_type){
  if (any(trn$original==event_type)){
    trn[original==event_type,translated]
  }
  else {
    event_type
  }
}

d[,c("EVTYPE")]<-sapply(d$EVTYPE,replace_event)
# Unique Number of Events in the dataset
length(unique(d$EVTYPE))

## [1] 179

Finally, we are going to calculate total number of fatalities, injuries, economic damage which are from property damage and crop damage.

d[,c("PHTTL"):=FATALITIES+INJURIES]                    #Public Health Total
d[,c("PROPDMGMULT")] <- sapply(d$PROPDMGEXP,unit_num)  # Multiplier for Property Damage
d[,c("CROPDMGMULT")] <- sapply(d$CROPDMGEXP,unit_num)  # Multiplier for Crop Damage
d[,c("PROPDMG"):=PROPDMG*PROPDMGMULT]                  # Update Property Damage
d[,c("CROPDMG"):=CROPDMG*CROPDMGMULT]                  # Update Crop Damage
d[,c("DMGTTL"):=PROPDMG+CROPDMG]                       # Damage Total

Results

Results of processed data are presented below.

Ranking: Public Health

Generate top 10 with following steps:

Group by events
Sort and extract top 10
Display

rk_ph<-d[,list(TOTAL=sum(PHTTL),FATALITIES=sum(FATALITIES),INJURIES=sum(INJURIES)),by=EVTYPE]
rk_ph<-head(rk_ph[order(TOTAL,decreasing=T),],10)
print(xtable(rk_ph,digits=0),type="html",format.args=list(big.mark = ","))

	EVTYPE	TOTAL	FATALITIES	INJURIES
1	TORNADO	22,178	1,511	20,667
2	EXCESSIVE HEAT	8,188	1,797	6,391
3	FLOOD	7,172	414	6,758
4	LIGHTNING	4,792	651	4,141
5	TSTM WIND	3,870	241	3,629
6	FLASH FLOOD	2,561	887	1,674
7	THUNDERSTORM WIND	1,530	130	1,400
8	WINTER STORM	1,483	191	1,292
9	HEAT	1,459	237	1,222
10	HURRICANE/TYPHOON	1,339	64	1,275

m_ph<-melt(rk_ph[,list(EVTYPE,FATALITIES,INJURIES)], id=c("EVTYPE"))
ggplot(m_ph,aes(factor(EVTYPE,levels=rk_ph[order(rk_ph$TOTAL),]$EVTYPE),fill=variable,y=value)) + geom_bar(stat="identity") + coord_flip() + labs(title="Public Health",x="Event Type",y="Population Affected by Events")

plot of chunk plotph

Ranking: Economic Damage

Generate top 10 with following steps:

Group by events
Sort and extract top 10
Display

rk_ed<-d[,list(TOTAL=sum(DMGTTL),PROPDMG=sum(PROPDMG),CROPDMG=sum(CROPDMG)),by=EVTYPE]
rk_ed<-head(rk_ed[order(TOTAL,decreasing=T),],10)
print(xtable(rk_ed,digits=0),type="html",format.args=list(big.mark = ","))

	EVTYPE	TOTAL	PROPDMG	CROPDMG
1	FLOOD	148,919,611,950	143,944,833,550	4,974,778,400
2	HURRICANE/TYPHOON	71,913,712,800	69,305,840,000	2,607,872,800
3	STORM SURGE	43,193,541,000	43,193,536,000	5,000
4	TORNADO	24,900,370,720	24,616,945,710	283,425,010
5	HAIL	17,071,172,870	14,595,143,420	2,476,029,450
6	FLASH FLOOD	16,557,155,610	15,222,253,910	1,334,901,700
7	HURRICANE	14,554,229,010	11,812,819,010	2,741,410,000
8	DROUGHT	14,413,667,000	1,046,101,000	13,367,566,000
9	TROPICAL STORM	8,320,186,550	7,642,475,550	677,711,000
10	HIGH WIND	5,881,921,660	5,248,360,360	633,561,300

m_ed<-melt(rk_ed[,list(EVTYPE,PROPDMG,CROPDMG)], id=c("EVTYPE"))
ggplot(m_ed,aes(factor(EVTYPE,levels=rk_ed[order(rk_ed$TOTAL),]$EVTYPE),fill=variable,y=value/1000000000.0)) + geom_bar(stat="identity") + coord_flip() + labs(title="Economic Damage",x="Event Type",y="Damage in $ (Billion)")

plot of chunk ploted

Miscellaneous

Generate .md document using knitr library as you need

#library(knitr)
#knit2html("assignment2_noaa_analysis.Rmd")

The published URL is

http://rpubs.com/kiichi/23256

Github URL

https://github.com/kiichi/datasciencecoursera/tree/master/notes/class5_repdata/assignment2

Public Health and Economic Damage Analysis by Severe Wheater Events

Kiichi Takeuchi

July 23, 2014