The goal of the analysis is to find impacts from various weather events in the U.S in terms of public health and economic damage. The data set was imported from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, and details are described below in multiple steps. The script clean up the event name, and it uses records after 1995.
The analysis shows tornado is the most critical event which affects the number of fatalities and injuries. The property and crop damage anaysis indicates flood is the largest impact on our economy. I also would like to draw your attention to the excessive heat because higher fatality rate than injury. Within the same analysis, drought is only the one that contains larger crop damage than property one.
In this section, data processing steps are presented.
First, the work directory path needs to be setup. If you are using R Studio, you can also set it from Session > Set Working Directory > To Source File Location. Here is my example below.
setwd("~/work/r/class/datasciencecoursera/notes/class5_repdata/assignment2")
Below is the list of required R packages for this analysis:
library(data.table)
library(parallel)
library(stringdist)
library(R.utils)
library(xtable)
library(ggplot2)
library(reshape2)
Download and extract the Zip file if it was not extracted yet
if (!file.exists("storm.bz2")){
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","storm.bz2",method="curl")
}
if (!file.exists("types.csv")){
download.file("https://raw.githubusercontent.com/kiichi/datasciencecoursera/master/notes/class5_repdata/assignment2/valid.csv","valid.csv",method="curl")
}
Note: NOAA’s documentations are here
Next, load the data file. Since I got error with fread function, I’m using read.csv first, then I’m converting into data.table type.
d<-data.table(read.csv("storm.bz2"))
These functions are created for proprocessing phase. cleanup removes white spaces and force characters to uppercase. unit_num function is for mapping unit of damage in dollar amount (K,M and B). More detailed descriptions are available in following section.
# Trim spaces, and remove after first parenthesis.
# Also force the string to uppercase.
cleanup <- function (x) {
toupper(gsub("^\\s+|\\s+$", "",gsub("\\(.+","",x)))
}
# Converting character K,M and B multiply by 1000, 1000000 and 1000000000.
unit_num<-function(unit){
if (is.na(unit)){
1
}
else if (unit == "K"){
1000
}
else if (unit=="M"){
1000000
}
else if (unit=="B"){
1000000000
}
else {
1
}
}
Data transformation policy
# Remove before 1995
d<-d[as.Date(d$BGN_DATE,"%m/%d/%Y") >= as.Date("1/1/1996","%m/%d/%Y"),]
# Remove if all of PROPDMG,CROPDMG,INJURIES and FATALITIES columns are zero
d<-d[(PROPDMG+CROPDMG+INJURIES+FATALITIES)>0,]
# Upper case and Trim
d[,c("EVTYPE")] <- cleanup(d$EVTYPE)
d[,c("PROPDMGEXP")] <- cleanup(d$PROPDMGEXP)
d[,c("CROPDMGEXP")] <- cleanup(d$CROPDMGEXP)
# Remove summary
d<-d[!(EVTYPE %like% 'SUMMARY'),]
Load valid event names, and find simlar words that needs to be corrected. Default osa algorithm was chosen for string distance calculation, and the threthshold is 1. If the distance is more than 1, the logic keeps original event name. In this way, I’m expecting to detect differences of singular v.s. prulal, hyphen, etc…
uni<-unique(d$EVTYPE)
valid<-cleanup(read.csv("valid.csv")$x)
# find minimum distance of words
# if the distance is too big (>= 2), skip it.
find_word<-function(original){
dst<-stringdist(original,valid)
if (min(dst)>=2){
original
}
else {
valid[which.min(dst)]
}
}
#Build translation table
trn<-data.table(original=uni)
trn[,c("translated")]<-sapply(uni,find_word)
trn<-trn[original!=translated,]
By processing distances between valid event type dictionary and data file’s original event types, I found following list of similar words.
trn
## original translated
## 1: RIP CURRENTS RIP CURRENT
## 2: STRONG WINDS STRONG WIND
## 3: LAKE EFFECT SNOW LAKE-EFFECT SNOW
## 4: HIGH WINDS HIGH WIND
# Look into the translation table and replace the
# event name if any.
replace_event<-function(event_type){
if (any(trn$original==event_type)){
trn[original==event_type,translated]
}
else {
event_type
}
}
d[,c("EVTYPE")]<-sapply(d$EVTYPE,replace_event)
# Unique Number of Events in the dataset
length(unique(d$EVTYPE))
## [1] 179
Finally, we are going to calculate total number of fatalities, injuries, economic damage which are from property damage and crop damage.
d[,c("PHTTL"):=FATALITIES+INJURIES] #Public Health Total
d[,c("PROPDMGMULT")] <- sapply(d$PROPDMGEXP,unit_num) # Multiplier for Property Damage
d[,c("CROPDMGMULT")] <- sapply(d$CROPDMGEXP,unit_num) # Multiplier for Crop Damage
d[,c("PROPDMG"):=PROPDMG*PROPDMGMULT] # Update Property Damage
d[,c("CROPDMG"):=CROPDMG*CROPDMGMULT] # Update Crop Damage
d[,c("DMGTTL"):=PROPDMG+CROPDMG] # Damage Total
Results of processed data are presented below.
Generate top 10 with following steps:
rk_ph<-d[,list(TOTAL=sum(PHTTL),FATALITIES=sum(FATALITIES),INJURIES=sum(INJURIES)),by=EVTYPE]
rk_ph<-head(rk_ph[order(TOTAL,decreasing=T),],10)
print(xtable(rk_ph,digits=0),type="html",format.args=list(big.mark = ","))
| EVTYPE | TOTAL | FATALITIES | INJURIES | |
|---|---|---|---|---|
| 1 | TORNADO | 22,178 | 1,511 | 20,667 |
| 2 | EXCESSIVE HEAT | 8,188 | 1,797 | 6,391 |
| 3 | FLOOD | 7,172 | 414 | 6,758 |
| 4 | LIGHTNING | 4,792 | 651 | 4,141 |
| 5 | TSTM WIND | 3,870 | 241 | 3,629 |
| 6 | FLASH FLOOD | 2,561 | 887 | 1,674 |
| 7 | THUNDERSTORM WIND | 1,530 | 130 | 1,400 |
| 8 | WINTER STORM | 1,483 | 191 | 1,292 |
| 9 | HEAT | 1,459 | 237 | 1,222 |
| 10 | HURRICANE/TYPHOON | 1,339 | 64 | 1,275 |
.
m_ph<-melt(rk_ph[,list(EVTYPE,FATALITIES,INJURIES)], id=c("EVTYPE"))
ggplot(m_ph,aes(factor(EVTYPE,levels=rk_ph[order(rk_ph$TOTAL),]$EVTYPE),fill=variable,y=value)) + geom_bar(stat="identity") + coord_flip() + labs(title="Public Health",x="Event Type",y="Population Affected by Events")
Generate top 10 with following steps:
rk_ed<-d[,list(TOTAL=sum(DMGTTL),PROPDMG=sum(PROPDMG),CROPDMG=sum(CROPDMG)),by=EVTYPE]
rk_ed<-head(rk_ed[order(TOTAL,decreasing=T),],10)
print(xtable(rk_ed,digits=0),type="html",format.args=list(big.mark = ","))
| EVTYPE | TOTAL | PROPDMG | CROPDMG | |
|---|---|---|---|---|
| 1 | FLOOD | 148,919,611,950 | 143,944,833,550 | 4,974,778,400 |
| 2 | HURRICANE/TYPHOON | 71,913,712,800 | 69,305,840,000 | 2,607,872,800 |
| 3 | STORM SURGE | 43,193,541,000 | 43,193,536,000 | 5,000 |
| 4 | TORNADO | 24,900,370,720 | 24,616,945,710 | 283,425,010 |
| 5 | HAIL | 17,071,172,870 | 14,595,143,420 | 2,476,029,450 |
| 6 | FLASH FLOOD | 16,557,155,610 | 15,222,253,910 | 1,334,901,700 |
| 7 | HURRICANE | 14,554,229,010 | 11,812,819,010 | 2,741,410,000 |
| 8 | DROUGHT | 14,413,667,000 | 1,046,101,000 | 13,367,566,000 |
| 9 | TROPICAL STORM | 8,320,186,550 | 7,642,475,550 | 677,711,000 |
| 10 | HIGH WIND | 5,881,921,660 | 5,248,360,360 | 633,561,300 |
.
m_ed<-melt(rk_ed[,list(EVTYPE,PROPDMG,CROPDMG)], id=c("EVTYPE"))
ggplot(m_ed,aes(factor(EVTYPE,levels=rk_ed[order(rk_ed$TOTAL),]$EVTYPE),fill=variable,y=value/1000000000.0)) + geom_bar(stat="identity") + coord_flip() + labs(title="Economic Damage",x="Event Type",y="Damage in $ (Billion)")
Generate .md document using knitr library as you need
#library(knitr)
#knit2html("assignment2_noaa_analysis.Rmd")
The published URL is
Github URL