Synopsis

Storm and other significant weather phenomena can be life threatening and may lead to fatalities, injuries, property damages, and crop damages. The purpose of this data analysis is to articulate the types of events that are most harmful with respect to population health and have the greatest economic impact in the US using the NOAA Storm Database. This analysis uses the storm data events from 1995-2011 and exclude events that do not follows NWS Directive 10-1605 “Permitted Storm Data Events”.

Tornado was identified as the major cause of injuries and fatalities in overall. Tornado has caused the greatest number of fatalities, whereas excessive heat has caused the greatest number of injuries. Heat, excessive heat, and tsunami were the top three events that likely to cause more injuries and fatalities per event since 1995.

Flood caused the greatest economic impact overall. Flood caused the most property damages and the second most crop damages. Drought was the greatest cause of crop damages since 1995.

Data Processing

  1. Setup
knitr::opts_chunk$set(echo = TRUE)

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(stringr)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(xtable)

setwd("~/study/DataScience/john hopskin/ReproducibleResearch/assignment2")
  1. Load data
zipfile<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
temp <- tempfile()
download.file(zipfile,temp,mode="wb")
storm_original<- read.csv(temp,na.strings=c("","NA"))
unlink(temp)
rm(temp,zipfile)
  1. Tranformation

According to the NWS Directive 10-1605, the only events permitted in Storm Data are listed in [Table 1 of Section 2.1.1] (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf). The original storm data contains many events that are not permitted.

  1. The first step is to remove the records that are not permitted.
#obtain permitted events (just copy and paste)
valid_events<-"Astronomical Low Tide Z
Avalanche Z
Blizzard Z
Coastal Flood Z
Cold/Wind Chill Z
Debris Flow C
Dense Fog Z
Dense Smoke Z
Drought Z
Dust Devil C
Dust Storm Z
Excessive Heat Z
Extreme Cold/Wind Chill Z
Flash Flood C
Flood C
Frost/Freeze Z
Funnel Cloud C
Freezing Fog Z
Hail C
Heat Z
Heavy Rain C
Heavy Snow Z
High Surf Z
High Wind Z
Hurricane (Typhoon) Z
Ice Storm Z
Lake-Effect Snow Z
Lakeshore Flood Z
Lightning C
Marine Hail M
Marine High Wind M
Marine Strong Wind M
Marine Thunderstorm Wind M
Rip Current Z
Seiche Z
Sleet Z
Storm Surge/Tide Z
Strong Wind Z
Thunderstorm Wind C
Tornado C
Tropical Depression Z
Tropical Storm Z
Tsunami Z
Volcanic Ash Z
Waterspout M
Wildfire Z
Winter Storm Z
Winter Weather Z"
valid_events<-toupper(substr(unlist(strsplit(valid_events,"\\n")),1,nchar(unlist(strsplit(valid_events,"\\n")))-2))

#keep only valid events from storm data
storm_valid<- storm_original %>% mutate(EVTYPE=toupper(EVTYPE)) %>%
filter(EVTYPE %in% valid_events)

#invalid count and %
invalidpct<-round(100*(nrow(storm_valid)/nrow(storm_original)),2)
validcnt<-nrow(storm_valid)-nrow(storm_original)
rm(storm_original,valid_events)
There were -266946 valid recors which is 70.41% of original records.
  1. The second step is to assess how much data can be used from 1950. Because the intereste is in the events that are most harmful and have economical impact, zero fatality/injury, proparty and crop damage are removed.
    storm_events<-storm_valid %>% mutate(year=format(as.Date(BGN_DATE,format="%m/%d/%Y"),"%Y")) %>%
    filter(FATALITIES+INJURIES>0 | PROPDMG+CROPDMG>0) %>%
    select(EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP,BGN_DATE,year)
Out of 635351 valid storm event records, only 172872 or 27.21% caused some forms of harm.
  g<-ggplot(storm_events,aes(as.Date(BGN_DATE,format="%m/%d/%Y")))+
    geom_histogram(bins=2011-1950)+
    labs(title="Histogram of Valid Storm Event Data, 1950-2011",x="Year")+
  geom_segment(aes(x=as.Date("1995-01-01"), xend=as.Date("2011-12-31"),y=-2,yend=-2),col="red", arrow=arrow(),size=2)

Figure 1: Histogram of Storm Events, 1950-2011

Based on the distribution of data shown in Figure 1, more data have been generated since 1995, hence this analysis will use the data from 1995 or later.

c.The third step is to generate data for 1995-2011 and remove invalid exponential values

  ads_storm<-storm_events %>% filter(year>=1995) %>% 
  filter(!PROPDMGEXP %in% c("+","-"))
  1. The fourth step is to normalize exponential values for property and crop damages
e1<-c("b","m","k","h","B","M","K","H","2","4","5","7","0")
e2<-c(10**9,10**6,10**3,10**2,10**9,10**6,10**3,10**2,10**2,10**4,10**5,10**7,1)
ep<-data.frame(e1,e2)
names(ep)<-c("PROPDMGEXP","PROPDMGEXPn")
ec<-data.frame(e1,e2)
names(ec)<-c("CROPDMGEXP","CROPDMGEXPn")
ads_storm1<-merge(ads_storm,ep,by="PROPDMGEXP",all.x=T)
ads_storm1<-merge(ads_storm1,ec,by="CROPDMGEXP",all.x=T)

ads_storm1$PROPDMGEXPn[is.na(ads_storm1$PROPDMGEXPn)] <- 0
ads_storm1$CROPDMGEXPn[is.na(ads_storm1$CROPDMGEXPn)] <- 0
ads_storm1$PROPDMG[is.na(ads_storm1$PROPDMG)] <- 0
ads_storm1$CROPDMG[is.na(ads_storm1$CROPDMG)] <- 0
  1. The fifth step is to derive the actual damage
ads_storm2<-ads_storm1 %>% 
mutate(PROPDMGEXPactual=PROPDMGEXPn*PROPDMG) %>%
mutate(CROPDMGEXPactual=CROPDMGEXPn*CROPDMG) 

meandamage1<-aggregate(ads_storm2$PROPDMGEXPactual+ads_storm2$CROPDMGEXPactual,list(ads_storm2$EVTYPE),mean)
names(meandamage1)<-c("EVTYPE","Mean Both")
meandamage2<-aggregate(ads_storm2$PROPDMGEXPactual,list(ads_storm2$EVTYPE),mean)
names(meandamage2)<-c("EVTYPE","Mean Property")
meandamage3<-aggregate(ads_storm2$CROPDMGEXPactual,list(ads_storm2$EVTYPE),mean)
names(meandamage3)<-c("EVTYPE","Mean Crop")
actualdamage1<-aggregate(ads_storm2$PROPDMGEXPactual+ads_storm2$CROPDMGEXPactual,list(ads_storm2$EVTYPE),sum)
names(actualdamage1)<-c("EVTYPE","Both")
actualdamage2<-aggregate(ads_storm2$PROPDMGEXPactual,list(ads_storm2$EVTYPE),sum)
names(actualdamage2)<-c("EVTYPE","Property")
actualdamage3<-aggregate(ads_storm2$CROPDMGEXPactual,list(ads_storm2$EVTYPE),sum)
names(actualdamage3)<-c("EVTYPE","Crop")
meandamage<-merge(meandamage1,meandamage2,by="EVTYPE")
meandamage<-merge(meandamage,meandamage3,by="EVTYPE")

damage<-merge(actualdamage1,actualdamage2,by="EVTYPE")
damage<-merge(damage,actualdamage3,by="EVTYPE")
damage<-merge(meandamage,damage,by="EVTYPE")
cnt_e<-as.data.frame(table(ads_storm2$EVTYPE))
 names(cnt_e)<-c("EVTYPE","EventN") 
damage<-merge(cnt_e,damage,by="EVTYPE")
damage<-filter(damage,Both>0)

ads_damage<-melt(damage,id=c("EVTYPE","EventN"))
rm(meandamage1,meandamage2,meandamage3,actualdamage1,actualdamage2,actualdamage3,meandamage)
  1. The sixth step is to derive the ratio of injuries, fatalities, combined relative to the events in a separate dataset
#event rate
rate1<-aggregate(ads_storm2$INJURIES+ads_storm2$FATALITIES,list(ads_storm2$EVTYPE),mean) 
names(rate1)<-c("EVTYPE","Both")
rate2<-  aggregate(ads_storm2$INJURIES,list(ads_storm2$EVTYPE),mean)
 names(rate2)<-c("EVTYPE","Injuries") 
rate3<-  aggregate(ads_storm2$FATALITIES,list(ads_storm2$EVTYPE),mean) 
 names(rate3)<-c("EVTYPE","Fatalities") 
rate<-merge(rate1,rate2,by="EVTYPE")
rate<-merge(rate,rate3,by="EVTYPE")
#injury,fatality count
cnt1<-aggregate(ads_storm2$INJURIES+ads_storm2$FATALITIES,list(ads_storm2$EVTYPE),sum)
 names(cnt1)<-c("EVTYPE","BothN")
cnt2<-aggregate(ads_storm2$INJURIES,list(ads_storm2$EVTYPE),sum)
 names(cnt2)<-c("EVTYPE","InjuriesN") 
cnt3<-aggregate(ads_storm2$FATALITIES,list(ads_storm2$EVTYPE),sum)
 names(cnt3)<-c("EVTYPE","FatalitiesN") 
cnt<-merge(cnt1,cnt2,by="EVTYPE")
cnt<-merge(cnt,cnt3,by="EVTYPE") 
rate<-merge(cnt,rate,by="EVTYPE") 
rm() 
#event count
cnt_e<-as.data.frame(table(ads_storm2$EVTYPE))
 names(cnt_e)<-c("EVTYPE","EventN") 
rate<-merge(cnt_e,rate,by="EVTYPE") 
ads_rate<-melt(rate,id=c("EVTYPE","EventN"))
rm(cnt_e,cnt1,cnt2,cnt3,rate1,rate2,rate3,cnt) 

Results

Across the US, which types of events are most harmful with respect to population health?

To answer this question, the number of injuries, fatalities, and both combined as well as the rate of occurance of injuries, fatalities, and both combined for each event were reviewed.

ads_rate$variable <- factor(ads_rate$variable, levels=c("BothN","InjuriesN","FatalitiesN","Both","Injuries","Fatalities"), labels=c("Both (N)","Injuries (N)","Fatalities (N)","Both (%)","Injuries (%)","Fatalities (%)"))

#plot for count
orderedEvtypeN<-arrange(rate, BothN)
nplot<-ggplot(filter(ads_rate,ads_rate$variable %in% c("Both (N)","Injuries (N)","Fatalities (N)")),aes(value,reorder(EVTYPE,value),fill=variable)) + 
  geom_point(aes(color=variable)) +
labs(title="Number of Reported Injuries and Fatilities from Storm Events in the US",subtitle="1995-2011", caption = "Data Source: NOAA Storm Database")+
  xlab("Number of Injuries and Fatilities")+ 
  scale_y_discrete(name="Storm Events, Events Count", labels=paste(orderedEvtypeN$EVTYPE,", n=", orderedEvtypeN$EventN))+
  theme(axis.text.y = element_text(size=7))
e1<-as.character(orderedEvtypeN$EVTYPE[length(orderedEvtypeN$EVTYPE)])

#plot for ratio
orderedEvtypeP<-arrange(rate, Both)
rplot<-ggplot(filter(ads_rate,ads_rate$variable %in% c("Both (%)","Injuries (%)","Fatalities (%)")),aes(value,reorder(EVTYPE,value),fill=variable)) + 
  geom_point(aes(color=variable)) +
  labs(title="Injuries and Fatilities Rate from Storm Events in the US",subtitle="1995-2011", caption = "Data Source: NOAA Storm Database")+
  xlab("% of Injuries and Fatilities")+ 
  scale_y_discrete(name="Storm Events, Events Count", labels=paste(orderedEvtypeP$EVTYPE,", n=", orderedEvtypeP$EventN))+
  theme(axis.text.y = element_text(size=7))


#print(rplot)

Figure 2: Number of Reported Injuries and Fatilities from Storm Events in the US

Table 1: Injury and Fatality Rate per Event

xt1<-xtable(head(arrange(orderedEvtypeP,desc(Both))[c(1,2,6:8)],n=10))
print(xt1,type="html")
EVTYPE EventN Both Injuries Fatalities
1 HEAT 203 14.55 10.00 4.55
2 EXCESSIVE HEAT 698 12.07 9.35 2.73
3 TSUNAMI 14 11.57 9.21 2.36
4 DUST STORM 98 4.50 4.29 0.21
5 DENSE FOG 64 4.47 4.31 0.16
6 SLEET 1 2.00 0.00 2.00
7 BLIZZARD 230 1.98 1.67 0.31
8 HIGH SURF 132 1.95 1.17 0.77
9 TORNADO 12999 1.79 1.67 0.12
10 RIP CURRENT 389 1.51 0.58 0.93

Based on above Figure 2, tornado was identified as the major cause of injuries and fatalities in total. It caused 2.33110^{4} of injuries and fatalities since 1995. Tornado has caused the greatest number of fatalities, and excessive heat has caused the greatest number of injuries.

Heat, excessive heat, and tsunami were the top three events that likely to cause more injuries and fatalities per event since 1995 as shown in Table 1.

Across the US, which types of events have the greatest economic consequences?

To answer this question, the property and crop damage, and both combined as well as the mean damage of property and crop damage, and both combined per each event were reviewed.

#plot for count
orderedEvtypeN<-arrange(damage, desc(Both))
nplot2<-ggplot(filter(ads_damage,ads_damage$variable %in% c("Both","Property","Crop")),aes(log10(value),reorder(EVTYPE,log10(value)),fill=variable)) + facet_grid (.~variable)+
  geom_point(aes(color=variable),position =   position_jitter(w = 0.02, h = 0.05))+
  labs(title="Reported Property and Crop Damages (Log10 USD) from Storm Events in the US",subtitle="1995-2011", caption = "Data Source: NOAA Storm Database")+
  xlab("Reported Property and Crop Damages (Log10 USD)")+
  ylab("Storm Events, Events Count")+
#  scale_y_discrete(name="Storm Events, Events Count", labels=paste(orderedEvtypeN$EVTYPE,", n=", orderedEvtypeN$EventN))+
  theme(axis.text.y = element_text(size=6))
e1<-as.character(orderedEvtypeN$EVTYPE[length(orderedEvtypeN$EVTYPE)])

Figure 3: Reported Property and Crop Damages (Log10 USD) from Storm Events in the US

Table 2: Property and Crop Damage Rate per Event

xt<-xtable(head(orderedEvtypeN[,c(1,2,6:8)],n=10))
print(xt,type="html")
EVTYPE EventN Both Property Crop
1 FLOOD 9712 149444847450.00 144022037050.00 5422810400.00
2 TORNADO 12999 25232535255.00 24935939485.00 296595770.00
3 HAIL 23860 17662849120.70 15048722050.70 2614127070.00
4 FLASH FLOOD 19850 17391709364.00 16047794364.00 1343915000.00
5 DROUGHT 264 14968172000.00 1046106000.00 13922066000.00
6 TROPICAL STORM 412 8331171550.00 7653335550.00 677836000.00
7 HIGH WIND 5415 5893346660.00 5259785360.00 633561300.00
8 WILDFIRE 853 5054536800.00 4759064000.00 295472800.00
9 STORM SURGE/TIDE 47 4642038000.00 4641188000.00 850000.00
10 THUNDERSTORM WIND 43170 3813636990.00 3399282990.00 414354000.00

Based on above Table 2 and Figure 3, it is obvious that flood has the greatest economic impact overall from the storm data. Flood has caused the most property damages and the second most crop damages. Drought has been the greatest cause of crop damages since 1995.