Overview

This article is a submission for the Coursera course “Reproducible Research”. It is an analysis of US Storm data from the National Oceanic and Atmospheric Administration’s (NOAA) storm database answering
set questions:

Synopsis

The weather event types are washed and the the period since 2000 looked at in greater detail. Tornadoes (under various desciptions) were the events responsible for the most Fatalities, most injuries and the highest economic impact. The year of 2006 had the highest weather related costs, but 2011 had the highest numbers of fatalities and injuries.

The most dangerous States are those in Tornado alley. Although for Injuries the northern states start ranking. The distribution of impacts is skewed with larger events resulting in the bulk of claims.

Data Processing

The code below loads the storms data from a Data subfolder and converts some of the fields to dates. As we are looking at time series, and time relevant data, entries which have a missing beginning and end dates are removed.

The data records contain summary information that is already included in other rows. The documentation states:

Additional summary information on cumulative fatalities, injuries, or damages from previous months can be explained in the event narrative of the Storm Data entry for the final month of the event.

To prevent double counting any record with the word Summary in the EVTYPE field is dropped.

Finally the data.frame is converted to use the data.table package.

Packages

require(data.table)
require(xtable)
## Warning: package 'xtable' was built under R version 3.1.2
require(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.2
require(datasets)

Load the data

setwd('C:/Coursera/Reproducible_Research/Assignment2')
storms <- read.csv(bzfile('./Data/repdata-data-StormData.csv.bz2'), stringsAsFactors=FALSE)

#Convert the date fields to dates
storms$BGN_DATE <- as.Date(storms$BGN_DATE, "%m/%d/%Y")
storms$END_DATE <- as.Date(storms$END_DATE, "%m/%d/%Y")
storms$END_DATE <- as.Date(storms$END_DATE, "%m/%d/%Y")

#Drop missing date records and summary record
drop <- grepl('summary', storms$EVTYPE, ignore.case=TRUE) | is.na(storms$BGN_DATE)
storms <- storms[!drop,] 

#Convert the EVTYPE to a factor
storms$EVTYPE<-as.factor(storms$EVTYPE)

#Convert storms to a data.table
storms <- as.data.table(storms)

It is clear from a list of the different types of events that there are many alternate names for the same type of event. For example the list shown below is for all events including the words WATERSPOUT and TORNADO

event.types <- levels(storms$EVTYPE)
tornadoes <- event.types[grepl('WATERSPOUT', event.types)|grepl('WATERSPOUT', event.types)]
print(xtable(as.data.frame(tornadoes)), type='html')
tornadoes
1 WATERSPOUT
2 DUST DEVIL WATERSPOUT
3 TORNADO/WATERSPOUT
4 WATERSPOUT
5 WATERSPOUT-
6 WATERSPOUT-TORNADO
7 WATERSPOUT FUNNEL CLOUD
8 WATERSPOUT TORNADO
9 WATERSPOUT/
10 WATERSPOUT/ TORNADO
11 WATERSPOUT/TORNADO
12 WATERSPOUTS

The events were grouped based on their wording as follows:

event.groups <- rep(NA_character_, length(event.types))  #mapping from event.type to a group name
names(event.groups) <- event.types

#Populate the map definition based on specific words/regular expressions
event.map <- matrix(c(
    #group name             #regular Expression
    'TORNADO'               ,'(tornado)|(spout)|(dust.*devil)|(whirlwind)|(torndao)'
    ,'TORNADO'              ,'(wall cloud)|(funnel)'
    ,'VOLCANIC'             ,'volcan(o|ic)'
    ,'TIDE/SURGE'           ,'(beach)|(coastal)|(surge)|(tsunami)'
    ,'TIDE/SURGE'           ,'(wave)|(surf)|(rip)|(high water)|(tide)|(seiche)'
    ,'SEA CONDITIONS'       ,'(swell)|(heavy sea)|(drown)|(high sea)|(rough sea)|(marine)'
    ,'FLOOD'                ,'(fl(o)*d)|(urban)|(rising water)|(stream)'
    ,'SNOW/COLD'            ,'(snow)|(blizzard)|(winter)|(wintry)|(chill)|(cold)|(frost)'
    ,'SNOW/COLD'            ,'(cool)|(low temp)|(ice)|(icy)|(sleet)|(hypothermia)|(freez)|(record low)'
    ,'AVALANCHE/LANDSLIDE'  ,'(avalanc)|((land)|(rock)|(mud).*(slide)|(slump)|(slip))|(dam (break|fail))'
    ,'HAIL'                 ,'hail'
    ,'HURRICANE/STORM/RAIN' ,'(t(h)?under)|(storm)|(depression)|(hurricane)|(shower)'
    ,'HURRICANE/STORM/RAIN' ,'(rain)|(microburst)|(wet)|(precip)|(downburst)|(typhoon)'
    ,'LIGHTNING'            ,'LIG(H)?T(N)?ING'
    ,'HOT/DRY/DUST'         ,'(hot)|(heat)|(warm)|(high temp)|(hyperthermia)|(dry)|(record temp)|(record high)'
    ,'HOT/DRY/DUST'         ,'(dust)|(driest)|(drought)|(mild)'
    ,'FIRE/SMOKE'           ,'(fire)|(smoke)'
    ,'WIND'                 ,'(wi?nd)|(gust)|(gail)|(turbulence)'
    ,'FOG/CLOUD'            ,'(fog)|(vog)|(cloud)'
    ,'OTHER'                ,'.*'                 #anything left over
    ), ncol=2, byrow=TRUE)

for (i in seq(nrow(event.map))){
    event.groups[is.na(event.groups) & grepl(event.map[i,2], event.types, ignore.case=TRUE)] <- event.map[i,1]
}

#Add the Event_group as a new field in storms
storms[, Event_group:= event.groups[EVTYPE]]

Analysis

Human Impact

What types of Events cause the most fatalities? The table below shows the top 10 event groups, the number of events with at least one fatality, and the number of recorded fatalities (since the start of the data in 1950)

table1 <- storms[FATALITIES>0
            ,.(Fatalities=sum(FATALITIES), Fatal_Events=.N)
            ,by=Event_group][order(-Fatalities)][1:10]

#print the table with xtable
print(xtable(table1, caption='Table 1: Top 10 Event types causing fatalities', digits=0)
      ,type='html')
Table 1: Top 10 Event types causing fatalities
Event_group Fatalities Fatal_Events
1 TORNADO 5667 1609
2 HOT/DRY/DUST 2998 771
3 FLOOD 1548 1005
4 SNOW/COLD 1130 766
5 TIDE/SURGE 993 664
6 WIND 935 744
7 LIGHTNING 817 760
8 HURRICANE/STORM/RAIN 531 318
9 AVALANCHE/LANDSLIDE 269 190
10 FIRE/SMOKE 90 38

If we limit the data to events since 2000 (as the older data is less reliable), the Top 10 changes to:

table2 <- storms[FATALITIES>0 & BGN_DATE >= as.Date("1/1/2000", "%m/%d/%Y")
           ,.(Fatalities=sum(FATALITIES), Fatal_Events=.N)
           ,by=Event_group][order(-Fatalities)][1:10]

#print the table
print(xtable(table2, caption='Table 2: Top 10 Event types causing fatalities since 2000',digits=0)
      ,type='html')
Table 2: Top 10 Event types causing fatalities since 2000
Event_group Fatalities Fatal_Events
1 HOT/DRY/DUST 1244 501
2 TORNADO 1197 356
3 FLOOD 878 612
4 TIDE/SURGE 635 501
5 SNOW/COLD 516 380
6 LIGHTNING 466 434
7 WIND 358 321
8 HURRICANE/STORM/RAIN 299 192
9 AVALANCHE/LANDSLIDE 216 154
10 FIRE/SMOKE 84 34

Ranking the impact of events by numbers of Injuries instead of Fatalities is as follows:

table3 <- storms[INJURIES>0 & BGN_DATE >= as.Date('1/1/2000', '%m/%d/%Y')
            ,.(Injuries=sum(INJURIES), Events_with_Injuries=.N)
            ,by=Event_group][order(-Injuries)][1:10]

#print the table
print(xtable(table3, caption='Table 3: Top 10 Event types causing injury since 2000', digits=0)
      ,type='html')
Table 3: Top 10 Event types causing injury since 2000
Event_group Injuries Events_with_Injuries
1 TORNADO 15250 1349
2 HOT/DRY/DUST 4937 134
3 HURRICANE/STORM/RAIN 3387 702
4 LIGHTNING 2993 1574
5 WIND 2761 1227
6 SNOW/COLD 1414 212
7 FIRE/SMOKE 1199 256
8 FLOOD 1155 283
9 TIDE/SURGE 733 227
10 HAIL 579 136

So, the types of events that rank in the top ten for both deaths and injuries are:

in.both <- table2$Event_group[table2$Event_group %in% table3$Event_group]
table4 <- storms[Event_group %in% in.both & BGN_DATE >= as.Date('1/1/2000', '%m/%d/%Y')
            ,.(Fatalities=sum(FATALITIES), Injuries=sum(INJURIES), Events=.N)
            ,by=Event_group][order(-Fatalities-Injuries)]

#print the table
print(xtable(table4, caption='Table 4: Major Human Impacts Event types since 2000', digits=0)
      ,type='html')
Table 4: Major Human Impacts Event types since 2000
Event_group Fatalities Injuries Events
1 TORNADO 1197 15250 25004
2 HOT/DRY/DUST 1244 4937 4145
3 HURRICANE/STORM/RAIN 299 3387 92458
4 LIGHTNING 466 2993 9686
5 WIND 358 2761 105209
6 FLOOD 878 1155 61938
7 SNOW/COLD 516 1414 37120
8 TIDE/SURGE 635 733 2673
9 FIRE/SMOKE 84 1199 3687

Economic Impact

An estimate of the economic impact is included in the data for each event. This is in the form of a numeric value as well as an exponent(B,M,K,…). Unfortunately the data contains many non-defined values for the exponent. For the purpose of this analysis the values of B,M,K are used, and all other values are assumed to be units.

scale <- c(
    "K" = 1000
    ,"k" = 1000
    ,"M" = 1000000
    ,"m" = 1000000
    ,"B" = 1000000000
    ,"b" = 1000000000
    )

#Add in a dollar figure for the Property and Crop Damage Amounts 
storms[PROPDMG>0, property_damage:= scale[PROPDMGEXP] * PROPDMG]        #scale up units if valid
storms[PROPDMG>0 & is.na(property_damage), property_damage:= PROPDMG]   #otherwise scale=1
storms[CROPDMG>0, crop_damage:= scale[PROPDMGEXP] * PROPDMG]
storms[CROPDMG>0 & is.na(crop_damage), crop_damage:= PROPDMG]

What are the crop and property damage estimates yearly since 2000?

The table below shows the damage for crops, property and combined (in $millions) for each year since 2000.

table5 <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y')
            ,.(Property=sum(property_damage, na.rm=TRUE)/1000000
                ,Crops=sum(crop_damage, na.rm=TRUE)/1000000
                ,Combined=(sum(property_damage, na.rm=TRUE)+sum(crop_damage, na.rm=TRUE))/1000000
                ,Fatalities = sum(FATALITIES)
                ,Injuries = sum(INJURIES))
            ,by=year(BGN_DATE)]

#print the table
print(xtable(table5, caption='Table 5: Economic Impacts', digits=0)
      ,type='html')
Table 5: Economic Impacts
year Property Crops Combined Fatalities Injuries
1 2000 5621 659 6280 477 2801
2 2001 10027 495 10522 469 2721
3 2002 4101 729 4830 498 3155
4 2003 10255 3106 13361 443 2931
5 2004 25347 19591 44938 370 2426
6 2005 96790 8251 105041 469 1834
7 2006 121937 115994 237931 599 3368
8 2007 5789 875 6664 421 2191
9 2008 15568 2286 17854 488 2703
10 2009 5227 1090 6317 333 1354
11 2010 9246 2222 11468 425 1855
12 2011 20889 2348 23237 1002 7792

The distribution of damages, injuries and fatalities is very skewed. The calculation below shows the proportion of crop damages, property damages, fatalities and injuries that the top 100 events in each category account for.

#Property damage top 100
prop.num <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & property_damage>0, .N]
prop.tot <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & property_damage>0, sum(property_damage)]
prop.100 <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & property_damage>0][order(-property_damage)][1:100, sum(property_damage)]

#Crop damage top 100
crop.num <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & crop_damage>0, .N]
crop.tot <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & crop_damage>0, sum(crop_damage)]
crop.100 <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & crop_damage>0][order(-crop_damage)][1:100, sum(crop_damage)]

#Fatalities
fatal.num <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & FATALITIES>0, .N]
fatal.tot <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & FATALITIES>0, sum(FATALITIES)]
fatal.100 <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & FATALITIES>0][order(-FATALITIES)][1:100, sum(FATALITIES)]

#Injuries
injury.num <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & INJURIES>0, .N]
injury.tot <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & INJURIES>0, sum(INJURIES)]
injury.100 <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & INJURIES>0][order(-INJURIES)][1:100, sum(INJURIES)]

#Report percentages
cat('Top 100 Property events (',format(10000/prop.num, digits=2)
    ,'%) account for' ,format(100*prop.100/prop.tot, digits=0)
    ,'% of damages\n')
cat('Top 100 Crop damaging events (',format(10000/crop.num, digits=2)
    ,'%) account for', format(100*crop.100/crop.tot, digits=0)
    ,'% of damages\n')
cat('Top 100 Fatal events (', format(10000/fatal.num, digits=2)
    ,'%) account for', format(100*fatal.100/fatal.tot, digits=0)
    ,'% of fatalities\n')
cat('Top 100 Injuring events (', format(10000/injury.num, digits=2)
    ,'%) account for', format(100*injury.100/injury.tot, digits=0)
    ,'% of injuries\n')
## Top 100 Property events ( 0.068 %) account for 83 % of damages
## Top 100 Crop damaging events ( 1 %) account for 98 % of damages
## Top 100 Fatal events ( 2.8 %) account for 24 % of fatalities
## Top 100 Injuring events ( 1.6 %) account for 40 % of injuries

To look at which states are most affected by Weather events, the three graph below shows each state with the colour of the state depedent on the damages, injuries and fatalities per unit of population in the state

data(state)
abb.population <- data.table(STATE=state.abb, POP=state.x77[,"Population"], STATE_NAME=tolower(state.name))

#summarise the storms since 200 by state
storms_state <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y')
            ,.(Property=sum(property_damage, na.rm=TRUE)/1000000
                ,Crops=sum(crop_damage, na.rm=TRUE)/1000000
                ,Combined=(sum(property_damage, na.rm=TRUE)+sum(crop_damage, na.rm=TRUE))/1000000
                ,Fatalities=sum(FATALITIES)
                ,Injuries=sum(INJURIES)) 
            ,by=STATE]

#add in the population to the table
storms_state <- merge(storms_state,abb.population, by='STATE')

#add in values per unit population
storms_state[,Property_pp:= Property*1000/POP/11]
storms_state[,Crop_pp:= Crops*1000/POP/11]
storms_state[,Fatalities_pm:= Fatalities*1000/POP/11]
storms_state[,Injuries_pm:= Injuries*1000/POP/11]


#add in the boundaries file
boundaries <- as.data.table(map_data("state"))
setnames(boundaries, old='region', new='STATE_NAME')
plot.data <- merge(boundaries, storms_state, by='STATE_NAME')

#map of Total Damages
gg_damages <- ggplot(data=plot.data) + 
                geom_polygon(aes(x=long, y=lat, group = group, fill=Property_pp + Crop_pp),colour="white") +
                scale_fill_gradient2(name="$Damages"
                    ,low = "darkgreen"
                    ,mid = "grey50"
                    ,high = "darkred"
                    ,midpoint = median(storms_state$Property_pp + storms_state$Property_pp)
                    ,limits = quantile(storms_state$Property_pp + storms_state$Property_pp, c(0.25,0.75))
                    ,space = "rgb"
                    , na.value = "grey50"
                    ) +
                labs(title='Total Crop and Property Damages since 2000 per person p.a.')

#map of Fatalities per million population
gg_fatalities <- ggplot(data=plot.data) + 
                    geom_polygon(aes(x=long, y=lat, group = group, fill=Fatalities_pm),colour="white") +
                    scale_fill_gradient2(name="Fatalities"
                        ,low = "darkgreen"
                        ,mid = "grey50"
                        ,high = "darkred"
                        ,midpoint = median(storms_state$Fatalities_pm)
                        ,limits = quantile(storms_state$Fatalities_pm, c(0.25,0.75))
                        ,space = "rgb"
                        , na.value = "grey50"
                    ) +
                labs(title='Weather related Fatalities per million population p.a. since 2000')

#map of Injuries per million population
gg_injuries <- ggplot(data=plot.data) + 
                    geom_polygon(aes(x=long, y=lat, group = group, fill=Injuries_pm),colour="white") +
                    scale_fill_gradient2(name="Injuries"
                    ,low = "darkgreen"
                    ,mid = "grey50"
                    ,high = "darkred"
                    ,midpoint = median(storms_state$Injuries_pm)
                    ,limits = quantile(storms_state$Injuries_pm, c(0.25,0.75))
                    ,space = "rgb"
                    , na.value = "grey50"
                ) +
            labs(title='Weather related Injuries per million population p.a. since 2000')
print(gg_damages)    

print(gg_fatalities)

print(gg_injuries)

Conclusion

Tornadoes are the most damaging weather events. The Tornado Alley states clearly show up in heat maps of damages, fatalities and (to a lesser extent) injuries. 2006 was an extreme year for damages for both high crop and property damages but was only the second highest for injuries and Fatalities.

From a public policy perspective, the 100 worst events cause disproportionately the most damage, so authorities should concentrate their efforts on catastrophe planning and preparation for major events, particularly in the Tornado alley states.