Analysis of US Storm Data

Overview

This article is a submission for the Coursera course “Reproducible Research”. It is an analysis of US Storm data from the National Oceanic and Atmospheric Administration’s (NOAA) storm database answering
set questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Synopsis

The weather event types are washed and the the period since 2000 looked at in greater detail. Tornadoes (under various desciptions) were the events responsible for the most Fatalities, most injuries and the highest economic impact. The year of 2006 had the highest weather related costs, but 2011 had the highest numbers of fatalities and injuries.

The most dangerous States are those in Tornado alley. Although for Injuries the northern states start ranking. The distribution of impacts is skewed with larger events resulting in the bulk of claims.

Data Processing

The code below loads the storms data from a Data subfolder and converts some of the fields to dates. As we are looking at time series, and time relevant data, entries which have a missing beginning and end dates are removed.

The data records contain summary information that is already included in other rows. The documentation states:

Additional summary information on cumulative fatalities, injuries, or damages from previous months can be explained in the event narrative of the Storm Data entry for the final month of the event.

To prevent double counting any record with the word Summary in the EVTYPE field is dropped.

Finally the data.frame is converted to use the data.table package.

Packages

require(data.table)
require(xtable)

## Warning: package 'xtable' was built under R version 3.1.2

require(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.1.2

require(datasets)

Load the data

setwd('C:/Coursera/Reproducible_Research/Assignment2')
storms <- read.csv(bzfile('./Data/repdata-data-StormData.csv.bz2'), stringsAsFactors=FALSE)

#Convert the date fields to dates
storms$BGN_DATE <- as.Date(storms$BGN_DATE, "%m/%d/%Y")
storms$END_DATE <- as.Date(storms$END_DATE, "%m/%d/%Y")
storms$END_DATE <- as.Date(storms$END_DATE, "%m/%d/%Y")

#Drop missing date records and summary record
drop <- grepl('summary', storms$EVTYPE, ignore.case=TRUE) | is.na(storms$BGN_DATE)
storms <- storms[!drop,] 

#Convert the EVTYPE to a factor
storms$EVTYPE<-as.factor(storms$EVTYPE)

#Convert storms to a data.table
storms <- as.data.table(storms)

It is clear from a list of the different types of events that there are many alternate names for the same type of event. For example the list shown below is for all events including the words WATERSPOUT and TORNADO

event.types <- levels(storms$EVTYPE)
tornadoes <- event.types[grepl('WATERSPOUT', event.types)|grepl('WATERSPOUT', event.types)]
print(xtable(as.data.frame(tornadoes)), type='html')

	tornadoes
1	WATERSPOUT
2	DUST DEVIL WATERSPOUT
3	TORNADO/WATERSPOUT
4	WATERSPOUT
5	WATERSPOUT-
6	WATERSPOUT-TORNADO
7	WATERSPOUT FUNNEL CLOUD
8	WATERSPOUT TORNADO
9	WATERSPOUT/
10	WATERSPOUT/ TORNADO
11	WATERSPOUT/TORNADO
12	WATERSPOUTS

The events were grouped based on their wording as follows:

event.groups <- rep(NA_character_, length(event.types))  #mapping from event.type to a group name
names(event.groups) <- event.types

#Populate the map definition based on specific words/regular expressions
event.map <- matrix(c(
    #group name             #regular Expression
    'TORNADO'               ,'(tornado)|(spout)|(dust.*devil)|(whirlwind)|(torndao)'
    ,'TORNADO'              ,'(wall cloud)|(funnel)'
    ,'VOLCANIC'             ,'volcan(o|ic)'
    ,'TIDE/SURGE'           ,'(beach)|(coastal)|(surge)|(tsunami)'
    ,'TIDE/SURGE'           ,'(wave)|(surf)|(rip)|(high water)|(tide)|(seiche)'
    ,'SEA CONDITIONS'       ,'(swell)|(heavy sea)|(drown)|(high sea)|(rough sea)|(marine)'
    ,'FLOOD'                ,'(fl(o)*d)|(urban)|(rising water)|(stream)'
    ,'SNOW/COLD'            ,'(snow)|(blizzard)|(winter)|(wintry)|(chill)|(cold)|(frost)'
    ,'SNOW/COLD'            ,'(cool)|(low temp)|(ice)|(icy)|(sleet)|(hypothermia)|(freez)|(record low)'
    ,'AVALANCHE/LANDSLIDE'  ,'(avalanc)|((land)|(rock)|(mud).*(slide)|(slump)|(slip))|(dam (break|fail))'
    ,'HAIL'                 ,'hail'
    ,'HURRICANE/STORM/RAIN' ,'(t(h)?under)|(storm)|(depression)|(hurricane)|(shower)'
    ,'HURRICANE/STORM/RAIN' ,'(rain)|(microburst)|(wet)|(precip)|(downburst)|(typhoon)'
    ,'LIGHTNING'            ,'LIG(H)?T(N)?ING'
    ,'HOT/DRY/DUST'         ,'(hot)|(heat)|(warm)|(high temp)|(hyperthermia)|(dry)|(record temp)|(record high)'
    ,'HOT/DRY/DUST'         ,'(dust)|(driest)|(drought)|(mild)'
    ,'FIRE/SMOKE'           ,'(fire)|(smoke)'
    ,'WIND'                 ,'(wi?nd)|(gust)|(gail)|(turbulence)'
    ,'FOG/CLOUD'            ,'(fog)|(vog)|(cloud)'
    ,'OTHER'                ,'.*'                 #anything left over
    ), ncol=2, byrow=TRUE)

for (i in seq(nrow(event.map))){
    event.groups[is.na(event.groups) & grepl(event.map[i,2], event.types, ignore.case=TRUE)] <- event.map[i,1]
}

#Add the Event_group as a new field in storms
storms[, Event_group:= event.groups[EVTYPE]]

Analysis

Human Impact

What types of Events cause the most fatalities? The table below shows the top 10 event groups, the number of events with at least one fatality, and the number of recorded fatalities (since the start of the data in 1950)

table1 <- storms[FATALITIES>0
            ,.(Fatalities=sum(FATALITIES), Fatal_Events=.N)
            ,by=Event_group][order(-Fatalities)][1:10]

#print the table with xtable
print(xtable(table1, caption='Table 1: Top 10 Event types causing fatalities', digits=0)
      ,type='html')

Table 1: Top 10 Event types causing fatalities
	Event_group	Fatalities	Fatal_Events
1	TORNADO	5667	1609
2	HOT/DRY/DUST	2998	771
3	FLOOD	1548	1005
4	SNOW/COLD	1130	766
5	TIDE/SURGE	993	664
6	WIND	935	744
7	LIGHTNING	817	760
8	HURRICANE/STORM/RAIN	531	318
9	AVALANCHE/LANDSLIDE	269	190
10	FIRE/SMOKE	90	38

If we limit the data to events since 2000 (as the older data is less reliable), the Top 10 changes to:

table2 <- storms[FATALITIES>0 & BGN_DATE >= as.Date("1/1/2000", "%m/%d/%Y")
           ,.(Fatalities=sum(FATALITIES), Fatal_Events=.N)
           ,by=Event_group][order(-Fatalities)][1:10]

#print the table
print(xtable(table2, caption='Table 2: Top 10 Event types causing fatalities since 2000',digits=0)
      ,type='html')

Table 2: Top 10 Event types causing fatalities since 2000
	Event_group	Fatalities	Fatal_Events
1	HOT/DRY/DUST	1244	501
2	TORNADO	1197	356
3	FLOOD	878	612
4	TIDE/SURGE	635	501
5	SNOW/COLD	516	380
6	LIGHTNING	466	434
7	WIND	358	321
8	HURRICANE/STORM/RAIN	299	192
9	AVALANCHE/LANDSLIDE	216	154
10	FIRE/SMOKE	84	34

Ranking the impact of events by numbers of Injuries instead of Fatalities is as follows:

table3 <- storms[INJURIES>0 & BGN_DATE >= as.Date('1/1/2000', '%m/%d/%Y')
            ,.(Injuries=sum(INJURIES), Events_with_Injuries=.N)
            ,by=Event_group][order(-Injuries)][1:10]

#print the table
print(xtable(table3, caption='Table 3: Top 10 Event types causing injury since 2000', digits=0)
      ,type='html')

Table 3: Top 10 Event types causing injury since 2000
	Event_group	Injuries	Events_with_Injuries
1	TORNADO	15250	1349
2	HOT/DRY/DUST	4937	134
3	HURRICANE/STORM/RAIN	3387	702
4	LIGHTNING	2993	1574
5	WIND	2761	1227
6	SNOW/COLD	1414	212
7	FIRE/SMOKE	1199	256
8	FLOOD	1155	283
9	TIDE/SURGE	733	227
10	HAIL	579	136

So, the types of events that rank in the top ten for both deaths and injuries are:

in.both <- table2$Event_group[table2$Event_group %in% table3$Event_group]
table4 <- storms[Event_group %in% in.both & BGN_DATE >= as.Date('1/1/2000', '%m/%d/%Y')
            ,.(Fatalities=sum(FATALITIES), Injuries=sum(INJURIES), Events=.N)
            ,by=Event_group][order(-Fatalities-Injuries)]

#print the table
print(xtable(table4, caption='Table 4: Major Human Impacts Event types since 2000', digits=0)
      ,type='html')

Table 4: Major Human Impacts Event types since 2000
	Event_group	Fatalities	Injuries	Events
1	TORNADO	1197	15250	25004
2	HOT/DRY/DUST	1244	4937	4145
3	HURRICANE/STORM/RAIN	299	3387	92458
4	LIGHTNING	466	2993	9686
5	WIND	358	2761	105209
6	FLOOD	878	1155	61938
7	SNOW/COLD	516	1414	37120
8	TIDE/SURGE	635	733	2673
9	FIRE/SMOKE	84	1199	3687

Economic Impact

An estimate of the economic impact is included in the data for each event. This is in the form of a numeric value as well as an exponent(B,M,K,…). Unfortunately the data contains many non-defined values for the exponent. For the purpose of this analysis the values of B,M,K are used, and all other values are assumed to be units.

scale <- c(
    "K" = 1000
    ,"k" = 1000
    ,"M" = 1000000
    ,"m" = 1000000
    ,"B" = 1000000000
    ,"b" = 1000000000
    )

#Add in a dollar figure for the Property and Crop Damage Amounts 
storms[PROPDMG>0, property_damage:= scale[PROPDMGEXP] * PROPDMG]        #scale up units if valid
storms[PROPDMG>0 & is.na(property_damage), property_damage:= PROPDMG]   #otherwise scale=1
storms[CROPDMG>0, crop_damage:= scale[PROPDMGEXP] * PROPDMG]
storms[CROPDMG>0 & is.na(crop_damage), crop_damage:= PROPDMG]

What are the crop and property damage estimates yearly since 2000?

The table below shows the damage for crops, property and combined (in $millions) for each year since 2000.

table5 <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y')
            ,.(Property=sum(property_damage, na.rm=TRUE)/1000000
                ,Crops=sum(crop_damage, na.rm=TRUE)/1000000
                ,Combined=(sum(property_damage, na.rm=TRUE)+sum(crop_damage, na.rm=TRUE))/1000000
                ,Fatalities = sum(FATALITIES)
                ,Injuries = sum(INJURIES))
            ,by=year(BGN_DATE)]

#print the table
print(xtable(table5, caption='Table 5: Economic Impacts', digits=0)
      ,type='html')

Table 5: Economic Impacts
	year	Property	Crops	Combined	Fatalities	Injuries
1	2000	5621	659	6280	477	2801
2	2001	10027	495	10522	469	2721
3	2002	4101	729	4830	498	3155
4	2003	10255	3106	13361	443	2931
5	2004	25347	19591	44938	370	2426
6	2005	96790	8251	105041	469	1834
7	2006	121937	115994	237931	599	3368
8	2007	5789	875	6664	421	2191
9	2008	15568	2286	17854	488	2703
10	2009	5227	1090	6317	333	1354
11	2010	9246	2222	11468	425	1855
12	2011	20889	2348	23237	1002	7792

The distribution of damages, injuries and fatalities is very skewed. The calculation below shows the proportion of crop damages, property damages, fatalities and injuries that the top 100 events in each category account for.

#Property damage top 100
prop.num <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & property_damage>0, .N]
prop.tot <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & property_damage>0, sum(property_damage)]
prop.100 <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & property_damage>0][order(-property_damage)][1:100, sum(property_damage)]

#Crop damage top 100
crop.num <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & crop_damage>0, .N]
crop.tot <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & crop_damage>0, sum(crop_damage)]
crop.100 <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & crop_damage>0][order(-crop_damage)][1:100, sum(crop_damage)]

#Fatalities
fatal.num <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & FATALITIES>0, .N]
fatal.tot <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & FATALITIES>0, sum(FATALITIES)]
fatal.100 <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & FATALITIES>0][order(-FATALITIES)][1:100, sum(FATALITIES)]

#Injuries
injury.num <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & INJURIES>0, .N]
injury.tot <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & INJURIES>0, sum(INJURIES)]
injury.100 <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y') & INJURIES>0][order(-INJURIES)][1:100, sum(INJURIES)]

#Report percentages
cat('Top 100 Property events (',format(10000/prop.num, digits=2)
    ,'%) account for' ,format(100*prop.100/prop.tot, digits=0)
    ,'% of damages\n')
cat('Top 100 Crop damaging events (',format(10000/crop.num, digits=2)
    ,'%) account for', format(100*crop.100/crop.tot, digits=0)
    ,'% of damages\n')
cat('Top 100 Fatal events (', format(10000/fatal.num, digits=2)
    ,'%) account for', format(100*fatal.100/fatal.tot, digits=0)
    ,'% of fatalities\n')
cat('Top 100 Injuring events (', format(10000/injury.num, digits=2)
    ,'%) account for', format(100*injury.100/injury.tot, digits=0)
    ,'% of injuries\n')

## Top 100 Property events ( 0.068 %) account for 83 % of damages
## Top 100 Crop damaging events ( 1 %) account for 98 % of damages
## Top 100 Fatal events ( 2.8 %) account for 24 % of fatalities
## Top 100 Injuring events ( 1.6 %) account for 40 % of injuries

To look at which states are most affected by Weather events, the three graph below shows each state with the colour of the state depedent on the damages, injuries and fatalities per unit of population in the state

data(state)
abb.population <- data.table(STATE=state.abb, POP=state.x77[,"Population"], STATE_NAME=tolower(state.name))

#summarise the storms since 200 by state
storms_state <- storms[BGN_DATE>as.Date('1/1/2000', '%m/%d/%Y')
            ,.(Property=sum(property_damage, na.rm=TRUE)/1000000
                ,Crops=sum(crop_damage, na.rm=TRUE)/1000000
                ,Combined=(sum(property_damage, na.rm=TRUE)+sum(crop_damage, na.rm=TRUE))/1000000
                ,Fatalities=sum(FATALITIES)
                ,Injuries=sum(INJURIES)) 
            ,by=STATE]

#add in the population to the table
storms_state <- merge(storms_state,abb.population, by='STATE')

#add in values per unit population
storms_state[,Property_pp:= Property*1000/POP/11]
storms_state[,Crop_pp:= Crops*1000/POP/11]
storms_state[,Fatalities_pm:= Fatalities*1000/POP/11]
storms_state[,Injuries_pm:= Injuries*1000/POP/11]


#add in the boundaries file
boundaries <- as.data.table(map_data("state"))
setnames(boundaries, old='region', new='STATE_NAME')
plot.data <- merge(boundaries, storms_state, by='STATE_NAME')

#map of Total Damages
gg_damages <- ggplot(data=plot.data) + 
                geom_polygon(aes(x=long, y=lat, group = group, fill=Property_pp + Crop_pp),colour="white") +
                scale_fill_gradient2(name="$Damages"
                    ,low = "darkgreen"
                    ,mid = "grey50"
                    ,high = "darkred"
                    ,midpoint = median(storms_state$Property_pp + storms_state$Property_pp)
                    ,limits = quantile(storms_state$Property_pp + storms_state$Property_pp, c(0.25,0.75))
                    ,space = "rgb"
                    , na.value = "grey50"
                    ) +
                labs(title='Total Crop and Property Damages since 2000 per person p.a.')

#map of Fatalities per million population
gg_fatalities <- ggplot(data=plot.data) + 
                    geom_polygon(aes(x=long, y=lat, group = group, fill=Fatalities_pm),colour="white") +
                    scale_fill_gradient2(name="Fatalities"
                        ,low = "darkgreen"
                        ,mid = "grey50"
                        ,high = "darkred"
                        ,midpoint = median(storms_state$Fatalities_pm)
                        ,limits = quantile(storms_state$Fatalities_pm, c(0.25,0.75))
                        ,space = "rgb"
                        , na.value = "grey50"
                    ) +
                labs(title='Weather related Fatalities per million population p.a. since 2000')

#map of Injuries per million population
gg_injuries <- ggplot(data=plot.data) + 
                    geom_polygon(aes(x=long, y=lat, group = group, fill=Injuries_pm),colour="white") +
                    scale_fill_gradient2(name="Injuries"
                    ,low = "darkgreen"
                    ,mid = "grey50"
                    ,high = "darkred"
                    ,midpoint = median(storms_state$Injuries_pm)
                    ,limits = quantile(storms_state$Injuries_pm, c(0.25,0.75))
                    ,space = "rgb"
                    , na.value = "grey50"
                ) +
            labs(title='Weather related Injuries per million population p.a. since 2000')

print(gg_damages)

print(gg_fatalities)

print(gg_injuries)

Conclusion

Tornadoes are the most damaging weather events. The Tornado Alley states clearly show up in heat maps of damages, fatalities and (to a lesser extent) injuries. 2006 was an extreme year for damages for both high crop and property damages but was only the second highest for injuries and Fatalities.

From a public policy perspective, the 100 worst events cause disproportionately the most damage, so authorities should concentrate their efforts on catastrophe planning and preparation for major events, particularly in the Tornado alley states.