What types of weather causes the greatest human and financial losses?

Synopsis

In this report I use data from the US National Weather Service to examine the human and financial costs of extreme weather events in the USA between 1950 and 2011. The weather events recorded by the National Weather Service are classified into 11 groups and the total loss of life, injuries, value of property damage, and value of crop damage are examined per group. I also look at how the loss of life caused by each type of extreme weather varies by season. The aim is not to focus on the detailed numbers, but to look at the scale of the losses and the relative damage cause by different types of weather.

The value of conclusions drawn from any analysis depends on the quality of the data and ensuring the analysis is undertaken with an understanding of the form of the data. I therefore spend some time investigating the data before grouping the weather events. In particular the groups chosen should be discussed to ensure they are appropriate for the use to which the analysis is put.

The broad conclusion of this analysis is that wind related events (tornados, hurricanes, etc.) cause the most loss of life, injuries while floods cause the most financial loss.

Data processing

Loading and preprocessing the data

I start by loading the necessary packages and setting some options.

# (message=FALSE suppresses package startup messages.)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(knitr)
options(scipen = 6, digits = 1)

Next I read in the data. The tbl_df wrapper to the data frame is a convenience to prevent accidentally printing excessively large amount of data to the console. I used colClasses in read.table to define the columns I wanted and to set the type of the variables - this speeds up the process by about 50% (which I checked with system.time)

setwd("~/Dropbox/jhDataScience/RepData/RepData_PeerAssessment2")
# Read in data set.
col_classes <- rep("NULL",37)
col_classes[2] <- "character"  # BGN_DATE
col_classes[8] <- "factor"     # EVTYPE
col_classes[23] <- "numeric"   # FATALITES
col_classes[24] <- "numeric"   # INJURIES
col_classes[25] <- "numeric"   # PROPDMG
col_classes[26] <- "character" # PROPDMGEXP
col_classes[27] <- "numeric"   # CROPDMG
col_classes[28] <- "character" # CROPDMGEXP
storm1 <- tbl_df(read.table("repdata_data_StormData.csv.bz2", sep = ",", 
                           header=T, nrows=902300, comment.char="",
                           colClasses=col_classes))

Then filter out the rows where EVTYPE contains “summary” since these probably do not refer to a single event. (In this and all subsequent operations involving EVTYPE I will ignore case, i.e. references to “summary” include “Summary” and “SUMMARY” or even “summarY” etc.)

storm <- filter(storm1, !grepl("summary", EVTYPE, ignore.case=TRUE))

As part of the analysis I will look at how the events vary by season, so here I create a factor variable SEASON, defining winter as the first three months of the year and the other seasons consistently with that. Other definitions of the seasons are, of course possible - this is the one used by the UK Met Office.

storm$BGN_DATE <- mdy_hms(storm$BGN_DATE)
seasons <- c("winter", "spring", "summer", "autumn")
storm <- mutate(storm, SEASON=seasons[(month(BGN_DATE)+2)%/%3])
storm$SEASON <- factor(storm$SEASON, 
                       levels=(c("winter", "spring", "summer", "autumn")))

Data analysis and grouping

Before we go any further, let’s look at how the number of extreme weather events recorded has evolved over time.

evol <- hist(year(storm$BGN_DATE), plot=F)
evolu <- data.frame("Start of period"=evol$breaks[-length(evol$breaks)],
                    "Number of events"=evol$counts)
kable(evolu, caption="Table 1,
      Number of extreme events in five year intervals")

Table 1, Number of extreme events in five year intervals
Start.of.period	Number.of.events
1950	3278
1955	9858
1960	11806
1965	14529
1970	20463
1975	21578
1980	35285
1985	44706
1990	87264
1995	164762
2000	189554
2005	236964
2010	62174

The number of recorded events increases in every interval (apart from the last where there isn’t a full five year’s worth of data. This might be because the number of extreme weather events has actually increased, or it might be because proportion of events captured by the recording process has increased. From this data set we have no way of knowing, but it is worthy of further investigation. For the purpose of this report I’ll work with the aggregated data over the whole time period.

The next step is to apply the “multipliers” in PROPDMGEXP and CROPDMGEXP. (Ignoring anything that isn’t K, M, or B.)

storm$PROPDMG[tolower(storm$PROPDMGEXP)=="k"] = 
    storm$PROPDMG[tolower(storm$PROPDMGEXP)=="k"]*1e3
storm$PROPDMG[tolower(storm$PROPDMGEXP)=="m"] = 
    storm$PROPDMG[tolower(storm$PROPDMGEXP)=="m"]*1e6
storm$PROPDMG[tolower(storm$PROPDMGEXP)=="b"] = 
    storm$PROPDMG[tolower(storm$PROPDMGEXP)=="b"]*1e9
storm$CROPDMG[tolower(storm$CROPDMGEXP)=="k"] = 
    storm$CROPDMG[tolower(storm$CROPDMGEXP)=="k"]*1e3
storm$CROPDMG[tolower(storm$CROPDMGEXP)=="m"] = 
    storm$CROPDMG[tolower(storm$CROPDMGEXP)=="m"]*1e6
storm$CROPDMG[tolower(storm$CROPDMGEXP)=="b"] = 
    storm$CROPDMG[tolower(storm$CROPDMGEXP)=="b"]*1e9

Now I add a column called “WEATHERGRP”. This column will eventually contain a factor variable which will clasify the events into a manageable number of groups. Initially it’s populated with NAs.

storm$WEATHERGRP <- rep(NA, nrow(storm))

We now start to classify the events into groups. We will define a list of groups (see Table 2)

The order of these groups is important. As we will see, to classify the events we will work through the list, searching EVTYPE for the keywords we have associated with each weather group. Once an event has been classified it won’t be moved so, for example, something containing the words “coastal” and “flood” will be classified as “coast” because that word will be picked up before flood is searched for.

# Group names are the first entry for each vector in the list defined below.
# Subsequent entries are used to assign weather events to that group.
# Where a "stem" has been entered rather than a whole word, that is to catch
# events in the original data that might have similar but not identical labels
# (including mis-spellings.)
wgroups <- list(c("marine", "sea"),
                c("coast", "tide", "surf", "current", "tsunami", "surge", 
                  "seiche", "swell", "high water"),
                c("flood", "fld", "high water"),
                c("wind", "hurricane", "tornado", "typhoon", "tropical storm",
                  "spout", "tropical depression", "funnel", "gustnado"),
                c("precip","snow", "sleet", "blizzard", "rain", "precip", "wet",
                  "shower", "burst", "hail"),
                c("lightning", "thun", "tstm"),
                c("visibility", "fog", "dust", "cloud", "smoke", "ash"),
                c("cold", "cool", "frost", "freez", "ice", "icy", "hypoth",
                  "wint", "glaze", "low"),
                c("heat", "hot", "warm", "fire", "dry", "drought", "high"),
                c("landslide", "avalanche", "slide"))

Now classify each event:

matched <- 0
records <- nrow(storm)
matches <- data.frame(Group=character(),
                      Match_pattern=character(),
                      Matched=numeric(),
                      Percent_matched=numeric(),
                      Total_percent_matched=numeric())
for (v in wgroups){
    pattern <- paste(v, collapse="|")
    test <- grepl(pattern, storm$EVTYPE, ignore.case=TRUE) & 
            is.na(storm$WEATHERGRP)
    storm$WEATHERGRP[test] <- v[1]
    this_match <- sum(test)
    matched <- matched + this_match
    prop_matched <- this_match/records*100
    cuml_prop_matched <- matched/records*100
    new_matches <- data.frame(Group=v[1],
                              Match_pattern=pattern,
                              Matched=this_match,
                              Percent_matched=prop_matched,
                              Total_percent_matched=cuml_prop_matched)
    matches <- rbind(matches, new_matches)
}
storm$WEATHERGRP[is.na(storm$WEATHERGRP)] <- "other"
storm$WEATHERGRP <- factor(storm$WEATHERGRP)
kable(matches, caption="Table 2, Weather group search terms and percentage in each group")

Table 2, Weather group search terms and percentage in each group
Group	Match_pattern	Matched	Percent_matched	Total_percent_matched
marine	marine\|sea	12907	1.4	1
coast	coast\|tide\|surf\|current\|tsunami\|surge\|seiche\|swell\|high water	3454	0.4	2
flood	flood\|fld\|high water	85275	9.5	11
wind	wind\|hurricane\|tornado\|typhoon\|tropical storm\|spout\|tropical depression\|funnel\|gustnado	425304	47.1	58
precip	precip\|snow\|sleet\|blizzard\|rain\|precip\|wet\|shower\|burst\|hail	321775	35.7	94
lightning	lightning\|thun\|tstm	15856	1.8	96
visibility	visibility\|fog\|dust\|cloud\|smoke\|ash	2529	0.3	96
cold	cold\|cool\|frost\|freez\|ice\|icy\|hypoth\|wint\|glaze\|low	24283	2.7	99
heat	heat\|hot\|warm\|fire\|dry\|drought\|high	9618	1.1	100
landslide	landslide\|avalanche\|slide	1032	0.1	100

Our classification of events for further analysis is given in Table 2 above. This table is an important part of this report because the classification should be discussed with weather experts and users of the report and amended if desired.

Now we have classified everything, let’s look at the entries dumped into “other” that contain fatalities or injuries. (Many of these have EVTYPE=“OTHER”, I’ve removed them from the list below to make it more readable.)

storm %>%
    filter(WEATHERGRP=="other", FATALITIES>0|INJURIES>0|PROPDMG>0|CROPDMG>0,
                                !EVTYPE=="OTHER", !EVTYPE=="Other") %>%
    select(BGN_DATE, EVTYPE, WEATHERGRP,FATALITIES,INJURIES,PROPDMG,CROPDMG)

## Source: local data frame [26 x 7]
## 
##      BGN_DATE                EVTYPE WEATHERGRP FATALITIES INJURIES PROPDMG
## 1  1993-03-31     SEVERE TURBULENCE      other          0        0   50000
## 2  1994-07-30         APACHE COUNTY      other          0        0    5000
## 3  1995-09-07             LIGNTNING      other          0        0    5000
## 4  1994-03-24               TORNDAO      other          0        0     500
## 5  1993-01-09              AVALANCE      other          1        0       0
## 6  1993-08-27       URBAN AND SMALL      other          0        0    5000
## 7  1994-12-10             HEAVY MIX      other          0        0  100000
## 8  1995-02-26             HEAVY MIX      other          0        0   50000
## 9  1994-04-06             HEAVY MIX      other          0        0    5000
## 10 1994-11-27             HEAVY MIX      other          0        0   50000
## 11 1994-02-21              LIGHTING      other          0        0    5000
## 12 1995-06-30  RAPIDLY RISING WATER      other          1        0       0
## 13 1994-04-06             HEAVY MIX      other          0        0   50000
## 14 1995-02-27             HEAVY MIX      other          0        0  500000
## 15 1994-11-27             HEAVY MIX      other          0        0   50000
## 16 1994-03-09             HEAVY MIX      other          0        0  500000
## 17 1994-02-09                     ?      other          0        0    5000
## 18 1994-08-28    URBAN/SMALL STREAM      other          0        0    5000
## 19 1994-10-13           URBAN SMALL      other          0        0      50
## 20 1996-11-15         Beach Erosion      other          0        0  100000
## 21 1996-04-16             Landslump      other          0        0  570000
## 22 1997-05-04             DAM BREAK      other          0        0    2000
## 23 1998-11-11 HYPERTHERMIA/EXPOSURE      other          1        0       0
## 24 2000-07-17             DAM BREAK      other          0        0 1000000
## 25 2001-01-04            ROGUE WAVE      other          0        2       0
## 26 2002-05-02              DROWNING      other          1        0       0
## Variables not shown: CROPDMG (dbl)

One or two of these look like they can be classified unambiguously, but I haven’t done so at this stage.

Results

We are now in a position to group our events by the variable WEATHERGRP and also by SEASON. For clarity I have split the data into two: one data set summarizes the human costs and the other summarizes property and crop damage.

storm <- mutate(storm, DAMAGE=PROPDMG + CROPDMG)
by_wgroup <- group_by(storm, WEATHERGRP, SEASON)
human_cost <- summarize(by_wgroup, total_fatalities=sum(FATALITIES), 
                            total_injuries=sum(INJURIES))
human_cost$WEATHERGRP <- reorder(human_cost$WEATHERGRP, -human_cost$total_fatalities)
human_cost <- gather(human_cost, loss, people, -WEATHERGRP, -SEASON)
damage_cost <- summarize(by_wgroup, property_damage=sum(PROPDMG)/1e9,
                            crop_damage=sum(CROPDMG)/1e9,
                            total_damage=sum(DAMAGE)/1e9)
damage_cost$WEATHERGRP <- reorder(damage_cost$WEATHERGRP, -damage_cost$total_damage)
damage_cost <- gather(damage_cost, loss, dollars, -WEATHERGRP, -SEASON)

Human costs

ggplot(human_cost, aes(x=WEATHERGRP, y=people, fill="red")) + 
    geom_bar(stat="identity") +
    facet_grid(loss~., scales="free_y") +
    scale_x_discrete(human_cost$WEATHERGRP, name="Extreme weather group") +
    ggtitle('Figure 1, Deaths and injuries from extreme weather 1950 - 2009') +
    ylab('Number of deaths or injuries') +
    theme(legend.position = "none")

Figure 1, above, shows the total number of deaths and injuries for each type of weather over the whole period. It is clear that wind related events (particularly typhoons and hurricanes) cause most deaths and injuries.

Note that the vertical scale is different in the top and bottom charts.

Property and crop damage

ggplot(damage_cost, aes(x=WEATHERGRP, y=dollars, fill="blue")) + 
    geom_bar(stat="identity") +
    facet_grid(loss~., scales="free_y") +
    #scale_x_discrete(damage_cost$WEATHERGRP, name="Extreme weather group") +
    ggtitle('Figure 2, Cost of damage from extreme weather 1950 - 2009') +
    ylab('Cost (USD bn)') +
    theme(legend.position = "none")

Figure 2, above, shows that flood is the biggest cause of total financial losses due to property and crop damage, closely followed by wind. The biggest cause of crop damage is heat. (As with Figure 1, the vertical scale varies. In particular the scale on the top and bottom charts is ten times that of the middle chart - which is why the total losses closely follow the property damage costs.)

Variation of deaths by season

ggplot(filter(human_cost, loss=="total_fatalities"), aes(x=WEATHERGRP, y=people, fill=SEASON)) + 
    geom_bar(stat="identity", position="dodge") +
    #facet_grid(SEASON~., scales="free_y") +
    #facet_grid(SEASON~.) +
    scale_x_discrete(human_cost$WEATHERGRP, name="Extreme weather group") +
    ggtitle('Figure 3, Deaths from extreme weather 1950 - 2009') +
    #xlab('Extreme weather group') +
    ylab('Number of deaths') +
    theme(legend.position = "right")

Figure 3 shows how the number of deaths varies by season.