In this report I use data from the US National Weather Service to examine the human and financial costs of extreme weather events in the USA between 1950 and 2011. The weather events recorded by the National Weather Service are classified into 11 groups and the total loss of life, injuries, value of property damage, and value of crop damage are examined per group. I also look at how the loss of life caused by each type of extreme weather varies by season. The aim is not to focus on the detailed numbers, but to look at the scale of the losses and the relative damage cause by different types of weather.
The value of conclusions drawn from any analysis depends on the quality of the data and ensuring the analysis is undertaken with an understanding of the form of the data. I therefore spend some time investigating the data before grouping the weather events. In particular the groups chosen should be discussed to ensure they are appropriate for the use to which the analysis is put.
The broad conclusion of this analysis is that wind related events (tornados, hurricanes, etc.) cause the most loss of life, injuries while floods cause the most financial loss.
I start by loading the necessary packages and setting some options.
# (message=FALSE suppresses package startup messages.)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(knitr)
options(scipen = 6, digits = 1)
Next I read in the data. The tbl_df wrapper to the data frame is a convenience to prevent accidentally printing excessively large amount of data to the console. I used colClasses in read.table to define the columns I wanted and to set the type of the variables - this speeds up the process by about 50% (which I checked with system.time)
setwd("~/Dropbox/jhDataScience/RepData/RepData_PeerAssessment2")
# Read in data set.
col_classes <- rep("NULL",37)
col_classes[2] <- "character" # BGN_DATE
col_classes[8] <- "factor" # EVTYPE
col_classes[23] <- "numeric" # FATALITES
col_classes[24] <- "numeric" # INJURIES
col_classes[25] <- "numeric" # PROPDMG
col_classes[26] <- "character" # PROPDMGEXP
col_classes[27] <- "numeric" # CROPDMG
col_classes[28] <- "character" # CROPDMGEXP
storm1 <- tbl_df(read.table("repdata_data_StormData.csv.bz2", sep = ",",
header=T, nrows=902300, comment.char="",
colClasses=col_classes))
Then filter out the rows where EVTYPE contains “summary” since these probably do not refer to a single event. (In this and all subsequent operations involving EVTYPE I will ignore case, i.e. references to “summary” include “Summary” and “SUMMARY” or even “summarY” etc.)
storm <- filter(storm1, !grepl("summary", EVTYPE, ignore.case=TRUE))
As part of the analysis I will look at how the events vary by season, so here I create a factor variable SEASON, defining winter as the first three months of the year and the other seasons consistently with that. Other definitions of the seasons are, of course possible - this is the one used by the UK Met Office.
storm$BGN_DATE <- mdy_hms(storm$BGN_DATE)
seasons <- c("winter", "spring", "summer", "autumn")
storm <- mutate(storm, SEASON=seasons[(month(BGN_DATE)+2)%/%3])
storm$SEASON <- factor(storm$SEASON,
levels=(c("winter", "spring", "summer", "autumn")))
Before we go any further, let’s look at how the number of extreme weather events recorded has evolved over time.
evol <- hist(year(storm$BGN_DATE), plot=F)
evolu <- data.frame("Start of period"=evol$breaks[-length(evol$breaks)],
"Number of events"=evol$counts)
kable(evolu, caption="Table 1,
Number of extreme events in five year intervals")
| Start.of.period | Number.of.events |
|---|---|
| 1950 | 3278 |
| 1955 | 9858 |
| 1960 | 11806 |
| 1965 | 14529 |
| 1970 | 20463 |
| 1975 | 21578 |
| 1980 | 35285 |
| 1985 | 44706 |
| 1990 | 87264 |
| 1995 | 164762 |
| 2000 | 189554 |
| 2005 | 236964 |
| 2010 | 62174 |
The number of recorded events increases in every interval (apart from the last where there isn’t a full five year’s worth of data. This might be because the number of extreme weather events has actually increased, or it might be because proportion of events captured by the recording process has increased. From this data set we have no way of knowing, but it is worthy of further investigation. For the purpose of this report I’ll work with the aggregated data over the whole time period.
The next step is to apply the “multipliers” in PROPDMGEXP and CROPDMGEXP. (Ignoring anything that isn’t K, M, or B.)
storm$PROPDMG[tolower(storm$PROPDMGEXP)=="k"] =
storm$PROPDMG[tolower(storm$PROPDMGEXP)=="k"]*1e3
storm$PROPDMG[tolower(storm$PROPDMGEXP)=="m"] =
storm$PROPDMG[tolower(storm$PROPDMGEXP)=="m"]*1e6
storm$PROPDMG[tolower(storm$PROPDMGEXP)=="b"] =
storm$PROPDMG[tolower(storm$PROPDMGEXP)=="b"]*1e9
storm$CROPDMG[tolower(storm$CROPDMGEXP)=="k"] =
storm$CROPDMG[tolower(storm$CROPDMGEXP)=="k"]*1e3
storm$CROPDMG[tolower(storm$CROPDMGEXP)=="m"] =
storm$CROPDMG[tolower(storm$CROPDMGEXP)=="m"]*1e6
storm$CROPDMG[tolower(storm$CROPDMGEXP)=="b"] =
storm$CROPDMG[tolower(storm$CROPDMGEXP)=="b"]*1e9
Now I add a column called “WEATHERGRP”. This column will eventually contain a factor variable which will clasify the events into a manageable number of groups. Initially it’s populated with NAs.
storm$WEATHERGRP <- rep(NA, nrow(storm))
We now start to classify the events into groups. We will define a list of groups (see Table 2)
The order of these groups is important. As we will see, to classify the events we will work through the list, searching EVTYPE for the keywords we have associated with each weather group. Once an event has been classified it won’t be moved so, for example, something containing the words “coastal” and “flood” will be classified as “coast” because that word will be picked up before flood is searched for.
# Group names are the first entry for each vector in the list defined below.
# Subsequent entries are used to assign weather events to that group.
# Where a "stem" has been entered rather than a whole word, that is to catch
# events in the original data that might have similar but not identical labels
# (including mis-spellings.)
wgroups <- list(c("marine", "sea"),
c("coast", "tide", "surf", "current", "tsunami", "surge",
"seiche", "swell", "high water"),
c("flood", "fld", "high water"),
c("wind", "hurricane", "tornado", "typhoon", "tropical storm",
"spout", "tropical depression", "funnel", "gustnado"),
c("precip","snow", "sleet", "blizzard", "rain", "precip", "wet",
"shower", "burst", "hail"),
c("lightning", "thun", "tstm"),
c("visibility", "fog", "dust", "cloud", "smoke", "ash"),
c("cold", "cool", "frost", "freez", "ice", "icy", "hypoth",
"wint", "glaze", "low"),
c("heat", "hot", "warm", "fire", "dry", "drought", "high"),
c("landslide", "avalanche", "slide"))
Now classify each event:
matched <- 0
records <- nrow(storm)
matches <- data.frame(Group=character(),
Match_pattern=character(),
Matched=numeric(),
Percent_matched=numeric(),
Total_percent_matched=numeric())
for (v in wgroups){
pattern <- paste(v, collapse="|")
test <- grepl(pattern, storm$EVTYPE, ignore.case=TRUE) &
is.na(storm$WEATHERGRP)
storm$WEATHERGRP[test] <- v[1]
this_match <- sum(test)
matched <- matched + this_match
prop_matched <- this_match/records*100
cuml_prop_matched <- matched/records*100
new_matches <- data.frame(Group=v[1],
Match_pattern=pattern,
Matched=this_match,
Percent_matched=prop_matched,
Total_percent_matched=cuml_prop_matched)
matches <- rbind(matches, new_matches)
}
storm$WEATHERGRP[is.na(storm$WEATHERGRP)] <- "other"
storm$WEATHERGRP <- factor(storm$WEATHERGRP)
kable(matches, caption="Table 2, Weather group search terms and percentage in each group")
| Group | Match_pattern | Matched | Percent_matched | Total_percent_matched |
|---|---|---|---|---|
| marine | marine|sea | 12907 | 1.4 | 1 |
| coast | coast|tide|surf|current|tsunami|surge|seiche|swell|high water | 3454 | 0.4 | 2 |
| flood | flood|fld|high water | 85275 | 9.5 | 11 |
| wind | wind|hurricane|tornado|typhoon|tropical storm|spout|tropical depression|funnel|gustnado | 425304 | 47.1 | 58 |
| precip | precip|snow|sleet|blizzard|rain|precip|wet|shower|burst|hail | 321775 | 35.7 | 94 |
| lightning | lightning|thun|tstm | 15856 | 1.8 | 96 |
| visibility | visibility|fog|dust|cloud|smoke|ash | 2529 | 0.3 | 96 |
| cold | cold|cool|frost|freez|ice|icy|hypoth|wint|glaze|low | 24283 | 2.7 | 99 |
| heat | heat|hot|warm|fire|dry|drought|high | 9618 | 1.1 | 100 |
| landslide | landslide|avalanche|slide | 1032 | 0.1 | 100 |
Our classification of events for further analysis is given in Table 2 above. This table is an important part of this report because the classification should be discussed with weather experts and users of the report and amended if desired.
Now we have classified everything, let’s look at the entries dumped into “other” that contain fatalities or injuries. (Many of these have EVTYPE=“OTHER”, I’ve removed them from the list below to make it more readable.)
storm %>%
filter(WEATHERGRP=="other", FATALITIES>0|INJURIES>0|PROPDMG>0|CROPDMG>0,
!EVTYPE=="OTHER", !EVTYPE=="Other") %>%
select(BGN_DATE, EVTYPE, WEATHERGRP,FATALITIES,INJURIES,PROPDMG,CROPDMG)
## Source: local data frame [26 x 7]
##
## BGN_DATE EVTYPE WEATHERGRP FATALITIES INJURIES PROPDMG
## 1 1993-03-31 SEVERE TURBULENCE other 0 0 50000
## 2 1994-07-30 APACHE COUNTY other 0 0 5000
## 3 1995-09-07 LIGNTNING other 0 0 5000
## 4 1994-03-24 TORNDAO other 0 0 500
## 5 1993-01-09 AVALANCE other 1 0 0
## 6 1993-08-27 URBAN AND SMALL other 0 0 5000
## 7 1994-12-10 HEAVY MIX other 0 0 100000
## 8 1995-02-26 HEAVY MIX other 0 0 50000
## 9 1994-04-06 HEAVY MIX other 0 0 5000
## 10 1994-11-27 HEAVY MIX other 0 0 50000
## 11 1994-02-21 LIGHTING other 0 0 5000
## 12 1995-06-30 RAPIDLY RISING WATER other 1 0 0
## 13 1994-04-06 HEAVY MIX other 0 0 50000
## 14 1995-02-27 HEAVY MIX other 0 0 500000
## 15 1994-11-27 HEAVY MIX other 0 0 50000
## 16 1994-03-09 HEAVY MIX other 0 0 500000
## 17 1994-02-09 ? other 0 0 5000
## 18 1994-08-28 URBAN/SMALL STREAM other 0 0 5000
## 19 1994-10-13 URBAN SMALL other 0 0 50
## 20 1996-11-15 Beach Erosion other 0 0 100000
## 21 1996-04-16 Landslump other 0 0 570000
## 22 1997-05-04 DAM BREAK other 0 0 2000
## 23 1998-11-11 HYPERTHERMIA/EXPOSURE other 1 0 0
## 24 2000-07-17 DAM BREAK other 0 0 1000000
## 25 2001-01-04 ROGUE WAVE other 0 2 0
## 26 2002-05-02 DROWNING other 1 0 0
## Variables not shown: CROPDMG (dbl)
One or two of these look like they can be classified unambiguously, but I haven’t done so at this stage.
We are now in a position to group our events by the variable WEATHERGRP and also by SEASON. For clarity I have split the data into two: one data set summarizes the human costs and the other summarizes property and crop damage.
storm <- mutate(storm, DAMAGE=PROPDMG + CROPDMG)
by_wgroup <- group_by(storm, WEATHERGRP, SEASON)
human_cost <- summarize(by_wgroup, total_fatalities=sum(FATALITIES),
total_injuries=sum(INJURIES))
human_cost$WEATHERGRP <- reorder(human_cost$WEATHERGRP, -human_cost$total_fatalities)
human_cost <- gather(human_cost, loss, people, -WEATHERGRP, -SEASON)
damage_cost <- summarize(by_wgroup, property_damage=sum(PROPDMG)/1e9,
crop_damage=sum(CROPDMG)/1e9,
total_damage=sum(DAMAGE)/1e9)
damage_cost$WEATHERGRP <- reorder(damage_cost$WEATHERGRP, -damage_cost$total_damage)
damage_cost <- gather(damage_cost, loss, dollars, -WEATHERGRP, -SEASON)
ggplot(human_cost, aes(x=WEATHERGRP, y=people, fill="red")) +
geom_bar(stat="identity") +
facet_grid(loss~., scales="free_y") +
scale_x_discrete(human_cost$WEATHERGRP, name="Extreme weather group") +
ggtitle('Figure 1, Deaths and injuries from extreme weather 1950 - 2009') +
ylab('Number of deaths or injuries') +
theme(legend.position = "none")
Figure 1, above, shows the total number of deaths and injuries for each type of weather over the whole period. It is clear that wind related events (particularly typhoons and hurricanes) cause most deaths and injuries.
Note that the vertical scale is different in the top and bottom charts.
ggplot(damage_cost, aes(x=WEATHERGRP, y=dollars, fill="blue")) +
geom_bar(stat="identity") +
facet_grid(loss~., scales="free_y") +
#scale_x_discrete(damage_cost$WEATHERGRP, name="Extreme weather group") +
ggtitle('Figure 2, Cost of damage from extreme weather 1950 - 2009') +
ylab('Cost (USD bn)') +
theme(legend.position = "none")
Figure 2, above, shows that flood is the biggest cause of total financial losses due to property and crop damage, closely followed by wind. The biggest cause of crop damage is heat. (As with Figure 1, the vertical scale varies. In particular the scale on the top and bottom charts is ten times that of the middle chart - which is why the total losses closely follow the property damage costs.)
ggplot(filter(human_cost, loss=="total_fatalities"), aes(x=WEATHERGRP, y=people, fill=SEASON)) +
geom_bar(stat="identity", position="dodge") +
#facet_grid(SEASON~., scales="free_y") +
#facet_grid(SEASON~.) +
scale_x_discrete(human_cost$WEATHERGRP, name="Extreme weather group") +
ggtitle('Figure 3, Deaths from extreme weather 1950 - 2009') +
#xlab('Extreme weather group') +
ylab('Number of deaths') +
theme(legend.position = "right")
Figure 3 shows how the number of deaths varies by season.