We sift through the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database in look for insights as to which type of weather events are more harmful in terms of loss of life, injuries, and damage to crops or property. We find that, by a large margin, tornados cause the most fatalities and injuries. Crops are most affected by drought, although floods (including river floods) do not lag far behind in havoc-wreaking power. Floods are also the greatest cause of property damage, accounting for more than twice as much losses as the second worst event, hurricane/typhoon. Our results are based on publicly available data from the period 1950-2011, including recorded events for all U.S. counties. All data processing was done in R, a free programming language and software environment for statistical computing and graphics.
In this section we document in detail every data manipulation we performed in order to arrive at our conclusions.
We begin by loading some useful packages, setting the working directory,
and reading the data into R
(we use a mild overestimate for nrow in order to speed up the process
without assuming that we know beforehand the exact number of rows):
library("dplyr")
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("stringr")
library("tidyr")
library("ggplot2")
library("RColorBrewer")
setwd("/home/eduardo/datascience/coursera/5-repdata/PA2")
storms <- read.csv("repdata-data-StormData.csv.bz2", nrow = 903000)
The data set includes 902297 observations of 37 variables:
dim(storms)
## [1] 902297 37
names(storms)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
We'll process these data using functions from the dplyr package.
First we select only those variables that we're interested in, which are:
EVTYPE: event typeFATALITIES: number of fatalitiesINJURIES: number of injured peoplePROPDMG: property damagePROPDMGEXP: units (K, M, B) for property damageCROPDMG: crop damageCROPDMGEXP: units (K, M, B) for crop damageThere are some weird prefixes included in the data, different from K, M, or B (or nothing). Since we have no way of knowing what is meant by them, we regard them as NAs and eliminate those observations from the processed data.
The pref2num() function, defined below, takes a vector of (character)
prefixes and returns a numeric vector, where each number corresponds to the
value of the prefix. It recognizes K, M, and B (uppercase or lowercase);
everything else is mapped to 1.
We use this function to construct numeric columns containing property/crop
damage directly in USD.
Event types in the original data are recorded sometimes in capital letters, producing some repetition. We convert all-capital event types into lowercase, capitalizing the first letter of every word. Trailing and leading white space is also removed. We have not attempted to fiddle with the data by, e.g., merging together categories like “Tornado” and “Hurricane/Typhoon”. We do have, however, replaced the “tstm” abbreviation with “Thunderstorm”, because it seemed clear to us that it made no sense to, e.g., count “Tstm wind” and “Thunderstorm wind” as two separate event types.
weirdprefs <- c(0:8, "-", "+", "?", "h", "H")
pref2num <- function(prefs) {
num <- rep(1, length(prefs))
num[prefs == "k" | prefs == "K"] <- 1e3
num[prefs == "m" | prefs == "M"] <- 1e6
num[prefs == "b" | prefs == "B"] <- 1e9
num
}
## capwords() function from the R help page for
## Character Translation and Casefolding (see ?toupper)
capwords <- function(s, strict = FALSE) {
cap <- function(s) {
paste(toupper(substring(s, 1, 1)),
{s <- substring(s, 2); if(strict) tolower(s) else s},
sep = "", collapse = " " )
}
sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}
storms <- storms %>%
## Select only those columns we're interested in
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
## Eliminate observations with weird prefixes
filter(! PROPDMGEXP %in% weirdprefs) %>%
filter(! CROPDMGEXP %in% weirdprefs) %>%
## Build numeric columns for crop/property damage
mutate(property = PROPDMG * pref2num(as.character(PROPDMGEXP))) %>%
mutate(crop = CROPDMG * pref2num(as.character(CROPDMGEXP))) %>%
## Delete trailing and leading spaces in EVTYPE
mutate(EVTYPE = str_trim(as.character(EVTYPE))) %>%
## Convert evtype to lowercase, capitalizing every word
mutate(EVTYPE = capwords(EVTYPE, strict = TRUE)) %>%
## Find and replace some abbreviations used in evtype
mutate(EVTYPE = sub("Tstm", "Thunderstorm", EVTYPE, ignore.case = TRUE)) %>%
## Rename columns to lowercase
rename(evtype = EVTYPE, fatalities = FATALITIES, injuries = INJURIES) %>%
## Eliminate superseded columns
select(-PROPDMG, -PROPDMGEXP, -CROPDMG, -CROPDMGEXP)
str(storms)
## 'data.frame': 901949 obs. of 5 variables:
## $ evtype : chr "Tornado" "Tornado" "Tornado" "Tornado" ...
## $ fatalities: num 0 0 0 0 0 0 0 0 1 0 ...
## $ injuries : num 15 0 2 2 2 6 1 0 14 0 ...
## $ property : num 25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
## $ crop : num 0 0 0 0 0 0 0 0 0 0 ...
There are 877 unique weather events (“storms”). The next step is to group the data by event type, and compute the sum of all other variables for each event type.
storms <- storms %>%
group_by(evtype) %>%
summarise_each(funs(sum))
str(storms)
## Classes 'tbl_df', 'tbl' and 'data.frame': 877 obs. of 5 variables:
## $ evtype : chr "?" "Abnormal Warmth" "Abnormally Dry" "Abnormally Wet" ...
## $ fatalities: num 0 0 0 0 0 0 0 0 0 1 ...
## $ injuries : num 0 0 0 0 0 0 0 0 0 0 ...
## $ property : num 5000 0 0 0 0 ...
## $ crop : num 0 0 0 0 0 ...
## - attr(*, "drop")= logi TRUE
summary(storms)
## evtype fatalities injuries
## Length:877 Min. : 0.00 Min. : 0.0
## Class :character 1st Qu.: 0.00 1st Qu.: 0.0
## Mode :character Median : 0.00 Median : 0.0
## Mean : 17.26 Mean : 160.2
## 3rd Qu.: 0.00 3rd Qu.: 0.0
## Max. :5630.00 Max. :91321.0
## property crop
## Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.:0.000e+00 1st Qu.:0.000e+00
## Median :0.000e+00 Median :0.000e+00
## Mean :4.872e+08 Mean :5.590e+07
## 3rd Qu.:8.000e+04 3rd Qu.:0.000e+00
## Max. :1.447e+11 Max. :1.397e+10
From the summary of the data we learn that at least 75 percent of all “storms” produce no fatalities, no injuries, and no crop damage, and that at least 50 percent of all “storms” also produce no property damage.
Total numbers of fatalities, injuries, and crop/property damage are:
total_fatalities <- sum(storms$fatalities)
total_injuries <- sum(storms$injuries)
total_property <- sum(storms$property)
total_crop <- sum(storms$crop)
This gives 1.51 × 104 fatalities, 1.4 × 105 injuries, 4.27 × 1011 USD in property losses, and 4.9 × 1010 USD in crop damage.
Since we're only interested in those events that produce the most damage, we ruthlessly supress all observations where fatalities, injuries, and crop/property damage are all equal to zero.
storms <- storms %>%
filter(fatalities != 0 | injuries != 0 | property != 0 | crop != 0)
summary(storms)
## evtype fatalities injuries
## Length:432 Min. : 0.00 Min. : 0.0
## Class :character 1st Qu.: 0.00 1st Qu.: 0.0
## Mode :character Median : 0.00 Median : 0.0
## Mean : 35.03 Mean : 325.2
## 3rd Qu.: 2.00 3rd Qu.: 2.0
## Max. :5630.00 Max. :91321.0
## property crop
## Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.:5.000e+03 1st Qu.:0.000e+00
## Median :8.750e+04 Median :0.000e+00
## Mean :9.892e+08 Mean :1.135e+08
## 3rd Qu.:1.860e+06 3rd Qu.:1.000e+04
## Max. :1.447e+11 Max. :1.397e+10
We are finally left with 432 observations where at least one damage indicator is different from zero.
To assess which event types are more harmful to humans
(meaning they cause more fatalities and injuries),
we arrange the data according to fatalities and injuries
(in descending order), which gives us a list of the worst events:
topfatalities <- arrange(storms, desc(fatalities))
topinjuries <- arrange(storms, desc(injuries))
print(topfatalities)
## Source: local data frame [432 x 5]
##
## evtype fatalities injuries property crop
## 1 Tornado 5630 91321 56936985483 364950110
## 2 Excessive Heat 1903 6525 7753700 492402000
## 3 Flash Flood 978 1777 16140861717 1420727100
## 4 Heat 937 2100 1797000 401461500
## 5 Lightning 816 5230 928659283 12092090
## 6 Thunderstorm Wind 637 8445 7976179582 968850400
## 7 Flood 470 6789 144657709807 5661968450
## 8 Rip Current 368 232 1000 0
## 9 High Wind 246 1137 5270046260 638571300
## 10 Avalanche 224 170 3721800 0
## .. ... ... ... ... ...
print(topinjuries)
## Source: local data frame [432 x 5]
##
## evtype fatalities injuries property crop
## 1 Tornado 5630 91321 56936985483 364950110
## 2 Thunderstorm Wind 637 8445 7976179582 968850400
## 3 Flood 470 6789 144657709807 5661968450
## 4 Excessive Heat 1903 6525 7753700 492402000
## 5 Lightning 816 5230 928659283 12092090
## 6 Heat 937 2100 1797000 401461500
## 7 Ice Storm 89 1975 3944927810 5022110000
## 8 Flash Flood 978 1777 16140861717 1420727100
## 9 Hail 15 1358 15732261777 3000954453
## 10 Winter Storm 206 1321 6688497250 26944000
## .. ... ... ... ... ...
By a large margin, tornados turn out to be the events that cause more fatalities and injuries. Tornados cause almost 3 times more fatalities than the second most fatal event, excessive heat. Tornados also cause about 10 times more injuries than the second most dangerous event, thunderstorm wind.
To represent the worst events in a plot, we first decide on a short list that includes the five more damaging events in terms of both fatalities and injuries:
tophumanevs <- unique(c(topfatalities$evtype[1:5], topinjuries$evtype[1:5]))
print(tophumanevs)
## [1] "Tornado" "Excessive Heat" "Flash Flood"
## [4] "Heat" "Lightning" "Thunderstorm Wind"
## [7] "Flood"
Now we recast the data in a format that's more suitable for plotting:
tophuman <- storms %>%
select(-property, -crop) %>%
filter(evtype %in% tophumanevs) %>%
mutate(fatalities = 100*fatalities/total_fatalities) %>%
mutate(injuries = 100*injuries/total_injuries) %>%
gather(Damage, Percentage, -evtype) %>%
rename(Event = evtype) %>%
mutate(Event = factor(Event), Damage = factor(Damage)) %>%
arrange(desc(Percentage))
tophuman
## Source: local data frame [14 x 3]
##
## Event Damage Percentage
## 1 Tornado injuries 65.010109
## 2 Tornado fatalities 37.198546
## 3 Excessive Heat fatalities 12.573505
## 4 Flash Flood fatalities 6.461843
## 5 Heat fatalities 6.190948
## 6 Thunderstorm Wind injuries 6.011874
## 7 Lightning fatalities 5.391477
## 8 Flood injuries 4.832992
## 9 Excessive Heat injuries 4.645054
## 10 Thunderstorm Wind fatalities 4.208788
## 11 Lightning injuries 3.723162
## 12 Flood fatalities 3.105385
## 13 Heat injuries 1.494960
## 14 Flash Flood injuries 1.265021
str(tophuman)
## Classes 'tbl_df', 'tbl' and 'data.frame': 14 obs. of 3 variables:
## $ Event : Factor w/ 7 levels "Excessive Heat",..: 7 7 1 2 4 6 5 3 1 6 ...
## $ Damage : Factor w/ 2 levels "fatalities","injuries": 2 1 1 1 1 2 1 2 2 1 ...
## $ Percentage: num 65.01 37.2 12.57 6.46 6.19 ...
The following stacked bar plot shows the fraction (percentage) of fatalities and injuries caused by the seven deadliest events. Note that these account for over 75 percent of all human damage.
## Change plot order of Events
tophuman$Event <- factor(tophuman$Event,
c("Tornado", "Excessive Heat", "Flash Flood", "Heat",
"Thunderstorm Wind", "Lightning", "Flood"))
## Plot
g <- ggplot(data = tophuman, aes(x = Damage, y = Percentage, fill = Event)) +
geom_bar(stat = "identity") +
labs(title = "Percentage of fatalities and injuries\ncaused by most catastrofic weather events") +
labs(x = "", y = "") +
ylim(0, 100) +
scale_fill_manual(values = brewer.pal(length(tophumanevs), "Set1"))
print(g)
To assess which event types are more harmful to the economy
(meaning they cause large crop and property damage),
we arrange the data according to property and crop
(in descending order), which gives us a list of the worst events:
topproperty <- arrange(storms, desc(property))
topcrop <- arrange(storms, desc(crop))
print(topproperty)
## Source: local data frame [432 x 5]
##
## evtype fatalities injuries property crop
## 1 Flood 470 6789 144657709807 5661968450
## 2 Hurricane/typhoon 64 1275 69305840000 2607872800
## 3 Tornado 5630 91321 56936985483 364950110
## 4 Storm Surge 13 38 43323536000 5000
## 5 Flash Flood 978 1777 16140861717 1420727100
## 6 Hail 15 1358 15732261777 3000954453
## 7 Hurricane 61 46 11868319010 2741910000
## 8 Thunderstorm Wind 637 8445 7976179582 968850400
## 9 Tropical Storm 58 340 7703890550 678346000
## 10 Winter Storm 206 1321 6688497250 26944000
## .. ... ... ... ... ...
print(topcrop)
## Source: local data frame [432 x 5]
##
## evtype fatalities injuries property crop
## 1 Drought 0 4 1046106000 13972566000
## 2 Flood 470 6789 144657709807 5661968450
## 3 River Flood 2 2 5118945500 5029459000
## 4 Ice Storm 89 1975 3944927810 5022110000
## 5 Hail 15 1358 15732261777 3000954453
## 6 Hurricane 61 46 11868319010 2741910000
## 7 Hurricane/typhoon 64 1275 69305840000 2607872800
## 8 Flash Flood 978 1777 16140861717 1420727100
## 9 Extreme Cold 162 231 67737400 1312973000
## 10 Frost/freeze 0 0 10480000 1094186000
## .. ... ... ... ... ...
Interestingly, Floods and Droughts turn out to be the events that cause greater economic damage. Floods cause twice as much property damage as the second worst event, Hurricane/typhoon. In crop damage, however, Droughts cause 30 percent more losses than Floods and River Floods, the second and third worst crop killers, taken together.
To represent the worst events in a plot, we first decide on a short list that includes the five more damaging events in terms of both crop and economy damage:
topeconomyevs <- unique(c(topcrop$evtype[1:5], topproperty$evtype[1:5]))
print(topeconomyevs)
## [1] "Drought" "Flood" "River Flood"
## [4] "Ice Storm" "Hail" "Hurricane/typhoon"
## [7] "Tornado" "Storm Surge" "Flash Flood"
Now we recast the data in a format that's more suitable for plotting:
topeconomy <- storms %>%
select(-fatalities, -injuries) %>%
filter(evtype %in% topeconomyevs) %>%
mutate(property = 100*property/total_property) %>%
mutate(crop = 100*crop/total_crop) %>%
gather(Damage, Percentage, -evtype) %>%
rename(Event = evtype) %>%
mutate(Event = factor(Event), Damage = factor(Damage)) %>%
arrange(desc(Percentage))
topeconomy
## Source: local data frame [18 x 3]
##
## Event Damage Percentage
## 1 Flood property 3.385252e+01
## 2 Drought crop 2.849881e+01
## 3 Hurricane/typhoon property 1.621882e+01
## 4 Tornado property 1.332429e+01
## 5 Flood crop 1.154830e+01
## 6 River Flood crop 1.025822e+01
## 7 Ice Storm crop 1.024323e+01
## 8 Storm Surge property 1.013849e+01
## 9 Hail crop 6.120825e+00
## 10 Hurricane/typhoon crop 5.319086e+00
## 11 Flash Flood property 3.777254e+00
## 12 Hail property 3.681634e+00
## 13 Flash Flood crop 2.897752e+00
## 14 River Flood property 1.197926e+00
## 15 Ice Storm property 9.231845e-01
## 16 Tornado crop 7.443618e-01
## 17 Drought property 2.448077e-01
## 18 Storm Surge crop 1.019813e-05
str(topeconomy)
## Classes 'tbl_df', 'tbl' and 'data.frame': 18 obs. of 3 variables:
## $ Event : Factor w/ 9 levels "Drought","Flash Flood",..: 3 1 5 9 3 7 6 8 4 5 ...
## $ Damage : Factor w/ 2 levels "property","crop": 1 2 1 1 2 2 2 1 2 2 ...
## $ Percentage: num 33.9 28.5 16.2 13.3 11.5 ...
The following stacked bar plot shows the fraction (percentage) of crop and property damage caused by the nine severest events. Note that these account for over 75 percent of all economic damage.
## Change plot order of Events
topeconomy$Event <- factor(topeconomy$Event,
c("Flood", "Drought", "Hurricane/typhoon", "Tornado",
"River Flood", "Ice Storm", "Storm Surge", "Hail",
"Flash Flood"))
## Plot
h <- ggplot(data = topeconomy, aes(x = Damage, y = Percentage, fill = Event)) +
geom_bar(stat = "identity") +
labs(title = "Percentage of property and crop damages\ncaused by most catastrofic weather events") +
labs(x = "", y = "") +
ylim(0, 100) +
scale_fill_manual(values = brewer.pal(length(topeconomyevs), "Set1"))
print(h)