We analyzed the NOAA storm database to answer two questions: which type of weather event is most harmful with respect to population health and which events have the greatest economic impact? For the first question we assigned weather event categories to the data and calculated the sums of the number of fatalities (FATALITIES
) and injuries (INJURIES
) by event type (EVTYPE
). Categories were used instead of the straight EVTYPE
because of duplicate or similar entries. For example there were 213 events types with “wind” in their name. In this analysis Tornados, Wind and Heat were the three most harmful weather events, assuming the sum of injuries and deaths was the best metric for harmfulness. A similar analysis was performed looking at the property damage (PROPDMG
) and agricultural damage (CROPDMG
). The most expensive weather events are Floods, Hurricanes and Tornados.
We began by downloading the file from the course webpage. The file path to the data was saved to dlDest
.
library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.8
## v tidyr 0.8.2 v stringr 1.3.1
## v readr 1.3.1 v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(knitr)
dl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dlDest <- "./data/StormData.csv.bz2"
if(!dir.exists("./data")){
dir.create("./data")
}
if(!file.exists(dlDest)){
download.file(url = dl,
destfile = dlDest)
}
The first line was read in, split up and turned into a character vector of column names. The data was then loaded into the variable storm
using readr::read_csv
with column types set based on the types of data found therein.
cnames <- read_lines(dlDest,
n_max = 1)
cnames <- unlist(strsplit(cnames, ",", fixed = TRUE))
cnames <- gsub("\"", "", cnames)
storm <- read_csv(dlDest,
col_types = "ncccncffcfccccccfcdncnnndfdfffcnnnncn")
names(storm) <- make.names(cnames)
To examine how harmful each event type (EVTYPE
) was, we calculated total fatalities (Event.F
), injuries (Event.I
) and both (Event.T
), grouped by event type. Event.r.F
, Event.r.I
and Event.r.T
give the rate of death and/or injury per event.
EventSummary <- storm %>%
select(EVTYPE, FATALITIES, INJURIES) %>%
group_by(EVTYPE) %>%
summarize(Event.Count = n(),
Event.F = sum(FATALITIES),
Event.I = sum(INJURIES),
Event.T = Event.I + Event.F,
Event.r.F = format(Event.F / Event.Count, digits = 3),
Event.r.I = format(Event.I / Event.Count, digits = 3),
Event.r.T = format(Event.T / Event.Count, digits = 3)) %>%
arrange(desc(Event.T))
kable(head(EventSummary[,1:5], 10),
caption = "Table of 10 Most Harmful Weather Events, by EVTYPE")
EVTYPE | Event.Count | Event.F | Event.I | Event.T |
---|---|---|---|---|
TORNADO | 60652 | 5633 | 91346 | 96979 |
EXCESSIVE HEAT | 1678 | 1903 | 6525 | 8428 |
TSTM WIND | 219944 | 504 | 6957 | 7461 |
FLOOD | 25326 | 470 | 6789 | 7259 |
LIGHTNING | 15755 | 816 | 5230 | 6046 |
HEAT | 767 | 937 | 2100 | 3037 |
FLASH FLOOD | 54278 | 978 | 1777 | 2755 |
ICE STORM | 2006 | 89 | 1975 | 2064 |
THUNDERSTORM WIND | 82563 | 133 | 1488 | 1621 |
WINTER STORM | 11433 | 206 | 1321 | 1527 |
Looking at the top ten list, sorted by total fatalities and injuries (Event.T
), there are a few interesting features. First off, those data indicated that tornadoes were clearly responsible for the most deaths and injuries (being approximately 12 times more harmful than the next entry). It also suggests that human error or inconsistent (and poor) data entry practices resulted in similar EVTYPE
factors (e.g. “FLOOD” and “FLASH FLOOD”) and for there to be duplicates (“TSTM WIND” and “THUNDERSTORM WIND”). We had to take these issue into consideration.
There are 977 factor levels in EVTYPE
, but as noted above the data quality looked to be low, with duplicate entries and inconsistent naming. To improve this we utilized grepl()
to append columns of logical values that indicate whether a given search string has a hit for a grouping term.
See the appendix for the method used for choosing the search terms. The following nine categories of event were determined to cover the majority of the events: WIND
, TORNADO
, HEAT
, FLOOD
, HURRICANE
, LIGHTNING
, CURRENTS
, FIRE
and SNOW
. grepl()
returns a TRUE
if the string is found in the EVTYPE
, which allowed us to use that column as a filter. SNOW
includes events that with names containing ice, snow, blizzard or winter. All of the values not caught by one of the search terms (i.e. was FALSE
for WIND:SNOW
) was loaded into subEV
.
EVSum2 <- EventSummary %>%
mutate(WIND = grepl("WIND", EVTYPE),
TORNADO = grepl("TORNADO", EVTYPE),
HEAT = grepl("HEAT", EVTYPE),
FLOOD = grepl("FLOOD", EVTYPE),
HURRICANE = grepl("HURRICANE|TYPHOON|TROPICAL|CYCLONE", EVTYPE),
LIGHTNING = grepl("LIGHTNING", EVTYPE),
CURRENTS = grepl("RIP CURRENT", EVTYPE),
FIRE = grepl("FIRE", EVTYPE),
SNOW = grepl("ICE|SNOW|BLIZZARD|WINTER|COLD", EVTYPE))
subEV <- EVSum2[!(EVSum2$WIND|
EVSum2$HEAT|
EVSum2$TORNADO|
EVSum2$FLOOD|
EVSum2$HURRICANE|
EVSum2$LIGHTNING|
EVSum2$CURRENTS|
EVSum2$FIRE|
EVSum2$SNOW),] %>%
arrange(desc(Event.F))
subEV %>%
select(EVTYPE:Event.T) %>%
head(10) %>%
kable(caption = "Table of Top 10 Ungrouped Weather Events")
EVTYPE | Event.Count | Event.F | Event.I | Event.T |
---|---|---|---|---|
AVALANCHE | 386 | 224 | 170 | 394 |
HIGH SURF | 725 | 101 | 152 | 253 |
HEAVY RAIN | 11723 | 98 | 251 | 349 |
FOG | 538 | 62 | 734 | 796 |
HEAVY SURF/HIGH SURF | 228 | 42 | 48 | 90 |
LANDSLIDE | 600 | 38 | 52 | 90 |
TSUNAMI | 20 | 33 | 129 | 162 |
UNSEASONABLY WARM AND DRY | 13 | 29 | 0 | 29 |
URBAN/SML STREAM FLD | 3392 | 28 | 79 | 107 |
DUST STORM | 427 | 22 | 440 | 462 |
The ungrouped categories covered only 5.78% of the total number of fatalities and 3.49% of all fatalities and injuries. This is low enough that we can ignore those events for now, although it would be simple to include more categories if deemed necessary.
Because it was possible for more than one of the category grepl()
to be true (e.g. an event of “SNOW/WIND/FLOODING” would be true for each of SNOW, WIND and FLOOD) it was important to establish the potential impact from choices made about how to handle these special cases. OVERLAP
counts the number of categories. By summing the combined fatalities and injuries over the subset of EVTYPES with more than one category we can see how many cases are effected.
EVSum2 <- EVSum2 %>%
mutate(OVERLAP = WIND +
TORNADO +
HEAT +
FLOOD +
HURRICANE +
LIGHTNING +
CURRENTS +
FIRE +
SNOW)
sum(EVSum2[EVSum2$OVERLAP > 1,]$Event.T)
## [1] 354
With only 354 total injuries or death impacted by the multiple category designations we won’t worry about it too much. The method used below for assigning the category imposes an ordering/hierachy and prevents double counting, but other methods (like splitting values) might make sense as well and aren’t examined in this analysis.
Using a series of nested ifelse()
the following categories were assigned, with the given order:
HURRICANE > SNOW > FLOOD > TORNADO > LIGHTNING > WIND > FIRE > HEAT > OTHER
This ordering was chosen because WIND
was commonly one of the two or three overlapping categories, and in cases where it was WIND + TORNADO
or WIND + HURRICANE
it made the most sense to group it with the primary event (the tornado or hurricane, in those cases). HURRICANE
, SNOW
, FLOOD
, TORNADO
and LIGHTNING
are primary descriptors with minimal internal overlap, while FIRE
and HEAT
have little to no overlap with other categories. Anything not assigned a category was called OTHER
and contains the same events as subEV
.
CatSum <- EVSum2 %>%
mutate(CATEGORY = ifelse(HURRICANE,
"HURRICANE",
ifelse(SNOW,
"SNOW",
ifelse(FLOOD,
"FLOOD",
ifelse(TORNADO,
"TORNADO",
ifelse(LIGHTNING,
"LIGHTNING",
ifelse(CURRENTS,
"RIP CURRENTS",
ifelse(WIND,
"WIND",
ifelse(FIRE,
"FIRE",
ifelse(HEAT,
"HEAT",
"OTHER")))))))))) %>%
select(-(WIND:SNOW)) %>%
group_by(CATEGORY) %>%
summarize(`Count` = sum(Event.Count),
`Fatalities` = sum(Event.F),
`Injuries` = sum(Event.I),
`Total` = `Injuries` + `Fatalities`,
`Fatality Rate` = format(`Fatalities` / `Count`, digits = 3),
`Injury Rate` = format(`Injuries` / `Count`, digits = 3),
`Total Rate` = format(`Total` / `Count`, digits = 3)) %>%
arrange(desc(`Total`))
The result of these calculations can be found in the Results section.
Economic impact was calculated in a similar manner, albeit with some modifications. Because of the range of values on the economic impact property and crop damage were each encoded into two columns, one for the coefficient (PROPDMG
or CROPDMG
) and one for the scale/exponent indicator (PROPDMGEXP
or CROPDMGEXP
). For example, a damage may be written as 5.2 M, representing 5.2 million USD or $5,200,000. Unfortunately PROPDMGEXP
encodes a weird range of characters. As a result of the US Government partial shutdown it is not possible to access the NOAA website for supplementary data and tables, but we were able to find a reference to someone else who has analyzed the same thing here: https://rpubs.com/flyingdisc/PROPDMGEXP. Those levels were used to replace the factor values with the correct multiplier.
EconSummary <- storm %>%
select(EVTYPE, PROPDMG:CROPDMGEXP)
levels(EconSummary$PROPDMGEXP) <- c(1000, 1000000, 1000000000, 1000000, 1, 0,
10, 10, 0, 10, 10, 10, 100, 10, 100, 0, 10,
10)
levels(EconSummary$CROPDMGEXP) <- c(1000000, 1000, 1000000, 1000000000, 0, 0,
1000, 10)
EconSummary <- EconSummary %>%
mutate(PropertyDamage = PROPDMG * as.numeric(as.character(PROPDMGEXP)),
CropDamage = CROPDMG * as.numeric(as.character(CROPDMGEXP)),
TotalDamage = ifelse(is.na(PropertyDamage), 0, PropertyDamage) +
ifelse(is.na(CropDamage), 0, CropDamage))
A similar grouping process was run, although a different set of terms were used. Hurricanes, typhoons, cyclones are the same thing, and tropical storms (and depressions) represent lower category versions of the same type of storm, so those were grouped together. Freeze and cold were added to the SNOW search term.
EconSum2 <- EconSummary %>%
mutate(WIND = grepl("WIND", EVTYPE),
TORNADO = grepl("TORNADO", EVTYPE),
FLOOD = grepl("FLOOD", EVTYPE),
HURRICANE = grepl("HURRICANE|TYPHOON|TROPICAL|CYCLONE", EVTYPE),
FIRE = grepl("FIRE", EVTYPE),
DROUGHT = grepl("DROUGHT", EVTYPE),
HAIL = grepl("HAIL", EVTYPE),
SURGE = grepl("SURGE", EVTYPE),
SNOW = grepl("ICE|SNOW|BLIZZARD|WINTER|FREEZE|COLD",EVTYPE))
subEC <- EconSum2[!(EconSum2$WIND|
EconSum2$TORNADO|
EconSum2$FLOOD|
EconSum2$HURRICANE|
EconSum2$FIRE|
EconSum2$DROUGHT|
EconSum2$HAIL|
EconSum2$SURGE|
EconSum2$SNOW),] %>% arrange(desc(TotalDamage))
subEC %>%
select(EVTYPE, PropertyDamage:TotalDamage ) %>%
head(10) %>%
kable(caption = "Table of Top 10 Ungrouped Weather Events")
EVTYPE | PropertyDamage | CropDamage | TotalDamage |
---|---|---|---|
HEAVY RAIN/SEVERE WEATHER | 2.500e+09 | NA | 2500000000 |
SEVERE THUNDERSTORM | 1.200e+09 | NA | 1200000000 |
EXCESSIVE HEAT | 1.700e+05 | 492400000 | 492570000 |
HEAT | NA | 400000000 | 400000000 |
HEAVY RAIN | NA | 200000000 | 200000000 |
HEAVY RAIN | NA | 190000000 | 190000000 |
EXCESSIVE WETNESS | NA | 142000000 | 142000000 |
River Flooding | 7.874e+07 | 26840000 | 105580000 |
TSUNAMI | 8.100e+07 | 20000 | 81020000 |
HEAVY RAIN | NA | 73600000 | 73600000 |
Like subEV before, subEC contained the ungrouped levels of EVTYPE
. Those ungrouped EVTYPE
factors only covered 1.76% of the total damage and were not worth the time to pull apart any further.
The dataset was then assigned CATEGORY
values using a similar ifelse()
structure, and the damage amounts were then subtotaled for each entry.
CatEconSum <- EconSum2 %>%
mutate(CATEGORY = ifelse(HURRICANE,
"HURRICANE",
ifelse(SURGE,
"SURGE",
ifelse(SNOW,
"SNOW",
ifelse(FLOOD,
"FLOOD",
ifelse(HAIL,
"HAIL",
ifelse(TORNADO,
"TORNADO",
ifelse(WIND,
"WIND",
ifelse(DROUGHT,
"DROUGHT",
ifelse(FIRE,
"FIRE",
"OTHER")))))))))) %>%
select(-(WIND:SNOW)) %>%
group_by(CATEGORY) %>%
summarize(`Property Damage` = sum(PropertyDamage, na.rm = TRUE),
`Crop Damage` = sum(CropDamage, na.rm = TRUE),
`Total Damage` = `Property Damage` + `Crop Damage`) %>%
arrange(desc(`Total Damage`))
Answering the question of which weather event is most harmful first requires that we choose the criteria for “most harmful.” If want to look at fatalities primarily we would find that Tornados, Heat and Floods are the three most deadly types of events.
kable(arrange(CatSum, desc(CatSum$Fatalities)),
padding = 2,
caption = "Most harmful weather events, ranked by fatalities")
CATEGORY | Count | Fatalities | Injuries | Total | Fatality Rate | Injury Rate | Total Rate |
---|---|---|---|---|---|---|---|
TORNADO | 60699 | 5661 | 91407 | 97068 | 0.0933 | 1.51 | 1.6 |
HEAT | 2645 | 3138 | 9154 | 12292 | 1.19 | 3.46 | 4.65 |
FLOOD | 82648 | 1523 | 8601 | 10124 | 0.0184 | 0.104 | 0.122 |
WIND | 363224 | 1192 | 11397 | 12589 | 0.00328 | 0.0314 | 0.0347 |
SNOW | 44462 | 1071 | 6331 | 7402 | 0.0241 | 0.142 | 0.166 |
OTHER | 326773 | 875 | 4555 | 5430 | 0.00268 | 0.0139 | 0.0166 |
LIGHTNING | 15776 | 817 | 5232 | 6049 | 0.0518 | 0.332 | 0.383 |
RIP CURRENTS | 777 | 577 | 529 | 1106 | 0.743 | 0.681 | 1.42 |
HURRICANE | 1054 | 201 | 1714 | 1915 | 0.191 | 1.63 | 1.82 |
FIRE | 4239 | 90 | 1608 | 1698 | 0.0212 | 0.379 | 0.401 |
Another way we can think about how harmful a weather events is by looking at the number of dead and injured per event. The top three events that have the highest Total Rate
are hurricanes, heat and tornado. While hurricanes and heat are much less common than tornados they are 3.2 and 2.9 times as likely to cause harm, compared against tornadoes.
kable(arrange(CatSum, desc(CatSum$`Total Rate`)),
padding = 2,
caption = "Most harmful weather events, ranked by total death and injury rate")
CATEGORY | Count | Fatalities | Injuries | Total | Fatality Rate | Injury Rate | Total Rate |
---|---|---|---|---|---|---|---|
HEAT | 2645 | 3138 | 9154 | 12292 | 1.19 | 3.46 | 4.65 |
HURRICANE | 1054 | 201 | 1714 | 1915 | 0.191 | 1.63 | 1.82 |
TORNADO | 60699 | 5661 | 91407 | 97068 | 0.0933 | 1.51 | 1.6 |
RIP CURRENTS | 777 | 577 | 529 | 1106 | 0.743 | 0.681 | 1.42 |
FIRE | 4239 | 90 | 1608 | 1698 | 0.0212 | 0.379 | 0.401 |
LIGHTNING | 15776 | 817 | 5232 | 6049 | 0.0518 | 0.332 | 0.383 |
SNOW | 44462 | 1071 | 6331 | 7402 | 0.0241 | 0.142 | 0.166 |
FLOOD | 82648 | 1523 | 8601 | 10124 | 0.0184 | 0.104 | 0.122 |
WIND | 363224 | 1192 | 11397 | 12589 | 0.00328 | 0.0314 | 0.0347 |
OTHER | 326773 | 875 | 4555 | 5430 | 0.00268 | 0.0139 | 0.0166 |
Finally, If we instead wanted to find the events with the highest total impact (fatalities and injuries combined) then we would get tornados, wind and heat as the most harmful.
kable(arrange(CatSum, desc(CatSum$Total)),
padding = 2,
caption = "Most harmful weather events, ranked by total death and injury")
CATEGORY | Count | Fatalities | Injuries | Total | Fatality Rate | Injury Rate | Total Rate |
---|---|---|---|---|---|---|---|
TORNADO | 60699 | 5661 | 91407 | 97068 | 0.0933 | 1.51 | 1.6 |
WIND | 363224 | 1192 | 11397 | 12589 | 0.00328 | 0.0314 | 0.0347 |
HEAT | 2645 | 3138 | 9154 | 12292 | 1.19 | 3.46 | 4.65 |
FLOOD | 82648 | 1523 | 8601 | 10124 | 0.0184 | 0.104 | 0.122 |
SNOW | 44462 | 1071 | 6331 | 7402 | 0.0241 | 0.142 | 0.166 |
LIGHTNING | 15776 | 817 | 5232 | 6049 | 0.0518 | 0.332 | 0.383 |
OTHER | 326773 | 875 | 4555 | 5430 | 0.00268 | 0.0139 | 0.0166 |
HURRICANE | 1054 | 201 | 1714 | 1915 | 0.191 | 1.63 | 1.82 |
FIRE | 4239 | 90 | 1608 | 1698 | 0.0212 | 0.379 | 0.401 |
RIP CURRENTS | 777 | 577 | 529 | 1106 | 0.743 | 0.681 | 1.42 |
This can also be seen graphically:
CatSum %>%
select(CATEGORY,`Fatalities`, `Injuries`) %>%
gather(key = `Casualty Type`, value = `Count`, -CATEGORY) %>%
ggplot() +
geom_col(mapping = aes(x = fct_reorder(CATEGORY, `Count`),
y = `Count`,
fill = `Casualty Type`)) +
coord_flip() +
labs(y = "Number of Casualties (Injured or Killed)",
x = "",
title = "Cost to Health of Extreme Weather Events",
subtitle = "U.S. NOAA Storm Database (1950-2011)",
fill = "Casualty Type") +
theme_bw() +
guides(fill=guide_legend(title = NULL)) +
theme(legend.justification=c(1,0), legend.position=c(0.9,0.1))
The same type of analysis was performed on the economic data, specifically the property damage and agricultural damage. Flooding was by far the most costly, almost twice the damage of hurricanes. Since the values were so large we scaled them back by \(10^9\), making the values representative of billions of dollars.
CatEconSum$`Property Damage` <- CatEconSum$`Property Damage`/1000000000
CatEconSum$`Crop Damage` <- CatEconSum$`Crop Damage`/1000000000
CatEconSum$`Total Damage` <- CatEconSum$`Total Damage`/1000000000
kable(CatEconSum,
caption = "Summary of Economic Damages, by Category")
CATEGORY | Property Damage | Crop Damage | Total Damage |
---|---|---|---|
FLOOD | 167.366829 | 12.3470691 | 179.71390 |
HURRICANE | 93.072538 | 6.2110138 | 99.28355 |
TORNADO | 56.993099 | 0.4149613 | 57.40806 |
SURGE | 47.965224 | 0.0008550 | 47.96608 |
SNOW | 12.699465 | 8.5572980 | 21.25676 |
HAIL | 17.619992 | 3.1142128 | 20.73420 |
WIND | 15.802228 | 1.9564790 | 17.75871 |
DROUGHT | 1.046306 | 13.9726218 | 15.01893 |
FIRE | 8.501629 | 0.4032816 | 8.90491 |
OTHER | 6.251340 | 2.1263994 | 8.37774 |
A column chart can be made to show the how the totals relate and how much the property and crop components contribute to the total.
CatEconSum %>%
select(CATEGORY:`Crop Damage`) %>%
gather(key = `Damage Type`, value = `Amount`, -CATEGORY) %>%
ggplot() +
geom_col(mapping = aes(x = fct_reorder(CATEGORY, `Amount`),
y = `Amount`,
fill = `Damage Type`)) +
coord_flip() +
labs(y = "Cost of Damages (Billions $)",
x = "",
title = "Cost of Damages of Extreme Weather Events",
subtitle = "U.S. NOAA Storm Database (1950-2011)",
fill = "Damage Type") +
theme_bw() +
guides(fill=guide_legend(title = NULL)) +
theme(legend.justification=c(1,0), legend.position=c(0.9,0.1))
The following page has a data dictionary was useful in determining the column types for the initial loading of the data.
http://rstudio-pubs-static.s3.amazonaws.com/23028_821a9591f7de4289a21c4ec8a41604c3.html
As stated above, the following website has a useful analysis of the meanings of the exponent values for the damages in this dataset.
Choosing the groups was done manually. From Table 1 it was clear that there were name-based issues, so terms like “heat”, “flood” and “wind” looked like they should catch multiple top 10 events. After using mutate to add the logical columns for the search terms, those terms were used to exclude “found” rows.
EventExample <- EventSummary %>%
mutate(WIND = grepl("WIND", EVTYPE),
HEAT = grepl("HEAT", EVTYPE),
FLOOD = grepl("FLOOD", EVTYPE))
subEvent <- EventExample[!(EventExample$WIND|
EventExample$HEAT|
EventExample$FLOOD),] %>%
arrange(desc(Event.F))
kable(subEvent[1:10,1:8])
EVTYPE | Event.Count | Event.F | Event.I | Event.T | Event.r.F | Event.r.I | Event.r.T |
---|---|---|---|---|---|---|---|
TORNADO | 60652 | 5633 | 91346 | 96979 | 0.0929 | 1.51 | 1.6 |
LIGHTNING | 15755 | 816 | 5230 | 6046 | 0.0518 | 0.332 | 0.384 |
RIP CURRENT | 470 | 368 | 232 | 600 | 0.783 | 0.494 | 1.28 |
AVALANCHE | 386 | 224 | 170 | 394 | 0.58 | 0.44 | 1.02 |
WINTER STORM | 11433 | 206 | 1321 | 1527 | 0.018 | 0.116 | 0.134 |
RIP CURRENTS | 304 | 204 | 297 | 501 | 0.671 | 0.977 | 1.65 |
EXTREME COLD | 655 | 160 | 231 | 391 | 0.244 | 0.353 | 0.597 |
HEAVY SNOW | 15708 | 127 | 1021 | 1148 | 0.00809 | 0.065 | 0.0731 |
BLIZZARD | 2719 | 101 | 805 | 906 | 0.0371 | 0.296 | 0.333 |
HIGH SURF | 725 | 101 | 152 | 253 | 0.139 | 0.21 | 0.349 |
This was continued until most outcomes were accounted for.