In this report we analyze the weather event types that have caused the most severe damage on population and properties across the U.S. in the period between 1950 - 2011. The study is based on the historical data managed by the U.S. National Oceanic and Atmospheric Administration’s (NOAA) in the storm database. From this data, we found that the event type with the highest impact on population damage (fatalities and injuries) is by far a Tornado. However, from the point of view of economic consequences (properties damages and crop losses), Flood followed by Hurricane/Typhoon are the most dangerous. Considering that NOAA registers more than 900 different event types, these conclusions may be relevant for public officers in preparing and prioritizing resources.
Before we proceed we must load some packages that are required to clean and process the data. The subsetting and restructuring of the raw dataset is based on dplyr and tidyr, and the graphs are produced with ggplot2.
# Cheking if some packages are installed and loading them
resp <- require("dplyr", quietly = TRUE)
if (!resp) {stop("Please, install 'dplyr' package before running this script")}
resp <- require("tidyr", quietly = TRUE)
if (!resp) {stop("Please, install 'tidyr' package before running this script")}
resp <- require("ggplot2", quietly = TRUE)
if (!resp) {stop("Please, install 'ggplot2' package before running this script")}
resp <- require("knitr", quietly = TRUE)
if (!resp) {stop("Please, install 'knitr' package before running this script")}
The storm data have been obtained from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2.
This is a comma-separated-value file with headers, compressed with the bzip2 algorithm. It is a format than can be read directly with the function read.csv().
Let’s proceed to reading the compressed downloaded file (it must be in the same directory where this script is executed). As it is a long file, and in order to reduce processing time in subsequent executions of this document, resulting dataset will be cached (cache=TRUE in the r code chunk).
# Reading the original raw dataset
stormdata_raw <- read.csv("repdata-data-StormData.csv.bz2", stringsAsFactors=FALSE)
This dataset includes 37 variables of 902,297 observations for 985 different event types. Let’s see a partial view of the dataset and its structure.
# Showing some data from the raw datset
kable(head(stormdata_raw[, 1:9]), caption = "Storm raw dataset (partial)")
| STATE__ | BGN_DATE | BGN_TIME | TIME_ZONE | COUNTY | COUNTYNAME | STATE | EVTYPE | BGN_RANGE |
|---|---|---|---|---|---|---|---|---|
| 1 | 4/18/1950 0:00:00 | 0130 | CST | 97 | MOBILE | AL | TORNADO | 0 |
| 1 | 4/18/1950 0:00:00 | 0145 | CST | 3 | BALDWIN | AL | TORNADO | 0 |
| 1 | 2/20/1951 0:00:00 | 1600 | CST | 57 | FAYETTE | AL | TORNADO | 0 |
| 1 | 6/8/1951 0:00:00 | 0900 | CST | 89 | MADISON | AL | TORNADO | 0 |
| 1 | 11/15/1951 0:00:00 | 1500 | CST | 43 | CULLMAN | AL | TORNADO | 0 |
| 1 | 11/15/1951 0:00:00 | 2000 | CST | 77 | LAUDERDALE | AL | TORNADO | 0 |
str(stormdata_raw)
'data.frame': 902297 obs. of 37 variables:
$ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
$ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
$ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
$ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
$ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
$ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
$ STATE : chr "AL" "AL" "AL" "AL" ...
$ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
$ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
$ BGN_AZI : chr "" "" "" "" ...
$ BGN_LOCATI: chr "" "" "" "" ...
$ END_DATE : chr "" "" "" "" ...
$ END_TIME : chr "" "" "" "" ...
$ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
$ COUNTYENDN: logi NA NA NA NA NA NA ...
$ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
$ END_AZI : chr "" "" "" "" ...
$ END_LOCATI: chr "" "" "" "" ...
$ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
$ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
$ F : int 3 2 2 2 2 2 2 1 3 3 ...
$ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
$ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
$ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
$ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
$ PROPDMGEXP: chr "K" "K" "K" "K" ...
$ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
$ CROPDMGEXP: chr "" "" "" "" ...
$ WFO : chr "" "" "" "" ...
$ STATEOFFIC: chr "" "" "" "" ...
$ ZONENAMES : chr "" "" "" "" ...
$ LATITUDE : num 3040 3042 3340 3458 3412 ...
$ LONGITUDE : num 8812 8755 8742 8626 8642 ...
$ LATITUDE_E: num 3051 0 0 0 0 ...
$ LONGITUDE_: num 8806 0 0 0 0 ...
$ REMARKS : chr "" "" "" "" ...
$ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
For the purpose of our study we will work with a subset of the columns from the raw dataset. In fact, the only relevant columns for analyzing impact on people and properties are:
EVTYPE: name of the event type.FATALITIES: number of people involved in fatalities.INJURIES: number of people who suffered as injuries.PROPDMG: relative amount of property damages in USD.PROPDMGEXP: multiplier of the previous measurements (as powers of ten).CROPDMG: relative amount of crop damages in USD.CROPDMGEXP: multiplier of the previous measurements ((powers of ten).Moreover, many reported events in NOAA database have no consequences at all. According to these aspects we create a subset for those events that produce some kind of damage and only with the relevant columns (it will be also cached):
# Selecting events with impact; subsetting it to the variables of interest
storm_damage <- stormdata_raw %>%
filter(FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0) %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
kable(head(storm_damage), caption = "Storm with damages (initial rows)")
| EVTYPE | FATALITIES | INJURIES | PROPDMG | PROPDMGEXP | CROPDMG | CROPDMGEXP |
|---|---|---|---|---|---|---|
| TORNADO | 0 | 15 | 25.0 | K | 0 | |
| TORNADO | 0 | 0 | 2.5 | K | 0 | |
| TORNADO | 0 | 2 | 25.0 | K | 0 | |
| TORNADO | 0 | 2 | 2.5 | K | 0 | |
| TORNADO | 0 | 2 | 2.5 | K | 0 | |
| TORNADO | 0 | 6 | 2.5 | K | 0 |
In terms of cleaning the biggest issue is the way that the amount of properties and crops damages are reported. In both cases a column indicates the base and another one the exponent as a power of ten, probably to save space in a historical moment (from the 50’s to the 70’s) where the memory available in processing systems was limited.
Moreover there are inconsistencies in the way that the exponent has been coded over the period under study. Let’s see the these. values:
# Showing the different exponent used with USD quantities
cat("PROPDMGEXP values: ", unique(storm_damage$PROPDMGEXP))
PROPDMGEXP values: K M B m + 0 5 6 4 h 2 7 3 H -
cat("CROPDMGEXP values: ", unique(storm_damage$CROPDMGEXP))
CROPDMGEXP values: M K m B ? 0 k
As it can be seen there is a mixture of digits, letters (uppercase and lowercase), and special characters, including blanks. Ufortunately, there is not a codebook or document in NOAA web site that describes the meaning of these symbols and the criteria used in the different periods, so the following replacement is based in sound assumptions:
There are an extremely low number of cases with “+”, “-” (PROPDMGEXP) or “?” (CROPDMGEXP) that can be discarded without impact in final estimations:
num_disc <- nrow(subset(storm_damage,
PROPDMGEXP == "+" | PROPDMGEXP == "-" |
CROPDMGEXP == "?"))
num_obs <- nrow(storm_damage)
storm_damage <- storm_damage %>%
filter(PROPDMGEXP != "+" & PROPDMGEXP != "-" &
CROPDMGEXP != "?")
cat("Discarded observations:", num_disc, "(out of", num_obs, "observations)")
Discarded observations: 12 (out of 254633 observations)
Taking this into account and after the cleaning now we can replace in every row the symbol for the right numeric exponent in order to normalize it:
# For the sake of clarity original symbols and correspondind replacement are visually aligned
# Searching and replacement will ignore upper and lowercase differences
exp_ini <- c("^b", "^m", "^k", "^h", "^$")
exp_rep <- c( 9, 6, 3, 2, 0)
# Replacing original symbols by numeric exponents according previous table
for (i in exp_ini) {
storm_damage$PROPDMGEXP <- sub(i, exp_rep[which(exp_ini == i)],
storm_damage$PROPDMGEXP,
ignore.case = TRUE)
storm_damage$CROPDMGEXP <- sub(i, exp_rep[which(exp_ini == i)],
storm_damage$CROPDMGEXP,
ignore.case = TRUE)
}
storm_damage$PROPDMGEXP <- as.integer(storm_damage$PROPDMGEXP)
storm_damage$CROPDMGEXP <- as.integer(storm_damage$CROPDMGEXP)
cat("PROPDMGEXP normalized: ", unique(storm_damage$PROPDMGEXP))
cat("\nCROPDMGEXP normalized: ", unique(storm_damage$CROPDMGEXP))
PROPDMGEXP normalized: 3 6 0 9 5 4 2 7
CROPDMGEXP normalized: 0 6 3 9
The last step before we proceed is to reorganize and tidy the dataset for getting the most appropriate representation for the subsequent analysis:
Gathering all kind of damages (people or properties) and their values in two columns (damage and value).
Grouping and summarizing all the events by type of event and type of damage.
Keeping only the groups where the total value is bigger than zero (not every type of event causes every type of damage).
# Restructuring events in a tidy way
events_tidy <- storm_damage %>%
mutate(PROPDMG = PROPDMG * 10 ^ PROPDMGEXP,
CROPDMG = CROPDMG * 10 ^ CROPDMGEXP) %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG) %>%
gather(DAMAGE, VALUE, -EVTYPE) %>%
group_by(EVTYPE, DAMAGE) %>%
summarise(TOT_VALUE = sum(VALUE)) %>%
filter(TOT_VALUE > 0)
Let’s see a sample of this tidy dataset:
# Printing a sample of the tidy dataset
set.seed(2371)
kable(events_tidy[sample(1:nrow(events_tidy), 5), ], caption = "Tidy working Event/Damages dataset (sample)")
| EVTYPE | DAMAGE | TOT_VALUE |
|---|---|---|
| STORM SURGE/TIDE | FATALITIES | 11 |
| RAIN | PROPDMG | 5300050 |
| WARM WEATHER | INJURIES | 2 |
| HAIL 125 | CROPDMG | 10000 |
| MUD SLIDES | PROPDMG | 50000 |
This dataset will be the basic working dataset in obtaining the results that are explained in the next section.
Starting from our tidy dataset events_tidy we can determine and plot which are the most harmful event types for people. First, we will consider together injuries and fatalities.
Let’s do a concrete subset to answer the question grouping and summarizing by event type and selecting the first 10 in descending order:
# Selecting the 10 events types with the highest impact on population
events_people <- events_tidy %>%
filter(DAMAGE == "FATALITIES" | DAMAGE == "INJURIES") %>%
group_by(EVTYPE) %>%
summarise(PEOPLE = sum(TOT_VALUE)) %>%
arrange(desc(PEOPLE)) %>%
slice(1:10)
We will see now the results:
g <- ggplot(events_people,
aes(x = reorder(EVTYPE, PEOPLE), y = PEOPLE, fill = EVTYPE)) +
geom_bar(stat = "identity", position = "dodge") +
labs(y = "Total population damaged (people number)",
x = "Event Type",
title = "Event types with the highest impact on population") +
coord_flip() +
theme(legend.position = "none",
plot.title = element_text(size = 12, face = "bold"),
axis.title = element_text(size = 12),
plot.margin = unit(c(1,0,1,0), "cm"))
print(g)
Fig 1. Damage on populations by event type
As it can be seen from the graph, TORNADO is the event type that has caused more people being impacted by fatalities and injuries over these years, with a great difference compared to the rest of them.
Let’s see now if the result are the same when we consider Fatalities and Injuries separately. We have to select the 10 highest for each class of people damage.
# Selecting the 10 events types that cause the highest number of fatalities
events_fatal <- events_tidy %>%
filter(DAMAGE == "FATALITIES") %>%
group_by(EVTYPE) %>%
summarise(DAMAGE, PEOPLE = sum(TOT_VALUE)) %>%
arrange(desc(PEOPLE)) %>%
slice(1:10)
# Selecting the 10 events types that cause the highest number of injuried people
events_injur <- events_tidy %>%
filter(DAMAGE == "INJURIES") %>%
group_by(EVTYPE) %>%
summarise(DAMAGE, PEOPLE = sum(TOT_VALUE)) %>%
arrange(desc(PEOPLE)) %>%
slice(1:10)
We can put and plot both subsets togeteher for comparison:
events_people_bydmg <- rbind(events_fatal, events_injur)
g <- ggplot(events_people_bydmg,
aes(x = reorder(EVTYPE, PEOPLE), y = PEOPLE, fill = EVTYPE)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~DAMAGE , nrow = 2, ncol = 1) +
labs(y = "Total population damaged (people number)",
x = "Event Type",
title = "Event types with the highest impact on population by type of damage") +
coord_flip() +
theme(legend.position = "none",
plot.title = element_text(size = 12, face = "bold"),
axis.title = element_text(size = 12),
panel.margin = unit(0.8, "cm"),
plot.margin = unit(c(1,0,1,0), "cm"))
print(g)
Fig 2. Comparing fatalities vs injuries on populations by event type
We see that the number of fatalities is, logically, small compared with injuries (about a tenth). The difference between event types in the case of fatalities is not as high as it is in injuries.
The second question is about the economic consequences of the different event types. We will proceed in the same way that in the previous section. First grouping and summarizing by event type the data related with properties and crops and then selecting the first 10 results in descending order:
# Selecting the 10 events types that cause the highest economic impact
events_proper <- events_tidy %>%
filter(DAMAGE == "PROPDMG" | DAMAGE == "CROPDMG") %>%
group_by(EVTYPE) %>%
summarise(MONEY = sum(TOT_VALUE) / 10 ^ 9) %>%
arrange(desc(MONEY)) %>%
slice(1:10)
Let’s plot the results:
g <- ggplot(events_proper,
aes(x = reorder(EVTYPE, MONEY), y = MONEY, fill = MONEY)) +
geom_bar(stat = "identity", position = "dodge") +
scale_y_continuous("Total economic damage (Billion USD)",
breaks = seq(0, 150, by = 30)) +
scale_fill_continuous(guide = "colourbar",
low = "orangered", high = "darkred",
name = "Damage", breaks = seq(0, 150, by = 30)) +
labs(x = "Event Type",
title = "Events with the highest economic impact (properties + crops)") +
coord_flip() +
theme(plot.title = element_text(size = 12, face = "bold"),
axis.title = element_text(size = 12),
panel.margin = unit(0.8, "cm"),
plot.margin = unit(c(1,0,1,0), "cm"))
print(g)
Fig 3. Damage on properties and crops by event type
In this cases the most harmful event type is FLOOD, which accounts for approximateky 150 billion USD losses, followed by HURRICANE/TYPHOON.