We analyzed the U.S. National Oceanic and Atmospheric Administration’s storm database from the past 60 years. Our aim was to identify the hydrometereological events that caused the most harmful effects for human health (injuries and fatalities) and had the greatest economic consequences in terms of property damage. We found that the most harmful metereological event was the tornado, which caused over 90,000 direct injuries in the last 60 years. Similarly, most deadly events were caused also by tornadoes, with about 5,600 deaths during the evaluated period of time. Finally, flooding had the greatest economic consequences, with over 150 billion dollars in property damages.
In this section we describe (in words and code) how the data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database was loaded into R and processed for analysis. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The events in the database start in the year 1950 and end in November 2011. Data processing and analysis was done using R version 3.2.0 “Full of Ingredients” (R Foundation for Statistical Computing, Vienna, Austria).
library(knitr)
opts_chunk$set(message = FALSE, fig.width = 9)
library(R.utils)
library(readr)
library(dplyr)
library(stringr)
library(ggvis)
We first downloaded the NOAA dataset from the Johns Hopkins Reproducible Research course link on Coursera’s web site. The data for this assignment came in the form of a CSV file compressed via the bzip2 algorithm to reduce its size. Along with the dataset, 2 more files were downloaded, indicating how the variables in the dataset are defined: the National Weather Service Storm Data Documentation (referenced here as the NWS Manual) and the National Climatic Data Center Storm Events FAQ. We finally unzipped the Storm_Data.bz2 dataset and saved the unzipped dataset to a file named Storm_Data.csv in the working directory. We then loaded the dataset into a data frame named Data and selected the variables to be used.
# Data file
URL_data <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(URL_data, "Files/Storm_Data.bz2", method = "wget")
# Storm data documentation
URL_Manual <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
download.file(URL_Manual, "Files/NMS_Manual.pdf", method = "wget")
# FAQ
URL_FAQ <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf"
download.file(URL_FAQ, "Files/FAQ.pdf", method = "wget")
# Unzipping the dataset
bunzip2("Files/Storm_Data.bz2", "Files/Storm_Data.csv")
# Loading the dataset
Data <- read_csv("Files/Storm_Data.csv")
# Selecting the variables
Data <- Data %>%
select(EVTYPE, PROPDMGEXP, PROPDMG, FATALITIES, INJURIES)
The major challenge we had for the analysis was the careless report of the events into the EVTYPE variable. Indeed, the NWS Manual specifies 48 events (page 6) while we found 985 levels in EVTYPE. So, our first task was to recode these 985 levels of EVTYPE into the predefined 48 events, using a combination of string replacements and regular expressions. When it was not possible to assign a predefined event to a particular level we coded this particular level as NA. We finally filter the dataset to exclude rows with NA values.
source("Files/Fix_EVTYPE.R")
Data <- Data %>%
filter(EVTYPE != "NA")
The final R script used for recoding, as indicated in the previous R chunk, is available here. If the link doesn’t open directly please right click on it and select “Open link in new …”.
We also noted similar inconsistencies in the PROPDMGEXP variable, with only 3 predefined levels (NWS Manual, page 12) and 18 levels in the downloaded dataset. So, our second task was to filter and recode PROPDMGEXP to include only the predefined levels. For this, we filter to keep only those levels that made sense. i.e., m, M, K, and B. We then replaced these string values for the corresponding numerical values. Finally, for estimating the total economic damage we combined PROPDMGand PROPDMGEXP, creating a new variable PROPDMGTOTAL:
Data <- Data %>%
filter(PROPDMGEXP %in% c("m", "M", "K", "B"))
Data$PROPDMGEXP <- Data$PROPDMGEXP %>%
plyr::revalue(c("m" = 1e+6, "M" = 1e+6, "K" = 1e+3, "B" = 1e+9)) %>%
as.numeric()
Data <- Data %>%
mutate(PROPDMGTOTAL = PROPDMG * PROPDMGEXP)
This is the dataset we used for the data analysis. All datasets, scripts and associated files are available here.
We first summarized all number of harmful events (i.e., injuries and fatalities) and property damage estimates by event type. For this, we created a summary table containing the following variables:
Harmful <- Data %>%
group_by(EVTYPE) %>%
summarize(ALL_INJURIES = sum(INJURIES),
ALL_FATALITIES = sum(FATALITIES),
ALL_PROPDMG = sum(PROPDMGTOTAL))
Harmful %>%
mutate(ALL_PROPDMG = format(ALL_PROPDMG, big.mark = ",")) %>%
kable(align = c("l", "c", "c", "c"))
| EVTYPE | ALL_INJURIES | ALL_FATALITIES | ALL_PROPDMG |
|---|---|---|---|
| Astronomical Low Tide | 0 | 0 | 9,745,000 |
| Avalanche | 71 | 103 | 3,721,800 |
| Blizzard | 779 | 70 | 664,913,950 |
| Coastal Flood | 6 | 6 | 449,682,060 |
| Cold/Wind Chill | 177 | 188 | 245,579,450 |
| Debris Flow | 49 | 29 | 327,408,100 |
| Dense Fog | 709 | 51 | 22,829,500 |
| Dense Smoke | 0 | 0 | 100,000 |
| Drought | 28 | 4 | 1,053,038,600 |
| Dust Devil | 31 | 1 | 719,130 |
| Dust Storm | 184 | 5 | 5,619,000 |
| Extreme Cold/Wind Chill | 0 | 0 | 755,000 |
| Flash Flood | 1608 | 776 | 16,991,195,460 |
| Flood | 6757 | 410 | 150,113,968,500 |
| Frost/Freeze | 2055 | 62 | 3,999,037,010 |
| Funnel Cloud | 2 | 0 | 194,600 |
| Hail | 705 | 32 | 17,619,950,720 |
| Heat | 2208 | 435 | 20,125,750 |
| Heavy Rain | 137 | 50 | 3,253,891,190 |
| Heavy Snow | 876 | 75 | 1,027,749,740 |
| High Surf | 178 | 81 | 101,510,500 |
| High Wind | 1167 | 195 | 5,881,880,960 |
| Hurricane (Typhoon) | 1328 | 106 | 85,356,410,010 |
| Lakeshore Flood | 0 | 0 | 7,570,000 |
| Lightning | 3825 | 410 | 7,365,530,370 |
| Marine High/Strong/Thunderstorm Wind | 26 | 16 | 7,186,340 |
| Rip Current | 157 | 216 | 163,000 |
| Seiche | 0 | 0 | 980,000 |
| Sleet | 0 | 0 | 0 |
| Storm Surge/Tide | 13 | 24 | 47,965,474,000 |
| Strong Wind | 265 | 84 | 188,401,740 |
| Thunderstorm Wind | 2746 | 169 | 4,494,356,940 |
| Tornado | 90447 | 5588 | 56,941,932,180 |
| Tropical Depression | 0 | 0 | 1,737,000 |
| Tropical Storm | 380 | 56 | 7,714,390,550 |
| Tsunami | 129 | 33 | 144,062,000 |
| Volcanic Ash | 0 | 0 | 500,000 |
| Watersprout | 71 | 5 | 61,235,200 |
| Wildfire | 1320 | 79 | 8,496,628,500 |
| Winter Weather | 1438 | 144 | 6,776,307,750 |
We then focused on answering the 2 main questions of this study.
For answering this question we identified the events that had the 5 highest total number of injured people and total number of people who died as a direct consequence of the event.
The 5 most harmful events that caused injuries are shown in this table:
Most_Injuries <- Harmful %>%
select(EVTYPE, ALL_INJURIES) %>%
arrange(desc(ALL_INJURIES)) %>%
head(5)
Most_Injuries %>%
mutate(ALL_INJURIES = format(ALL_INJURIES, big.mark = ",")) %>%
kable(align = c("l", "c"))
| EVTYPE | ALL_INJURIES |
|---|---|
| Tornado | 90,447 |
| Flood | 6,757 |
| Lightning | 3,825 |
| Thunderstorm Wind | 2,746 |
| Heat | 2,208 |
The following plot shows the previous results:
Most_Injuries %>%
ggvis(~EVTYPE, ~ALL_INJURIES) %>%
layer_bars(fill = ~EVTYPE) %>%
add_axis("y", title_offset = 60)
The 5 most harmful events that caused fatalities are shown in this table:
Most_Fatalities <- Harmful %>%
select(EVTYPE, ALL_FATALITIES) %>%
arrange(desc(ALL_FATALITIES)) %>%
head(5)
Most_Fatalities %>%
mutate(ALL_FATALITIES = format(ALL_FATALITIES, big.mark = ",")) %>%
kable(align = c("l", "c"))
| EVTYPE | ALL_FATALITIES |
|---|---|
| Tornado | 5,588 |
| Flash Flood | 776 |
| Heat | 435 |
| Flood | 410 |
| Lightning | 410 |
The following plot shows the previous results:
Most_Fatalities %>%
ggvis(~EVTYPE, ~ALL_FATALITIES) %>%
layer_bars(fill = ~EVTYPE) %>%
add_axis("y", title_offset = 60)
For answering this question we identified the events that had the 5 highest property damage costs as a consequence of the event, as shown in this table:
Most_Damage <- Harmful %>%
select(EVTYPE, ALL_PROPDMG) %>%
arrange(desc(ALL_PROPDMG)) %>%
head(5)
Most_Damage %>%
mutate(ALL_PROPDMG = format(ALL_PROPDMG, big.mark = ",")) %>%
kable(align = c("l", "c"))
| EVTYPE | ALL_PROPDMG |
|---|---|
| Flood | 150,113,968,500 |
| Hurricane (Typhoon) | 85,356,410,010 |
| Tornado | 56,941,932,180 |
| Storm Surge/Tide | 47,965,474,000 |
| Hail | 17,619,950,720 |
The following plot shows the previous results:
Most_Damage %>%
ggvis(~EVTYPE, ~ALL_PROPDMG) %>%
layer_bars(fill = ~EVTYPE) %>%
add_axis("y", title_offset = 120)
By far, the most harmful metereological event has been the tornado, which has caused over 90,000 injuries in the last 60 years. Flood, lighting, thunderstorm wind, and heat were among the most harmful events following tornadoes. Similarly, most deadly events were caused also by tornadoes, with about 5,600 deaths in the past 60 years. Flash flood, heat, flood, and lightning were among the most deadly events following tornadoes. Finally, flooding had the greatest economic consequences, with over 150 billion dollars in property damages. Other hydrometereological events that caused great property damage were hurricanes, tornadoes, storm surges/tides, and hail.