We analyzed the U.S. National Oceanic and Atmospheric Administration's storm database from the past 60 years. Our aim was to identify the hydrometereological events that caused the most harmful effects for human health (injuries and fatalities) and had the greatest economic consequences in terms of property damage. We found that the most harmful metereological event was the tornado, which caused over 90,000 direct injuries in the last 60 years. Likewise, most deadly events were caused also by tornadoes, with about 5,600 deaths during the evaluated period of time. Finally, flooding had the greatest economic consequences, with over 150 billion dollars in property damages.
In this section we describe (in words and code) how the data from the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database was loaded into R and processed for analysis. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The events in the database start in the year 1950 and end in November 2011. Data processing and analysis was done using R version 3.1.0 (R Foundation for Statistical Computing, Vienna, Austria).
The NOAA database was downloaded from here. This database came in the form of a comma-separated-value file compressed via de bzip2 algorithm to reduce its size. The file was saved as StormData.csv.bz2 in the working directory.
Two more files were downloaded, indicating how the variables in the dataset are constructed/defined:
We then unzipped the StormData.csv.bz2 dataset using the bunzip2 function from the R.utils R package and saved the unzipped dataset to a file named StormData.csv in the working directory. We then load the dataset into a data.frame storm.data containing all the 902,297 observation of 37 variables.
storm.data <- read.csv("StormData.csv", na.strings = "")
We decided to convert all blank spaces into NAs. After viewing the data.frame we noted that the variable EVTYPE had some backslashes, so we decided to convert these backslashes into forward slashes to avoid error while recoding levels.
storm.data$EVTYPE <- gsub("\\", "/", storm.data$EVTYPE, fixed = TRUE)
The major challenge we had for the analysis was the careless report of the events into the EVTYPE variable. Indeed, the NWS Manual specifies 48 permitted events (page 6) while we found 985 levels in EVTYPE. So, our first task was to recode these 985 levels of EVTYPE into the 48 permitted levels. When it was not possible to assign a permitted level to a particular level we coded this particular level as NA.
We also noted similar inconsistencies in the PROPDMGEXP variable, with only 3 permited levels (NWS Manual, page 12) and 18 levels in the dataset. So, our second task was to recode PROPDMGEXP to include only the permitted levels. When it was not possible to assign a permitted level to a particular level we coded this particular level as NA.
To increase the readability of this document we have decided to include all this recoding into a separate R script:
source("RecodingNWS.R")
## Loading required package: plyr
This script is published here in GitHub, along with the R script used as the basis of this report.
After fixing the levels of EVTYPE and PROPDMGEXP we recode these variables as factor and numeric variables, respectively:
storm.data$EVTYPE <- as.factor(storm.data$EVTYPE)
storm.data$PROPDMGEXP <- as.numeric(levels(storm.data$PROPDMGEXP))[storm.data$PROPDMGEXP]
For estimating the total economic damage we combined PROPDMGand PROPDMGEXP, creating a new variable PROPDMGTOTAL:
storm.data$PROPDMGTOTAL <- storm.data$PROPDMG * storm.data$PROPDMGEXP
Finally, we created a new data frame with the variables that were used for estimating the harmful effect and economic consequences of the hydrometereological events included in the NOAA dataset:
dataset <- storm.data[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMGTOTAL")]
dataset$PROPDMGTOTAL[is.na(dataset$PROPDMGTOTAL)] <- 0
As seen, we decided to recode the NAs into 0s, as these cells were originally empty, and we assumed that the reported event caused no property damage. We then removed the original data.frame:
rm(storm.data)
The dataset data.frame was used for analyzing the data and reporting the Results.
We first summarized all number of harmful events (i.e., injuries and fatalities) and property damage estimates by event type. For this, we created a new R object named harmful using the ddply function from the plyr R package:
require(plyr)
harmful <- ddply(dataset, "EVTYPE", summarize, ALL.INJURIES = sum(INJURIES),
ALL.FATALITIES = sum(FATALITIES), ALL.PROPDMG = sum(PROPDMGTOTAL))
The harmful object includes the following variables:
The full table is shown here:
format(harmful, big.mark = ",", scientific = FALSE)
## EVTYPE ALL.INJURIES ALL.FATALITIES ALL.PROPDMG
## 1 Astronomical Low Tide 0 0 320,000
## 2 Avalanche 171 225 8,721,800
## 3 Blizzard 805 101 659,913,950
## 4 Coastal Flood 7 6 459,107,060
## 5 Cold/Wind Chill 60 165 2,544,000
## 6 Debris Flow 55 44 327,258,100
## 7 Dense Fog 342 18 9,674,000
## 8 Dense Smoke 0 0 100,000
## 9 Drought 48 10 1,053,038,600
## 10 Dust Devil 43 2 719,130
## 11 Dust Storm 440 22 5,619,000
## 12 Excessive Heat 6,749 2,059 7,869,200
## 13 Extreme Cold/Wind Chill 260 316 133,290,400
## 14 Flash Flood 1,880 1,065 16,991,233,460
## 15 Flood 6,794 482 150,129,365,500
## 16 Freezing Fog 735 63 15,337,500
## 17 Frost/Freeze 234 26 51,246,700
## 18 Funnel Cloud 3 0 199,600
## 19 Hail 1,372 15 15,975,650,720
## 20 Heat 2,479 1,114 12,257,050
## 21 Heavy Rain 280 101 3,238,397,690
## 22 Heavy Snow 1,162 149 979,442,740
## 23 High Surf 251 179 102,000,000
## 24 High Wind 1,476 295 5,992,380,960
## 25 Hurricane (Typhoon) 1,333 135 85,356,410,010
## 26 Ice Storm 2,208 96 3,950,832,310
## 27 Lake-Effect Snow 0 0 40,682,000
## 28 Lakeshore Flood 0 0 7,570,000
## 29 Lightning 5,231 817 933,732,280
## 30 Marine Hail 0 0 4,000
## 31 Marine High Wind 9 9 1,312,510
## 32 Marine Strong Wind 22 14 418,330
## 33 Marine Thunderstorm Wind 34 19 5,857,400
## 34 Rip Current 529 572 163,000
## 35 Seiche 0 0 980,000
## 36 Sleet 0 2 1,901,000
## 37 Storm Surge/Tide 45 28 47,965,274,000
## 38 Strong Wind 395 135 188,106,740
## 39 Thunderstorm Wind 9,510 714 10,970,557,630
## 40 Tornado 91,407 5,661 58,593,098,230
## 41 Tropical Depression 0 0 1,737,000
## 42 Tropical Storm 383 66 7,714,390,550
## 43 Tsunami 129 33 144,062,000
## 44 Volcanic Ash 0 0 500,000
## 45 Waterspout 29 3 9,564,200
## 46 Wildfire 1,608 90 8,496,628,500
## 47 Winter Storm 1,353 217 6,749,497,250
## 48 Winter Weather 615 62 27,310,500
## 49 NA 42 15 2,365,500
We then focused on answering the 2 main questions of this study.
For answering this question we identified the events that had the 5 highest total number of injured people and total number of people who died as a direct consequence of the event. For the total number of injured people we created a Q1 R object in which we selected the 5 most harmful events:
Q1 <- order(harmful$ALL.INJURIES, decreasing = TRUE)[1:5]
The 5 most harmful events that caused injuries are shown in this table:
format(harmful[Q1, c(1, 2)], big.mark = ",")
## EVTYPE ALL.INJURIES
## 40 Tornado 91,407
## 39 Thunderstorm Wind 9,510
## 15 Flood 6,794
## 12 Excessive Heat 6,749
## 29 Lightning 5,231
The results are shown in this plot:
barplot(harmful[Q1, 2], xlab = "Event Type", ylab = "Total No. Injuries", cex.lab = 1.5,
names.arg = harmful$EVTYPE[Q1])
For the total number of people who died we created a Q2 R object in which we selected the 5 most deadly events:
Q2 <- order(harmful$ALL.FATALITIES, decreasing = TRUE)[1:5]
The 5 most deadly events are shown in this table:
format(harmful[Q2, c(1, 3)], big.mark = ",")
## EVTYPE ALL.FATALITIES
## 40 Tornado 5,661
## 12 Excessive Heat 2,059
## 20 Heat 1,114
## 14 Flash Flood 1,065
## 29 Lightning 817
These results are shown in this plot:
barplot(harmful[Q2, 3], xlab = "Event Type", ylab = "Total No. Fatalities",
cex.lab = 1.5, names.arg = harmful$EVTYPE[Q2])
For answering this question we identified the events that caused the 5 highest property damage as a consequence of the event. For the total property we created a Q3 R object in which we selected these 5 events:
Q3 <- order(harmful$ALL.PROPDMG, decreasing = TRUE)[1:5]
The events that caused the highest property damage are shown in this table:
format(harmful[Q3, c(1, 4)], big.mark = ",", scientific = FALSE)
## EVTYPE ALL.PROPDMG
## 15 Flood 150,129,365,500
## 25 Hurricane (Typhoon) 85,356,410,010
## 40 Tornado 58,593,098,230
## 37 Storm Surge/Tide 47,965,274,000
## 14 Flash Flood 16,991,233,460
These results are shown in this plot:
barplot(harmful[Q3, 4], xlab = "Event Type", ylab = "Total Property Damage",
cex.lab = 1.5, names.arg = harmful$EVTYPE[Q3])
By far, the most harmful metereological event has been the tornado, which has caused over 90,000 injuries in the last 60 years. Thunderstorm wind, flood, excessive heat, and lighting were among the most harmful events following tornadoes. Likewise, most deadly events were caused also by tornadoes, with about 5,600 deaths in the past 60 years. Excessive heat, heat, flash flood, and lighting were among the most deadly events following tornadoes. Finally, flooding had the greatest economic consequences, with over 150 billion dollars in property damages. Other hydrometereological events that caused great property damage were hurricanes, tornadoes, storm surges/tides, and flash flooding.