This goal of this analysis is to determine which types of weather events are the most harmful to population health and the United States economy, inform policy-makers about preventative measures and preparing for severe weather events, as well as to define the prioritization and allocation of resources for different types of events in the most effective way possible. This report used data distributed by the National Oceanic and Atmospheric Administration (NOAA). It follows a simple approach to produce two ranking results, each listing the most dangerous/destructive weather event types observed in the United States between 1996 and 2011. A list of the storm events and references used can be found in the Appendix.
Storms and other severe weather events can cause both public health and economic problems for communities. Many severe events result in fatalities, injuries, and property damage, thus preparing for those outcomes is a crucial. This project explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database that tracks several characteristics of major storms and weather events in the United States, including times and locations of occurrences; it also tracks estimates of any fatalities, injuries, and property damage [1].
Policy makers need to make informed decisions for allocating resources in response to these weather events. This report attempts to answer two specific questions:
- Across the United States, which types of weather events are most harmful with respect to
population health?
- Across the United States, which types of weather events have the most severe economic
consequences health?
The analysis uses a simple ranking approach, using the following steps: (1) Start with the raw data set of recorded weather events loaded to the R environment.
(2) Choose variables that indicate population health (i.e. number of injuries the event caused) and a set of variables that indicate economic consequences (i.e. monetary damage caused). (3) Creates subsets of the raw data containing only relevant variables. (4) Exclude the observations before the 1996 standardization. (5) Reformat/calculate the variables to be analyzed. (6) For each subset, aggregate the values for variables per event type (i.e. total number of injuries or total monetary damage), then sort the data based on the aggregated values. (7) Select the top 10 observations - the events with the most impact on health/damage.
The result of the analysis is two rankings: The 10 types of weather events most harmful to population health across the United States and the 10 types of weather events with the most severe economic consequences across the United States.
The data set used in this analysis is provided by the National Oceanic and Atmospheric Administration (NOAA) as a zipped CSV file called “Storm Data”. The Storm Data is the NOAA’s official publication which documents the occurrence of storms and other significant natural hazards having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce [3].
The document titled NWS Directive 10-1605, published by NOAA, corresponds to this data, and is used as the authoritative source of information. Documentation of the database used in this analysis and how variables are constructed/defined can be found in the “Storm Event Types” section of the Appendix.
The Storm Events Database contains records used to create the official NOAA Storm Data publication including: (1) The occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce; (2) Rare, unusual, weather phenomena that generate media attention, such as snow flurries in South Florida or the San Diego coastal area; and (3) Other significant meteorological events, such as record maximum or minimum temperatures or precipitation that occur in connection with another event [1].
The NOAA dataset contains observations from 1950 to 2011, however, the details page in reference [1] notes that only events after 1996 have been recorded as the standardized 48 event types specified in the NWS Directive [1]. Since the data before 1996 does not have formatted/standardized event types, observations before 1996 will be removed for this analysis.
Furthermore, only observations with event types which match of one of the 48 events defined in NWS Directive are used, except event types that have a slash in them (e.g. “Cold/Wind Chill”) will also match observations that have an event type of the constituent terms (e.g. “Cold” and “Wind Chill”).
This weather event analysis uses the following variables:
| Variable | Data Type | Description |
|---|---|---|
| BGN_DATE | Date | Begin Date; used to subset for observations between 1996 and 2011. |
| EVTYPE | Categorical | Event Type: type of weather event for each observation |
| FATALATIES | Integer | Number of fatalities caused by event (population health) |
| INJURIES | Integer | Number of injuries caused by event (population health) |
| PropDmg | Numeric ($) | Est. monetary value of damage to property caused in U.S. Dollars |
| PropDmgEXP | Numeric | Multiplier for PropDmg [K = 1,000; M = 1,000,000; B = 1,000,000,000]* |
| CROPDMG | Numeric ($) | Est. monetary value of damage to agricultural property in U.S. dollars |
| CROPDMGEXP | Numeric | Multiplier for CROPDMG [K = 1,000; M = 1,000,000; B = 1,000,000,000]* |
Note: Multiplier values come from NWS Directive 10-1605, and estimated monetary values are in U.S. Dollars ($), rounded to three significant digits.
For reproducibility, a timestamp and the version of R Studio used for the analysis is provided. Begin by setting the working directory to the correct place and loading the known required packages into R and removing scientific notation.
message(sprintf("Run time: %s\nR version: %s", Sys.time(), R.Version()$version.string))
setwd("C:/Users/20292/Documents/Coursera Files/Reproducible Research/Project 2")
library(knitr); library(ggplot2); library(plyr); library(R.utils); library(gridExtra)
options(scipen = 1) # Turn off scientific notation
After setting the working directory, the first steps in analysis are to load the dataset into R, look at the data structure, and to pre-process and scrub the raw data into a useable and clean dataset.
if (!file.exists("stormData.csv.bz2")) { # Download storm data file from course website
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile = "C:/Users/20292/Documents/Coursera Files/Reproducible Research/Project 2/data/stormData.csv.bz2"
download.file(fileURL, destfile)
bunzip2("C:/Users/20292/Documents/Coursera Files/Reproducible Research/Project 2/data/stormData.csv.bz2",
overwrite=T, remove=F)}
If the dataset is not already defined in the current workspace, load the data into the R Studio environment as a data frame and look at the dimensions and variable names.
if (!"storm.raw" %in% ls()) {
storm.raw <- read.csv("C:/Users/20292/Documents/Coursera Files/Reproducible Research/Project 2/data/stormData.csv", sep = ",")}
dim(storm.raw)
## [1] 902297 37
There are 902,297 observations (weather events) and 37 variables in the raw Storm dataset.
names(storm.raw) # Look at dimension and variable names
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
The list above contains the names of the columns (variables) in the raw dataset. Note that only a fraction of these will be used in this report.
Using the NOAA documentation, subset the raw data to the desired variables, including only the beginning date and those variables related to event type, population health and economic damage. This reduces the dataset to seven variables.
columns.to.keep <- which(colnames(storm.raw) %in% c("BGN_DATE", "EVTYPE", "FATALITIES",
"INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))
storm.data <- (storm.raw[, columns.to.keep])
The beginning date variable is converted to a date class and any observation before 1996 is removed per the NWS Directive explained aboved.
storm.data$BGN_DATE = as.Date(storm.data$BGN_DATE, "%m/%d/%Y %H:%M:%S")
storm <- storm.data[ which( as.numeric(format(storm.data$BGN_DATE, "%Y")) >= 1996), ]
c(min(storm$BGN_DATE), max(storm$BGN_DATE)) # Verify correct dates.
## [1] "1996-01-01" "2011-11-30"
Since the minimum date is 1996, the storm data now includes only the observations recorded after standardization.
Next, the data is subset to include only the 48 weather event types classified and defined by the NOAA. The storm_events vector contains the official list of weather event types defined in NWS Directive 10-1605, and event types that are part of one of the slash (/) classifications.
storm_events <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood",
"Cold/Wind Chill", "Cold", "Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke",
"Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill",
"Extreme Cold", "Flash Flood", "Flood", "Freezing Fog", "Frost/Freeze", "Frost",
"Freeze", "Funnel Cloud", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf",
"High Wind", "Hurricane/Typhoon","Hurricane","Typhoon", "Ice Storm", "Lakeshore Flood",
"Lake-Effect Snow","Lightning", "Marine Hail","Marine High Wind", "Marine Strong Wind",
"Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Tide",
"Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm",
"Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
To match the event types in the dataset to the list above, convert all characters in the EVTYPE column to upper-case letters and remove any white space for more consistency in the event types. Reduce the data to contain only the defined weather event types and remove any duplicate factor levels.
storm$EVTYPE = toupper(storm$EVTYPE) # Convert to upper-case
storm$EVTYPE = trimws(storm$EVTYPE, which = c("both","left","right")) # Remove white space
storm_events = toupper(storm_events)
storm <- subset(storm, (storm$EVTYPE %in% toupper(storm_events)))
droplevels(as.factor(storm$EVTYPE)) # Remove duplicates
dim(storm)/dim(storm.raw) # Compare datasets
Comparing the new data set that removed unneccessary information to the raw data set, the number of observations after data cleaning is about 56% of the original, and the new data has only 22% of the variables.
Looking at the structure of the data, some conversion and calculation is necessary.
str(storm)
## 'data.frame': 505533 obs. of 8 variables:
## $ BGN_DATE : Date, format: "1996-01-06" "1996-01-11" ...
## $ EVTYPE : chr "WINTER STORM" "TORNADO" "HAIL" "HIGH WIND" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 1 0 0 ...
## $ INJURIES : num 0 0 0 0 0 0 0 3 0 0 ...
## $ PROPDMG : num 380 100 0 400 0 75 0 100 150 0 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 1 17 1 17 1 17 17 1 ...
## $ CROPDMG : num 38 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 7 1 1 1 1 1 1 1 1 1 ...
The property damage expense and crop damage expenses need to be reformatted to from the alphabetic coding to U.S. dollars, using the following conversion function.
conversion <- function(x) { # Convert string variables to correct multiplier value
if (x == "" | x == "-" | x == "+" | x == "?") {return(0)}
else if (x == "h" | x == "H") {return(10^2)}
else if (x == "k" | x == "K") {return(10^3)}
else if (x == "m" | x == "M") {return(10^6)}
else if (x == "B") { return(10^9)}
else {return(10^as.numeric(x))}}
Finally, calculate variables containing the exact amounts of property damage and corp damage for each weather event in U.S. dollars using multiplication.
storm$CropDamage <- storm$CROPDMG * sapply(storm$CROPDMGEXP, conversion)
storm$PropertyDamage <- storm$PROPDMG * sapply(storm$PROPDMGEXP, conversion)
Since the data is clean and all the variables are formatted, we can calculate totals for event types. Using the {dplyr} package, the totals for the number of injuries, total fatalities, and the overall total negative consequences to health for each type of weather event type, is calculated. Similary, the total property damage, total crop damage, and overall total economic damage for each weather event type in the U.S are calculated.
library(dplyr)
health.full <- ddply(storm, .(EVTYPE), summarize, totalFatality = sum(FATALITIES),
totalInjury = sum(INJURIES))
health.full$totalHealthDamage = health.full$totalFatality + health.full$totalInjury
damage.full <- ddply(storm, .(EVTYPE), summarize, totalPropDmg = sum(PropertyDamage),
totalCropDmg = sum(CropDamage))
damage.full$totalEconDamage <- damage.full$totalPropDmg + damage.full$totalCropDmg
There is now data for each weather event type that summarizes the totals of the health consequences and economic damages.
The analysis of the storm dataset consists mainly of ranking the weather event types by impact on health and the economy. Since the objective is to help the law-makers efficiently allocate resources, the top 10 most impacting weather event types will be examined.
In order to determine which events have the most damaging impacts, sort the values for counts of the total health damage by descending order to rank the event types by cost in U.S. dollars and then select the top 10 most damaging weather event types for the final subset to analyze.
health.order <- health.full[ order(-health.full[,4]), ] # Sort by column index
damage.order <- damage.full[ order(-damage.full[,4]), ]
health.order$rank <- seq(1, to = dim(health.order)[1], by = 1) # Rank event types
damage.order$rank <- seq(1, to = dim(damage.order)[1], by = 1)
health.top <- health.order[1:10,] # Subset top 10
damage.top <- damage.order[1:10,]
The ranked event types are now determined and we can use graphics to help understand the data.
The following histograms are created with the {ggplot2} package and the {gridExtra} package to show the data distribution for health (injuries and fatalities) by event type. For space issues, only the top five are shown. The next two histograms illustrate the distribution for U.S. economic damage (crop damage and property damage) by weather event type. Note that the values are in millions of U.S. Dollars for the two bottom plots.
g1 <- ggplot(data=health.top[1:5,], aes(x=reorder(EVTYPE,-totalFatality),y=totalFatality,
fill=EVTYPE)) + geom_bar(stat = "identity") + labs(x = "Event Types", y = "") +
ggtitle("Number of Fatalities") +
theme_bw() + theme(axis.ticks = element_blank(), axis.text.x = element_blank())
g2 <- ggplot(data=health.top[1:5,], aes(x=reorder(EVTYPE,-totalInjury),y=totalInjury,
fill=EVTYPE) ) + geom_bar(stat = "identity") + labs(x = "Event Types", y = "" ) +
ggtitle("Number of Injuries") +
theme_bw() + theme(axis.ticks = element_blank(), axis.text.x = element_blank())
grid.arrange(g1, g2, ncol=1)
Figure 1 above shows that excessive heat events caused the most fatalities, while tornados caused the most deaths in the U.S.
x1 <- damage.top[1:5,]; x1$Prop = x1$totalPropDmg/1000000; x1$Crop = x1$totalCropDmg/1000000
g3 <- ggplot(data = damage.top[1:5,], aes(x = reorder(EVTYPE,-totalPropDmg), y=x1$Prop,
fill = EVTYPE)) + geom_bar(stat = "identity") + labs(x="Event Types", y="Dollars") +
ggtitle("Property Damage (in Millions)") +
theme_bw() + theme(axis.ticks = element_blank(), axis.text.x = element_blank())
g4 <- ggplot(data = damage.top[1:5,], aes(x = reorder(EVTYPE,-totalCropDmg),y = x1$Crop,
fill = EVTYPE)) + geom_bar(stat = "identity") + labs(x="Event Types", y="Dollars") +
ggtitle("Crop Damage (in Millions)") +
theme_bw() + theme(axis.ticks = element_blank(), axis.text.x = element_blank())
grid.arrange(g3, g4, ncol=1) # Actually create the graphic
Figure 2 shows floods caused the most property damage and crop damage, accounting for almost half of the monetary damage of the top 10 most economically damaging weather event types in the U.S.
The weather event types that cause the most negative consquences to population health are ranked as follows:
health.top
## EVTYPE totalFatality totalInjury totalHealthDamage rank
## 43 TORNADO 1511 20667 22178 1
## 12 EXCESSIVE HEAT 1797 6391 8188 2
## 16 FLOOD 414 6758 7172 3
## 33 LIGHTNING 651 4141 4792 4
## 15 FLASH FLOOD 887 1674 2561 5
## 42 THUNDERSTORM WIND 130 1400 1530 6
## 52 WINTER STORM 191 1292 1483 7
## 23 HEAT 237 1222 1459 8
## 29 HURRICANE/TYPHOON 64 1275 1339 9
## 27 HIGH WIND 235 1083 1318 10
The weather event types that cause the most damage to the U.S. economy are ranked as follows:
damage.top
## EVTYPE totalPropDmg totalCropDmg totalEconDamage rank
## 16 FLOOD 143944833550 4974778400 148919611950 1
## 29 HURRICANE/TYPHOON 69305840000 2607872800 71913712800 2
## 43 TORNADO 24616945710 283425010 24900370720 3
## 22 HAIL 14595143420 2476029450 17071172870 4
## 15 FLASH FLOOD 15222253910 1334901700 16557155610 5
## 28 HURRICANE 11812819010 2741410000 14554229010 6
## 9 DROUGHT 1046101000 13367566000 14413667000 7
## 45 TROPICAL STORM 7642475550 677711000 8320186550 8
## 27 HIGH WIND 5247860360 633561300 5881421660 9
## 50 WILDFIRE 4758667000 295472800 5054139800 10
The distribution of the data for the top 10 impacting event types can be depicted by pie charts.
Now, look at the overall damage from the weather events. There are 5 event types that are in the top 10 most harmful to human health and top 10 most damaging to the U.S. economy, listed below.
intersect(damage.top$EVTYPE, health.top$EVTYPE)
## [1] "FLOOD" "HURRICANE/TYPHOON" "TORNADO"
## [4] "FLASH FLOOD" "HIGH WIND"
Overall, floods, hurricanes/typhoons, tornados, flash floods, and high winds have the most costly effects on the population health and economy. Specifically, excessive heat events caused the most fatalities, tornados caused the most deaths, and floods caused the most property damage and crop damage in the U.S. Thus, law-makers should allocate the most resources to these weather events.
From the analysis above, this report has determined which events have the most impact on the U.S. health and economy in order to inform policy on preventative measures and resource allovation against the most harmful and damaging weather events in the United States. The data is provided by the National Oceanic Amospheric Administration (NOAA) storm database and the analysis uses data from the years 1996 to 2011 to produce rankings for each event type.
Across the United States, the weather event types most harmful with respect to population health, listed by decreasing impact, are: tornado, excessive heat, flood, lightning, flash flood, thunderstorm wind, winter storm, heat, hurricane / typhoon, and high wind.
Across the United States, the weather event types that have the greatest economic consquences, listed by decreasing damage, are: flood, hurricane/typhoon, tornado, hail, flash flood, hurricane, drought, tropical storms, high wind, and wildfires.
Therefore, the events listed above should receieve priority in allocation of resources for the different types of events, with the most resources for health going to excessive heat centers and tornado rescue, and the most economic damage prevention allocated to floods.
| Number | Reference Information |
|---|---|
| 1. | “Storm Events Database.” NOAA - National Centers for Environmental Information. |
| Web. 13 May 2016. <*http://www.ncdc.noaa.gov/stormevents/*>. | |
| 2. | “Storm Data Preparation.” National Weather Service Instruction 10-1605. 1-97. |
| Department of Commerce - National Oceanic & Atmospheric Administration - National | |
| Weather Service, 17 Aug. 2007. Web. 13 May 2016. | |
| <*https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf*> | |
| and <*http://www.ncdc.noaa.gov/stormevents/pd01016005curr.pdf*> | |
| 3. | “Storm Data FAQ Page.” NOAA Satellite and Information Service. National Climatic |
| Data Center - U.S. Department of Commerce, 8 Aug. 2008. Web. 13 May 2016. | |
| <*https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm* | |
| %20Events-FAQ%20Page.pdf>. |
The variables listed in this table come from NOAA’s Storm Event Database Documentation [1]
| Event Name | Code | Event Name | Code | Event Name | Code |
|---|---|---|---|---|---|
| Astronomical Low Tide | Z | Funnel Cloud | C | Marine Thunderstorm Wind | M |
| Avalanche | Z | Freezing Fog | Z | Rip Current | Z |
| Blizzard | Z | Hail | C | Seiche | Z |
| Coastal Flood | Z | Heat | Z | Sleet | Z |
| Cold/Wind Chill | Z | Heavy Rain | C | Storm Surge/Tide | Z |
| Debris Flow | C | Heavy Snow | Z | Strong Wind | Z |
| Dense Fog | Z | High Surf | Z | Thunderstorm Wind | C |
| Dense Smoke | Z | High Wind | Z | Tornado | C |
| Drought | Z | Hurricane (Typhoon) | Z | Tropical Depression | Z |
| Dust Devil | C | Ice Storm | Z | Tropical Storm | Z |
| Dust Storm | Z | Lake-Effect Snow | Z | Tsunami | Z |
| Excessive Heat | Z | Lakeshore Flood | Z | Volcanic Ash | Z |
| Extreme Cold/Wind Chill | Z | Lightning | C | Waterspout | M |
| Flash Flood | C | Marine Hail | M | Wildfire | Z |
| Flood | C | Marine High Wind | M | Winter Storm | Z |
| Frost/Freeze | Z | Marine Strong Wind | M | Winter Weather | Z |