In this report, we use the NOAA Storm Database to evaluate which kinds of events are most harmful to population health and have the greatest economic consequences. The data set covers the periods of 1950 to 2011. We measure human health impacts by the number of fatalities and injuries, and economic consequences based on estimated property and crop damage. We find that tornadoes have been, by far, the most harmful to human health, and that flooding has been associated with the most economic harm. As an additional analysis, we apply the GDP Deflator to evaluate economic consequences in real 2020 dollars. In 2020 dollars, flooding remains the top event in terms of economic harm, but the total damages associated with tornadoes increases significantly and moves into second place ahead of hurricanes/typhoons. This report was prepared for Course Project 2 in the Coursera “Reproducible Research” course.
We obtained our data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database (link). We begin by downloading the data.
download.file(url=
"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile="StormData.csv.bz2", method="curl")
Next, we read the data into the stormData data frame for analysis and look at the number of observations and variables. Note that the data set includes headers, which are read in by default.
stormData<-read.csv("StormData.csv.bz2")
dim(stormData)
## [1] 902297 37
names(stormData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
To work with the data, we will be using the dplyr, ggplot2, and tidyr packages, so we will load them now.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
We are interested in the following variables for our analysis: EVTYPE: event type FATALITIES: number of fatalities INJURIES: number of injuries PROPDMG: property damage value PROPDMGEXP: property damage exponent CROPDMG: crop damage value CROPDMGEXP: crop damage exponent
Looking at the EVTYPE column, we can see that there are 985 event types.
length(unique(stormData$EVTYPE))
## [1] 985
There are a lot of codes, and a lot of opportunity to further summarize the data, but for now we will simply trim the leading and trailing white spaces, to eliminate any duplicates due to spaces.
stormData$EVTYPE<-trimws(stormData$EVTYPE)
This, in fact, reduced the number of event types to 977.
length(unique(stormData$EVTYPE))
## [1] 977
We create a new variable to contain the total number of people injured or killed.
stormData$TOTALPEOPLE<-stormData$FATALITIES+stormData$INJURIES
We create new variables to contain the property and crop damage estimates to incorporate both the value and exponent. We begin by examining the damage exponents to ensure they are clean data.
Property Damage:
unique(stormData$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
Crop Damage:
unique(stormData$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
In addition to K for thousands, M for millions, etc., there are other characters that appear to be in error, as well as numeric entries that do not require additional processing. We applied corrections to convert the alphabetical exponents into numerical powers of ten, to remain consistent with the column names.
stormData$PROPDMGEXP[stormData$PROPDMGEXP %in% c("B")] <- "9"
stormData$PROPDMGEXP[stormData$PROPDMGEXP %in% c("m", "M")] <- "6"
stormData$PROPDMGEXP[stormData$PROPDMGEXP %in% c("K")] <- "3"
stormData$PROPDMGEXP[stormData$PROPDMGEXP %in% c("h", "H")] <- "2"
stormData$PROPDMGEXP[stormData$PROPDMGEXP %in% c("", "+", "-", "?")] <- "0"
stormData$CROPDMGEXP[stormData$CROPDMGEXP %in% c("B")] <- "9"
stormData$CROPDMGEXP[stormData$CROPDMGEXP %in% c("m","M")] <- "6"
stormData$CROPDMGEXP[stormData$CROPDMGEXP %in% c("k","K")] <- "3"
stormData$CROPDMGEXP[stormData$CROPDMGEXP %in% c("","?")] <- "0"
We can now verify that the columns use consistent values equal to powers of ten.
Property Damage:
unique(stormData$PROPDMGEXP)
## [1] "3" "6" "0" "9" "5" "4" "2" "7" "1" "8"
Crop Damage:
unique(stormData$CROPDMGEXP)
## [1] "0" "6" "3" "9" "2"
We next create new variables to contain the full values of damage estimates, as well as a new total damage estimate.
PROPDMGEST: estimated property damage in dollars (nominal) CROPDMGEST: estimated crop damage in dollars (nominal) TOTLDMGEST: total estimated damage in dollars (nominal)
stormData$PROPDMGEST<-stormData$PROPDMG*10^as.numeric(stormData$PROPDMGEXP)
stormData$CROPDMGEST<-stormData$CROPDMG*10^as.numeric(stormData$CROPDMGEXP)
stormData$TOTLDMGEST<-stormData$PROPDMGEST+stormData$CROPDMGEST
Having prepared our data set, we now address the two questions posed by this analysis:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
To evaluate the events most harmful to human health, we will look at injuries and fatalities individually and in total. After creating peopleSummary to summarize the information, we’ll convert it to a tidy data set.
peopleSummary<-stormData %>% group_by(EVTYPE) %>%
summarize(INJURIES=sum(INJURIES), FATALITIES=sum(FATALITIES),
TOTALPEOPLE=sum(TOTALPEOPLE)) %>% arrange(desc(TOTALPEOPLE))
Looking at the top 5 events, we see that TORNADO is by far the largest cause of injury and death (9.1346^{4} injuries and 5633 deaths), followed by EXCESSIVE HEAT with 6525 injuries and 1903 deaths.
peopleSummary[1:5,]
Now let’s tidy up the data (for our plot), subset to the top 5 event types, and plot it.
peopleSummaryTidy<-pivot_longer(peopleSummary, cols=2:3, names_to="TYPE",
values_to="COUNT")
peopleSubsetTidy<-peopleSummaryTidy[1:10,c(1,3:4)]
ggplot(peopleSubsetTidy, aes(x=EVTYPE, y=COUNT)) +
geom_col() +
facet_grid(TYPE~., scales="free") +
labs(x="Event Type", y="Number of Injuries or Fatalities",
title="Events Most Harmful to Human Health")
Previously, we had calculated damage estimates that combined the estimated values and the exponents. Now we will repeat the analysis we just did for human health, but focusing on economic damages.
We will look at property and crop damages individually and in total. After creating damageSummary to summarize the information, we’ll convert it to a tidy data set.
damageSummary<-stormData %>% group_by(EVTYPE) %>%
summarize(PROPDMGEST=sum(PROPDMGEST), CROPDMGEST=sum(CROPDMGEST),
TOTLDMGEST=sum(TOTLDMGEST)) %>% arrange(desc(TOTLDMGEST))
Looking at the top 5 events, we see that FLOOD is the largest cause of damage (1.4465771^{11} of property damage and 5.6619684^{9} of crops damage), followed by HURRICANE/TYPHOON with 6.930584^{10} (property) and 2.6078728^{9} crops.
damageSummary[1:5,]
Now let’s tidy up the data (for our plot), subset to the top 5 event types, and plot it.
damageSummary<-rename(damageSummary, PROPERTY=PROPDMGEST, CROP=CROPDMGEST, TOTAL=TOTLDMGEST)
damageSummaryTidy<-pivot_longer(damageSummary, cols=2:3, names_to="TYPE", values_to="AMOUNT")
damageSubsetTidy<-damageSummaryTidy[1:10,c(1,3:4)]
ggplot(damageSubsetTidy, aes(x=EVTYPE, y=AMOUNT)) +
geom_col() +
facet_grid(TYPE~., scales="free") +
labs(x="Event Type", y="Estimated Damage (Dollars)",
title="Events with Greatest Economic Consequences")
When working with cost data over a long period (61 years in this data set), it is important to adjust dollar values into real dollars for comparison. It appears that damage estimates are in nominal dollars (indicating the estimate is in the dollars of the year in which the event’s damage was estimated). For this additional investigation, we use the GDP Deflator (link) from the St. Louis Fed to adjust all estimates to 2020 dollars.
First, we need to download the data. [Note: for publication purposes, the URL is split. The code is not evaluated since it does not work with lines added.]
download.file(url="https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=
%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=
%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=
%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=
yes&show_axis_titles=yes&show_tooltip=yes&id=GDPDEF&scale=
left&cosd=1947-01-01&coed=2021-04-01&line_color=
%234572a7&link_values=false&line_style=solid&mark_type=none&mw=
3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Quarterly&fam=
avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=
lin&vintage_date=2021-09-09&revision_date=2021-09-09&nd=
1947-01-01", destfile="GDPDEF.csv", method="curl")
We then read it into the gdpDeflator data frame.
gdpDeflator<-read.csv("GDPDEF.csv")
Let’s look at the data.
head(gdpDeflator)
The data set contains quarterly deflators. To adjust to 2020 dollars, we will first convert the DATE column to a Date value and then create a number column with the year.
gdpDeflator$DATE<-as.Date(gdpDeflator$DATE, "%Y-%m-%d")
gdpDeflator$YEAR<-format(gdpDeflator$DATE,"%Y")
Next, we calculate the annual average for each year into a new data set.
gdpAnnDef<-gdpDeflator %>% group_by(YEAR) %>% summarize(ANNDEF=mean(GDPDEF))
Finally, as an approximation, we will assume that all events are adjusted based on the average deflator for each calendar year. We’ll create a new ANNMULT variable to generate the multiplier for each event’s damage estimates to bring them to 2020 dollars.
gdpAnnDef$ANNMULT<-gdpAnnDef$ANNDEF[gdpAnnDef$YEAR==2020]/gdpAnnDef$ANNDEF
This seems to be a reasonable assumption since we know the date of the event but we don’t know when the total damage estimate was reported. Therefore, while we could pinpoint the quarter in which the event occurred, there’s no telling what year the estimate was finalized. We’ll accept this uncertainty for purposes of the analysis.
Let’s look at the first and last six rows of this new data set. As we can see, the annual multiplier starts out at 9.2704875 for 1947 and declines to 0.9751311 in 2021.
head(gdpAnnDef)
tail(gdpAnnDef)
We can now apply the annual multiplier to the property and crop damage estimates and create new 2020 values. We’ll also calculate the total again.
stormData$PROPDMGEST2020<-stormData$PROPDMGEST *
gdpAnnDef$ANNMULT[match(format(as.Date(stormData$BGN_DATE, "%m/%d/%Y"),"%Y"),
gdpAnnDef$YEAR)]
stormData$CROPDMGEST2020<-stormData$CROPDMGEST *
gdpAnnDef$ANNMULT[match(format(as.Date(stormData$BGN_DATE, "%m/%d/%Y"),"%Y"),
gdpAnnDef$YEAR)]
stormData$TOTLDMGEST2020<-stormData$PROPDMGEST2020+stormData$CROPDMGEST2020
Using the same steps as before, we will look at property and crop damages in total. After creating damageSummary2020 to summarize the information, we’ll convert it to a tidy data set.
damageSummary2020<-stormData %>% group_by(EVTYPE) %>%
summarize(PROPDMGEST2020=sum(PROPDMGEST2020),
CROPDMGEST2020=sum(CROPDMGEST2020),
TOTLDMGEST2020=sum(TOTLDMGEST2020)) %>%
arrange(desc(TOTLDMGEST2020))
Looking at the top 5 events, we see that FLOOD is the largest cause of damage (1.8421909^{11} of property damage and 7.5674877^{9} of crops damage), followed by TORNADO with 1.3834928^{11} (property) and 5.8443117^{8} crops.
damageSummary2020[1:5,]
Now let’s tidy up the data (for our plot), subset to the top 5 event types, and plot it.
damageSummary2020<-rename(damageSummary2020, PROPERTY=PROPDMGEST2020,
CROP=CROPDMGEST2020, TOTAL=TOTLDMGEST2020)
damageSummaryTidy2020<-pivot_longer(damageSummary2020, cols=2:3,
names_to="TYPE", values_to="AMOUNT")
damageSubsetTidy2020<-damageSummaryTidy2020[1:10,c(1,3:4)]
ggplot(damageSubsetTidy2020, aes(x=EVTYPE, y=AMOUNT)) +
geom_col() +
facet_grid(TYPE~., scales="free") +
labs(x="Event Type", y="Estimated Damage (in 2020 Dollars)",
title="Events with Greatest Economic Consequences (2020 Dollars)")
A future analysis could attempt to group event types in order to reduce the number of types from 977 down to something more conducive to a high-level analysis.
For example, it looks like plurals, abbreviations, and combined codes could be splitting the data more than is needed.
This is a topic for further study.