In this document, we use the U.S. National Oceanic and Atmospheric Administration (NOAA) storm database to answer the two following questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
As will be shown below, the top 3 types of events most harmful with respect to population health are tornados, excessive heat, and thunderstorms.
We also find that the top 3 types of events that have the greatest economic consequences are floods, hurricanes/typhoons, and tornados.
See the Results section for a longer, ordered list of the most harmful events.
The NOAA storm database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data come in the form of a compressed comma-separated-value file we last downloaded from this link on April 22, 2015.
Documentation regarding how the features of this database were constructed/defined can be found at:
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011.
Data was loaded into R using the following code:
# Install and/or load out dependencies
if ("dplyr" %in% installed.packages()[,"Package"] == FALSE) install.packages("dplyr")
library("dplyr")
if ("ggplot2" %in% installed.packages()[,"Package"] == FALSE) install.packages("ggplot2")
library("ggplot2")
# Load data
remotezip <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
localzip <- "repdata-data-StormData.csv.bz2"
localcsv <- "repdata-data-StormData.csv"
if (!(file.exists(localcsv)) && !(file.exists(localzip))) downloadfile(remotezip, destfile = localzip, method = "curl")
if (!(file.exists(localcsv)) && file.exists(localzip)) bzfile(localzip, localcsv)
if (!(exists("StormDFRaw")) && file.exists(localcsv)) StormDFRaw <- read.csv(localcsv, stringsAsFactors=F)
After taking a quick look at the data, it is clear that the main issue that must be addressed before we can use the data in our anlaysis is to fix the lack of standardazation in describing event types.
The following example illustrates how a lack of systematic standardization in the compilation of this data has lead to the recording of similar/identical events using very different terms. The following example shows all the event types that amount to a form of “excessive heat”:
# List ways to describe excessive heat:
heat <- unique(StormDFRaw$EVTYPE[grep("WARM|HEAT|HYPERTHERMIA", StormDFRaw$EVTYPE, ignore.case=T)])
# Report number of ways to describe excessive heat in next sentence
numheatdesc <- length(heat)
heat
## [1] "HEAT" "RECORD WARMTH"
## [3] "EXTREME HEAT" "EXCESSIVE HEAT"
## [5] "RECORD HEAT" "HEAT WAVE"
## [7] "UNSEASONABLY WARM" "DROUGHT/EXCESSIVE HEAT"
## [9] "WARM DRY CONDITIONS" "RECORD HEAT WAVE"
## [11] "RECORD/EXCESSIVE HEAT" "HEAT WAVES"
## [13] "HEAT WAVE DROUGHT" "UNSEASONABLY WARM AND DRY"
## [15] "HEAT/DROUGHT" "HEAT DROUGHT"
## [17] "RECORD WARM TEMPS." "Heatburst"
## [19] "Record Heat" "Heat Wave"
## [21] "Record Warmth" "ABNORMAL WARMTH"
## [23] "UNUSUAL WARMTH" "UNUSUAL/RECORD WARMTH"
## [25] "UNSEASONABLY WARM YEAR" "HYPERTHERMIA/EXPOSURE"
## [27] "RECORD WARM" "UNSEASONABLY WARM/WET"
## [29] "UNUSUALLY WARM" "WARM WEATHER"
## [31] "UNSEASONABLY WARM & WET" "PROLONG WARMTH"
## [33] "EXCESSIVE HEAT/DROUGHT" "VERY WARM"
What amounts to “Excessive Heat” is described using a total of 34 different terms!
This is true of many other categories of events in this dataset. Hence, we must process the data in a uniform and consistent way to combine related event type descriptions. However, it makes little sense to do this over the entire original dataset. Indeed, we’re only interested in measuring casualties and damages associated with each meteorological event, so we first remove any observation where there are no casualties (fatalities and injuries) or no economical damages (property damages and crop damages) and only keep those variables relevant to our analysis using the following code:
# Subset data and report before/after num observations in next sentence
obervationsbefore <- nrow(StormDFRaw)
StormDF <- subset(StormDFRaw, INJURIES > 0 | FATALITIES > 0 | PROPDMG > 0 | CROPDMG > 0, select = c(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
observationsafter <- nrow(StormDF)
# Free up memory of unneeded objects
rm(StormDFRaw)
The number of observations has been reduced from 902297 to 254633
Now we can combine related event type descriptions the following cleaning code:
# Cleanup DF and report before/after num evt types in next sentence
eventypesbefore <- length(unique(StormDF$EVTYPE))
StormDF$EVTYPE <- toupper(StormDF$EVTYPE)
StormDF$EVTYPE[grep("AVALANC", StormDF$EVTYPE, ignore.case=T)] <- "AVALANCHE"
StormDF$EVTYPE[grep("EROSION", StormDF$EVTYPE, ignore.case=T)] <- "BEACH/COASTAL EROSION"
StormDF$EVTYPE[grep("COASTAL FL", StormDF$EVTYPE, ignore.case=T)] <- "COASTAL FLOODING"
StormDF$EVTYPE[grep("COASTALST", StormDF$EVTYPE, ignore.case=T)] <- "COASTAL STORM"
StormDF$EVTYPE[grep("COLD|HYPOTHERMIA|LOW TEMP|WIND CHILL|WINDCHILL", StormDF$EVTYPE, ignore.case=T)] <- "COLD"
StormDF$EVTYPE[grep("DROUGHT", StormDF$EVTYPE, ignore.case=T)] <- "DROUGHT"
StormDF$EVTYPE[grep("DRY MI", StormDF$EVTYPE, ignore.case=T)] <- "DRY MICROBURST"
StormDF$EVTYPE[grep("DUST", StormDF$EVTYPE, ignore.case=T)] <- "DUST STORM/DEVIL"
StormDF$EVTYPE[grep("FLASH|RISING WATER", StormDF$EVTYPE, ignore.case=T)] <- "FLASH FLOOD"
StormDF$EVTYPE[grep("FLOODING|FLOODS|LAKE FLOOD|LAKESHORE FLOOD|MAJOR FLOOD|RIVER FLOOD|STREAM FL|URBAN", StormDF$EVTYPE, ignore.case=T)] <- "FLOOD"
StormDF$EVTYPE[grep("FOG", StormDF$EVTYPE, ignore.case=T)] <- "FOG"
StormDF$EVTYPE[grep("FREEZ", StormDF$EVTYPE, ignore.case=T)] <- "FREEZING RAIN"
StormDF$EVTYPE[grep("ICE STORM", StormDF$EVTYPE, ignore.case=T)] <- "GLAZE"
StormDF$EVTYPE[grep("GUSTY|HIGH WIND|HIGH WINDS|MICROBURST|STRONG WIND", StormDF$EVTYPE, ignore.case=T)] <- "GUSTY/HIGH/STRONG WIND"
StormDF$EVTYPE[grep("HAIL", StormDF$EVTYPE, ignore.case=T)] <- "HAIL"
StormDF$EVTYPE[grep("WARM|HEAT|HYPERTHERMIA", StormDF$EVTYPE, ignore.case=T)] <- "HEAT"
StormDF$EVTYPE[grep("HEAVY RAIN|HEAVY PRECIPITATION|HEAVY SHOWER|HVY RAIN|EXCESSIVE RAINFALL|EXCESSIVE WETNESS|RAINSTORM|RECORD RAINFALL|TORRENTIAL|UNSEASONAL RAIN", StormDF$EVTYPE, ignore.case=T)] <- "HEAVY RAIN"
StormDF$EVTYPE[grep("SURF|COASTAL SURGE", StormDF$EVTYPE, ignore.case=T)] <- "HEAVY/HIGH SURF"
StormDF$EVTYPE[grep("HEAVY SEAS|HIGH SEAS|HIGH SWELLS|HIGH WATER|HIGH WAVES|ROGUE WAVE|ROUGH SEAS", StormDF$EVTYPE, ignore.case=T)] <- "HIGH/ROUGH SEAS"
StormDF$EVTYPE[grep("TIDE", StormDF$EVTYPE, ignore.case=T)] <- "HIGH/LOW TIDE"
StormDF$EVTYPE[grep("HURRICANE|TYPHOON", StormDF$EVTYPE, ignore.case=T)] <- "HURRICANE/TYPHOON"
StormDF$EVTYPE[grep("ICE|ICY|FROST", StormDF$EVTYPE, ignore.case=T)] <- "ICE"
StormDF$EVTYPE[grep("LANDSLIDE|LANDSLUMP", StormDF$EVTYPE, ignore.case=T)] <- "LANDSLIDE"
StormDF$EVTYPE[grep("LIGHTING|LIGHTNING|LIGNTNING", StormDF$EVTYPE, ignore.case=T)] <- "LIGHTNING"
StormDF$EVTYPE[grep("MARINE", StormDF$EVTYPE, ignore.case=T)] <- "MARINE MISHAP"
StormDF$EVTYPE[grep("MUDSLIDE|MUD SLIDE", StormDF$EVTYPE, ignore.case=T)] <- "MUDSLIDE"
StormDF$EVTYPE[grep("RIP CU", StormDF$EVTYPE, ignore.case=T)] <- "RIP CURRENT"
StormDF$EVTYPE[grep("SNOW", StormDF$EVTYPE, ignore.case=T)] <- "SNOW"
StormDF$EVTYPE[grep("THUNDER|TSTM|THUDERSTORM|THUNDEERSTORM|THUNERSTORM|TUNDERSTORM", StormDF$EVTYPE, ignore.case=T)] <- "THUNDERSTORM"
StormDF$EVTYPE[grep("TORNADO|TORNDAO", StormDF$EVTYPE, ignore.case=T)] <- "TORNADO"
StormDF$EVTYPE[grep("TROPICAL S", StormDF$EVTYPE, ignore.case=T)] <- "TROPICAL STORM"
StormDF$EVTYPE[grep("WATERSPOUT", StormDF$EVTYPE, ignore.case=T)] <- "WATERSPOUT"
StormDF$EVTYPE[grep("BRUSH FIRE|FOREST FIRES|WILD FIRE|WILDFIRE", StormDF$EVTYPE, ignore.case=T)] <- "WILD/FOREST FIRE"
StormDF$EVTYPE[grep("SLEET|WINTER STORM|MIXED PRECIP|WINTER WEATHER|WINTRY MIX", StormDF$EVTYPE, ignore.case=T)] <- "WINTER WEATHER/MIX"
eventtypesafter <- length(unique(StormDF$EVTYPE))
This reduces the number of unique descriptors for event types from 488 to 69
To measure casualties by event type, we simply add the number or injuries and fatalities by event type and order them in descending order:
TotalCasualties <- summarize (group_by(StormDF, EVTYPE), casualties = sum (INJURIES + FATALITIES))
TotalCasualties <- arrange(TotalCasualties, desc(casualties))
Top10Casualties <- TotalCasualties[1:10,]
# Free up memory of unneeded objects
rm(TotalCasualties)
Top10Casualties$EVTYPE <- factor(Top10Casualties$EVTYPE, levels=Top10Casualties$EVTYPE)
print(Top10Casualties)
## Source: local data frame [10 x 2]
##
## EVTYPE casualties
## 1 TORNADO 97043
## 2 HEAT 12401
## 3 THUNDERSTORM 10119
## 4 FLOOD 7393
## 5 LIGHTNING 6049
## 6 FLASH FLOOD 2838
## 7 GUSTY/HIGH/STRONG WIND 2315
## 8 GLAZE 2302
## 9 WINTER WEATHER/MIX 2261
## 10 WILD/FOREST FIRE 1698
We need to perform additional data processing to accurately report total damages. First, the dataset report damages to property (PROPDMG and PROPDMGEXP) and crops (CROPDMG and CROPDMGEXP) saperately, so we will need to sum them together to get total damages. Second, per the NOAA website documentation: “Damages should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e.,1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include”K" for thousands, “M” for millions, and “B” for billions."
Based on the above, we would expect to see in PROPDMGEXP and CROPDMGEXP the following values: “K” or “k” for thousands, “M” or “m” for millions, “B” or “b” for billions, or nothing if the numbers PROPDMG or CROPDMG are to be read as-is. But a quick look at the data suggests there are other values for these variables in the dataset:
unique(c(unique(StormDF$PROPDMGEXP), unique(StormDF$CROPDMGEXP)))
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-" "?"
## [18] "k"
For each “exponent” above whose meaning we cannot ascertain, we assume the PROPDMG and CROPDMG variable represent damages in actual dollars and the PROPDMGEXP and CROPDMGEXP are to be ignored. We can now generate two new variables to account for property damages and crop damages in actual dollars:
StormDF$PROPDMGEXP <- toupper(StormDF$PROPDMGEXP)
StormDF$PROPDAMAGES <- StormDF$PROPDMG
StormDF$PROPDAMAGES <- StormDF$PROPDMG * 10^3 * as.numeric(StormDF$PROPDMGEXP == "K") + StormDF$PROPDMG * 10^6 * as.numeric(StormDF$PROPDMGEXP == "M") + StormDF$PROPDMG * 10^9 * as.numeric(StormDF$PROPDMGEXP == "B")
StormDF$CROPDMGEXP <- toupper(StormDF$CROPDMGEXP)
StormDF$CROPDAMAGES <- StormDF$CROPDMG
StormDF$CROPDAMAGES <- StormDF$CROPDMG * 10^3 * as.numeric(StormDF$CROPDMGEXP == "K") + StormDF$CROPDMG * 10^6 * as.numeric(StormDF$CROPDMGEXP == "M") + StormDF$CROPDMG * 10^9 * as.numeric(StormDF$CROPDMGEXP == "B")
To measure total damages (crop damages + property damages) by event type, we simply add crop damages and property damages by event type and order them in descending order:
TotalDamages <- summarize (group_by(StormDF, EVTYPE), damages = sum (PROPDAMAGES + CROPDAMAGES))
TotalDamages <- arrange(TotalDamages, desc(damages))
Top10Damages <- TotalDamages[1:10,]
rm(TotalDamages)
Top10Damages$EVTYPE <- factor(Top10Damages$EVTYPE, levels=Top10Damages$EVTYPE)
print(Top10Damages)
## Source: local data frame [10 x 2]
##
## EVTYPE damages
## 1 FLOOD 1.6e+11
## 2 HURRICANE/TYPHOON 9.1e+10
## 3 TORNADO 5.7e+10
## 4 STORM SURGE 4.3e+10
## 5 HAIL 2.1e+10
## 6 FLASH FLOOD 1.8e+10
## 7 DROUGHT 1.5e+10
## 8 THUNDERSTORM 1.2e+10
## 9 GLAZE 9.0e+09
## 10 WILD/FOREST FIRE 8.9e+09
The following figure shows the top 10 types of events that are the most harmful with respect to the overall US population health (injuries + fatalities). Most harmful events are presented in decreasing order of influence, from left to right:
casplot <- ggplot(Top10Casualties, aes(EVTYPE, casualties))
casplot <- casplot + geom_bar(color="black", fill="#FF0000", width=0.95, stat="identity")
casplot <- casplot + labs(x="Weather Event Type") + labs(y="Total Casualties") + labs(title="Total US Casualties per Type of Weather Event (1950-2011)")
casplot <- casplot + theme(axis.text.x = element_text(angle = 15))
print(casplot)
This next figure shows the top 10 types of events that have the largest economic impact on the US economy. Most impactful events are presented in decreasing order of influence, from left to right:
damplot <- ggplot(Top10Damages, aes(EVTYPE, damages))
damplot <- damplot + geom_bar(color="black", fill="#FF0000", width=0.95, stat="identity")
damplot <- damplot + labs(x="Weather Event Type") + labs(y="Total Damages (in $)") + labs(title="Total US Damages per Type of Weather Event (1950-2011)")
damplot <- damplot + theme(axis.text.x = element_text(angle = 15))
print(damplot)