In this report we study the severe weather events across the United States between 1996 and 2011, using the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. More specifically, we want to find out which types of events are the most harmful with respect to population health, and which have the greatest economic consequences. The data between the years 1996 and 2011 are used, excluding the earlier years data, because recent years data are more complete. Our analysis found that the most harmful event types are Tornado, Excessive Heat, and Flood. On the other hand, event types having the greatest economic consequences are Flood, Hurricane, and Storm Surge/Tide.
First we manually download the data from the link provided in the course website, save it in the working directory and unzip it. Then we read the raw csv.bz2 file where the missing data are indicated by empty strings.
df <- read.csv("repdata-data-StormData.csv.bz2", na.strings = "")
First we plot an exploratory histogram about the years of the events, from 1950 to 2011.
library(lubridate)
df_year <- year(mdy_hms(df$BGN_DATE))
hist(df_year, xlab="Year", main="Histogram of Event Years")
We see that in the earlier years there are much fewer events recorded, which is probably due to a lack of good records. On the other hand, recent years should be considered more complete. Therefore for a fair comparison among the different types of events, and also for speeding up our data cleaning process, we make a decision of using only the recent years data. The cutoff year we choose is 1995.
library(dplyr)
df <- df %>% filter(year(mdy_hms(BGN_DATE)) > 1995)
Notice that since we are only interested in finding out which types of events are the most harmful and have the greatest economic consequences, discarding the fewer, earlier years data would have little effect on our conclusion.
Once again that we are only interested in finding out which types of events are the most harmful and have the greatest economic consequences, hence we only keep data with fatalities/injuries, or with property/crop damages of at least one million dollars. Since there are data with property/crop damages of billions of dollars, damages of less than one million dollars (less than 0.1 percent of the highest damage) are not significant to our anaylsis.
## Creating a new function to convert the exponential symbols to numbers.
exp_convert <- function(x) {
if (x[1] == "B") {
1000000000
} else if (x[1] == "M") {
1000000
} else {
0
}
}
## Setting NA to zero
df[is.na(df$PROPDMGEXP),]$PROPDMGEXP <- 0
df[is.na(df$CROPDMGEXP),]$CROPDMGEXP <- 0
## Computing the actual damage by multiplying the DMG columns
## with the corresponding converted exponential numbers
df <- df %>% group_by(PROPDMGEXP) %>%
mutate(PROPDMGEXPCON = exp_convert(PROPDMGEXP)) %>%
mutate(PROPDMGMULTIPLIED = PROPDMG * PROPDMGEXPCON)
df <- df %>% group_by(CROPDMGEXP) %>%
mutate(CROPDMGEXPCON = exp_convert(CROPDMGEXP)) %>%
mutate(CROPDMGMULTIPLIED = CROPDMG * CROPDMGEXPCON)
## Filtering
df <- df %>% filter(FATALITIES + INJURIES > 0 |
PROPDMGMULTIPLIED >= 1000000 |
CROPDMGMULTIPLIED >= 1000000)
The next step of our data processing is cleaning the \(EVTYPE\) variable (which indicates the type of events). Our goal in this section is to match the data in \(ENTYPE\) with the event names listed in Table 1 of Section 2.1.1 of the National Weather Service Storm Data Documentation.
The following is a list of raw data entries in \(EVTYPE\):
sort(unique(df$EVTYPE))
## [1] TSTM WIND AGRICULTURAL FREEZE
## [3] ASTRONOMICAL HIGH TIDE AVALANCHE
## [5] BLACK ICE BLIZZARD
## [7] blowing snow BRUSH FIRE
## [9] COASTAL FLOODING/EROSION Coastal Flood
## [11] COASTAL FLOOD Coastal Flooding
## [13] COASTAL FLOODING COASTAL FLOODING/EROSION
## [15] Coastal Storm COASTAL STORM
## [17] COASTALSTORM Cold
## [19] COLD COLD AND SNOW
## [21] Cold Temperature COLD WEATHER
## [23] COLD/WIND CHILL DAM BREAK
## [25] Damaging Freeze DAMAGING FREEZE
## [27] DENSE FOG DROUGHT
## [29] DROWNING DRY MICROBURST
## [31] Dust Devil DUST DEVIL
## [33] DUST STORM Early Frost
## [35] Erosion/Cstl Flood EXCESSIVE HEAT
## [37] EXCESSIVE SNOW Extended Cold
## [39] Extreme Cold EXTREME COLD
## [41] EXTREME COLD/WIND CHILL EXTREME WINDCHILL
## [43] FALLING SNOW/ICE FLASH FLOOD
## [45] FLOOD FOG
## [47] Freeze FREEZE
## [49] FREEZING DRIZZLE FREEZING FOG
## [51] FREEZING RAIN Freezing Spray
## [53] FROST Frost/Freeze
## [55] FROST/FREEZE FUNNEL CLOUD
## [57] GLAZE GUSTY WIND
## [59] Gusty winds Gusty Winds
## [61] GUSTY WINDS HAIL
## [63] HARD FREEZE HAZARDOUS SURF
## [65] HEAT Heat Wave
## [67] HEAVY RAIN Heavy Rain/High Surf
## [69] HEAVY SEAS HEAVY SNOW
## [71] Heavy snow shower Heavy Surf
## [73] HEAVY SURF Heavy surf and wind
## [75] HEAVY SURF/HIGH SURF HIGH SEAS
## [77] High Surf HIGH SURF
## [79] HIGH SWELLS HIGH WATER
## [81] HIGH WIND HURRICANE
## [83] Hurricane Edouard HURRICANE/TYPHOON
## [85] HYPERTHERMIA/EXPOSURE Hypothermia/Exposure
## [87] HYPOTHERMIA/EXPOSURE ICE ON ROAD
## [89] ICE ROADS ICE STORM
## [91] ICY ROADS LAKE-EFFECT SNOW
## [93] LAKESHORE FLOOD LANDSLIDE
## [95] LANDSLIDES LIGHT SNOW
## [97] LIGHTNING Marine Accident
## [99] MARINE HIGH WIND MARINE STRONG WIND
## [101] MARINE THUNDERSTORM WIND MARINE TSTM WIND
## [103] MIXED PRECIP Mudslide
## [105] MUDSLIDE Mudslides
## [107] NON TSTM WIND NON-SEVERE WIND DAMAGE
## [109] OTHER RAIN/SNOW
## [111] RECORD HEAT RIP CURRENT
## [113] RIP CURRENTS RIVER FLOOD
## [115] River Flooding RIVER FLOODING
## [117] ROGUE WAVE ROUGH SEAS
## [119] ROUGH SURF SMALL HAIL
## [121] Snow SNOW
## [123] SNOW AND ICE SNOW SQUALL
## [125] Snow Squalls STORM SURGE
## [127] STORM SURGE/TIDE STRONG WIND
## [129] Strong Winds STRONG WINDS
## [131] THUNDERSTORM THUNDERSTORM WIND
## [133] THUNDERSTORM WIND (G40) TIDAL FLOODING
## [135] TORNADO Torrential Rainfall
## [137] TROPICAL DEPRESSION TROPICAL STORM
## [139] TSTM WIND TSTM WIND (G35)
## [141] TSTM WIND (G40) TSTM WIND (G45)
## [143] TSTM WIND/HAIL TSUNAMI
## [145] TYPHOON Unseasonable Cold
## [147] UNSEASONABLY COLD UNSEASONABLY WARM
## [149] UNSEASONAL RAIN URBAN/SML STREAM FLD
## [151] WARM WEATHER WATERSPOUT
## [153] Whirlwind WILD/FOREST FIRE
## [155] WILDFIRE WIND
## [157] WIND AND WAVE WINDS
## [159] WINTER STORM WINTER WEATHER
## [161] WINTER WEATHER MIX WINTER WEATHER/MIX
## [163] WINTRY MIX
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
The data corresponding to “Other” and “Record Heat” are not needed, so we take them out.
df <- df %>% filter(!grepl("other|record", EVTYPE, ignore.case=TRUE))
Next we will try to go along the list and relabel the entries.
The first relabelling action we do is to translate all the characters into upper case.
df <- df %>% mutate(EVTYPE = toupper(EVTYPE))
We group together some data by using regular expressions on some keywords in the event type names. Note that this is done only for event types with a unique keyword in their names. On the other hand, event types like Flood, Flash Flood, etc have to be dealt with separately.
df$EVTYPE[grepl("FROST|FREEZE", df$EVTYPE)] <- "FROST/FREEZE"
df$EVTYPE[grepl("FIRE", df$EVTYPE)] <- "WILDFIRE"
df$EVTYPE[grepl("HURRICANE|TYPHOON", df$EVTYPE)] <- "HURRICANE"
df$EVTYPE[grepl("WINTER WEATHER|WINTRY", df$EVTYPE)] <- "WINTER WEATHER"
df$EVTYPE[grepl("RIP CURRENT", df$EVTYPE)] <- "RIP CURRENT"
We recongnize that “TSTM” is an abbreviation of Thunderstorm, and we use regular expression to group all the thunderstorm data together (but not Marine Thunderstorm Wind). Same process is done for Marine Thunderstorm Wind.
df$EVTYPE[grepl("^ *TSTM|^THUNDERSTORM", df$EVTYPE)] <- "THUNDERSTORM WIND"
df$EVTYPE[grepl("MARINE TSTM WIND", df$EVTYPE)] <- "MARINE THUNDERSTORM WIND"
We realize that high tide and storm surge falls under the category of Storm Surge/Tide.
df$EVTYPE[grepl("HIGH TIDE|STORM SURGE", df$EVTYPE)] <- "STORM SURGE/TIDE"
Black Ice and other ice related entries (but not ice storem) fall under the category of Winter Weather.
df$EVTYPE[grepl(".+ICE|IC.*ROAD", df$EVTYPE)] <- "WINTER WEATHER"
Here we deal with the data containing the word “snow”. Rain/Snow is relabelled as Sleet, Light Snow goes under Winter Weather, Cold and Snow under Cold/Wind Chill, and finally the rest under Heavy Snow except Lake-Effect Snow.
df$EVTYPE[grepl("RAIN/SNOW", df$EVTYPE)] <- "SLEET"
df$EVTYPE[grepl("LIGHT SNOW", df$EVTYPE)] <- "WINTER WEATHER"
df$EVTYPE[grepl("COLD AND SNOW", df$EVTYPE)] <- "COLD/WIND CHILL"
df$EVTYPE[grepl("SNOW", df$EVTYPE) &
!grepl("LAKE-EFFECT SNOW", df$EVTYPE)] <- "HEAVY SNOW"
After relabelling Rain/Snow, we can put the rest of data containing “rain” under the event type Heavy Rain.
df$EVTYPE[grepl("RAIN", df$EVTYPE)] <- "HEAVY RAIN"
We recongnize that “CSTL” is an abbreviation of Coastal, and we also put Tital Flooding under Coastal Flood.
df$EVTYPE[grepl("COASTAL +FLOOD|CSTL|TIDAL", df$EVTYPE)] <- "COASTAL FLOOD"
We put the coastal storm data under Tropical Storm.
df$EVTYPE[grepl("COASTAL ?STORM", df$EVTYPE)] <- "TROPICAL STORM"
All data containing the word “extreme” are put under the category of Extreme Cold/Wind Chill, while the rest of data containing “cold” are put under Cold/Wind Chill.
df$EVTYPE[grepl("EXTREME", df$EVTYPE)] <- "EXTREME COLD/WIND CHILL"
df$EVTYPE[grepl("COLD", df$EVTYPE) &
!grepl("EXTREME COLD", df$EVTYPE)] <- "COLD/WIND CHILL"
Dam Break is relabelled as Flash Flood.
df$EVTYPE[grepl("DAM BREAK", df$EVTYPE)] <- "FLASH FLOOD"
Fog is relabelled as Dense Fog.
df$EVTYPE[grepl("^FOG", df$EVTYPE)] <- "DENSE FOG"
All data containing “freezing” are put under Frost/Freeze, except Freezing Fog. Glaze is put under Frost/Freeze as well.
df$EVTYPE[grepl("FREEZING|GLAZE", df$EVTYPE) &
!grepl("FREEZING FOG", df$EVTYPE)] <- "FROST/FREEZE"
Gusty wind(s), Wind(s), Non TSTM Wind, Non-Severe Wind Damage, and Strong Winds are relabelled as Strong Wind.
df$EVTYPE[grepl("^WIND.?$|GUSTY|NON.*WIND|^STRONG WIND", df$EVTYPE)] <-
"STRONG WIND"
Next we put all data containing “surf”, “swells”, “wave”, together with those ending with “water” and “seas” under High Surf.
df$EVTYPE[grepl("SURF|SWELLS|WAVE|SEAS$|WATER$", df$EVTYPE)] <- "HIGH SURF"
We relabel Heat Wave as Heat, and we also put data containing “warm” under Heat.
df$EVTYPE[grepl("HEAT WAVE|WARM", df$EVTYPE)] <- "HEAT"
They are put under Heat and Cold/Wind Chill respectively.
df$EVTYPE[grepl("HYPERTHERMIA", df$EVTYPE)] <- "HEAT"
df$EVTYPE[grepl("HYPOTHERMIA", df$EVTYPE)] <- "COLD/WIND CHILL"
Landslide(s) and Mudslide(s) are relabelled as Debris Flow.
df$EVTYPE[grepl("LANDSLIDE|MUDSLIDE", df$EVTYPE)] <- "DEBRIS FLOW"
We recognize that “Mixed Precip” stands for Mixed Precipitation, hence it should be put under Sleet.
df$EVTYPE[grepl("MIXED PRECIP", df$EVTYPE)] <- "SLEET"
River Flood(ing) belongs to the category of Flood, as well as Urban/Small Stream Flooding which is abbreviated as “Urban/Sml Stream Fld”.
df$EVTYPE[grepl("RIVER FLOOD|FLD", df$EVTYPE)] <- "FLOOD"
Small Hail is now put under the event type Hail.
df$EVTYPE[grepl("SMALL HAIL", df$EVTYPE)] <- "HAIL"
Whirlwind is relabelled as Tornado.
df$EVTYPE[grepl("WHIRLWIND", df$EVTYPE)] <- "TORNADO"
Finally we filter out the data that do not fit into any category listed in National Weather Service Storm Data Documentation. Those are Drowning, Dry MicroBurst, and Marine Accident
df <- df %>% filter(!grepl("DROWNING|MICROBURST|MARINE ACCIDENT", EVTYPE))
In order to find out which types of events are the most harmful with respect to population health, we add up the fatalities and injuries number for each event type. Then we add up the 2 numbers to form a column named Total. The events that have the highest total number are shown below.
df1 <- df %>% group_by(EVTYPE) %>%
summarize(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES),
TOTAL = sum(FATALITIES) + sum(INJURIES)) %>%
arrange(desc(TOTAL))
head(df1, n=10)
## Source: local data frame [10 x 4]
##
## EVTYPE FATALITIES INJURIES TOTAL
## 1 TORNADO 1512 20667 22179
## 2 EXCESSIVE HEAT 1797 6391 8188
## 3 FLOOD 444 6838 7282
## 4 THUNDERSTORM WIND 379 5129 5508
## 5 LIGHTNING 651 4141 4792
## 6 FLASH FLOOD 887 1674 2561
## 7 WILDFIRE 87 1458 1545
## 8 WINTER STORM 191 1292 1483
## 9 HEAT 238 1241 1479
## 10 HURRICANE 125 1328 1453
We see that Tornado is the most harmful event by a large margin. It is followed by Excessive Heat, Flood, Thunderstorm Wind and Lightning. After that the number drops significantly. We remark that even though the difference in the injuries numbers between Tornado and Excessive Heat is huge, Excessive Heat has the highest number of fatalities so we can consider it to be more lethal than Tornado. We conlude this subsection by plotting the above table in a bar chart.
library(reshape2)
df1_long <- melt(df1[1:10,1:3], id.vars="EVTYPE")
library(ggplot2)
ggplot(df1_long, aes(x = EVTYPE, y = value, fill = variable)) +
geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=90)) +
labs(title="Most Harmful Event Types") +
labs(x="Event Type", y="TOTAL")
Similar to the previous part, we add up the property damage and the crop damage for each event type, and then we sum the 2 numbers to form a column named Total Damage. The events with the highest total number are shown below, where the numbers are in billion dollars.
df2 <- df %>% group_by(EVTYPE) %>%
summarize(PROP_DAMAGE = sum(PROPDMGMULTIPLIED)/10^9,
CROP_DAMAGE = sum(CROPDMGMULTIPLIED)/10^9,
TOTAL_DAMAGE= sum(PROPDMGMULTIPLIED)/10^9 +
sum(CROPDMGMULTIPLIED)/10^9) %>%
arrange(desc(TOTAL_DAMAGE))
head(df2, n=10)
## Source: local data frame [10 x 4]
##
## EVTYPE PROP_DAMAGE CROP_DAMAGE TOTAL_DAMAGE
## 1 FLOOD 143.27821 4.86032 148.13853
## 2 HURRICANE 81.70724 5.34348 87.05072
## 3 STORM SURGE/TIDE 47.81798 0.00000 47.81798
## 4 TORNADO 23.43980 0.19349 23.63329
## 5 HAIL 14.02981 1.99875 16.02856
## 6 FLASH FLOOD 13.98040 1.17501 15.15541
## 7 DROUGHT 1.04305 13.34662 14.38967
## 8 TROPICAL STORM 7.59392 0.67312 8.26704
## 9 WILDFIRE 7.64324 0.39410 8.03734
## 10 THUNDERSTORM WIND 5.71324 0.83765 6.55089
The event type that causes the greatest economic damage is Flood, followed by Hurricane and Storm Surge/Tide. Notice that the differences among the total damages of these 3 event types are large. We also remark that the event type with the highest crop damage is Drought, which is also the only one in the above table with its crop damage higher than its property damage. Finally we conclude this subsection with a bar chart of the above result.
df2_long <- melt(df2[1:10,1:3], id.vars="EVTYPE")
library(ggplot2)
ggplot(df2_long, aes(x = EVTYPE, y = value, fill = variable)) +
geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=90)) +
labs(title="Event Types with Greatest Economic Consequences") +
labs(x="Event Type", y="TOTAL DAMAGE (billion dollars)")