In this report the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database is explored. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. Based on the analysis in the given data, the top 3 most harmful storms and weather events with respect to population health are: Tornado, Excessive Heat, and Flood, respectively. And the top 3 storms and weather events have the greatest economic consequences are: Flood, Hurricane (Typhoon), and Storm Surge/Tide, respectively. Flood is among the top 3 most harmful to population health AND one of the top 3 events that has the greatest economic consequences.
The Storm Data is the data file. There are also some documentations of the database available:
. National Weather Service Storm Data Documentation
. National Climatic Data Center Storm Events FAQ
First download the data file, then read in the data from the raw csv file included in the downloaded zip archive. The data is a delimited file where fields are delimited with the ‘,’ character and missing values are coded as blank fields. Some string fields are double-quoted (“). The data also contains a header line.
fileLocal <- "StormData.csv.bz2"
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists(fileLocal)) {
download.file(fileURL, destfile = fileLocal)
}
# Reads data file into data table
data <- data.table(read.csv(fileLocal, header = TRUE, sep = ",", quote = "\"", na.strings = ""))
After reading in the data we check the first few rows. There are total of 902,297 rows, and 37 columns in this dataset.
dim(data)
## [1] 902297 37
head(data[, 2:8, with = FALSE], n = 6)
## BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1: 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2: 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3: 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4: 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5: 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6: 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
According to National Climate Data Center of NOAA, only a few storm and weather event types were recorded prior to year 1996. From year 1996 to present, more completed data were recorded; 48 event types were recorded as defined in NWS Directive 10-1605. For the purpose of this analysis and the sake of fairness of comparison, only data from year 1996 and later are used:
data[["DATE_OBJ"]] <- as.POSIXct(strptime(data$BGN_DATE, "%m/%d/%Y"))
data1 <- subset(data, DATE_OBJ >= as.POSIXct("1996-01-01"))
To further simplyfy and expedite the cleaning and processing in the later steps, only the observations that meet the following criteria are extracted:
. Has fatalities, or
. Has injuries, or
. Has property damage with valid monetary multiplier, or
. Has crop damage with valid monetary multiplier.
data1 <- subset(data1, (FATALITIES > 0 | INJURIES > 0) |
(PROPDMG > 0 & grepl("b|B|h|H|k|K|m|M", PROPDMGEXP)) |
(CROPDMG > 0 & grepl("b|B|h|H|k|K|m|M", CROPDMGEXP)))
As the recorded event types (“EVTYPE”) appear to be not as clean as expected, a clean up or recode is needed. First, let’s create a white list of event types based on info provided by National Weather Service Storm Data Documentation. Then mark the event types that match exactly with the given white list, and mark the unmatched ones as “OTHER”.
weatherEventsStd <- c(
"ASTRONOMICAL LOW TIDE", "AVALANCHE", "BLIZZARD",
"COASTAL FLOOD", "COLD/WIND CHILL", "DEBRIS FLOW",
"DENSE FOG", "DENSE SMOKE", "DROUGHT",
"DUST DEVIL", "DUST STORM", "EXCESSIVE HEAT",
"EXTREME COLD/WIND CHILL", "FLASH FLOOD", "FLOOD",
"FROST/FREEZE", "FUNNEL CLOUD", "FREEZING FOG",
"HAIL", "HEAT", "HEAVY RAIN",
"HEAVY SNOW", "HIGH SURF", "HIGH WIND",
"HURRICANE (TYPHOON)", "ICE STORM", "LAKE-EFFECT SNOW",
"LAKESHORE FLOOD", "LIGHTNING", "MARINE HAIL",
"MARINE HIGH WIND", "MARINE STRONG WIND", "MARINE THUNDERSTORM WIND",
"RIP CURRENT", "SEICHE", "SLEET",
"STORM SURGE/TIDE", "STRONG WIND", "THUNDERSTORM WIND",
"TORNADO", "TROPICAL DEPRESSION", "TROPICAL STORM",
"TSUNAMI", "VOLCANIC ASH", "WATERSPOUT",
"WILDFIRE", "WINTER STORM", "WINTER WEATHER"
)
data1$EVENT_SDT <- ifelse(trimws(toupper(data1$EVTYPE)) %in% weatherEventsStd,
trimws(toupper(as.character(data1$EVTYPE))), "OTHER")
There are still quite many of “non-standard” event types left:
length(unique(filter(data1, EVENT_SDT == "OTHER")$EVTYPE))
## [1] 171
head(unique(filter(data1, EVENT_SDT == "OTHER")$EVTYPE), n = 10)
## [1] TSTM WIND FREEZING RAIN EXTREME COLD
## [4] TSTM WIND/HAIL RIP CURRENTS Other
## [7] WILD/FOREST FIRE STORM SURGE Ice jam flood (minor
## [10] Tstm Wind
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
Before jumping into this list and try to categorize them, let’s take a look at this list from casualties and economic consequences perspectives. Perhaps only a handful of event types made up the majority of casualties or economic consequences. If so, only those event types are to be categorized, then the remaining uncategorized event types will make no unsignificant impact to our analysis.
First, looks into the observations with event types categorized as “OTHER” in the earlier steps. Then factorizes the EVENT_SDT for grouping purpose:
dataOther <- filter(data1, EVENT_SDT %in% "OTHER")
dataOther$EVENT_SDT <- as.factor(dataOther$EVENT_SDT)
Top 30 uncategorized event types (OTHER) that are most harmful to population health:
# Summarizes the data "dataOther" to get top harmful events to population health.
dataPopHealth_other <- dataOther %>% group_by(EVTYPE) %>%
summarize("POPULATION" = sum(FATALITIES + INJURIES)) %>%
arrange(desc(POPULATION))
head(dataPopHealth_other, n = 30)
## EVTYPE POPULATION
## 1 TSTM WIND 3870
## 2 HURRICANE/TYPHOON 1339
## 3 FOG 772
## 4 WILD/FOREST FIRE 557
## 5 RIP CURRENTS 496
## 6 GLAZE 213
## 7 EXTREME COLD 192
## 8 URBAN/SML STREAM FLD 107
## 9 HURRICANE 107
## 10 WIND 102
## 11 TSTM WIND/HAIL 100
## 12 WINTER WEATHER/MIX 100
## 13 HEAVY SURF/HIGH SURF 90
## 14 LANDSLIDE 89
## 15 WINTRY MIX 78
## 16 Heat Wave 70
## 17 WINTER WEATHER MIX 68
## 18 HEAVY SURF 45
## 19 STORM SURGE 39
## 20 SNOW SQUALL 37
## 21 DRY MICROBURST 28
## 22 MIXED PRECIP 28
## 23 COLD 27
## 24 STRONG WINDS 27
## 25 ICY ROADS 26
## 26 BLACK ICE 25
## 27 EXTREME WINDCHILL 22
## 28 UNSEASONABLY WARM 17
## 29 MARINE TSTM WIND 17
## 30 FREEZING DRIZZLE 15
Top 20 uncategorized event types (OTHER) that have greatest econonmic consequences:
# Summarizes the data "dataOther" to get top events that have greatest econonmic consequences
dataEcoDmg_other <- dataOther %>% mutate(prop_dmg = ifelse(
grepl("b|B", PROPDMGEXP), PROPDMG * 1e+09,
ifelse(
grepl("m|M", PROPDMGEXP), PROPDMG * 1e+06,
ifelse(
grepl("k|K", PROPDMGEXP), PROPDMG * 1e+03,
ifelse(grepl("h|H", PROPDMGEXP), PROPDMG * 1e+02, 0)
)
)
)) %>%
mutate(crop_dmg = ifelse(
grepl("b|B", CROPDMGEXP), CROPDMG * 1e+09,
ifelse(
grepl("m|M", CROPDMGEXP), CROPDMG * 1e+06,
ifelse(
grepl("k|K", CROPDMGEXP), CROPDMG * 1e+03,
ifelse(grepl("h|H", CROPDMGEXP), CROPDMG * 1e+02, 0)
)
)
)) %>%
group_by(EVTYPE) %>%
summarize("Total_DMG" = sum(prop_dmg + crop_dmg)) %>%
arrange(desc(Total_DMG))
head(dataEcoDmg_other, n = 20)
## EVTYPE Total_DMG
## 1 HURRICANE/TYPHOON 71913712800
## 2 STORM SURGE 43193541000
## 3 HURRICANE 14554229010
## 4 TSTM WIND 5031941790
## 5 WILD/FOREST FIRE 3108564830
## 6 EXTREME COLD 1308733400
## 7 TYPHOON 601055000
## 8 LANDSLIDE 344595000
## 9 FREEZE 146425000
## 10 River Flooding 134175000
## 11 TSTM WIND/HAIL 109031750
## 12 COASTAL FLOODING 97484000
## 13 URBAN/SML STREAM FLD 66797750
## 14 Early Frost 42000000
## 15 Damaging Freeze 34130000
## 16 AGRICULTURAL FREEZE 28820000
## 17 UNSEASONABLY COLD 25042500
## 18 RIVER FLOOD 22157000
## 19 SMALL HAIL 20863000
## 20 COASTAL FLOODING/EROSION 20030000
Let’s perform a brute force recoding of these event types. The following helper function is used:
recodeEventType
## function (data, pattern, eventType)
## {
## ifelse(data1$EVENT_SDT == "OTHER", ifelse(grepl(pattern,
## trimws(data$EVTYPE), ignore.case = TRUE), eventType,
## data$EVENT_SDT), data$EVENT_SDT)
## }
To perform the event type recode, the top 20 uncategorized event types that made up the major part of the casualties, and the top 10 uncategorized event types that made up the major part of monetory loss are chosen. The two numbers were derived by running this markdown file multiple times until the remaining uncategorized event types make no significant impact to the analysis are confirmed.
# Top 20 - Non-standard event types with highest casualties.
data1$EVENT_SDT <- recodeEventType(data1, "^TSTM", "THUNDERSTORM WIND")
data1$EVENT_SDT <- recodeEventType(data1, "^HURRICANE|^TYPHOON", "HURRICANE (TYPHOON)")
data1$EVENT_SDT <- recodeEventType(data1, "^FOG$", "DENSE FOG")
data1$EVENT_SDT <- recodeEventType(data1, "^WILD.*FIRE$", "WILDFIRE")
data1$EVENT_SDT <- recodeEventType(data1, "^RIP CURRENT", "RIP CURRENT")
data1$EVENT_SDT <- recodeEventType(data1, "^GLAZE$|^FREEZE$", "FROST/FREEZE")
data1$EVENT_SDT <- recodeEventType(data1, "^EXTREME COLD$", "EXTREME COLD/WIND CHILL")
data1$EVENT_SDT <- recodeEventType(data1, "^URBAN/SML STREAM FLD$|^RIVER FLOODING$", "FLOOD")
data1$EVENT_SDT <- recodeEventType(data1, "^WIND$", "HIGH WIND")
data1$EVENT_SDT <- recodeEventType(data1, "^WINTER WEATHER|^WINTRY MIX$", "WINTER WEATHER")
data1$EVENT_SDT <- recodeEventType(data1, "^HEAVY SURF|HIGH SURF$", "HIGH SURF")
data1$EVENT_SDT <- recodeEventType(data1, "^STORM SURGE$", "STORM SURGE/TIDE")
data1$EVENT_SDT <- recodeEventType(data1, "^LANDSLIDE$", "DEBRIS FLOW")
data1$EVENT_SDT <- recodeEventType(data1, "^HEAT WAVE$", "EXCESSIVE HEAT")
data1$EVENT_SDT <- recodeEventType(data1, "^SNOW SQUALL$", "BLIZZARD")
# Top 10 - Non-standard event types with greatest economic consequences.
# All event types were already covered by the above function calls.
Now, factorizes the EVENT_SDT for summary and plotting purposes:
data1$EVENT_SDT <- as.factor(data1$EVENT_SDT)
Summarizes the data for top 10 harmful events:
dataPopHealth <- data1 %>% group_by(EVENT_SDT) %>%
summarize("POPULATION" = sum(FATALITIES + INJURIES)) %>%
arrange(desc(POPULATION))
dataPopHealth_10 <- head(dataPopHealth, 10)
Lists the top 10 harmful events:
dataPopHealth_10
## EVENT_SDT POPULATION
## 1 TORNADO 22178
## 2 EXCESSIVE HEAT 8258
## 3 FLOOD 7281
## 4 THUNDERSTORM WIND 5505
## 5 LIGHTNING 4792
## 6 FLASH FLOOD 2561
## 7 WILDFIRE 1543
## 8 WINTER STORM 1483
## 9 HEAT 1459
## 10 HURRICANE (TYPHOON) 1453
Lists the uncategorized events (OTHER) to ensure that the amount of casualties are insignificant to at least the top 3 categorized events.
filter(dataPopHealth, EVENT_SDT %in% "OTHER")
## Source: local data table [1 x 2]
##
## EVENT_SDT POPULATION
## (fctr) (dbl)
## 1 OTHER 423
It turns out that the remaining uncategorized events (OTHER) are insignificant to the top 6 categorized events.
Summarizes the data for top 10 events that have greatest economic consequences:
dataEcoDmg <- data1 %>% mutate(prop_dmg = ifelse(
grepl("b|B", PROPDMGEXP), PROPDMG * 1e+09,
ifelse(
grepl("m|M", PROPDMGEXP), PROPDMG * 1e+06,
ifelse(
grepl("k|K", PROPDMGEXP), PROPDMG * 1e+03,
ifelse(grepl("h|H", PROPDMGEXP), PROPDMG * 1e+02, 0)
)
)
)) %>%
mutate(crop_dmg = ifelse(
grepl("b|B", CROPDMGEXP), CROPDMG * 1e+09,
ifelse(
grepl("m|M", CROPDMGEXP), CROPDMG * 1e+06,
ifelse(
grepl("k|K", CROPDMGEXP), CROPDMG * 1e+03,
ifelse(grepl("h|H", CROPDMGEXP), CROPDMG * 1e+02, 0)
)
)
)) %>%
group_by(EVENT_SDT) %>%
summarize("Total_DMG" = sum(prop_dmg + crop_dmg)) %>%
arrange(desc(Total_DMG))
dataEcoDmg_10 <- head(dataEcoDmg, 10)
# Gets the monetary damages in billion dollars.
dataEcoDmg_10$Total_DMG_BIL <- sapply(dataEcoDmg_10$Total_DMG, function(figure) round(figure/1e+09, digits=2))
Lists the top 10 events that have greatest economic consequences:
dataEcoDmg_10
## EVENT_SDT Total_DMG Total_DMG_BIL
## 1 FLOOD 149120584700 149.12
## 2 HURRICANE (TYPHOON) 87068996810 87.07
## 3 STORM SURGE/TIDE 47835579000 47.84
## 4 TORNADO 24900370720 24.90
## 5 HAIL 17071172870 17.07
## 6 FLASH FLOOD 16557155610 16.56
## 7 DROUGHT 14413667000 14.41
## 8 THUNDERSTORM WIND 8930498480 8.93
## 9 TROPICAL STORM 8320186550 8.32
## 10 WILDFIRE 8162704630 8.16
Lists the uncategorized events (OTHER) to ensure that its amount of property and crop damages are insignificant to at least the top 3 categorized events.
filter(dataEcoDmg, EVENT_SDT %in% "OTHER")
## Source: local data table [1 x 2]
##
## EVENT_SDT Total_DMG
## (fctr) (dbl)
## 1 OTHER 420551090
The remaining uncategorized events (OTHER) are insignificant to all the top 10 categorized events.
Below shows the code and plot of top 10 harmful events vs. number of casualties:
ggplot(dataPopHealth_10, aes(EVENT_SDT, POPULATION, fill=EVENT_SDT)) +
geom_bar(stat = "identity", width = 0.8) + coord_flip() +
labs(title = "Top 10 Most Harmful Event Types w/. Respect to Population Health (1996 - 2011)") +
labs(x = "Event Type", y = "Number of Casualties") + theme_bw() +
geom_text(aes(y=POPULATION, ymax=POPULATION, label=POPULATION),
position= position_dodge(width=0.9), vjust=.5, hjust=-0.1, color="black") +
##scale_fill_discrete(name ="Event Type") +
guides(fill=FALSE) +
scale_y_discrete(expand = c(0, 4000), breaks = seq(0, 40000, 5000))
The top 3 most harmful storms and weather events with respect to population health are: Tornado, Excessive Heat, and Flood, respectively.
Below shows the code and plot of top 10 events that have greatest economic consequences vs. amount of damage in billion dollars:
ggplot(dataEcoDmg_10, aes(EVENT_SDT, Total_DMG_BIL, fill=EVENT_SDT)) +
geom_bar(stat = "identity", width = 0.8) + coord_flip() +
labs(title = "Top 10 Event Types Have Greatest Economic Consequences (1996 - 2011)") +
labs(x = "Event Type", y = "Amount of Damage (in Billion Dollars)") + theme_bw() +
geom_text(aes(y=Total_DMG_BIL, ymax=Total_DMG_BIL, label=Total_DMG_BIL),
position= position_dodge(width=0.9), vjust=.5, hjust=-0.1, color="black") +
##scale_fill_discrete(name ="Event Type") +
guides(fill=FALSE) +
scale_y_discrete(expand = c(0, 18), breaks = seq(0, 200, 40))
The top 3 storms and weather events have the greatest economic consequences are: Flood, Hurricane (Typhoon), and Storm Surge/Tide, respectively.