In this analysis, U.S. storm events data from 1955 to 2011 are explored to show which types of events are most harmful with respect to population health and the economy. The data source is the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The bulk of this research is about data cleaning, including standardizing event types, calculating damages, etc. Cleaned data are split into 2 subsets with the cutoff point at January 1st, 1993, before which only 3 types of storm events were recorded. For each storm event, its effects in a single occurrence as well as its cumulative effects over either analyzed period are both assessed.
We found that before 1993 tornado was the most harmful one among the 3 recorded storm event types. After 1993, ice storm, flood and tornado can all be the most harmful one among the 48 standard event types, depending on the criterion selected. The detailed results are presented in the last part of this analysis.
library(rmarkdown)
library(knitr)
library(data.table)
library(reshape2)
library(lubridate)
library(dplyr)
library(tidyr)
library(ggplot2)
Download data and check the header (first line of the data file) to determine which variables should be kept for the analysis.
if (!file.exists("stormdata.csv.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "StormData.csv.bz2")
}
table(readLines(bzfile("StormData.csv.bz2"), 1))
##
## "STATE__","BGN_DATE","BGN_TIME","TIME_ZONE","COUNTY","COUNTYNAME","STATE","EVTYPE","BGN_RANGE","BGN_AZI","BGN_LOCATI","END_DATE","END_TIME","COUNTY_END","COUNTYENDN","END_RANGE","END_AZI","END_LOCATI","LENGTH","WIDTH","F","MAG","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","WFO","STATEOFFIC","ZONENAMES","LATITUDE","LONGITUDE","LATITUDE_E","LONGITUDE_","REMARKS","REFNUM"
## 1
# fread() is used instead of read.csv() for reading large dataset.
dt <-fread(sprintf("bzcat %s", "StormData.csv.bz2"),
select = c("STATE__", "BGN_DATE", "COUNTY", "COUNTYNAME", "STATE", "EVTYPE",
"FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))
dt$BGN_DATE <- colsplit(string = dt$BGN_DATE, pattern=" ", names = c("Part1", "Part2"))$Part1
dt <- rename(dt, STATE_ID = STATE__)
dt$BGN_DATE <- mdy(dt$BGN_DATE)
It is stated on the instruction in NOAA official site that from 1950 through 1954, only tornado events were recorded. Since no comparison can be drawn, data before January 1955 are not considered in this analysis.
dt <- dt[dt$BGN_DATE >= ymd(19550101), ]
nrow(dt)
## [1] 900432
The following cleaning of crop and property damage data is based on a research of NOAA raw data by David Hood and Eddie Song. The goal is to generate 2 new variables showing the 2 kinds of damages in dollar amount.
# To keep the preliminarily processed dataset "dt" intact, its copy "alldata" is used for the further cleaning.
alldata = dt
table(alldata$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 422909 7 11221
table(alldata$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 616548 7 19 1 9 21 281832 1 1994
mapping <- function(x) {
# Numbers need to be dealt with first, otherwise the other regular expressions applied would lead to erroneous result.
x <- gsub("[0-8]", 10, x)
x <- gsub("\\?", 0, x)
x <- gsub("^$", 0, x)
x <- gsub("\\-", 0, x)
x <- gsub("\\+", 1, x)
x <- gsub("b", 1000000000, x, ignore.case = T)
x <- gsub("m", 1000000, x, ignore.case = T)
x <- gsub("k", 1000, x, ignore.case = T)
x <- gsub("h", 100, x, ignore.case = T)
return(as.numeric(x))
}
# Must assign the values back to the original variables.
alldata$PROPDMGEXP <- mapping(alldata$PROPDMGEXP)
alldata$CROPDMGEXP <- mapping(alldata$CROPDMGEXP)
table(alldata$PROPDMGEXP)
##
## 0 1 10 100 1000 1e+06 1e+09
## 465943 5 300 7 422909 11228 40
table(alldata$CROPDMGEXP)
##
## 0 10 1000 1e+06 1e+09
## 616555 20 281853 1995 9
# sapply(alldata, class)
alldata <- mutate(alldata, CROP_DAMAGE = CROPDMG * CROPDMGEXP, PROP_DAMAGE = PROPDMG * PROPDMGEXP)
alldata <- select(alldata, -one_of(c("PROPDMG", "PROPDMGEXP", "CROPDMG","CROPDMGEXP")))
Only keep the records that had economic and/or population health consequences.
impactdata <- filter(alldata, FATALITIES !=0 | INJURIES !=0 | CROP_DAMAGE !=0 | PROP_DAMAGE !=0)
Event types are cleaned base on Storm Data Event Table in the NOAA Storm Data Preparation Document.
# Compile the official event types.
officialtypes <- toupper(c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather"))
# One entry has EVTYPE == "?", delete it because it's useless for the analysis. (Readers can check the NOAA database to comfirm this.)
impactdata <- filter(impactdata, EVTYPE != "?")
nrow(impactdata)
## [1] 252975
# Transform all EVTYPEs to uppercase strings.
impactdata$EVTYPE <- toupper(impactdata$EVTYPE)
Create a “matched” column in “impactdata” indicating whether each observation’s event type is already absolutely precise. Then split the dataset based on it.
impactdata$matched <- impactdata$EVTYPE %in% officialtypes
eventsclean <- impactdata[impactdata$matched == 1, ]
eventsdirty <- impactdata[impactdata$matched == 0, ]
print(c(oldclean <-nrow(eventsclean), olddirty <- nrow(eventsdirty)))
## [1] 171256 81719
For records whose event type does not exactly match any of the official terms, we first correct the apparent typos.
# Punctuation cleaning
eventsdirty$EVTYPE <- gsub("[\\. \\, \\( \\)]", " ", eventsdirty$EVTYPE)
# Replace multiple spaces with single space.
eventsdirty$EVTYPE <- gsub(" +", " ", eventsdirty$EVTYPE)
# Delete space at the beginning or the end of each string.
eventsdirty$EVTYPE <- gsub("^\\s* | \\s*$", "", eventsdirty$EVTYPE)
# Delete all numbers.
eventsdirty$EVTYPE <- gsub("[0-9]", "", eventsdirty$EVTYPE)
Type-specific regular expressions are then used to manually clean entries
head(sort(table(eventsdirty$EVTYPE), decreasing = T), 20)
##
## TSTM WIND THUNDERSTORM WINDS URBAN/SML STREAM FLD
## 63237 12057 702
## HIGH WINDS TSTM WIND/HAIL WILD/FOREST FIRE
## 657 441 388
## FLASH FLOODING FLOOD/FLASH FLOOD RIP CURRENTS
## 301 279 241
## EXTREME COLD LANDSLIDE STORM SURGE
## 199 193 177
## LIGHT SNOW URBAN FLOOD WINTER WEATHER/MIX
## 141 139 139
## HURRICANE MARINE TSTM WIND FOG
## 129 109 107
## RIVER FLOOD WIND
## 107 84
The majority of event-type-imprecise data can be easily determined as thunderstorm wind data.
If a observation’s EVTYPE string contains “TSTM” or “THUNDERSTORM” but not “MARINE”, it is considered as a thunderstorm wind event. This creates bias toward the effect of thunderstorm wind, but it significantly decreases the workload for further event type matching.
# (?<!...) means negative lookbehind. Must use pert = T argument.
eventsdirty$EVTYPE[grepl("(?<!MARINE )TSTM|THUNDERSTORM", eventsdirty$EVTYPE, perl = T)] <- "THUNDERSTORM WIND"
# Marine Thunderstorm Wind is a unique type that should not be classified as plain thunderstrom wind.
eventsdirty$EVTYPE[eventsdirty$EVTYPE == "MARINE TSTM WIND"] <- "MARINE THUNDERSTORM WIND"
Next, some of the flood-related data are manipulated.
eventsdirty$EVTYPE[eventsdirty$EVTYPE == "URBAN/SML STREAM FLD"] <- "FLOOD"
eventsdirty$EVTYPE <- gsub("FLOODING", "FLOOD", eventsdirty$EVTYPE)
head(sort(table(eventsdirty$EVTYPE), decreasing = T), 20)
##
## THUNDERSTORM WIND FLOOD HIGH WINDS
## 75979 760 657
## WILD/FOREST FIRE FLASH FLOOD FLOOD/FLASH FLOOD
## 388 302 279
## RIP CURRENTS URBAN FLOOD EXTREME COLD
## 241 219 199
## LANDSLIDE STORM SURGE LIGHT SNOW
## 193 177 141
## WINTER WEATHER/MIX HURRICANE RIVER FLOOD
## 139 129 125
## MARINE THUNDERSTORM WIND FOG WIND
## 109 107 84
## DRY MICROBURST HURRICANE/TYPHOON
## 78 72
Now we can shift the just refined data in “eventsdirty” to the “eventsclean” dataset.
eventsclean <- rbind(eventsdirty[eventsdirty$EVTYPE %in% officialtypes, ], eventsclean)
eventsdirty <- filter(eventsdirty, !(EVTYPE %in% officialtypes))
print(c(newclean <- nrow(eventsclean), newdirty <- nrow(eventsdirty)))
## [1] 248468 4507
types_before_stringdist <- length(unique(eventsdirty$EVTYPE))
print(paste("Number of unique unidentified event types:", types_before_stringdist))
## [1] "Number of unique unidentified event types: 307"
The number of data in need of further manipulation has shrunk from 81719 to 4507.
What we need to do next is to find the most likely event types for the 4507 records (307 unique uncleaned event types) using approximate string matching.
A distance matrix is created to show the generalized Levenshtein (edit) distance between the official event types vector and the EVTYPE column in the “eventsdirty” dataset. For each observation, we find an offical event type whose distance to that observation is the shortest among all the official types. Then this offical event type is assumed as the true event type of that observation. The results are stored in a new column “ESTTYPE” (estimated type) in dataset “eventsdirty”.
# Must set partial = T to enable matching of substrings of each EVTYPE value!!!
distmatrix = as.data.frame(adist(eventsdirty$EVTYPE, officialtypes, partial = T))
# The column names of the matrix should be the 48 official event types.
names(distmatrix) <- officialtypes
eventsdirty$ESTTYPE <- NA
for (i in 1:nrow(eventsdirty)) {
# Find out which official type(s) is/are most plausible.
candis <- colnames(distmatrix)[which.min(distmatrix[i, 1:48])]
# In case there are ties between multiple types, only use the first type matched.
eventsdirty$ESTTYPE[i] <- candis[1]
}
The algorithm applied in adist() function does not necessarily guarantee the most probable true event type. Any little error in the identification of the real event type may lead to serious mistake in the subsequent data analysis part. Therefore 2 groups of most important observations in the “eventsdirty” set are filtered out for double check. The 2 groups are those
whose “EVTYPE” is among the 10 most frequent “EVTYPE”s, or
whose effect in terms of fatalities, injuries, crop damage or property damage ranks in the top 1‰.
The “EVTYPE” column of the 2 groups of data is selected and juxtaposed with its “ESTTYPE” counterpart.
For the first group:
# Show the 10 types of most frequently happened events.
print(mostfreq <- head(sort(table(eventsdirty$EVTYPE), decreasing = T), 10))
##
## HIGH WINDS WILD/FOREST FIRE FLOOD/FLASH FLOOD
## 657 388 279
## RIP CURRENTS URBAN FLOOD EXTREME COLD
## 241 219 199
## LANDSLIDE STORM SURGE LIGHT SNOW
## 193 177 141
## WINTER WEATHER/MIX
## 139
mostfreqnames <- names(mostfreq)
highfreqgroup <- eventsdirty[eventsdirty$EVTYPE %in% mostfreqnames, c("EVTYPE", "ESTTYPE")]
# Only show the unique entries for comparison.
highfreqgroup[!duplicated(highfreqgroup), ]
## EVTYPE ESTTYPE
## 7 HIGH WINDS HIGH WIND
## 17 EXTREME COLD EXTREME COLD/WIND CHILL
## 102 FLOOD/FLASH FLOOD FLASH FLOOD
## 156 URBAN FLOOD COASTAL FLOOD
## 189 STORM SURGE STORM SURGE/TIDE
## 1114 RIP CURRENTS RIP CURRENT
## 1459 WILD/FOREST FIRE WILDFIRE
## 1660 LANDSLIDE ASTRONOMICAL LOW TIDE
## 2118 LIGHT SNOW LAKE-EFFECT SNOW
## 3964 WINTER WEATHER/MIX WINTER WEATHER
As we can see, the estimated true event types are mostly correct. But 3 of the 10 most frequent events are incorrectly identified. They are “URBAN FLOOD”, “LANDSLIDE” and “LIGHT SNOW”.
For the second group:
toprankgroup <- eventsdirty %>% filter(FATALITIES > quantile(FATALITIES, 0.999) |
INJURIES > quantile(INJURIES, 0.999) |
CROP_DAMAGE > quantile(CROP_DAMAGE, 0.999) |
PROP_DAMAGE > quantile(PROP_DAMAGE, 0.999)) %>%
select(EVTYPE, ESTTYPE)
# Only show the unique entries.
toprankgroup[!duplicated(toprankgroup), ]
## EVTYPE ESTTYPE
## 1 WILD FIRES WILDFIRE
## 2 RIVER FLOOD LAKESHORE FLOOD
## 3 EXTREME COLD EXTREME COLD/WIND CHILL
## 4 UNSEASONABLY WARM AND DRY MARINE THUNDERSTORM WIND
## 5 HEAT WAVE EXCESSIVE HEAT
## 8 EXTREME HEAT EXTREME COLD/WIND CHILL
## 10 HURRICANE HURRICANE (TYPHOON)
## 11 HURRICANE/TYPHOON HURRICANE (TYPHOON)
## 16 STORM SURGE STORM SURGE/TIDE
In the most influential group, there’re also 3 types of events mis-identified, namely “RIVER FLOOD”, “UNSEASONABLY WARM AND DRY” and “EXTREME HEAT”.
These event type estimates are to be manually rectified.
# Write a function to save some key strokes because eventsdirty$COLUMNNAME is too long to type.
manualreplace <- function(ev, est) {
# From high frequency group:
est[ev == "URBAN FLOOD"] <- "FLOOD"
est[ev == "LANDSLIDE"] <- "DEBRIS FLOW"
est[ev == "LIGHT SNOW"] <- "HEAVY SNOW"
# From most influential group:
est[ev == "RIVER FLOOD"] <- "FLOOD"
est[ev == "UNSEASONABLY WARM AND DRY" | ev == "EXTREME HEAT"] <- "EXCESSIVE HEAT"
return (est)
}
eventsdirty$ESTTYPE <- manualreplace(eventsdirty$EVTYPE, eventsdirty$ESTTYPE)
So far all the event type refining is completed. For the 4507 records that went through approximate string matching, some of their mapping could still be erroneous as long as they did not undergo manual check. But considering that the 4507 records take up only 1.78% of the final dataset, no more resource in this analysis would be devoted to this matter.
The “eventsclean” and “eventsdirty” datasets are to be combined together for the data analysis in the next section.
dirtypart <- eventsdirty %>% select(-EVTYPE, -matched) %>% rename(EVTYPE = ESTTYPE)
cleanpart <- select(eventsclean, - matched)
refineddata <- rbind(cleanpart, dirtypart)
nrow(refineddata)
## [1] 252975
sapply(refineddata, class)
## STATE_ID BGN_DATE COUNTY COUNTYNAME STATE EVTYPE
## "numeric" "Date" "numeric" "character" "character" "character"
## FATALITIES INJURIES CROP_DAMAGE PROP_DAMAGE
## "numeric" "numeric" "numeric" "numeric"
# Alter the class of some columns.
refineddata[c(1, 3:6)] <- lapply(refineddata[c(1, 3:6)], factor)
We add 2 new variables “HEALTH_TOTAL” and “ECON_TOTAL” to the dataset “refineddata” storing for each event occurrence its total effect on population health and on economy.
refineddata <- refineddata %>% mutate(HEALTH_EFFECT = FATALITIES + INJURIES,
ECON_EFFECT = CROP_DAMAGE + PROP_DAMAGE)
From NOAA Storm Events Database Collection Source Clarification we can see that:
From 1955 through 1992, only tornado, thunderstorm wind and hail events were recorded.
From January 1993 through November 2011, NOAA recorded all of the 48 types of storm events.
Therefore, data will be split into 2 subsets accordingly and analyses would be carried out on each of them.
One subset (“before1993”) includes all the data in between 1955 and 1992 and a comparison will be drawn between tornado, thunderstorm wind and hail. Data in the period from 1993 to 2011 are put in another subset (“after1993”) to make a more complete comparison between all the 48 types of events. (Data before 1955 are already deleted in step 1 of data processing section above.)
before1993 = subset(refineddata, BGN_DATE >= ymd(19550101) & BGN_DATE < ymd(19930101))
after1993 = subset(refineddata, BGN_DATE >= ymd(19930101))
singleworstbefore1993 <- before1993 %>% filter(HEALTH_EFFECT == max(HEALTH_EFFECT) |
ECON_EFFECT == max(ECON_EFFECT)) %>%
arrange(BGN_DATE)
kable(singleworstbefore1993, format = "markdown")
| STATE_ID | BGN_DATE | COUNTY | COUNTYNAME | STATE | EVTYPE | FATALITIES | INJURIES | CROP_DAMAGE | PROP_DAMAGE | HEALTH_EFFECT | ECON_EFFECT |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 18 | 1965-04-11 | 67 | HOWARD | IN | TORNADO | 17 | 560 | 0 | 2.5e+08 | 577 | 2.5e+08 |
| 18 | 1965-04-11 | 53 | GRANT | IN | TORNADO | 8 | 275 | 0 | 2.5e+08 | 283 | 2.5e+08 |
| 26 | 1965-04-11 | 23 | BRANCH | MI | TORNADO | 9 | 200 | 0 | 2.5e+08 | 209 | 2.5e+08 |
| 20 | 1966-06-08 | 177 | SHAWNEE | KS | TORNADO | 16 | 450 | 0 | 2.5e+08 | 466 | 2.5e+08 |
| 48 | 1970-05-11 | 303 | LUBBOCK | TX | TORNADO | 26 | 500 | 0 | 2.5e+08 | 526 | 2.5e+08 |
| 13 | 1973-03-31 | 63 | CLAYTON | GA | TORNADO | 0 | 0 | 0 | 2.5e+08 | 0 | 2.5e+08 |
| 13 | 1973-03-31 | 297 | WALTON | GA | TORNADO | 1 | 50 | 0 | 2.5e+08 | 51 | 2.5e+08 |
| 13 | 1973-03-31 | 219 | OCONEE | GA | TORNADO | 0 | 0 | 0 | 2.5e+08 | 0 | 2.5e+08 |
| 13 | 1973-03-31 | 59 | CLARKE | GA | TORNADO | 1 | 50 | 0 | 2.5e+08 | 51 | 2.5e+08 |
| 13 | 1973-03-31 | 221 | OGLETHORPE | GA | TORNADO | 0 | 0 | 0 | 2.5e+08 | 0 | 2.5e+08 |
| 18 | 1974-04-03 | 123 | PERRY | IN | TORNADO | 2 | 6 | 0 | 2.5e+08 | 8 | 2.5e+08 |
| 18 | 1974-04-03 | 19 | CLARK | IN | TORNADO | 0 | 0 | 0 | 2.5e+08 | 0 | 2.5e+08 |
| 18 | 1974-04-03 | 7 | BENTON | IN | TORNADO | 0 | 0 | 0 | 2.5e+08 | 0 | 2.5e+08 |
| 39 | 1974-04-03 | 57 | GREENE | OH | TORNADO | 36 | 1150 | 0 | 2.5e+08 | 1186 | 2.5e+08 |
| 13 | 1975-03-24 | 121 | FULTON | GA | TORNADO | 3 | 152 | 0 | 2.5e+08 | 155 | 2.5e+08 |
| 31 | 1975-05-06 | 55 | DOUGLAS | NE | TORNADO | 3 | 118 | 0 | 2.5e+08 | 121 | 2.5e+08 |
| 22 | 1978-12-03 | 15 | BOSSIER | LA | TORNADO | 2 | 266 | 0 | 2.5e+08 | 268 | 2.5e+08 |
| 48 | 1979-04-10 | 485 | WICHITA | TX | TORNADO | 42 | 1700 | 0 | 2.5e+08 | 1742 | 2.5e+08 |
| 9 | 1979-10-03 | 3 | HARTFORD | CT | TORNADO | 3 | 500 | 0 | 2.5e+08 | 503 | 2.5e+08 |
| 31 | 1980-06-03 | 79 | HALL | NE | TORNADO | 3 | 110 | 0 | 2.5e+08 | 113 | 2.5e+08 |
| 42 | 1980-06-03 | 3 | ALLEGHENY | PA | TORNADO | 0 | 20 | 0 | 2.5e+08 | 20 | 2.5e+08 |
| 42 | 1980-06-03 | 129 | WESTMORELAND | PA | TORNADO | 0 | 0 | 0 | 2.5e+08 | 0 | 2.5e+08 |
| 42 | 1980-06-03 | 5 | ARMSTRONG | PA | TORNADO | 0 | 120 | 0 | 2.5e+08 | 120 | 2.5e+08 |
| 42 | 1980-07-21 | 39 | CRAWFORD | PA | TORNADO | 0 | 1 | 0 | 2.5e+08 | 1 | 2.5e+08 |
| 48 | 1980-08-10 | 453 | TRAVIS | TX | TORNADO | 0 | 4 | 0 | 2.5e+08 | 4 | 2.5e+08 |
| 40 | 1981-04-19 | 143 | TULSA | OK | TORNADO | 0 | 7 | 0 | 2.5e+08 | 7 | 2.5e+08 |
| 40 | 1982-05-11 | 65 | JACKSON | OK | TORNADO | 0 | 41 | 0 | 2.5e+08 | 41 | 2.5e+08 |
| 17 | 1982-05-29 | 199 | WILLIAMSON | IL | TORNADO | 10 | 181 | 0 | 2.5e+08 | 191 | 2.5e+08 |
| 39 | 1985-05-31 | 133 | PORTAGE | OH | TORNADO | 0 | 0 | 0 | 2.5e+08 | 0 | 2.5e+08 |
| 39 | 1985-05-31 | 155 | TRUMBULL | OH | TORNADO | 10 | 250 | 0 | 2.5e+08 | 260 | 2.5e+08 |
| 39 | 1985-05-31 | 155 | TRUMBULL | OH | TORNADO | 0 | 0 | 0 | 2.5e+08 | 0 | 2.5e+08 |
| 19 | 1986-07-28 | 193 | WOODBURY | IA | TORNADO | 0 | 1 | 0 | 2.5e+08 | 1 | 2.5e+08 |
| 19 | 1986-07-28 | 133 | MONONA | IA | TORNADO | 0 | 0 | 0 | 2.5e+08 | 0 | 2.5e+08 |
| 37 | 1988-11-28 | 183 | WAKE | NC | TORNADO | 2 | 105 | 0 | 2.5e+08 | 107 | 2.5e+08 |
| 9 | 1989-07-10 | 9 | NEW HAVEN | CT | TORNADO | 0 | 40 | 0 | 2.5e+08 | 40 | 2.5e+08 |
| 1 | 1989-11-15 | 89 | MADISON | AL | TORNADO | 21 | 463 | 0 | 2.5e+08 | 484 | 2.5e+08 |
| 1 | 1989-11-15 | 89 | MADISON | AL | TORNADO | 0 | 0 | 0 | 2.5e+08 | 0 | 2.5e+08 |
| 17 | 1990-08-28 | 197 | WILL | IL | TORNADO | 29 | 350 | 0 | 2.5e+08 | 379 | 2.5e+08 |
| 20 | 1991-04-26 | 173 | SEDGWICK | KS | TORNADO | 4 | 75 | 0 | 2.5e+08 | 79 | 2.5e+08 |
| 20 | 1991-04-26 | 15 | BUTLER | KS | TORNADO | 13 | 150 | 0 | 2.5e+08 | 163 | 2.5e+08 |
| 48 | 1992-11-21 | 201 | HARRIS | TX | TORNADO | 0 | 15 | 0 | 2.5e+08 | 15 | 2.5e+08 |
As we can see, it was always tornado that caused the most severe damages in its single occurrences.
maxsinglebefore1993 <- singleworstbefore1993[which.max(singleworstbefore1993$HEALTH_EFFECT), ]
kable(maxsinglebefore1993, format = "markdown")
| STATE_ID | BGN_DATE | COUNTY | COUNTYNAME | STATE | EVTYPE | FATALITIES | INJURIES | CROP_DAMAGE | PROP_DAMAGE | HEALTH_EFFECT | ECON_EFFECT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18 | 48 | 1979-04-10 | 485 | WICHITA | TX | TORNADO | 42 | 1700 | 0 | 2.5e+08 | 1742 | 2.5e+08 |
In particular, the tornado that took place on 1979-04-10 was most harmful both in terms of population health and economic consequences. It caused 42 death, 1700 injuries, $0 crop damage and $250000000 property damage.
# Write a function "maxcumu()" that can be applied to both "before1993" and "after1993" datasets. The nested function "cumueffects()" will later be utilized in plotting "after1993" dataset.
cumueffects <- function(x) {
x %>% group_by(EVTYPE) %>%
summarise(FATALITIES_TOTAL = sum(FATALITIES), INJURIES_TOTAL = sum(INJURIES),
CROP_TOTAL = sum(CROP_DAMAGE), PROP_TOTAL = sum(PROP_DAMAGE),
HEALTH_TOTAL = sum(HEALTH_EFFECT), ECON_TOTAL = sum(ECON_EFFECT))
}
maxcumu <- function(x) {
cumueffects(x) %>% filter(HEALTH_TOTAL == max(HEALTH_TOTAL) | ECON_TOTAL == max(ECON_TOTAL))
}
maxcumubefore1993 <- maxcumu(before1993)
kable(maxcumubefore1993, format = "markdown")
| EVTYPE | FATALITIES_TOTAL | INJURIES_TOTAL | CROP_TOTAL | PROP_TOTAL | HEALTH_TOTAL | ECON_TOTAL |
|---|---|---|---|---|---|---|
| TORNADO | 3123 | 59092 | 0 | 29722198670 | 62215 | 29722198670 |
From the table above we can see that from January 1955 to December 1992, tornado caused 3123 death, 59092 injuries, $0 crop damage and $29722198670 property damage all together. Hence tornado had the most harmful cumulative effect both in terms of population health and economic consequences.
singleworstafter1993 <- after1993 %>% filter(HEALTH_EFFECT == max(HEALTH_EFFECT) |
ECON_EFFECT == max(ECON_EFFECT)) %>% arrange(BGN_DATE)
kable(singleworstafter1993, format = "markdown")
| STATE_ID | BGN_DATE | COUNTY | COUNTYNAME | STATE | EVTYPE | FATALITIES | INJURIES | CROP_DAMAGE | PROP_DAMAGE | HEALTH_EFFECT | ECON_EFFECT |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 39 | 1994-02-08 | 0 | OHZ42>088 | OH | ICE STORM | 1 | 1568 | 5000000 | 5.00e+07 | 1569 | 55000000 |
| 6 | 2006-01-01 | 55 | NAPA | CA | FLOOD | 0 | 0 | 32500000 | 1.15e+11 | 0 | 115032500000 |
In their single occurrences, ice storm was most harmful to the population health, causing 1 death and 1568 injuries on 1994-02-08, and flood caused the most economic damage as $32500000 crop as well as $115000000000 were damaged in a flood on 2006-01-01.
maxcumuafter1993 <- maxcumu(after1993) %>% arrange(EVTYPE)
kable(maxcumuafter1993, format = "markdown")
| EVTYPE | FATALITIES_TOTAL | INJURIES_TOTAL | CROP_TOTAL | PROP_TOTAL | HEALTH_TOTAL | ECON_TOTAL |
|---|---|---|---|---|---|---|
| FLOOD | 508 | 6873 | 10738086050 | 150083407610 | 7381 | 160821493660 |
| TORNADO | 1621 | 23328 | 414962910 | 26343736027 | 24949 | 26758698937 |
From the table above we can see that from January 1993 to November 2011, flood in total caused $10738086050 crop damage and $150083407610 property damage. Flood was most harmful event type with respect to the economy.
On the other hand, tornado in this period caused 1621 death and 23328 injuries, making it the worst event in terms of cumulative effect on the population health.
We are to present the impact of 10 most harmful event types with respect to the public health and the economy in 2 plots. Let’s first take a look at the data excerpt below. This is from the summary of each event type’s cumulative harm in the period from January 1, 1993 to November 30, 2011.
cumueffectsafter1993 <- cumueffects(after1993)
kable(head(cumueffectsafter1993, 3), format = "markdown")
| EVTYPE | FATALITIES_TOTAL | INJURIES_TOTAL | CROP_TOTAL | PROP_TOTAL | HEALTH_TOTAL | ECON_TOTAL |
|---|---|---|---|---|---|---|
| ASTRONOMICAL LOW TIDE | 16 | 38 | 5000000 | 13126150 | 54 | 18126150 |
| AVALANCHE | 225 | 170 | 0 | 4291800 | 395 | 4291800 |
| BLIZZARD | 101 | 805 | 112060000 | 659313950 | 906 | 771373950 |
Relevant columns from the above table are then selected. After the needed data reshaping, 2 plots would be made by ggplot2.
# Select the top 10 population-health-harmful events.
# Use gather() to reshape data.
after1993_healthtop10 <- filter(cumueffectsafter1993, HEALTH_TOTAL %in% tail(sort(HEALTH_TOTAL), 10)) %>%
select(1:3) %>% gather(key = EFFECT, value = VALUE, -EVTYPE)
head(after1993_healthtop10)
## # A tibble: 6 × 3
## EVTYPE EFFECT VALUE
## <fctr> <chr> <dbl>
## 1 EXCESSIVE HEAT FATALITIES_TOTAL 2228
## 2 FLASH FLOOD FATALITIES_TOTAL 1035
## 3 FLOOD FATALITIES_TOTAL 508
## 4 HEAT FATALITIES_TOTAL 937
## 5 HIGH WIND FATALITIES_TOTAL 299
## 6 ICE STORM FATALITIES_TOTAL 97
# Plot the top 10 health-impacting event types.
g_health <- ggplot(after1993_healthtop10, aes(x = EVTYPE, y = VALUE, fill = EFFECT)) + geom_col( ) +
labs(title = "Effects of the 10 Event Types Most Harmful to the Population Health",
subtitle = "Data from January 1993 to November 2011",
x = "Event Type", y = "Number of Fatalities and Injuries") +
theme(axis.text.x = element_text(angle = 30, hjust = 1),
legend.title = element_blank()) +
scale_fill_manual(values = c("lightblue3", "lightcoral"),
labels = c("Total Fatalities", "Total Injuries"))
g_health
# Select the top 10 economically consequential events.
after1993_econtop10 <- filter(cumueffectsafter1993, ECON_TOTAL %in% tail(sort(ECON_TOTAL), 10)) %>%
select(c(1, 4, 5)) %>% gather(key = EFFECT, value = VALUE, -EVTYPE)
head(after1993_econtop10)
## # A tibble: 6 × 3
## EVTYPE EFFECT VALUE
## <fctr> <chr> <dbl>
## 1 DROUGHT CROP_TOTAL 13972631000
## 2 FLASH FLOOD CROP_TOTAL 1532197150
## 3 FLOOD CROP_TOTAL 10738086050
## 4 HAIL CROP_TOTAL 3026044650
## 5 HURRICANE (TYPHOON) CROP_TOTAL 5506117800
## 6 ICE STORM CROP_TOTAL 5022113500
# Plot the top 10 economy-impacting event types.
g_econ <- ggplot(after1993_econtop10, aes(x = EVTYPE, y = VALUE/1000000000, fill = EFFECT)) + geom_col( ) +
labs(title = "Effects of the 10 Event Types with Greatest Economic Consequences",
subtitle = "Data from January 1993 to November 2011",
x = "Event Type", y = "Crop and Property Damage (in Billions of Dollars)") +
theme(axis.text.x = element_text(angle = 30, hjust = 1),
legend.title = element_blank()) +
scale_fill_manual(values = c("coral", "cornflowerblue"),
labels = c("Total Crop Damage", "Total Property Damage"))
g_econ
In this analysis, we found that: