Synopsis

In this analysis, U.S. storm events data from 1955 to 2011 are explored to show which types of events are most harmful with respect to population health and the economy. The data source is the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The bulk of this research is about data cleaning, including standardizing event types, calculating damages, etc. Cleaned data are split into 2 subsets with the cutoff point at January 1st, 1993, before which only 3 types of storm events were recorded. For each storm event, its effects in a single occurrence as well as its cumulative effects over either analyzed period are both assessed.

We found that before 1993 tornado was the most harmful one among the 3 recorded storm event types. After 1993, ice storm, flood and tornado can all be the most harmful one among the 48 standard event types, depending on the criterion selected. The detailed results are presented in the last part of this analysis.



Data Processing

library(rmarkdown)
library(knitr)
library(data.table)
library(reshape2)
library(lubridate)
library(dplyr)
library(tidyr)
library(ggplot2)


Step 1: Downloading Raw Data

Download data and check the header (first line of the data file) to determine which variables should be kept for the analysis.

if (!file.exists("stormdata.csv.bz2")) {
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "StormData.csv.bz2") 
}
table(readLines(bzfile("StormData.csv.bz2"), 1))
## 
## "STATE__","BGN_DATE","BGN_TIME","TIME_ZONE","COUNTY","COUNTYNAME","STATE","EVTYPE","BGN_RANGE","BGN_AZI","BGN_LOCATI","END_DATE","END_TIME","COUNTY_END","COUNTYENDN","END_RANGE","END_AZI","END_LOCATI","LENGTH","WIDTH","F","MAG","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","WFO","STATEOFFIC","ZONENAMES","LATITUDE","LONGITUDE","LATITUDE_E","LONGITUDE_","REMARKS","REFNUM" 
##                                                                                                                                                                                                                                                                                                                                                                                                            1
# fread() is used instead of read.csv() for reading large dataset.
dt <-fread(sprintf("bzcat %s", "StormData.csv.bz2"), 
           select = c("STATE__", "BGN_DATE", "COUNTY", "COUNTYNAME", "STATE", "EVTYPE", 
                      "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))


Step 2: Data cleaning

(1) Parsing Dates

dt$BGN_DATE <- colsplit(string = dt$BGN_DATE, pattern=" ", names = c("Part1", "Part2"))$Part1
dt <- rename(dt, STATE_ID = STATE__)
dt$BGN_DATE <- mdy(dt$BGN_DATE)

It is stated on the instruction in NOAA official site that from 1950 through 1954, only tornado events were recorded. Since no comparison can be drawn, data before January 1955 are not considered in this analysis.

dt <- dt[dt$BGN_DATE >= ymd(19550101), ]
nrow(dt)
## [1] 900432


(2) Manipulating Crop and Property Damage Data

The following cleaning of crop and property damage data is based on a research of NOAA raw data by David Hood and Eddie Song. The goal is to generate 2 new variables showing the 2 kinds of damages in dollar amount.

# To keep the preliminarily processed dataset "dt" intact, its copy "alldata" is used for the further cleaning.
alldata = dt
table(alldata$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 422909      7  11221
table(alldata$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 616548      7     19      1      9     21 281832      1   1994
mapping <- function(x) {
# Numbers need to be dealt with first, otherwise the other regular expressions applied would lead to erroneous result.
    x <- gsub("[0-8]", 10, x)
    
    x <- gsub("\\?", 0, x)
    x <- gsub("^$", 0, x)    
    x <- gsub("\\-", 0, x)
    x <- gsub("\\+", 1, x)
    
    x <- gsub("b", 1000000000, x, ignore.case = T)
    x <- gsub("m", 1000000, x, ignore.case = T)
    x <- gsub("k", 1000, x, ignore.case = T)
    x <- gsub("h", 100, x, ignore.case = T)

    return(as.numeric(x))
}

# Must assign the values back to the original variables.
alldata$PROPDMGEXP <- mapping(alldata$PROPDMGEXP)
alldata$CROPDMGEXP <- mapping(alldata$CROPDMGEXP) 

table(alldata$PROPDMGEXP)
## 
##      0      1     10    100   1000  1e+06  1e+09 
## 465943      5    300      7 422909  11228     40
table(alldata$CROPDMGEXP)
## 
##      0     10   1000  1e+06  1e+09 
## 616555     20 281853   1995      9
# sapply(alldata, class)
alldata <- mutate(alldata, CROP_DAMAGE = CROPDMG * CROPDMGEXP, PROP_DAMAGE = PROPDMG * PROPDMGEXP)
alldata <- select(alldata, -one_of(c("PROPDMG", "PROPDMGEXP", "CROPDMG","CROPDMGEXP")))

Only keep the records that had economic and/or population health consequences.

impactdata <- filter(alldata, FATALITIES !=0 | INJURIES !=0 | CROP_DAMAGE !=0 | PROP_DAMAGE !=0)


(3) Event Types Cleaning

Event types are cleaned base on Storm Data Event Table in the NOAA Storm Data Preparation Document.

# Compile the official event types.
officialtypes <- toupper(c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather"))

# One entry has EVTYPE == "?", delete it because it's useless for the analysis. (Readers can check the NOAA database to comfirm this.)
impactdata <- filter(impactdata, EVTYPE != "?")
nrow(impactdata)
## [1] 252975
# Transform all EVTYPEs to uppercase strings. 
impactdata$EVTYPE <- toupper(impactdata$EVTYPE)

Create a “matched” column in “impactdata” indicating whether each observation’s event type is already absolutely precise. Then split the dataset based on it.

impactdata$matched <- impactdata$EVTYPE %in% officialtypes
eventsclean <- impactdata[impactdata$matched == 1, ]
eventsdirty <- impactdata[impactdata$matched == 0, ]
print(c(oldclean <-nrow(eventsclean), olddirty <- nrow(eventsdirty)))
## [1] 171256  81719
i) Preliminary Manual Processing

For records whose event type does not exactly match any of the official terms, we first correct the apparent typos.

# Punctuation cleaning
eventsdirty$EVTYPE <- gsub("[\\. \\, \\( \\)]", " ", eventsdirty$EVTYPE)

# Replace multiple spaces with single space.
eventsdirty$EVTYPE <- gsub(" +", " ", eventsdirty$EVTYPE)

# Delete space at the beginning or the end of each string.
eventsdirty$EVTYPE <- gsub("^\\s* | \\s*$", "", eventsdirty$EVTYPE)

# Delete all numbers.
eventsdirty$EVTYPE <- gsub("[0-9]", "", eventsdirty$EVTYPE)

Type-specific regular expressions are then used to manually clean entries

  • whose event type has a high repetition among all the observations, and
  • whose exact event type can be easily inferred.
head(sort(table(eventsdirty$EVTYPE), decreasing = T), 20)
## 
##            TSTM WIND   THUNDERSTORM WINDS URBAN/SML STREAM FLD 
##                63237                12057                  702 
##           HIGH WINDS       TSTM WIND/HAIL     WILD/FOREST FIRE 
##                  657                  441                  388 
##       FLASH FLOODING    FLOOD/FLASH FLOOD         RIP CURRENTS 
##                  301                  279                  241 
##         EXTREME COLD            LANDSLIDE          STORM SURGE 
##                  199                  193                  177 
##           LIGHT SNOW          URBAN FLOOD   WINTER WEATHER/MIX 
##                  141                  139                  139 
##            HURRICANE     MARINE TSTM WIND                  FOG 
##                  129                  109                  107 
##          RIVER FLOOD                 WIND 
##                  107                   84

The majority of event-type-imprecise data can be easily determined as thunderstorm wind data.
If a observation’s EVTYPE string contains “TSTM” or “THUNDERSTORM” but not “MARINE”, it is considered as a thunderstorm wind event. This creates bias toward the effect of thunderstorm wind, but it significantly decreases the workload for further event type matching.

# (?<!...) means negative lookbehind. Must use pert = T argument.
eventsdirty$EVTYPE[grepl("(?<!MARINE )TSTM|THUNDERSTORM", eventsdirty$EVTYPE, perl = T)] <- "THUNDERSTORM WIND"
# Marine Thunderstorm Wind is a unique type that should not be classified as plain thunderstrom wind.
eventsdirty$EVTYPE[eventsdirty$EVTYPE == "MARINE TSTM WIND"] <- "MARINE THUNDERSTORM WIND"

Next, some of the flood-related data are manipulated.

eventsdirty$EVTYPE[eventsdirty$EVTYPE == "URBAN/SML STREAM FLD"] <- "FLOOD"
eventsdirty$EVTYPE <- gsub("FLOODING", "FLOOD", eventsdirty$EVTYPE)
head(sort(table(eventsdirty$EVTYPE), decreasing = T), 20)
## 
##        THUNDERSTORM WIND                    FLOOD               HIGH WINDS 
##                    75979                      760                      657 
##         WILD/FOREST FIRE              FLASH FLOOD        FLOOD/FLASH FLOOD 
##                      388                      302                      279 
##             RIP CURRENTS              URBAN FLOOD             EXTREME COLD 
##                      241                      219                      199 
##                LANDSLIDE              STORM SURGE               LIGHT SNOW 
##                      193                      177                      141 
##       WINTER WEATHER/MIX                HURRICANE              RIVER FLOOD 
##                      139                      129                      125 
## MARINE THUNDERSTORM WIND                      FOG                     WIND 
##                      109                      107                       84 
##           DRY MICROBURST        HURRICANE/TYPHOON 
##                       78                       72

Now we can shift the just refined data in “eventsdirty” to the “eventsclean” dataset.

eventsclean <- rbind(eventsdirty[eventsdirty$EVTYPE %in% officialtypes, ], eventsclean)
eventsdirty <- filter(eventsdirty, !(EVTYPE %in% officialtypes))

print(c(newclean <- nrow(eventsclean), newdirty <- nrow(eventsdirty)))
## [1] 248468   4507
types_before_stringdist <- length(unique(eventsdirty$EVTYPE))
print(paste("Number of unique unidentified event types:", types_before_stringdist))
## [1] "Number of unique unidentified event types: 307"

The number of data in need of further manipulation has shrunk from 81719 to 4507.

ii) Approximate String Matching by adist() Function

What we need to do next is to find the most likely event types for the 4507 records (307 unique uncleaned event types) using approximate string matching.
A distance matrix is created to show the generalized Levenshtein (edit) distance between the official event types vector and the EVTYPE column in the “eventsdirty” dataset. For each observation, we find an offical event type whose distance to that observation is the shortest among all the official types. Then this offical event type is assumed as the true event type of that observation. The results are stored in a new column “ESTTYPE” (estimated type) in dataset “eventsdirty”.

# Must set partial = T to enable matching of substrings of each EVTYPE value!!!
distmatrix = as.data.frame(adist(eventsdirty$EVTYPE, officialtypes, partial = T))

# The column names of the matrix should be the 48 official event types.
names(distmatrix) <- officialtypes
eventsdirty$ESTTYPE <- NA

for (i in 1:nrow(eventsdirty)) {
    # Find out which official type(s) is/are most plausible.
    candis <- colnames(distmatrix)[which.min(distmatrix[i, 1:48])]
    # In case there are ties between multiple types, only use the first type matched.
    eventsdirty$ESTTYPE[i] <- candis[1]
}
iii) Manual Double-check

The algorithm applied in adist() function does not necessarily guarantee the most probable true event type. Any little error in the identification of the real event type may lead to serious mistake in the subsequent data analysis part. Therefore 2 groups of most important observations in the “eventsdirty” set are filtered out for double check. The 2 groups are those

  • whose “EVTYPE” is among the 10 most frequent “EVTYPE”s, or

  • whose effect in terms of fatalities, injuries, crop damage or property damage ranks in the top 1‰.

The “EVTYPE” column of the 2 groups of data is selected and juxtaposed with its “ESTTYPE” counterpart.

For the first group:

# Show the 10 types of most frequently happened events.
print(mostfreq <- head(sort(table(eventsdirty$EVTYPE), decreasing = T), 10))
## 
##         HIGH WINDS   WILD/FOREST FIRE  FLOOD/FLASH FLOOD 
##                657                388                279 
##       RIP CURRENTS        URBAN FLOOD       EXTREME COLD 
##                241                219                199 
##          LANDSLIDE        STORM SURGE         LIGHT SNOW 
##                193                177                141 
## WINTER WEATHER/MIX 
##                139
mostfreqnames <- names(mostfreq)
highfreqgroup <- eventsdirty[eventsdirty$EVTYPE %in% mostfreqnames, c("EVTYPE", "ESTTYPE")]

# Only show the unique entries for comparison.
highfreqgroup[!duplicated(highfreqgroup), ]
##                  EVTYPE                 ESTTYPE
## 7            HIGH WINDS               HIGH WIND
## 17         EXTREME COLD EXTREME COLD/WIND CHILL
## 102   FLOOD/FLASH FLOOD             FLASH FLOOD
## 156         URBAN FLOOD           COASTAL FLOOD
## 189         STORM SURGE        STORM SURGE/TIDE
## 1114       RIP CURRENTS             RIP CURRENT
## 1459   WILD/FOREST FIRE                WILDFIRE
## 1660          LANDSLIDE   ASTRONOMICAL LOW TIDE
## 2118         LIGHT SNOW        LAKE-EFFECT SNOW
## 3964 WINTER WEATHER/MIX          WINTER WEATHER

As we can see, the estimated true event types are mostly correct. But 3 of the 10 most frequent events are incorrectly identified. They are “URBAN FLOOD”, “LANDSLIDE” and “LIGHT SNOW”.

For the second group:

toprankgroup <- eventsdirty %>% filter(FATALITIES > quantile(FATALITIES, 0.999) |
                                       INJURIES > quantile(INJURIES, 0.999) |
                                       CROP_DAMAGE > quantile(CROP_DAMAGE, 0.999) |
                                       PROP_DAMAGE > quantile(PROP_DAMAGE, 0.999)) %>%
                                select(EVTYPE, ESTTYPE)

# Only show the unique entries.
toprankgroup[!duplicated(toprankgroup), ]
##                       EVTYPE                  ESTTYPE
## 1                 WILD FIRES                 WILDFIRE
## 2                RIVER FLOOD          LAKESHORE FLOOD
## 3               EXTREME COLD  EXTREME COLD/WIND CHILL
## 4  UNSEASONABLY WARM AND DRY MARINE THUNDERSTORM WIND
## 5                  HEAT WAVE           EXCESSIVE HEAT
## 8               EXTREME HEAT  EXTREME COLD/WIND CHILL
## 10                 HURRICANE      HURRICANE (TYPHOON)
## 11         HURRICANE/TYPHOON      HURRICANE (TYPHOON)
## 16               STORM SURGE         STORM SURGE/TIDE

In the most influential group, there’re also 3 types of events mis-identified, namely “RIVER FLOOD”, “UNSEASONABLY WARM AND DRY” and “EXTREME HEAT”.

These event type estimates are to be manually rectified.

# Write a function to save some key strokes because eventsdirty$COLUMNNAME is too long to type.
manualreplace <- function(ev, est) {
    # From high frequency group:
    est[ev == "URBAN FLOOD"] <- "FLOOD"
    est[ev == "LANDSLIDE"] <- "DEBRIS FLOW"
    est[ev == "LIGHT SNOW"] <- "HEAVY SNOW"
    
    # From most influential group:
    est[ev == "RIVER FLOOD"] <- "FLOOD"
    est[ev == "UNSEASONABLY WARM AND DRY" | ev == "EXTREME HEAT"] <- "EXCESSIVE HEAT"

    return (est)
}

eventsdirty$ESTTYPE <- manualreplace(eventsdirty$EVTYPE, eventsdirty$ESTTYPE)

So far all the event type refining is completed. For the 4507 records that went through approximate string matching, some of their mapping could still be erroneous as long as they did not undergo manual check. But considering that the 4507 records take up only 1.78% of the final dataset, no more resource in this analysis would be devoted to this matter.


(4) Joining Data

The “eventsclean” and “eventsdirty” datasets are to be combined together for the data analysis in the next section.

dirtypart <- eventsdirty %>% select(-EVTYPE, -matched) %>% rename(EVTYPE = ESTTYPE)
cleanpart <- select(eventsclean, - matched)
refineddata <- rbind(cleanpart, dirtypart)
nrow(refineddata)
## [1] 252975
sapply(refineddata, class)
##    STATE_ID    BGN_DATE      COUNTY  COUNTYNAME       STATE      EVTYPE 
##   "numeric"      "Date"   "numeric" "character" "character" "character" 
##  FATALITIES    INJURIES CROP_DAMAGE PROP_DAMAGE 
##   "numeric"   "numeric"   "numeric"   "numeric"
# Alter the class of some columns.
refineddata[c(1, 3:6)] <- lapply(refineddata[c(1, 3:6)], factor)



Data Analysis

Extreme Events Definition and Data Split

Events most harmful with respect to population health are defined as those who:
  1. caused greatest injury and fatality in one single occurrence, or
  2. happened frequently enough to have caused the greatest cumulative injury and fatality during a period of time.
Events having greatest economic consequences are defined as those who:
  1. caused the greatest of crop or property damage in one single occurrence, or
  2. happened frequently enough to have caused the greatest cummulative crop or property damage during a period of time.

We add 2 new variables “HEALTH_TOTAL” and “ECON_TOTAL” to the dataset “refineddata” storing for each event occurrence its total effect on population health and on economy.

refineddata <- refineddata %>% mutate(HEALTH_EFFECT = FATALITIES + INJURIES, 
                                      ECON_EFFECT = CROP_DAMAGE + PROP_DAMAGE)

From NOAA Storm Events Database Collection Source Clarification we can see that:

  • From 1955 through 1992, only tornado, thunderstorm wind and hail events were recorded.

  • From January 1993 through November 2011, NOAA recorded all of the 48 types of storm events.

Therefore, data will be split into 2 subsets accordingly and analyses would be carried out on each of them.

One subset (“before1993”) includes all the data in between 1955 and 1992 and a comparison will be drawn between tornado, thunderstorm wind and hail. Data in the period from 1993 to 2011 are put in another subset (“after1993”) to make a more complete comparison between all the 48 types of events. (Data before 1955 are already deleted in step 1 of data processing section above.)

before1993 = subset(refineddata, BGN_DATE >= ymd(19550101) & BGN_DATE < ymd(19930101))
after1993 = subset(refineddata, BGN_DATE >= ymd(19930101))


Exploration of Data BEFORE Year 1993

(1) Most Influential Events in Their Single Occurrences

singleworstbefore1993 <- before1993 %>% filter(HEALTH_EFFECT == max(HEALTH_EFFECT) |
                                               ECON_EFFECT == max(ECON_EFFECT)) %>%
                         arrange(BGN_DATE)

kable(singleworstbefore1993, format = "markdown")
STATE_ID BGN_DATE COUNTY COUNTYNAME STATE EVTYPE FATALITIES INJURIES CROP_DAMAGE PROP_DAMAGE HEALTH_EFFECT ECON_EFFECT
18 1965-04-11 67 HOWARD IN TORNADO 17 560 0 2.5e+08 577 2.5e+08
18 1965-04-11 53 GRANT IN TORNADO 8 275 0 2.5e+08 283 2.5e+08
26 1965-04-11 23 BRANCH MI TORNADO 9 200 0 2.5e+08 209 2.5e+08
20 1966-06-08 177 SHAWNEE KS TORNADO 16 450 0 2.5e+08 466 2.5e+08
48 1970-05-11 303 LUBBOCK TX TORNADO 26 500 0 2.5e+08 526 2.5e+08
13 1973-03-31 63 CLAYTON GA TORNADO 0 0 0 2.5e+08 0 2.5e+08
13 1973-03-31 297 WALTON GA TORNADO 1 50 0 2.5e+08 51 2.5e+08
13 1973-03-31 219 OCONEE GA TORNADO 0 0 0 2.5e+08 0 2.5e+08
13 1973-03-31 59 CLARKE GA TORNADO 1 50 0 2.5e+08 51 2.5e+08
13 1973-03-31 221 OGLETHORPE GA TORNADO 0 0 0 2.5e+08 0 2.5e+08
18 1974-04-03 123 PERRY IN TORNADO 2 6 0 2.5e+08 8 2.5e+08
18 1974-04-03 19 CLARK IN TORNADO 0 0 0 2.5e+08 0 2.5e+08
18 1974-04-03 7 BENTON IN TORNADO 0 0 0 2.5e+08 0 2.5e+08
39 1974-04-03 57 GREENE OH TORNADO 36 1150 0 2.5e+08 1186 2.5e+08
13 1975-03-24 121 FULTON GA TORNADO 3 152 0 2.5e+08 155 2.5e+08
31 1975-05-06 55 DOUGLAS NE TORNADO 3 118 0 2.5e+08 121 2.5e+08
22 1978-12-03 15 BOSSIER LA TORNADO 2 266 0 2.5e+08 268 2.5e+08
48 1979-04-10 485 WICHITA TX TORNADO 42 1700 0 2.5e+08 1742 2.5e+08
9 1979-10-03 3 HARTFORD CT TORNADO 3 500 0 2.5e+08 503 2.5e+08
31 1980-06-03 79 HALL NE TORNADO 3 110 0 2.5e+08 113 2.5e+08
42 1980-06-03 3 ALLEGHENY PA TORNADO 0 20 0 2.5e+08 20 2.5e+08
42 1980-06-03 129 WESTMORELAND PA TORNADO 0 0 0 2.5e+08 0 2.5e+08
42 1980-06-03 5 ARMSTRONG PA TORNADO 0 120 0 2.5e+08 120 2.5e+08
42 1980-07-21 39 CRAWFORD PA TORNADO 0 1 0 2.5e+08 1 2.5e+08
48 1980-08-10 453 TRAVIS TX TORNADO 0 4 0 2.5e+08 4 2.5e+08
40 1981-04-19 143 TULSA OK TORNADO 0 7 0 2.5e+08 7 2.5e+08
40 1982-05-11 65 JACKSON OK TORNADO 0 41 0 2.5e+08 41 2.5e+08
17 1982-05-29 199 WILLIAMSON IL TORNADO 10 181 0 2.5e+08 191 2.5e+08
39 1985-05-31 133 PORTAGE OH TORNADO 0 0 0 2.5e+08 0 2.5e+08
39 1985-05-31 155 TRUMBULL OH TORNADO 10 250 0 2.5e+08 260 2.5e+08
39 1985-05-31 155 TRUMBULL OH TORNADO 0 0 0 2.5e+08 0 2.5e+08
19 1986-07-28 193 WOODBURY IA TORNADO 0 1 0 2.5e+08 1 2.5e+08
19 1986-07-28 133 MONONA IA TORNADO 0 0 0 2.5e+08 0 2.5e+08
37 1988-11-28 183 WAKE NC TORNADO 2 105 0 2.5e+08 107 2.5e+08
9 1989-07-10 9 NEW HAVEN CT TORNADO 0 40 0 2.5e+08 40 2.5e+08
1 1989-11-15 89 MADISON AL TORNADO 21 463 0 2.5e+08 484 2.5e+08
1 1989-11-15 89 MADISON AL TORNADO 0 0 0 2.5e+08 0 2.5e+08
17 1990-08-28 197 WILL IL TORNADO 29 350 0 2.5e+08 379 2.5e+08
20 1991-04-26 173 SEDGWICK KS TORNADO 4 75 0 2.5e+08 79 2.5e+08
20 1991-04-26 15 BUTLER KS TORNADO 13 150 0 2.5e+08 163 2.5e+08
48 1992-11-21 201 HARRIS TX TORNADO 0 15 0 2.5e+08 15 2.5e+08

As we can see, it was always tornado that caused the most severe damages in its single occurrences.

maxsinglebefore1993 <- singleworstbefore1993[which.max(singleworstbefore1993$HEALTH_EFFECT), ]
kable(maxsinglebefore1993, format = "markdown")
STATE_ID BGN_DATE COUNTY COUNTYNAME STATE EVTYPE FATALITIES INJURIES CROP_DAMAGE PROP_DAMAGE HEALTH_EFFECT ECON_EFFECT
18 48 1979-04-10 485 WICHITA TX TORNADO 42 1700 0 2.5e+08 1742 2.5e+08

In particular, the tornado that took place on 1979-04-10 was most harmful both in terms of population health and economic consequences. It caused 42 death, 1700 injuries, $0 crop damage and $250000000 property damage.


(2) Events Whose Cumulative Influence is the Most Significant

# Write a function "maxcumu()" that can be applied to both "before1993" and "after1993" datasets. The nested function "cumueffects()" will later be utilized in plotting "after1993" dataset. 

cumueffects <- function(x) {
    x %>% group_by(EVTYPE) %>% 
          summarise(FATALITIES_TOTAL = sum(FATALITIES), INJURIES_TOTAL = sum(INJURIES),
                    CROP_TOTAL = sum(CROP_DAMAGE), PROP_TOTAL = sum(PROP_DAMAGE),
                    HEALTH_TOTAL = sum(HEALTH_EFFECT), ECON_TOTAL = sum(ECON_EFFECT))
}

maxcumu <- function(x) {
    cumueffects(x) %>% filter(HEALTH_TOTAL == max(HEALTH_TOTAL) | ECON_TOTAL == max(ECON_TOTAL))
}

maxcumubefore1993 <- maxcumu(before1993)

kable(maxcumubefore1993, format = "markdown")
EVTYPE FATALITIES_TOTAL INJURIES_TOTAL CROP_TOTAL PROP_TOTAL HEALTH_TOTAL ECON_TOTAL
TORNADO 3123 59092 0 29722198670 62215 29722198670

From the table above we can see that from January 1955 to December 1992, tornado caused 3123 death, 59092 injuries, $0 crop damage and $29722198670 property damage all together. Hence tornado had the most harmful cumulative effect both in terms of population health and economic consequences.


Exploration of Data AFTER Year 1993

(1) Most Influential Events in Their Single Occurrences

singleworstafter1993 <- after1993 %>% filter(HEALTH_EFFECT == max(HEALTH_EFFECT) |
                                             ECON_EFFECT == max(ECON_EFFECT)) %>% arrange(BGN_DATE)

kable(singleworstafter1993, format = "markdown")
STATE_ID BGN_DATE COUNTY COUNTYNAME STATE EVTYPE FATALITIES INJURIES CROP_DAMAGE PROP_DAMAGE HEALTH_EFFECT ECON_EFFECT
39 1994-02-08 0 OHZ42>088 OH ICE STORM 1 1568 5000000 5.00e+07 1569 55000000
6 2006-01-01 55 NAPA CA FLOOD 0 0 32500000 1.15e+11 0 115032500000

In their single occurrences, ice storm was most harmful to the population health, causing 1 death and 1568 injuries on 1994-02-08, and flood caused the most economic damage as $32500000 crop as well as $115000000000 were damaged in a flood on 2006-01-01.


(2) Events Whose Cumulative Influence is the Most Significant

maxcumuafter1993 <- maxcumu(after1993) %>% arrange(EVTYPE)

kable(maxcumuafter1993, format = "markdown")
EVTYPE FATALITIES_TOTAL INJURIES_TOTAL CROP_TOTAL PROP_TOTAL HEALTH_TOTAL ECON_TOTAL
FLOOD 508 6873 10738086050 150083407610 7381 160821493660
TORNADO 1621 23328 414962910 26343736027 24949 26758698937

From the table above we can see that from January 1993 to November 2011, flood in total caused $10738086050 crop damage and $150083407610 property damage. Flood was most harmful event type with respect to the economy.

On the other hand, tornado in this period caused 1621 death and 23328 injuries, making it the worst event in terms of cumulative effect on the population health.


(3) Plotting the Facts

We are to present the impact of 10 most harmful event types with respect to the public health and the economy in 2 plots. Let’s first take a look at the data excerpt below. This is from the summary of each event type’s cumulative harm in the period from January 1, 1993 to November 30, 2011.

cumueffectsafter1993 <- cumueffects(after1993)
kable(head(cumueffectsafter1993, 3), format = "markdown")
EVTYPE FATALITIES_TOTAL INJURIES_TOTAL CROP_TOTAL PROP_TOTAL HEALTH_TOTAL ECON_TOTAL
ASTRONOMICAL LOW TIDE 16 38 5000000 13126150 54 18126150
AVALANCHE 225 170 0 4291800 395 4291800
BLIZZARD 101 805 112060000 659313950 906 771373950

Relevant columns from the above table are then selected. After the needed data reshaping, 2 plots would be made by ggplot2.

# Select the top 10 population-health-harmful events.
# Use gather() to reshape data.
after1993_healthtop10 <- filter(cumueffectsafter1993, HEALTH_TOTAL %in% tail(sort(HEALTH_TOTAL), 10)) %>% 
                         select(1:3) %>% gather(key = EFFECT, value = VALUE, -EVTYPE)
head(after1993_healthtop10)
## # A tibble: 6 × 3
##           EVTYPE           EFFECT VALUE
##           <fctr>            <chr> <dbl>
## 1 EXCESSIVE HEAT FATALITIES_TOTAL  2228
## 2    FLASH FLOOD FATALITIES_TOTAL  1035
## 3          FLOOD FATALITIES_TOTAL   508
## 4           HEAT FATALITIES_TOTAL   937
## 5      HIGH WIND FATALITIES_TOTAL   299
## 6      ICE STORM FATALITIES_TOTAL    97
# Plot the top 10 health-impacting event types.
g_health <- ggplot(after1993_healthtop10, aes(x = EVTYPE, y = VALUE, fill = EFFECT)) + geom_col( ) +
            labs(title = "Effects of the 10 Event Types Most Harmful to the Population Health", 
                 subtitle = "Data from January 1993 to November 2011",
                 x = "Event Type", y = "Number of Fatalities and Injuries") +
            theme(axis.text.x = element_text(angle = 30, hjust = 1), 
                  legend.title = element_blank()) +
            scale_fill_manual(values = c("lightblue3", "lightcoral"),
                              labels = c("Total Fatalities", "Total Injuries"))

g_health 

# Select the top 10 economically consequential events.
after1993_econtop10 <- filter(cumueffectsafter1993, ECON_TOTAL %in% tail(sort(ECON_TOTAL), 10)) %>% 
                         select(c(1, 4, 5)) %>% gather(key = EFFECT, value = VALUE, -EVTYPE)
head(after1993_econtop10)
## # A tibble: 6 × 3
##                EVTYPE     EFFECT       VALUE
##                <fctr>      <chr>       <dbl>
## 1             DROUGHT CROP_TOTAL 13972631000
## 2         FLASH FLOOD CROP_TOTAL  1532197150
## 3               FLOOD CROP_TOTAL 10738086050
## 4                HAIL CROP_TOTAL  3026044650
## 5 HURRICANE (TYPHOON) CROP_TOTAL  5506117800
## 6           ICE STORM CROP_TOTAL  5022113500
# Plot the top 10 economy-impacting event types.
g_econ <- ggplot(after1993_econtop10, aes(x = EVTYPE, y = VALUE/1000000000, fill = EFFECT)) + geom_col( ) +
          labs(title = "Effects of the 10 Event Types with Greatest Economic Consequences", 
               subtitle = "Data from January 1993 to November 2011",
               x = "Event Type", y = "Crop and Property Damage (in Billions of Dollars)") +
          theme(axis.text.x = element_text(angle = 30, hjust = 1),
                legend.title = element_blank()) +
          scale_fill_manual(values = c("coral", "cornflowerblue"),
                            labels = c("Total Crop Damage", "Total Property Damage"))
    
g_econ



Results

In this analysis, we found that:

  1. From January 1, 1955 to December 31, 1992:
  1. From January 1, 1993 to November 30, 2011: