Synopsis

In this analysis, U.S. storm events data from 1955 to 2011 are explored to show which types of events are most harmful with respect to population health and the economy. The data source is the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The bulk of this research is about data cleaning, including standardizing event types, calculating damages, etc. Cleaned data are split into 2 subsets with the cutoff point at January 1st, 1993, before which only 3 types of storm events were recorded. For each storm event, its effects in a single occurrence as well as its cumulative effects over either analyzed period are both assessed.

We found that before 1993 tornado was the most harmful one among the 3 recorded storm event types. After 1993, ice storm, flood and tornado can all be the most harmful one among the 48 standard event types, depending on the criterion selected. The detailed results are presented in the last part of this analysis.

Data Processing

library(rmarkdown)
library(knitr)
library(data.table)
library(reshape2)
library(lubridate)
library(dplyr)
library(tidyr)
library(ggplot2)

Step 1: Downloading Raw Data

Download data and check the header (first line of the data file) to determine which variables should be kept for the analysis.

if (!file.exists("stormdata.csv.bz2")) {
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "StormData.csv.bz2") 
}
table(readLines(bzfile("StormData.csv.bz2"), 1))

## 
## "STATE__","BGN_DATE","BGN_TIME","TIME_ZONE","COUNTY","COUNTYNAME","STATE","EVTYPE","BGN_RANGE","BGN_AZI","BGN_LOCATI","END_DATE","END_TIME","COUNTY_END","COUNTYENDN","END_RANGE","END_AZI","END_LOCATI","LENGTH","WIDTH","F","MAG","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","WFO","STATEOFFIC","ZONENAMES","LATITUDE","LONGITUDE","LATITUDE_E","LONGITUDE_","REMARKS","REFNUM" 
##                                                                                                                                                                                                                                                                                                                                                                                                            1

# fread() is used instead of read.csv() for reading large dataset.
dt <-fread(sprintf("bzcat %s", "StormData.csv.bz2"), 
           select = c("STATE__", "BGN_DATE", "COUNTY", "COUNTYNAME", "STATE", "EVTYPE", 
                      "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))

Step 2: Data cleaning

(1) Parsing Dates

dt$BGN_DATE <- colsplit(string = dt$BGN_DATE, pattern=" ", names = c("Part1", "Part2"))$Part1
dt <- rename(dt, STATE_ID = STATE__)
dt$BGN_DATE <- mdy(dt$BGN_DATE)

It is stated on the instruction in NOAA official site that from 1950 through 1954, only tornado events were recorded. Since no comparison can be drawn, data before January 1955 are not considered in this analysis.

dt <- dt[dt$BGN_DATE >= ymd(19550101), ]
nrow(dt)

## [1] 900432

(2) Manipulating Crop and Property Damage Data

The following cleaning of crop and property damage data is based on a research of NOAA raw data by David Hood and Eddie Song. The goal is to generate 2 new variables showing the 2 kinds of damages in dollar amount.

# To keep the preliminarily processed dataset "dt" intact, its copy "alldata" is used for the further cleaning.
alldata = dt
table(alldata$PROPDMGEXP)

## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 422909      7  11221

table(alldata$CROPDMGEXP)

## 
##             ?      0      2      B      k      K      m      M 
## 616548      7     19      1      9     21 281832      1   1994

mapping <- function(x) {
# Numbers need to be dealt with first, otherwise the other regular expressions applied would lead to erroneous result.
    x <- gsub("[0-8]", 10, x)
    
    x <- gsub("\\?", 0, x)
    x <- gsub("^$", 0, x)    
    x <- gsub("\\-", 0, x)
    x <- gsub("\\+", 1, x)
    
    x <- gsub("b", 1000000000, x, ignore.case = T)
    x <- gsub("m", 1000000, x, ignore.case = T)
    x <- gsub("k", 1000, x, ignore.case = T)
    x <- gsub("h", 100, x, ignore.case = T)

    return(as.numeric(x))
}

# Must assign the values back to the original variables.
alldata$PROPDMGEXP <- mapping(alldata$PROPDMGEXP)
alldata$CROPDMGEXP <- mapping(alldata$CROPDMGEXP) 

table(alldata$PROPDMGEXP)

## 
##      0      1     10    100   1000  1e+06  1e+09 
## 465943      5    300      7 422909  11228     40

table(alldata$CROPDMGEXP)

## 
##      0     10   1000  1e+06  1e+09 
## 616555     20 281853   1995      9

# sapply(alldata, class)
alldata <- mutate(alldata, CROP_DAMAGE = CROPDMG * CROPDMGEXP, PROP_DAMAGE = PROPDMG * PROPDMGEXP)
alldata <- select(alldata, -one_of(c("PROPDMG", "PROPDMGEXP", "CROPDMG","CROPDMGEXP")))

Only keep the records that had economic and/or population health consequences.

impactdata <- filter(alldata, FATALITIES !=0 | INJURIES !=0 | CROP_DAMAGE !=0 | PROP_DAMAGE !=0)

(3) Event Types Cleaning

Event types are cleaned base on Storm Data Event Table in the NOAA Storm Data Preparation Document.

# Compile the official event types.
officialtypes <- toupper(c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather"))

# One entry has EVTYPE == "?", delete it because it's useless for the analysis. (Readers can check the NOAA database to comfirm this.)
impactdata <- filter(impactdata, EVTYPE != "?")
nrow(impactdata)

## [1] 252975

# Transform all EVTYPEs to uppercase strings. 
impactdata$EVTYPE <- toupper(impactdata$EVTYPE)

Create a “matched” column in “impactdata” indicating whether each observation’s event type is already absolutely precise. Then split the dataset based on it.

impactdata$matched <- impactdata$EVTYPE %in% officialtypes
eventsclean <- impactdata[impactdata$matched == 1, ]
eventsdirty <- impactdata[impactdata$matched == 0, ]
print(c(oldclean <-nrow(eventsclean), olddirty <- nrow(eventsdirty)))

## [1] 171256  81719

i) Preliminary Manual Processing

For records whose event type does not exactly match any of the official terms, we first correct the apparent typos.

# Punctuation cleaning
eventsdirty$EVTYPE <- gsub("[\\. \\, \\( \\)]", " ", eventsdirty$EVTYPE)

# Replace multiple spaces with single space.
eventsdirty$EVTYPE <- gsub(" +", " ", eventsdirty$EVTYPE)

# Delete space at the beginning or the end of each string.
eventsdirty$EVTYPE <- gsub("^\\s* | \\s*$", "", eventsdirty$EVTYPE)

# Delete all numbers.
eventsdirty$EVTYPE <- gsub("[0-9]", "", eventsdirty$EVTYPE)

Type-specific regular expressions are then used to manually clean entries

whose event type has a high repetition among all the observations, and
whose exact event type can be easily inferred.

head(sort(table(eventsdirty$EVTYPE), decreasing = T), 20)

## 
##            TSTM WIND   THUNDERSTORM WINDS URBAN/SML STREAM FLD 
##                63237                12057                  702 
##           HIGH WINDS       TSTM WIND/HAIL     WILD/FOREST FIRE 
##                  657                  441                  388 
##       FLASH FLOODING    FLOOD/FLASH FLOOD         RIP CURRENTS 
##                  301                  279                  241 
##         EXTREME COLD            LANDSLIDE          STORM SURGE 
##                  199                  193                  177 
##           LIGHT SNOW          URBAN FLOOD   WINTER WEATHER/MIX 
##                  141                  139                  139 
##            HURRICANE     MARINE TSTM WIND                  FOG 
##                  129                  109                  107 
##          RIVER FLOOD                 WIND 
##                  107                   84

The majority of event-type-imprecise data can be easily determined as thunderstorm wind data.
If a observation’s EVTYPE string contains “TSTM” or “THUNDERSTORM” but not “MARINE”, it is considered as a thunderstorm wind event. This creates bias toward the effect of thunderstorm wind, but it significantly decreases the workload for further event type matching.

# (?<!...) means negative lookbehind. Must use pert = T argument.
eventsdirty$EVTYPE[grepl("(?<!MARINE )TSTM|THUNDERSTORM", eventsdirty$EVTYPE, perl = T)] <- "THUNDERSTORM WIND"
# Marine Thunderstorm Wind is a unique type that should not be classified as plain thunderstrom wind.
eventsdirty$EVTYPE[eventsdirty$EVTYPE == "MARINE TSTM WIND"] <- "MARINE THUNDERSTORM WIND"

Next, some of the flood-related data are manipulated.

eventsdirty$EVTYPE[eventsdirty$EVTYPE == "URBAN/SML STREAM FLD"] <- "FLOOD"
eventsdirty$EVTYPE <- gsub("FLOODING", "FLOOD", eventsdirty$EVTYPE)
head(sort(table(eventsdirty$EVTYPE), decreasing = T), 20)

## 
##        THUNDERSTORM WIND                    FLOOD               HIGH WINDS 
##                    75979                      760                      657 
##         WILD/FOREST FIRE              FLASH FLOOD        FLOOD/FLASH FLOOD 
##                      388                      302                      279 
##             RIP CURRENTS              URBAN FLOOD             EXTREME COLD 
##                      241                      219                      199 
##                LANDSLIDE              STORM SURGE               LIGHT SNOW 
##                      193                      177                      141 
##       WINTER WEATHER/MIX                HURRICANE              RIVER FLOOD 
##                      139                      129                      125 
## MARINE THUNDERSTORM WIND                      FOG                     WIND 
##                      109                      107                       84 
##           DRY MICROBURST        HURRICANE/TYPHOON 
##                       78                       72

Now we can shift the just refined data in “eventsdirty” to the “eventsclean” dataset.

eventsclean <- rbind(eventsdirty[eventsdirty$EVTYPE %in% officialtypes, ], eventsclean)
eventsdirty <- filter(eventsdirty, !(EVTYPE %in% officialtypes))

print(c(newclean <- nrow(eventsclean), newdirty <- nrow(eventsdirty)))

## [1] 248468   4507

types_before_stringdist <- length(unique(eventsdirty$EVTYPE))
print(paste("Number of unique unidentified event types:", types_before_stringdist))

## [1] "Number of unique unidentified event types: 307"

The number of data in need of further manipulation has shrunk from 81719 to 4507.

ii) Approximate String Matching by adist() Function

What we need to do next is to find the most likely event types for the 4507 records (307 unique uncleaned event types) using approximate string matching.
A distance matrix is created to show the generalized Levenshtein (edit) distance between the official event types vector and the EVTYPE column in the “eventsdirty” dataset. For each observation, we find an offical event type whose distance to that observation is the shortest among all the official types. Then this offical event type is assumed as the true event type of that observation. The results are stored in a new column “ESTTYPE” (estimated type) in dataset “eventsdirty”.

# Must set partial = T to enable matching of substrings of each EVTYPE value!!!
distmatrix = as.data.frame(adist(eventsdirty$EVTYPE, officialtypes, partial = T))

# The column names of the matrix should be the 48 official event types.
names(distmatrix) <- officialtypes
eventsdirty$ESTTYPE <- NA

for (i in 1:nrow(eventsdirty)) {
    # Find out which official type(s) is/are most plausible.
    candis <- colnames(distmatrix)[which.min(distmatrix[i, 1:48])]
    # In case there are ties between multiple types, only use the first type matched.
    eventsdirty$ESTTYPE[i] <- candis[1]
}

iii) Manual Double-check

The algorithm applied in adist() function does not necessarily guarantee the most probable true event type. Any little error in the identification of the real event type may lead to serious mistake in the subsequent data analysis part. Therefore 2 groups of most important observations in the “eventsdirty” set are filtered out for double check. The 2 groups are those

whose “EVTYPE” is among the 10 most frequent “EVTYPE”s, or
whose effect in terms of fatalities, injuries, crop damage or property damage ranks in the top 1‰.

The “EVTYPE” column of the 2 groups of data is selected and juxtaposed with its “ESTTYPE” counterpart.

For the first group:

# Show the 10 types of most frequently happened events.
print(mostfreq <- head(sort(table(eventsdirty$EVTYPE), decreasing = T), 10))

## 
##         HIGH WINDS   WILD/FOREST FIRE  FLOOD/FLASH FLOOD 
##                657                388                279 
##       RIP CURRENTS        URBAN FLOOD       EXTREME COLD 
##                241                219                199 
##          LANDSLIDE        STORM SURGE         LIGHT SNOW 
##                193                177                141 
## WINTER WEATHER/MIX 
##                139

mostfreqnames <- names(mostfreq)
highfreqgroup <- eventsdirty[eventsdirty$EVTYPE %in% mostfreqnames, c("EVTYPE", "ESTTYPE")]

# Only show the unique entries for comparison.
highfreqgroup[!duplicated(highfreqgroup), ]

##                  EVTYPE                 ESTTYPE
## 7            HIGH WINDS               HIGH WIND
## 17         EXTREME COLD EXTREME COLD/WIND CHILL
## 102   FLOOD/FLASH FLOOD             FLASH FLOOD
## 156         URBAN FLOOD           COASTAL FLOOD
## 189         STORM SURGE        STORM SURGE/TIDE
## 1114       RIP CURRENTS             RIP CURRENT
## 1459   WILD/FOREST FIRE                WILDFIRE
## 1660          LANDSLIDE   ASTRONOMICAL LOW TIDE
## 2118         LIGHT SNOW        LAKE-EFFECT SNOW
## 3964 WINTER WEATHER/MIX          WINTER WEATHER

As we can see, the estimated true event types are mostly correct. But 3 of the 10 most frequent events are incorrectly identified. They are “URBAN FLOOD”, “LANDSLIDE” and “LIGHT SNOW”.

For the second group:

toprankgroup <- eventsdirty %>% filter(FATALITIES > quantile(FATALITIES, 0.999) |
                                       INJURIES > quantile(INJURIES, 0.999) |
                                       CROP_DAMAGE > quantile(CROP_DAMAGE, 0.999) |
                                       PROP_DAMAGE > quantile(PROP_DAMAGE, 0.999)) %>%
                                select(EVTYPE, ESTTYPE)

# Only show the unique entries.
toprankgroup[!duplicated(toprankgroup), ]

##                       EVTYPE                  ESTTYPE
## 1                 WILD FIRES                 WILDFIRE
## 2                RIVER FLOOD          LAKESHORE FLOOD
## 3               EXTREME COLD  EXTREME COLD/WIND CHILL
## 4  UNSEASONABLY WARM AND DRY MARINE THUNDERSTORM WIND
## 5                  HEAT WAVE           EXCESSIVE HEAT
## 8               EXTREME HEAT  EXTREME COLD/WIND CHILL
## 10                 HURRICANE      HURRICANE (TYPHOON)
## 11         HURRICANE/TYPHOON      HURRICANE (TYPHOON)
## 16               STORM SURGE         STORM SURGE/TIDE

In the most influential group, there’re also 3 types of events mis-identified, namely “RIVER FLOOD”, “UNSEASONABLY WARM AND DRY” and “EXTREME HEAT”.

These event type estimates are to be manually rectified.

# Write a function to save some key strokes because eventsdirty$COLUMNNAME is too long to type.
manualreplace <- function(ev, est) {
    # From high frequency group:
    est[ev == "URBAN FLOOD"] <- "FLOOD"
    est[ev == "LANDSLIDE"] <- "DEBRIS FLOW"
    est[ev == "LIGHT SNOW"] <- "HEAVY SNOW"
    
    # From most influential group:
    est[ev == "RIVER FLOOD"] <- "FLOOD"
    est[ev == "UNSEASONABLY WARM AND DRY" | ev == "EXTREME HEAT"] <- "EXCESSIVE HEAT"

    return (est)
}

eventsdirty$ESTTYPE <- manualreplace(eventsdirty$EVTYPE, eventsdirty$ESTTYPE)

So far all the event type refining is completed. For the 4507 records that went through approximate string matching, some of their mapping could still be erroneous as long as they did not undergo manual check. But considering that the 4507 records take up only 1.78% of the final dataset, no more resource in this analysis would be devoted to this matter.

(4) Joining Data

The “eventsclean” and “eventsdirty” datasets are to be combined together for the data analysis in the next section.

dirtypart <- eventsdirty %>% select(-EVTYPE, -matched) %>% rename(EVTYPE = ESTTYPE)
cleanpart <- select(eventsclean, - matched)
refineddata <- rbind(cleanpart, dirtypart)
nrow(refineddata)

## [1] 252975

sapply(refineddata, class)

##    STATE_ID    BGN_DATE      COUNTY  COUNTYNAME       STATE      EVTYPE 
##   "numeric"      "Date"   "numeric" "character" "character" "character" 
##  FATALITIES    INJURIES CROP_DAMAGE PROP_DAMAGE 
##   "numeric"   "numeric"   "numeric"   "numeric"

# Alter the class of some columns.
refineddata[c(1, 3:6)] <- lapply(refineddata[c(1, 3:6)], factor)

Data Analysis

Extreme Events Definition and Data Split

Events most harmful with respect to population health are defined as those who:

caused greatest injury and fatality in one single occurrence, or
happened frequently enough to have caused the greatest cumulative injury and fatality during a period of time.

Events having greatest economic consequences are defined as those who:

caused the greatest of crop or property damage in one single occurrence, or
happened frequently enough to have caused the greatest cummulative crop or property damage during a period of time.

We add 2 new variables “HEALTH_TOTAL” and “ECON_TOTAL” to the dataset “refineddata” storing for each event occurrence its total effect on population health and on economy.

refineddata <- refineddata %>% mutate(HEALTH_EFFECT = FATALITIES + INJURIES, 
                                      ECON_EFFECT = CROP_DAMAGE + PROP_DAMAGE)

From NOAA Storm Events Database Collection Source Clarification we can see that:

From 1955 through 1992, only tornado, thunderstorm wind and hail events were recorded.
From January 1993 through November 2011, NOAA recorded all of the 48 types of storm events.

Therefore, data will be split into 2 subsets accordingly and analyses would be carried out on each of them.

One subset (“before1993”) includes all the data in between 1955 and 1992 and a comparison will be drawn between tornado, thunderstorm wind and hail. Data in the period from 1993 to 2011 are put in another subset (“after1993”) to make a more complete comparison between all the 48 types of events. (Data before 1955 are already deleted in step 1 of data processing section above.)

before1993 = subset(refineddata, BGN_DATE >= ymd(19550101) & BGN_DATE < ymd(19930101))
after1993 = subset(refineddata, BGN_DATE >= ymd(19930101))

Exploration of Data BEFORE Year 1993

(1) Most Influential Events in Their Single Occurrences

singleworstbefore1993 <- before1993 %>% filter(HEALTH_EFFECT == max(HEALTH_EFFECT) |
                                               ECON_EFFECT == max(ECON_EFFECT)) %>%
                         arrange(BGN_DATE)

kable(singleworstbefore1993, format = "markdown")

STATE_ID	BGN_DATE	COUNTY	COUNTYNAME	STATE	EVTYPE	FATALITIES	INJURIES	PROP_DAMAGE	HEALTH_EFFECT	ECON_EFFECT
18	1965-04-11	67	HOWARD	IN	TORNADO	17	560	2.5e+08	577	2.5e+08
18	1965-04-11	53	GRANT	IN	TORNADO	8	275	2.5e+08	283	2.5e+08
26	1965-04-11	23	BRANCH	MI	TORNADO	9	200	2.5e+08	209	2.5e+08
20	1966-06-08	177	SHAWNEE	KS	TORNADO	16	450	2.5e+08	466	2.5e+08
48	1970-05-11	303	LUBBOCK	TX	TORNADO	26	500	2.5e+08	526	2.5e+08
13	1973-03-31	63	CLAYTON	GA	TORNADO	0	0	2.5e+08	0	2.5e+08
13	1973-03-31	297	WALTON	GA	TORNADO	1	50	2.5e+08	51	2.5e+08
13	1973-03-31	219	OCONEE	GA	TORNADO	0	0	2.5e+08	0	2.5e+08
13	1973-03-31	59	CLARKE	GA	TORNADO	1	50	2.5e+08	51	2.5e+08
13	1973-03-31	221	OGLETHORPE	GA	TORNADO	0	0	2.5e+08	0	2.5e+08
18	1974-04-03	123	PERRY	IN	TORNADO	2	6	2.5e+08	8	2.5e+08
18	1974-04-03	19	CLARK	IN	TORNADO	0	0	2.5e+08	0	2.5e+08
18	1974-04-03	7	BENTON	IN	TORNADO	0	0	2.5e+08	0	2.5e+08
39	1974-04-03	57	GREENE	OH	TORNADO	36	1150	2.5e+08	1186	2.5e+08
13	1975-03-24	121	FULTON	GA	TORNADO	3	152	2.5e+08	155	2.5e+08
31	1975-05-06	55	DOUGLAS	NE	TORNADO	3	118	2.5e+08	121	2.5e+08
22	1978-12-03	15	BOSSIER	LA	TORNADO	2	266	2.5e+08	268	2.5e+08
48	1979-04-10	485	WICHITA	TX	TORNADO	42	1700	2.5e+08	1742	2.5e+08
9	1979-10-03	3	HARTFORD	CT	TORNADO	3	500	2.5e+08	503	2.5e+08
31	1980-06-03	79	HALL	NE	TORNADO	3	110	2.5e+08	113	2.5e+08
42	1980-06-03	3	ALLEGHENY	PA	TORNADO	0	20	2.5e+08	20	2.5e+08
42	1980-06-03	129	WESTMORELAND	PA	TORNADO	0	0	2.5e+08	0	2.5e+08
42	1980-06-03	5	ARMSTRONG	PA	TORNADO	0	120	2.5e+08	120	2.5e+08
42	1980-07-21	39	CRAWFORD	PA	TORNADO	0	1	2.5e+08	1	2.5e+08
48	1980-08-10	453	TRAVIS	TX	TORNADO	0	4	2.5e+08	4	2.5e+08
40	1981-04-19	143	TULSA	OK	TORNADO	0	7	2.5e+08	7	2.5e+08
40	1982-05-11	65	JACKSON	OK	TORNADO	0	41	2.5e+08	41	2.5e+08
17	1982-05-29	199	WILLIAMSON	IL	TORNADO	10	181	2.5e+08	191	2.5e+08
39	1985-05-31	133	PORTAGE	OH	TORNADO	0	0	2.5e+08	0	2.5e+08
39	1985-05-31	155	TRUMBULL	OH	TORNADO	10	250	2.5e+08	260	2.5e+08
39	1985-05-31	155	TRUMBULL	OH	TORNADO	0	0	2.5e+08	0	2.5e+08
19	1986-07-28	193	WOODBURY	IA	TORNADO	0	1	2.5e+08	1	2.5e+08
19	1986-07-28	133	MONONA	IA	TORNADO	0	0	2.5e+08	0	2.5e+08
37	1988-11-28	183	WAKE	NC	TORNADO	2	105	2.5e+08	107	2.5e+08
9	1989-07-10	9	NEW HAVEN	CT	TORNADO	0	40	2.5e+08	40	2.5e+08
1	1989-11-15	89	MADISON	AL	TORNADO	21	463	2.5e+08	484	2.5e+08
1	1989-11-15	89	MADISON	AL	TORNADO	0	0	2.5e+08	0	2.5e+08
17	1990-08-28	197	WILL	IL	TORNADO	29	350	2.5e+08	379	2.5e+08
20	1991-04-26	173	SEDGWICK	KS	TORNADO	4	75	2.5e+08	79	2.5e+08
20	1991-04-26	15	BUTLER	KS	TORNADO	13	150	2.5e+08	163	2.5e+08
48	1992-11-21	201	HARRIS	TX	TORNADO	0	15	2.5e+08	15	2.5e+08

As we can see, it was always tornado that caused the most severe damages in its single occurrences.

maxsinglebefore1993 <- singleworstbefore1993[which.max(singleworstbefore1993$HEALTH_EFFECT), ]
kable(maxsinglebefore1993, format = "markdown")

	STATE_ID	BGN_DATE	COUNTY	COUNTYNAME	STATE	EVTYPE	FATALITIES	INJURIES	CROP_DAMAGE	PROP_DAMAGE	HEALTH_EFFECT	ECON_EFFECT
18	48	1979-04-10	485	WICHITA	TX	TORNADO	42	1700	0	2.5e+08	1742	2.5e+08

In particular, the tornado that took place on 1979-04-10 was most harmful both in terms of population health and economic consequences. It caused 42 death, 1700 injuries, $0 crop damage and $250000000 property damage.

(2) Events Whose Cumulative Influence is the Most Significant

# Write a function "maxcumu()" that can be applied to both "before1993" and "after1993" datasets. The nested function "cumueffects()" will later be utilized in plotting "after1993" dataset. 

cumueffects <- function(x) {
    x %>% group_by(EVTYPE) %>% 
          summarise(FATALITIES_TOTAL = sum(FATALITIES), INJURIES_TOTAL = sum(INJURIES),
                    CROP_TOTAL = sum(CROP_DAMAGE), PROP_TOTAL = sum(PROP_DAMAGE),
                    HEALTH_TOTAL = sum(HEALTH_EFFECT), ECON_TOTAL = sum(ECON_EFFECT))
}

maxcumu <- function(x) {
    cumueffects(x) %>% filter(HEALTH_TOTAL == max(HEALTH_TOTAL) | ECON_TOTAL == max(ECON_TOTAL))
}

maxcumubefore1993 <- maxcumu(before1993)

kable(maxcumubefore1993, format = "markdown")

EVTYPE	FATALITIES_TOTAL	INJURIES_TOTAL	CROP_TOTAL	PROP_TOTAL	HEALTH_TOTAL	ECON_TOTAL
TORNADO	3123	59092	0	29722198670	62215	29722198670

From the table above we can see that from January 1955 to December 1992, tornado caused 3123 death, 59092 injuries, $0 crop damage and $29722198670 property damage all together. Hence tornado had the most harmful cumulative effect both in terms of population health and economic consequences.

Exploration of Data AFTER Year 1993

(1) Most Influential Events in Their Single Occurrences

singleworstafter1993 <- after1993 %>% filter(HEALTH_EFFECT == max(HEALTH_EFFECT) |
                                             ECON_EFFECT == max(ECON_EFFECT)) %>% arrange(BGN_DATE)

kable(singleworstafter1993, format = "markdown")

STATE_ID	BGN_DATE	COUNTY	COUNTYNAME	STATE	EVTYPE	FATALITIES	INJURIES	CROP_DAMAGE	PROP_DAMAGE	HEALTH_EFFECT	ECON_EFFECT
39	1994-02-08	0	OHZ42>088	OH	ICE STORM	1	1568	5000000	5.00e+07	1569	55000000
6	2006-01-01	55	NAPA	CA	FLOOD	0	0	32500000	1.15e+11	0	115032500000

In their single occurrences, ice storm was most harmful to the population health, causing 1 death and 1568 injuries on 1994-02-08, and flood caused the most economic damage as $32500000 crop as well as $115000000000 were damaged in a flood on 2006-01-01.

(2) Events Whose Cumulative Influence is the Most Significant

maxcumuafter1993 <- maxcumu(after1993) %>% arrange(EVTYPE)

kable(maxcumuafter1993, format = "markdown")

EVTYPE	FATALITIES_TOTAL	INJURIES_TOTAL	CROP_TOTAL	PROP_TOTAL	HEALTH_TOTAL	ECON_TOTAL
FLOOD	508	6873	10738086050	150083407610	7381	160821493660
TORNADO	1621	23328	414962910	26343736027	24949	26758698937

From the table above we can see that from January 1993 to November 2011, flood in total caused $10738086050 crop damage and $150083407610 property damage. Flood was most harmful event type with respect to the economy.

On the other hand, tornado in this period caused 1621 death and 23328 injuries, making it the worst event in terms of cumulative effect on the population health.

(3) Plotting the Facts

We are to present the impact of 10 most harmful event types with respect to the public health and the economy in 2 plots. Let’s first take a look at the data excerpt below. This is from the summary of each event type’s cumulative harm in the period from January 1, 1993 to November 30, 2011.

cumueffectsafter1993 <- cumueffects(after1993)
kable(head(cumueffectsafter1993, 3), format = "markdown")

EVTYPE	FATALITIES_TOTAL	INJURIES_TOTAL	CROP_TOTAL	PROP_TOTAL	HEALTH_TOTAL	ECON_TOTAL
ASTRONOMICAL LOW TIDE	16	38	5000000	13126150	54	18126150
AVALANCHE	225	170	0	4291800	395	4291800
BLIZZARD	101	805	112060000	659313950	906	771373950

Relevant columns from the above table are then selected. After the needed data reshaping, 2 plots would be made by ggplot2.

# Select the top 10 population-health-harmful events.
# Use gather() to reshape data.
after1993_healthtop10 <- filter(cumueffectsafter1993, HEALTH_TOTAL %in% tail(sort(HEALTH_TOTAL), 10)) %>% 
                         select(1:3) %>% gather(key = EFFECT, value = VALUE, -EVTYPE)
head(after1993_healthtop10)

## # A tibble: 6 × 3
##           EVTYPE           EFFECT VALUE
##           <fctr>            <chr> <dbl>
## 1 EXCESSIVE HEAT FATALITIES_TOTAL  2228
## 2    FLASH FLOOD FATALITIES_TOTAL  1035
## 3          FLOOD FATALITIES_TOTAL   508
## 4           HEAT FATALITIES_TOTAL   937
## 5      HIGH WIND FATALITIES_TOTAL   299
## 6      ICE STORM FATALITIES_TOTAL    97

# Plot the top 10 health-impacting event types.
g_health <- ggplot(after1993_healthtop10, aes(x = EVTYPE, y = VALUE, fill = EFFECT)) + geom_col( ) +
            labs(title = "Effects of the 10 Event Types Most Harmful to the Population Health", 
                 subtitle = "Data from January 1993 to November 2011",
                 x = "Event Type", y = "Number of Fatalities and Injuries") +
            theme(axis.text.x = element_text(angle = 30, hjust = 1), 
                  legend.title = element_blank()) +
            scale_fill_manual(values = c("lightblue3", "lightcoral"),
                              labels = c("Total Fatalities", "Total Injuries"))

g_health

# Select the top 10 economically consequential events.
after1993_econtop10 <- filter(cumueffectsafter1993, ECON_TOTAL %in% tail(sort(ECON_TOTAL), 10)) %>% 
                         select(c(1, 4, 5)) %>% gather(key = EFFECT, value = VALUE, -EVTYPE)
head(after1993_econtop10)

## # A tibble: 6 × 3
##                EVTYPE     EFFECT       VALUE
##                <fctr>      <chr>       <dbl>
## 1             DROUGHT CROP_TOTAL 13972631000
## 2         FLASH FLOOD CROP_TOTAL  1532197150
## 3               FLOOD CROP_TOTAL 10738086050
## 4                HAIL CROP_TOTAL  3026044650
## 5 HURRICANE (TYPHOON) CROP_TOTAL  5506117800
## 6           ICE STORM CROP_TOTAL  5022113500

# Plot the top 10 economy-impacting event types.
g_econ <- ggplot(after1993_econtop10, aes(x = EVTYPE, y = VALUE/1000000000, fill = EFFECT)) + geom_col( ) +
          labs(title = "Effects of the 10 Event Types with Greatest Economic Consequences", 
               subtitle = "Data from January 1993 to November 2011",
               x = "Event Type", y = "Crop and Property Damage (in Billions of Dollars)") +
          theme(axis.text.x = element_text(angle = 30, hjust = 1),
                legend.title = element_blank()) +
          scale_fill_manual(values = c("coral", "cornflowerblue"),
                            labels = c("Total Crop Damage", "Total Property Damage"))
    
g_econ

Results

In this analysis, we found that:

From January 1, 1955 to December 31, 1992:

We only compared tornado, thunderstorm wind and hail events.
The storm event that caused greatest damage to the public health and economy in the U.S. was tornado.
Its consequence in a single occurrence and its cumulative effect over that period were both the worst among the 3 event types.

From January 1, 1993 to November 30, 2011:

48 types of storm events are compared.
In their single occurrence, ice storm was the most harmful to the population health while flood caused the most economic damage.
In terms of their cumulative effect over the whole period, flood was the most devastating event to the economy and tornado was most detrimental to the public health.

Analysis of the Impact of Storm Events in the U.S.

Evidence from 1955 to 2011

Feng Hong

2/21/2017

Synopsis

Data Processing

Step 1: Downloading Raw Data

Step 2: Data cleaning

(1) Parsing Dates

(2) Manipulating Crop and Property Damage Data

(3) Event Types Cleaning

i) Preliminary Manual Processing

ii) Approximate String Matching by adist() Function

iii) Manual Double-check

(4) Joining Data

Data Analysis

Extreme Events Definition and Data Split

Events most harmful with respect to population health are defined as those who:

Events having greatest economic consequences are defined as those who:

Exploration of Data BEFORE Year 1993

(1) Most Influential Events in Their Single Occurrences

(2) Events Whose Cumulative Influence is the Most Significant

Exploration of Data AFTER Year 1993

(1) Most Influential Events in Their Single Occurrences

(2) Events Whose Cumulative Influence is the Most Significant

(3) Plotting the Facts

Results