Health and Economic Impacts of the most hazardous Weather Events in the USA

Synopsis

This report describes the exploration of the NOAA Storm Database and answers some questions about severe weather events with the greatest impact on the population health and economics between 1996 and 2011 years. Analysis was based on the standard of the weather events types declared in the NWS Storm Data Documentation.

The Data processing part of this report explains how the time period of exploration was chosen and describes how the weather event names form the source dataset was brought in accordance with the standard event names.

At the Summary part shown that the most dangerous weather events with impact on the populations health are tornadoes, excessive heat and floods. By the economic consequences the most hazard weather events are floods, hurricanes/typhoons, droughts, storm surges/tides and tornadoes.

Data Processing

Data downloading and prerocessing

data_url <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
data_file <- "repdata-data-StormData.csv.bz2"
download.file(data_url, destfile = data_file, method = "auto")
data_date <- date()

The original dataset was downloaded Sun Dec 27 15:57:06 2015 as repdata-data-StormData.csv.bz2 archive:

Storm Data [47Mb]

Next code unpacks the downloaded .bz2 archive and reads the original dataset:

data <- read.csv(bzfile(data_file))
data_size <- dim(data)

Supplied dataset contains 902297 observations of weather events and 37 variables. Next variables were used in the present exploration:

EVTYPE - the type (name) of the weather event;
BGN_DATE - the date of the weather event;
FATALITIES - the number of fatalities concerned with the weather event;
INJURIES - the number of injuries concerned with the weather event;
PROPDMG - the amount of property damages in US dollars;
PROPDMGEXP - the multiplier (thousands, millions or billions) for the PROPDMG;
CROPDMG - the amount of crop losses in US dollars;
CROPDMGEXP - the multiplier (thousands, millions or billions) for the CROPDMG.

data <- data[, c("EVTYPE", "BGN_DATE",
                 "FATALITIES", "INJURIES",
                 "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]

Table of standart events

The supplied NOAA’s dataset contains information about storms and other severe weather events. In the purpose of data processing all event types were represented with 48 unique weather events and defined in National Weather Service Storm Data Documentation. This set of the standard types of weather events was established since 1996 year.

To provide further analysis in accordance with NWS standards I prepared a dataframe with standart names of weather events:

events <- data.frame(
                EventName = c("Astronomical Low Tide", "Avalanche", "Blizzard",
                              "Coastal Flood", "Cold/Wind Chill", "Debris Flow",
                              "Dense Fog", "Dense Smoke", "Drought", "Dust Devil",
                              "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill",
                              "Flash Flood", "Flood", "Freezing Fog", "Frost/Freeze",
                              "Funnel Cloud", "Hail", "Heat", "Heavy Rain", "Heavy Snow",
                              "High Surf", "High Wind", "Hurricane/Typhoon", "Ice Storm",
                              "Lakeshore Flood", "Lake-Effect Snow", "Lightning",
                              "Marine Hail", "Marine High Wind", "Marine Strong Wind",
                              "Marine Thunderstorm Wind", "Rip Current", "Seiche",
                              "Sleet", "Storm Surge/Tide", "Strong Wind",
                              "Thunderstorm Wind", "Tornado", "Tropical Depression",
                              "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout",
                              "Wildfire", "Winter Storm", "Winter Weather"
        )
)

After that I converted all event names to uppercase:

data$EVTYPE <- toupper(data$EVTYPE)
events$EventName <- toupper(events$EventName)

Choosing of the time period

The events in the supplied dataset start in the year 1950 and end in November 2011.

As it is defined in the task to the present assignment we should consider writing report “as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events”. On the one hand, this means that chosen data period should contain quite fresh data (old data could be irrelevant because of different climate changes or measures already taken to prevent the consequences of hazardous weather events). On the other hand, analyzed data should contain enough information for statistical analysis because frequency of the different weather events varies year to year.

Furthermore, as it shown at the [Supplimental Information page] website](http://www.ncdc.noaa.gov/stormevents/details.jsp?type=supplemental) at the NOAA’s the number of matched events till 1993 is only 61% and not all event types were recorded:

from 1950 through 1954 - only tornado events;
from 1955 through 1992 - only tornado, thunderstorm wind and hail events.

If take into account that the standard of weather event names was accepted since 1996 I made a decision to exclude older events from the analisis:

require(lubridate)
data$YEAR <- year(as.Date(data$BGN_DATE, "%m/%d/%Y"))
subdata <- data[data$YEAR >= 1996, ]

After this transformation the amount of events reduced to 653530 (about 72.4% of the source dataset).

Excluding events without consequences

Based on the goal of analysis I also filtered out all weather events with absence of population health and economic consequences:

subdata <- subdata[subdata$FATALITIES > 0 |
                   subdata$INJURIES > 0 |
                   subdata$PROPDMG > 0 |
                   subdata$CROPDMG > 0, ]

Refactoring the values of damages into absolute values

Information about property and crop damages is presented in the variables PROPDMG and CROPDMG respectively. Corresponding multipliers lie in the variables PROPDMGEXP and CROPDMGEXP:

table(subdata$PROPDMGEXP)[table(subdata$PROPDMGEXP) > 0]

## 
##             B      K      M 
##   8448     32 185474   7364

table(subdata$CROPDMGEXP)[table(subdata$CROPDMGEXP) > 0]

## 
##             B      K      M 
## 102767      2  96787   1762

To calculate absolute values of damages I used the next refactoring:

multiplier_table <- c("1" = 1, "K" = 1e3, "M" = 1e6, "B" = 1e9)
subdata$PROPDMGEXP <- ifelse(subdata$PROPDMGEXP == "",
                             "1", as.character(subdata$PROPDMGEXP))
subdata$CROPDMGEXP <- ifelse(subdata$CROPDMGEXP == "",
                             "1", as.character(subdata$CROPDMGEXP))
subdata$PROPDMG <- as.numeric(subdata$PROPDMG) *
                              multiplier_table[subdata$PROPDMGEXP]
subdata$CROPDMG <- as.numeric(subdata$CROPDMG) *
                              multiplier_table[subdata$CROPDMGEXP]

Summing up total health or economic Impacts

At the end of preprocessing I calculated total health and economic impacts and saved them into the corresponding variables (they will be useful for the further analysis):

fatalities_total = sum(subdata$FATALITIES)
injuries_total = sum(subdata$INJURIES)
propdmg_total = sum(subdata$PROPDMG)
cropdmg_total = sum(subdata$CROPDMG)

Event names correction

The supplied dataset was collected by people and as we know everybody makes mistakes. In our case it is possible to estimate the impact of difference between event names in the dataset and standard event names.

To understand the size of the problem I merged dataset and the list of standard event names:

tmp <- merge(subdata, events, by.x = "EVTYPE", by.y = "EventName")

…and calculated the losses in terms of health and economy impact:

fatalities_before = 100 * (1 - sum(tmp$FATALITIES) / fatalities_total)
injuries_before = 100 * (1 - sum(tmp$INJURIES) / injuries_total)
propdmg_before = 100 * (1 - sum(tmp$PROPDMG) / propdmg_total)
cropdmg_before = 100 * (1 - sum(tmp$CROPDMG) / cropdmg_total)

By the reason of inconsistent event names next amount of values could be lost:

11.66% of all fatalities;
11.31% of all injuries;
17.43% of all property demages;
14.94% of all crop demages.

As you can see this values are quite big and could have meaningfull influence on the results. Therefore I made a decision to bring the source event names in accordance with the standard ones as far as it is possible.

Looking through the source dataset and using of definitions from the NWS Storm Data Documentation I noticed three types of problem with event names:

using of non standard names (“WILD/FOREST FIRE” instead of “WILDFIRE”, “GLAZE” instead of “FREEZING FOG” e.t.c.)
abbreviations (“TSTM” instead of “THUNDERSTORM”)
different spelling-errors and other similar cases (“TUNDERSTORM” instead of “THUNDERSTORM”, “LAKE EFFECT SNOW” instead of “LAKE-EFFECT SNOW”, “STRONG WINDS” instead of “STRONG WIND”, “COASTAL FLOODING” instead of “COASTAL FLOOD” e.t.c.)

Non standard event names and abbriviations

The most frequent cases with using of abbreviations and non standard names I solved by direct substitution:

subdata$EVTYPE <- gsub("TSTM", "THUNDERSTORM", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^URBAN/SML STREAM FLD$", "HEAVY RAIN", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^WILD/FOREST FIRE$", "WILDFIRE", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^EXTREME COLD$", "EXTREME COLD/WIND CHILL", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^HURRICANE$", "HURRICANE/TYPHOON", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^TYPHOON$", "HURRICANE/TYPHOON", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^COLD$", "EXTREME COLD/WIND CHILL", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^LANDSLIDE$", "DEBRIS FLOW", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^FOG$", "DENSE FOG", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^GLAZE$", "FREEZING FOG", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^WIND$", "HIGH WIND", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^HEAVY SURF/HIGH SURF$", "HIGH SURF", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^COLD AND SNOW$", "EXTREME COLD/WIND CHILL", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^WINTER WEATHER/MIX$", "EXTREME COLD/WIND CHILL", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^THUNDERSTORM WIND/HAIL$", "THUNDERSTORM WIND", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^EXTREME WINDCHILL$", "EXTREME COLD/WIND CHILL", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^STORM SURGE$", "STORM SURGE/TIDE", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^RIVER FLOODING$", "FLOOD", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^COASTAL FLOODING/EROSION$", "COASTAL FLOOD", subdata$EVTYPE)

Correction of spelling-errors

The idea of spelling-errors correction was taken from the article by Gregory V. Bard “Spelling-Error Tolerant, Order-Independent Pass-Phrases via the Damerau-Levenshtein String-Edit Distance Metric”. But in place of algorithm described in the article I used the more simple method based on the Optimal String Aligment metric (restricted Damerau-Levenshtein distance) from the R-package stringdist.

At the begining I calculated the pairwise distances between event names from the dataset and standard event names:

require(stringdist)
raw_event_names <- as.data.frame(table(subdata$EVTYPE))
names(raw_event_names) <- c("EventOldName", "Count")
dist_matrix <- stringdistmatrix(raw_event_names$EventOldName, events$EventName)

After that for every original event name I chose the closest standard event name by the OSA-distaance metric:

raw_event_names$DistOSA <- apply(dist_matrix, 1, min)
raw_event_names$EventName <- events[apply(dist_matrix, 1, which.min), ]

If the OSA-distance is equal to zero the original event name is equal to standard one, otherwise there are three possible cases:

it was a spelling-error and it was correctly corrected;
it was a spelling-error and it was not correcly corrected;
it was not a spelling-error, in other words it was a non standard event name which was not corrected by direct substitution.

The third case was not frequent so I focused on the first and second cases. The good way to reduce the wrong error corrections is to limit the value of OSA-distance. Strong limitations (by setting the maximum of OSA-distans <= 2 ) could slightly reduce the recall of method but at the same time would highly reduce its fullness. To avoid this I calculate the normalized OSA-distance by dividing it to the length of corrected word:

raw_event_names$DistNorm <- raw_event_names$DistOSA / 
                                nchar(as.character(raw_event_names$EventName))

It helped to avoid such wrong correction of short lexically-similar terms like “HEAVY SEAS” -> “HEAVY SNOW”, “RAIN” -> “HAIL” e t.c. After the set of attempts I have chosen the next pair of limitations:

max_osa_dist <- 4
max_norm_dist <- 0.3

…and used them to correct of the event names marked all residual names as “UNCLASSIFIED”:

raw_event_names$EventNewName <- ifelse(raw_event_names$DistOSA <= max_osa_dist & 
                                       raw_event_names$DistNorm < max_norm_dist,
                                       raw_event_names$EventName,
                                       "UNCLASSIFIED")
subdata <- merge(subdata, raw_event_names[, c("EventOldName", "EventNewName")],
                 by.x = "EVTYPE",
                 by.y = "EventOldName")

After that I calculated the losses in terms of health and economy impact again:

delta <- subdata[subdata$EventNewName != "UNCLASSIFIED", ]
fatalities_after = 100 * (1 - sum(delta$FATALITIES) / fatalities_total)
injuries_after = 100 * (1 - sum(delta$INJURIES) / injuries_total)
propdmg_after = 100 * (1 - sum(delta$PROPDMG) / propdmg_total)
cropdmg_after = 100 * (1 - sum(delta$CROPDMG) / cropdmg_total)

The new estimation is:

1.11% of all fatalities;
0.77% of all injuries;
0.03% of all property demages;
0.98% of all crop demages.

As we can see all cleaning procedures have redused the loses of data available for analysis more then 10 times. From my point of view the dataset is ready to the final investigation now.

Final calculations

To plot the final results I summed up all amounts of fatalities, injuries, property and crop damages grouped by event types and converted the values of property and crop damages into millions of dollars:

total_impact <- aggregate(cbind(FATALITIES, INJURIES, PROPDMG, CROPDMG) ~ EventNewName,
                          data = subdata, sum)
total_impact_table <- total_impact
total_impact$PROPDMG <- round(total_impact$PROPDMG / 1e6)
total_impact$CROPDMG <- round(total_impact$CROPDMG / 1e6)

Bar plot function

Next my_barplot() function was written to represent the pairs of “fatalities/injuries” and “property/crop demages” in one plot. The idea of such type of plotting was taken from the site stackoverflow.com and modified to the conditions of present investigation. (For details, please, see Two horizontal bar charts with shared axis in ggplot2.

require(ggplot2)
require(scales)
require(grid)
require(gridExtra)

my_barplot = function(datatable, title = "", lsubtitle = "", rsubtitle = "") {
        datatable <- as.data.frame(datatable)
        names(datatable) <- c("Events", "Left", "Right")
        datatable <- datatable[order(-(datatable$Left + datatable$Right)), ]
        datatable$Events <- factor(datatable$Events,
                                   levels = datatable[order(datatable$Left + 
                                                            datatable$Right),
                                                      "Events"])
        datatable <- datatable[1:15, ]

        g.mid <- ggplot(datatable, aes(x = 1, y = Events)) +
                geom_text(aes(label = Events), position = "identity", size = 3) +
                ggtitle("") +
                ylab(NULL) +
                scale_x_continuous(expand = c(0,0),
                                   limits = c(0.94,1.065)) +
                theme(
                        axis.title = element_blank(),
                        panel.grid = element_blank(),
                        axis.text.y = element_blank(),
                        axis.ticks.y = element_blank(),
                        panel.background = element_blank(),
                        axis.text.x = element_text(color = NA),
                        axis.ticks.x = element_line(color = NA),
                        plot.margin = unit(c(1,-1,1,-1), "mm")
                )
        
        g1 <- ggplot(datatable, aes(x = Events, y = Left)) +
                geom_bar(stat = "identity") +
                geom_text(aes(label = Left), hjust = 1.4, size = 3) +
                ggtitle(lsubtitle) +
                theme(
                        axis.title.x = element_blank(),
                        axis.title.y = element_blank(),
                        axis.text.y = element_blank(),
                        axis.ticks.y = element_blank(),
                        plot.margin = unit(c(1,-1,1,0), "mm")
                ) +
                scale_y_reverse(limits = c(max(datatable$Left) * 1.4, 0)) +
                coord_flip()
        
        g2 <- ggplot(datatable, aes(x = Events, y = Right)) +
                xlab(NULL) +
                geom_bar(stat = "identity") +
                ggtitle(rsubtitle) +
                geom_text(aes(label = Right), hjust = -.3, size = 3) +
                theme(
                        axis.title.x = element_blank(), 
                        axis.title.y = element_blank(),
                        axis.text.y = element_blank(),
                        axis.ticks.y = element_blank(),
                        plot.margin = unit(c(1,0,1,-1), "mm")
                ) +
                scale_y_continuous(limits = c(0,max(datatable$Right) * 1.3)) +
                coord_flip()

        gg.mid <- ggplot_gtable(ggplot_build(g.mid))
        gg1 <- ggplot_gtable(ggplot_build(g1))
        gg2 <- ggplot_gtable(ggplot_build(g2))
        
        grid.arrange(gg1,gg.mid,gg2,
                     top = textGrob(title, gp=gpar(cex=1.5)),
                     ncol = 3,
                     widths = c(0.35, 0.25, 0.35))
}

Results

The most harmful events with respect to population health

Figure 1 shows the most harmful weather events with respect to the population health.

Middle column explains the standatd names of events with the biggest sums of fatalities and injuries sorted by the decrease of values. The left and the right bar plots represent the amount of fatalities and injuries separately:

my_barplot(total_impact[, c("EventNewName", "FATALITIES", "INJURIES")],
           "Fig 1. The most harmful weather events with respect to population health\n(for the period of 1996 - 2011 years)",
           "Fatalities", "Injuries")

As we can see the most harmful weather event with the biggest total influence on the population health is TORNADO which causes the biggest amount of injuries. Then goes EXCESSIVE HEAT and FLOOD. At the same time the most fatal weather event is EXCESSIVE HEAT.

The most harmful events with the biggest economic consequences

Figure 2 shows the most harmful weather events with respect to the property and crop damages.

Middle column represents the standatd names of events with the biggest sums of property and crop damages sorted by the decrease of values. The left and the right bar plots represent the amount of property damages and crop damages separately:

my_barplot(total_impact[, c("EventNewName", "PROPDMG", "CROPDMG")],
           "Fig 2. The most harmful events with the biggest economic consequences\n(for the period of 1996 - 2011 years, in millions USD)",
           "Property Demage", "Crop Demage")

As we can see the most harmful weather event with the biggest total economic consequence is FLOOD which also causes the largest property damages. Then goes HURRICANE/TYPHOON and STORM SURGE/TIDE. TORNADO is at the forth place. All this weather events also caused the the biggest property damages.

But the situation with the crop damages is different. The most harmfull weather event here is DROUGHT, then goes HURRICANE/TYPHOON and FLOOD.

The result data

Here is the final data-table of the aggregate values of the weather events consequences for the years between 1996 and 2001 sorted by the standard event names. The “UNCLASSIFIED” event corresponds to the all source events which names was not associated with the standard weather event names.

library(knitr)
total_impact_table$PROPDMG <- round(total_impact_table$PROPDMG / 1e6, digits = 3)
total_impact_table$CROPDMG <- round(total_impact_table$CROPDMG / 1e6, digits = 3)
names(total_impact_table) <- c("Event Fame", "Fatalities", "Injuries",
                               "Property Damages, mln USD", "Crop Damages, mln USD")
kable(total_impact_table, digits=3)

Event Fame	Fatalities	Injuries	Property Damages, mln USD	Crop Damages, mln USD
ASTRONOMICAL LOW TIDE	0	0	9.745	0.000
AVALANCHE	223	156	3.712	0.000
BLIZZARD	70	385	525.659	7.060
COASTAL FLOOD	6	7	375.240	0.000
COLD/WIND CHILL	95	12	1.990	0.600
DEBRIS FLOW	37	52	324.578	20.017
DENSE FOG	69	855	20.465	0.000
DENSE SMOKE	0	0	0.100	0.000
DROUGHT	0	4	1046.101	13367.566
DUST DEVIL	2	39	0.664	0.000
DUST STORM	11	376	5.474	3.100
EXCESSIVE HEAT	1797	6393	9.659	492.402
EXTREME COLD/WIND CHILL	317	192	36.089	1326.023
FLASH FLOOD	887	1674	15222.254	1334.902
FLOOD	415	6759	144050.989	5002.798
FREEZING FOG	1	212	2.332	0.000
FROST/FREEZE	0	0	10.480	1094.186
FUNNEL CLOUD	0	1	0.134	0.000
HAIL	7	713	14595.143	2476.029
HEAT	237	1222	1.520	0.176
HEAVY RAIN	122	309	643.174	736.658
HEAVY SNOW	107	698	634.418	71.122
HIGH SURF	132	198	93.775	0.000
HIGH WIND	253	1167	5250.650	633.861
HURRICANE/TYPHOON	125	1326	81718.889	5350.108
ICE STORM	82	318	3642.249	15.660
LAKE-EFFECT SNOW	0	0	40.182	0.000
LAKESHORE FLOOD	0	0	7.540	0.000
LIGHTNING	651	4141	743.077	6.898
MARINE HAIL	0	0	0.004	0.000
MARINE HIGH WIND	1	1	1.297	0.000
MARINE STRONG WIND	14	22	0.418	0.000
MARINE THUNDERSTORM WIND	19	34	5.857	0.050
RIP CURRENT	542	503	0.163	0.000
SEICHE	0	0	0.980	0.000
STORM SURGE/TIDE	13	42	47834.724	0.855
STRONG WIND	110	299	176.994	64.954
THUNDERSTORM WIND	376	5125	7913.198	1016.943
TORNADO	1511	20667	24616.946	283.425
TROPICAL DEPRESSION	0	0	1.737	0.000
TROPICAL STORM	57	338	7642.476	677.711
TSUNAMI	33	129	144.062	0.020
UNCLASSIFIED	97	445	92.130	340.405
VOLCANIC ASH	0	0	0.500	0.000
WATERSPOUT	2	2	5.730	0.000
WILDFIRE	87	1456	7760.449	402.255
WINTER STORM	191	1292	1532.743	11.944
WINTER WEATHER	33	411	20.926	15.000