Weather events: isolating the major causes of fatalities, injuries and economic costs

Synopsis

The goal of the present report is to address two questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

To do this, we look into the NOAA Storm Database and follow the steps:

[data reduction] classify each event in the database as being one (or a combination) of the following: “TORNADO”, “LIGHTNING”, “WIND”, “HAIL”, “COLD”, “HEAT”, “FLOOD”, “STORM”, “TSUNAMI”, “HURRICANE”, “ICE”, “AVALANCHE”, “DROUGHT”, “RAIN”, “FOG”, “OTHER”
[data analysis: fatalities and injuries] identify which event types caused more fatalities and injuries
[data analysis: economic damage] idenfity which event types caused more economic damage

Limitations: The data reduction step is crucial and could be improved upon. The data analysis was not exaustive: only the most relevant events were explicitly looked at, all the others were combined; a more detailed analysis is possible.

Data Processing

We start by loading the data and seing what it looks like:

setInternet2(TRUE)
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
    "data.csv")
data = read.csv("data.csv")
head(data)

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

Notice, in particular, that there is a great variety of event types:

str(data$EVTYPE)

##  Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...

Before we continue, it is important to reduce the number of different event types considered. To reduce the data, we will employ a two step strategy:

define a smaller set of categories
check, for each entry of the EVTYPE variable, which category(ies) it fits into

For the first step, we will simply consider the following events:

events = c("TORNADO", "LIGHTNING", "WIND", "HAIL", "COLD", "HEAT", "FLOOD", 
    "STORM", "TSUNAMI", "HURRICANE", "ICE", "AVALANCHE", "DROUGHT", "RAIN", 
    "FOG", "OTHER")

Definitely, we could do better. The way this list is defined is a crucial step in the current analysis and it is also a serious limitation. Bare in mind that:

The set of categories “events” may definitely be improved upon. In particular, one may take into account that there are typos in the original .csv file.
an event in the original .csv file may actually fit in more than one category. For instance, if we have a cold storm, it will be matched with two categories in the events list: “STORM” and “COLD”. Definitely this could be improved upon.
more rare events in the original .csv file will be categorized as “OTHER”.

We will now create a new table (is_related_to_event, you can check it below) with the categorization of the events in the original .csv file. The variables of the table will be the events we defined. The rows will be the same as in the original database. So, for each entry of the original database, we will classify it as being one (or mode) of the event types defined by us:

is_related_to_event = matrix(nrow = nrow(data), ncol = length(events))
for (j in seq(1:length(events) - 1)) {
    is_related_to_event[, j] = grepl(toupper(events[j]), data$EVTYPE)
}
is_related_to_event[, length(events)] = rowSums(is_related_to_event) == 0
colnames(is_related_to_event) = events

We thus obtain the following matrix (where you can see that the first six events are all tornados):

head(is_related_to_event)

##      TORNADO LIGHTNING  WIND  HAIL  COLD  HEAT FLOOD STORM TSUNAMI
## [1,]    TRUE     FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE
## [2,]    TRUE     FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE
## [3,]    TRUE     FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE
## [4,]    TRUE     FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE
## [5,]    TRUE     FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE
## [6,]    TRUE     FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE
##      HURRICANE   ICE AVALANCHE DROUGHT  RAIN   FOG OTHER
## [1,]     FALSE FALSE     FALSE   FALSE FALSE FALSE FALSE
## [2,]     FALSE FALSE     FALSE   FALSE FALSE FALSE FALSE
## [3,]     FALSE FALSE     FALSE   FALSE FALSE FALSE FALSE
## [4,]     FALSE FALSE     FALSE   FALSE FALSE FALSE FALSE
## [5,]     FALSE FALSE     FALSE   FALSE FALSE FALSE FALSE
## [6,]     FALSE FALSE     FALSE   FALSE FALSE FALSE FALSE

Right, we know, for each entry of the original database we downloaded, what events it is related to. We can now build our final table containing the information we are interested in analysing. Notably, for each event type, we want to know

How many times it occured
The number of fatalities associated
The number of injuries associated
The financial costs it envolved

This table is rather easy to build - and we can see how it looks like:

final_info = matrix(nrow = length(events), ncol = 4)
rownames(final_info) = events
colnames(final_info) = c("n", "fatalities", "injuries", "total_cost (in billions)")

exponents = data["PROPDMGEXP"]

cost = matrix(nrow = nrow(data), ncol = 1)
cost = 0
cost[exponents == "K" | exponents == "k"] = 1000 * data$PROPDMG[exponents == 
    "K" | exponents == "k"]
cost[exponents == "M" | exponents == "m"] = 1e+06 * data$PROPDMG[exponents == 
    "M" | exponents == "m"]
cost[exponents == "B" | exponents == "b"] = 1e+09 * data$PROPDMG[exponents == 
    "B" | exponents == "b"]
# all costs now in dollars

for (j in seq(1:length(events))) {
    final_info[j, 1] = sum(is_related_to_event[, j])
    final_info[j, 2] = sum(data$FATALITIES[is_related_to_event[, j] == TRUE])
    final_info[j, 3] = sum(data$INJURIES[is_related_to_event[, j] == TRUE])
    final_info[j, 4] = round(sum(cost[is_related_to_event[, j] == TRUE], na.rm = TRUE)/1e+09, 
        digits = 2)
}

final_info = as.data.frame(final_info)
final_info

##                n fatalities injuries total_cost (in billions)
## TORNADO    60700       5661    91407                    58.59
## LIGHTNING  15776        817     5232                     0.94
## WIND      364847       1446    11495                    17.74
## HAIL      290394         45     1467                    17.62
## COLD        2431        443      320                     0.25
## HEAT        2645       3138     9154                     0.02
## FLOOD      82665       1523     8603                   167.38
## STORM     124619        633     6691                    73.06
## TSUNAMI       20         33      129                     0.14
## HURRICANE    286        135     1326                    84.76
## ICE         2185        102     2166                     3.97
## AVALANCHE    387        224      171                     0.01
## DROUGHT     2512          6       19                     1.05
## RAIN       12202        114      301                     3.26
## FOG         1880         81     1077                     0.03
## OTHER      53012       1327     5567                    11.50

In the results section, We will want look at the number of fatalities and injuries caused by the events. Clearly not all events have the same impact. To focus on those that have the larger impacts, we will only consider events that caused 3000 or more fatalities or 6000 or more injuries:

final_harm_info = final_info[!(rownames(final_info) == "OTHER") & (final_info$fatalities > 
    2999 | final_info$injuries > 5999), c(2, 3)]
final_harm_info[nrow(final_harm_info) + 1, ] = colSums(final_info[!(!(rownames(final_info) == 
    "OTHER") & (final_info$fatalities > 2999 | final_info$injuries > 5999)), 
    c(2, 3)])
rownames(final_harm_info) = c(rownames(final_harm_info[seq(1:(nrow(final_harm_info) - 
    1)), ]), "OTHER")
final_harm_info

##         fatalities injuries
## TORNADO       5661    91407
## WIND          1446    11495
## HEAT          3138     9154
## FLOOD         1523     8603
## STORM          633     6691
## OTHER         3327    17775

We will also want to look at the total costs related to each event type, which is easier to do if we order our table

final_info = final_info[order(final_info$total_cost), ]
final_info

##                n fatalities injuries total_cost (in billions)
## AVALANCHE    387        224      171                     0.01
## HEAT        2645       3138     9154                     0.02
## FOG         1880         81     1077                     0.03
## TSUNAMI       20         33      129                     0.14
## COLD        2431        443      320                     0.25
## LIGHTNING  15776        817     5232                     0.94
## DROUGHT     2512          6       19                     1.05
## RAIN       12202        114      301                     3.26
## ICE         2185        102     2166                     3.97
## OTHER      53012       1327     5567                    11.50
## HAIL      290394         45     1467                    17.62
## WIND      364847       1446    11495                    17.74
## TORNADO    60700       5661    91407                    58.59
## STORM     124619        633     6691                    73.06
## HURRICANE    286        135     1326                    84.76
## FLOOD      82665       1523     8603                   167.38

Clearly some events cause much more economic consequences than others. Maybe we should just look at those that cause the greater consequences. Events with associated costs below 20 billion dollars will now just go to our “other” column:

final_cost_info = final_info[final_info$total_cost > 20, ]
final_cost_info[nrow(final_cost_info) + 1, ] = colSums(final_info[final_info$total_cost < 
    20, ])
rownames(final_cost_info) = c(rownames(final_cost_info[seq(1:(nrow(final_cost_info) - 
    1)), ]), "OTHER")
final_cost_info

##                n fatalities injuries total_cost (in billions)
## TORNADO    60700       5661    91407                    58.59
## STORM     124619        633     6691                    73.06
## HURRICANE    286        135     1326                    84.76
## FLOOD      82665       1523     8603                   167.38
## OTHER     748291       7776    37098                    56.53

Results

Which types of events are most harmful with respect to population health?

This information is included in the final_harm_info table:

final_harm_info

##         fatalities injuries
## TORNADO       5661    91407
## WIND          1446    11495
## HEAT          3138     9154
## FLOOD         1523     8603
## STORM          633     6691
## OTHER         3327    17775

Let's plot this info:

par(mfrow = c(2, 1))
barplot(final_harm_info$fatalities, names.arg = rownames(final_harm_info), main = "Fatalities per event type", 
    ylab = "Fatalities", xlab = "Event type")
barplot(final_harm_info$injuries, names.arg = rownames(final_harm_info), main = "Injuries per event type", 
    ylab = "Injuries", xlab = "Event type")

plot of chunk unnamed-chunk-11

Figure 1. Injuries and fatalities associated to event types

Clearly tornados caused more fatalities and injuries than any other even type. Heat cause a great number of fatalities as well, followed by floods and wind. Heat, wind, storm and flood caused approximately the same number of injuries.

Which types of events have the greatest economic consequences?

The relevant information is stored in final_cost_info:

final_cost_info

##                n fatalities injuries total_cost (in billions)
## TORNADO    60700       5661    91407                    58.59
## STORM     124619        633     6691                    73.06
## HURRICANE    286        135     1326                    84.76
## FLOOD      82665       1523     8603                   167.38
## OTHER     748291       7776    37098                    56.53

We can plot this information

barplot(final_cost_info$total_cost, names.arg = rownames(final_cost_info), main = "Cost (in billion dollars) per event type", 
    ylab = "Cost (in billion dollars", xlab = "Event type")

plot of chunk unnamed-chunk-13

Figure 2. Economic costs associated to event types

Floods, hurricanes, storms and tornados, by this order, are the events that caused the most economic damage, more than any other event type.

Acknowledgment

The discussion forums were terribly helpful. Thank you all of you who posted your suggestions there.