The goal of the present report is to address two questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
To do this, we look into the NOAA Storm Database and follow the steps:
[data reduction] classify each event in the database as being one (or a combination) of the following: “TORNADO”, “LIGHTNING”, “WIND”, “HAIL”, “COLD”, “HEAT”, “FLOOD”, “STORM”, “TSUNAMI”, “HURRICANE”, “ICE”, “AVALANCHE”, “DROUGHT”, “RAIN”, “FOG”, “OTHER”
[data analysis: fatalities and injuries] identify which event types caused more fatalities and injuries
[data analysis: economic damage] idenfity which event types caused more economic damage
Limitations: The data reduction step is crucial and could be improved upon. The data analysis was not exaustive: only the most relevant events were explicitly looked at, all the others were combined; a more detailed analysis is possible.
We start by loading the data and seing what it looks like:
setInternet2(TRUE)
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"data.csv")
data = read.csv("data.csv")
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
Notice, in particular, that there is a great variety of event types:
str(data$EVTYPE)
## Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
Before we continue, it is important to reduce the number of different event types considered. To reduce the data, we will employ a two step strategy:
For the first step, we will simply consider the following events:
events = c("TORNADO", "LIGHTNING", "WIND", "HAIL", "COLD", "HEAT", "FLOOD",
"STORM", "TSUNAMI", "HURRICANE", "ICE", "AVALANCHE", "DROUGHT", "RAIN",
"FOG", "OTHER")
Definitely, we could do better. The way this list is defined is a crucial step in the current analysis and it is also a serious limitation. Bare in mind that:
We will now create a new table (is_related_to_event, you can check it below) with the categorization of the events in the original .csv file. The variables of the table will be the events we defined. The rows will be the same as in the original database. So, for each entry of the original database, we will classify it as being one (or mode) of the event types defined by us:
is_related_to_event = matrix(nrow = nrow(data), ncol = length(events))
for (j in seq(1:length(events) - 1)) {
is_related_to_event[, j] = grepl(toupper(events[j]), data$EVTYPE)
}
is_related_to_event[, length(events)] = rowSums(is_related_to_event) == 0
colnames(is_related_to_event) = events
We thus obtain the following matrix (where you can see that the first six events are all tornados):
head(is_related_to_event)
## TORNADO LIGHTNING WIND HAIL COLD HEAT FLOOD STORM TSUNAMI
## [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## HURRICANE ICE AVALANCHE DROUGHT RAIN FOG OTHER
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Right, we know, for each entry of the original database we downloaded, what events it is related to. We can now build our final table containing the information we are interested in analysing. Notably, for each event type, we want to know
This table is rather easy to build - and we can see how it looks like:
final_info = matrix(nrow = length(events), ncol = 4)
rownames(final_info) = events
colnames(final_info) = c("n", "fatalities", "injuries", "total_cost (in billions)")
exponents = data["PROPDMGEXP"]
cost = matrix(nrow = nrow(data), ncol = 1)
cost = 0
cost[exponents == "K" | exponents == "k"] = 1000 * data$PROPDMG[exponents ==
"K" | exponents == "k"]
cost[exponents == "M" | exponents == "m"] = 1e+06 * data$PROPDMG[exponents ==
"M" | exponents == "m"]
cost[exponents == "B" | exponents == "b"] = 1e+09 * data$PROPDMG[exponents ==
"B" | exponents == "b"]
# all costs now in dollars
for (j in seq(1:length(events))) {
final_info[j, 1] = sum(is_related_to_event[, j])
final_info[j, 2] = sum(data$FATALITIES[is_related_to_event[, j] == TRUE])
final_info[j, 3] = sum(data$INJURIES[is_related_to_event[, j] == TRUE])
final_info[j, 4] = round(sum(cost[is_related_to_event[, j] == TRUE], na.rm = TRUE)/1e+09,
digits = 2)
}
final_info = as.data.frame(final_info)
final_info
## n fatalities injuries total_cost (in billions)
## TORNADO 60700 5661 91407 58.59
## LIGHTNING 15776 817 5232 0.94
## WIND 364847 1446 11495 17.74
## HAIL 290394 45 1467 17.62
## COLD 2431 443 320 0.25
## HEAT 2645 3138 9154 0.02
## FLOOD 82665 1523 8603 167.38
## STORM 124619 633 6691 73.06
## TSUNAMI 20 33 129 0.14
## HURRICANE 286 135 1326 84.76
## ICE 2185 102 2166 3.97
## AVALANCHE 387 224 171 0.01
## DROUGHT 2512 6 19 1.05
## RAIN 12202 114 301 3.26
## FOG 1880 81 1077 0.03
## OTHER 53012 1327 5567 11.50
In the results section, We will want look at the number of fatalities and injuries caused by the events. Clearly not all events have the same impact. To focus on those that have the larger impacts, we will only consider events that caused 3000 or more fatalities or 6000 or more injuries:
final_harm_info = final_info[!(rownames(final_info) == "OTHER") & (final_info$fatalities >
2999 | final_info$injuries > 5999), c(2, 3)]
final_harm_info[nrow(final_harm_info) + 1, ] = colSums(final_info[!(!(rownames(final_info) ==
"OTHER") & (final_info$fatalities > 2999 | final_info$injuries > 5999)),
c(2, 3)])
rownames(final_harm_info) = c(rownames(final_harm_info[seq(1:(nrow(final_harm_info) -
1)), ]), "OTHER")
final_harm_info
## fatalities injuries
## TORNADO 5661 91407
## WIND 1446 11495
## HEAT 3138 9154
## FLOOD 1523 8603
## STORM 633 6691
## OTHER 3327 17775
We will also want to look at the total costs related to each event type, which is easier to do if we order our table
final_info = final_info[order(final_info$total_cost), ]
final_info
## n fatalities injuries total_cost (in billions)
## AVALANCHE 387 224 171 0.01
## HEAT 2645 3138 9154 0.02
## FOG 1880 81 1077 0.03
## TSUNAMI 20 33 129 0.14
## COLD 2431 443 320 0.25
## LIGHTNING 15776 817 5232 0.94
## DROUGHT 2512 6 19 1.05
## RAIN 12202 114 301 3.26
## ICE 2185 102 2166 3.97
## OTHER 53012 1327 5567 11.50
## HAIL 290394 45 1467 17.62
## WIND 364847 1446 11495 17.74
## TORNADO 60700 5661 91407 58.59
## STORM 124619 633 6691 73.06
## HURRICANE 286 135 1326 84.76
## FLOOD 82665 1523 8603 167.38
Clearly some events cause much more economic consequences than others. Maybe we should just look at those that cause the greater consequences. Events with associated costs below 20 billion dollars will now just go to our “other” column:
final_cost_info = final_info[final_info$total_cost > 20, ]
final_cost_info[nrow(final_cost_info) + 1, ] = colSums(final_info[final_info$total_cost <
20, ])
rownames(final_cost_info) = c(rownames(final_cost_info[seq(1:(nrow(final_cost_info) -
1)), ]), "OTHER")
final_cost_info
## n fatalities injuries total_cost (in billions)
## TORNADO 60700 5661 91407 58.59
## STORM 124619 633 6691 73.06
## HURRICANE 286 135 1326 84.76
## FLOOD 82665 1523 8603 167.38
## OTHER 748291 7776 37098 56.53
This information is included in the final_harm_info table:
final_harm_info
## fatalities injuries
## TORNADO 5661 91407
## WIND 1446 11495
## HEAT 3138 9154
## FLOOD 1523 8603
## STORM 633 6691
## OTHER 3327 17775
Let's plot this info:
par(mfrow = c(2, 1))
barplot(final_harm_info$fatalities, names.arg = rownames(final_harm_info), main = "Fatalities per event type",
ylab = "Fatalities", xlab = "Event type")
barplot(final_harm_info$injuries, names.arg = rownames(final_harm_info), main = "Injuries per event type",
ylab = "Injuries", xlab = "Event type")
Figure 1. Injuries and fatalities associated to event types
Clearly tornados caused more fatalities and injuries than any other even type. Heat cause a great number of fatalities as well, followed by floods and wind. Heat, wind, storm and flood caused approximately the same number of injuries.
The relevant information is stored in final_cost_info:
final_cost_info
## n fatalities injuries total_cost (in billions)
## TORNADO 60700 5661 91407 58.59
## STORM 124619 633 6691 73.06
## HURRICANE 286 135 1326 84.76
## FLOOD 82665 1523 8603 167.38
## OTHER 748291 7776 37098 56.53
We can plot this information
barplot(final_cost_info$total_cost, names.arg = rownames(final_cost_info), main = "Cost (in billion dollars) per event type",
ylab = "Cost (in billion dollars", xlab = "Event type")
Figure 2. Economic costs associated to event types
Floods, hurricanes, storms and tornados, by this order, are the events that caused the most economic damage, more than any other event type.
The discussion forums were terribly helpful. Thank you all of you who posted your suggestions there.