In this project, I explore the NOAA's storm database, which contains records of major storms and other weather events along with associated fatalities, injuries and damage estimates to crops and property. From an exploration of the severity of individual events I develop a measure of risk associated with each event type, which I feel would be of greater significance for decisions involving distribution of resources to preventing or reducing the harm resulting from weather events.
It is assumed that the storm database has been downloaded into the R working directory. For this project I only read in the variables:
EVTYPEFATALITIESINJURIESPROPDMGCROPDMGFor the variables PROPDMG and CROPDMG I'm not interested in the units, only in that the data may - like number of fatalities or injuries - be interpreted as a measure of severity of the associated event.
setwd('/Users/rambler/Documents/Coursera/Data Science/05_ReproducibleResearch/Project2')
data <- subset(
read.csv('repdata-data-StormData.csv.bz2'),
select = c(EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG),
)
# remove "summary" events
# These appear to be summaries of event data already accounted for.
data$EVTYPE <- tolower(data$EVTYPE)
data <- subset(
data,
subset = !grepl("summary", data$EVTYPE)
)
There are several issues with the variable EVTYPE. To illustrate a few, I've written a function that will display the head of means of a selected variable grouped by the raw category in EVTYPE arranged in descending order.
disp.head.by.mean.cat <- function(variable) {
mfbt <- data.frame()
invisible(
lapply(
split(
subset(data, select = c('EVTYPE', variable)),
data$EVTYPE
),
function(x) mfbt <<- rbind(
mfbt,
data.frame(
type = x$EVTYPE[1],
mean.var = mean(x[[variable]]),
instances = nrow(x)
)
)
)
)
head(mfbt[order(-mfbt$mean.var), ], 30)
}
Let us use the function to explore the mean value of FATALITIES by raw category.
disp.head.by.mean.cat('FATALITIES')
type mean.var instances
699 tornadoes, tstm wind, hail 25.000 1
61 cold and snow 14.000 1
708 tropical storm gordon 8.000 1
518 record/excessive heat 5.667 3
126 extreme heat 4.364 22
245 heat wave drought 4.000 1
333 high wind/seas 4.000 1
440 marine mishap 3.500 2
825 winter storms 3.333 3
302 heavy surf and wind 3.000 1
326 high wind and seas 3.000 1
533 rough seas 2.667 3
246 heat waves 2.500 2
526 rip currents/heavy surf 2.500 2
244 heat wave 2.293 75
744 unseasonably warm and dry 2.231 13
369 hurricane opal/high winds 2.000 1
733 tsunami 1.650 20
270 heavy seas 1.500 2
66 cold weather 1.250 4
242 heat 1.222 767
375 hypothermia/exposure 1.167 6
115 excessive heat 1.134 1678
18 avalance 1.000 1
55 coastalstorm 1.000 1
63 cold temperature 1.000 2
65 cold wave 1.000 3
69 cold/winds 1.000 1
85 drowning 1.000 1
123 extended cold 1.000 1
Notice that many of the “categories” of events are actually unique cases (e.g., the event categorised as TORNADOES, TSTM WIND, HAIL with 25 fatalities) or very small, either as a result of rarity or inconsistent labelling techniques. Further, many of the categories, e.g. HEAT, HEAT WAVE and EXCESSIVE HEAT appear to refer to the same category of event, and it might make sense to merge them.
It is still a useful exploration, however, in that it suggests which types of events are particularly severe. Let us similarly explore mean values of INJURIES, PROPDMG and CROPDMG.
disp.head.by.mean.cat('INJURIES')
type mean.var instances
708 tropical storm gordon 43.000 1
805 wild fires 37.500 4
679 thunderstormw 27.000 1
326 high wind and seas 20.000 1
584 snow/high winds 18.000 2
196 glaze/ice storm 15.000 1
245 heat wave drought 15.000 1
822 winter storm high winds 15.000 1
371 hurricane/typhoon 14.489 88
827 winter weather mix 11.333 6
126 extreme heat 7.045 22
472 non-severe wind damage 7.000 1
733 tsunami 6.450 20
825 winter storms 5.667 3
695 tornado f2 5.333 3
119 excessive rainfall 5.250 4
795 waterspout/tornado 5.250 8
244 heat wave 5.053 75
194 glaze 5.023 43
703 torrential rainfall 4.000 1
115 excessive heat 3.889 1678
242 heat 2.738 767
452 mixed precip 2.600 10
440 marine mishap 2.500 2
376 ice 2.246 61
282 heavy snow shower 2.000 1
294 heavy snow/ice 2.000 5
355 high winds/snow 2.000 3
388 ice storm/flash flood 2.000 1
437 marine accident 2.000 1
disp.head.by.mean.cat('PROPDMG')
type mean.var instances
47 coastal erosion 766.0 1
254 heavy rain and flood 600.0 1
527 river and stream flood 600.0 2
35 blizzard/winter storm 500.0 1
142 flash flood/ 500.0 1
150 flash flooding/thunderstorm wi 500.0 1
165 flood/river flood 500.0 1
187 frost\\freeze 500.0 1
264 heavy rain/snow 500.0 1
299 heavy snow/winter storm 500.0 1
333 high wind/seas 500.0 1
367 hurricane gordon 500.0 1
548 sleet/ice storm 500.0 1
568 snow and ice storm 500.0 1
581 snow/cold 500.0 2
583 snow/heavy snow 500.0 1
708 tropical storm gordon 500.0 1
696 tornado f3 362.5 2
169 floods 335.0 3
817 wind storm 300.0 1
401 landslump 285.0 2
52 coastal surge 250.0 2
237 hail/winds 250.0 2
366 hurricane felix 250.0 2
149 flash flooding/flood 218.8 8
694 tornado f1 217.9 4
685 thundertorm winds 202.0 3
709 tropical storm jerry 201.3 3
695 tornado f2 200.3 3
1 high surf advisory 200.0 1
disp.head.by.mean.cat('CROPDMG')
type mean.var instances
105 dust storm/high winds 500.00 1
172 forest fires 500.00 1
708 tropical storm gordon 500.00 1
352 high winds/cold 401.00 5
366 hurricane felix 250.00 2
825 winter storms 166.67 3
121 excessive wetness 142.00 1
735 typhoon 75.00 11
62 cold and wet conditions 66.00 1
809 wildfires 62.50 8
371 hurricane/typhoon 54.53 88
75 damaging freeze 53.26 8
245 heat wave drought 50.00 1
616 thunderstorm hail 50.00 1
484 rain 46.88 16
529 river flooding 41.79 29
269 heavy rains/flooding 39.22 9
551 small hail 32.68 53
362 hurricane 30.69 174
167 flooding 28.01 120
145 flash flood/flood 25.23 22
237 hail/winds 25.02 2
221 hail 150 25.00 2
707 tropical storm dean 25.00 2
149 flash flooding/flood 21.88 8
268 heavy rains 21.54 26
108 early frost 21.00 2
365 hurricane erin 20.86 7
528 river flood 20.17 173
292 heavy snow/high winds & flood 20.00 1
Based on the output above, let us create a new variable cat (category) that is a bit more in line with the directives (p. 6) with respect to categorising events. The following code assigns each event a category based on pattern matching on the value stored in EVTYPE.
I generally stuck to the directives (p. 6). See comments for exceptions and notes.
# *mudslide changed name to debris flow
# *instance of "devil" spelled incorrectly
# cold/wind chill and extreme cold/wind chill merged:
# the deaths/injuries/damage appear to be very similarly sustained
# strong high thunderstorm etc. winds to "Strong Wind"
# issues: mutually exclusive treatment - some events of more than one
# type categorised arbitrarily as one or the other by the order
# of the pattern matching below
data$category <- sapply(
data$EVTYPE,
function(x) {
if (grepl("astronomical", x)) "Astronomical High Tide"
else if (grepl("avalanche", x)) "Avalanche"
else if (grepl("blizzard", x)) "Blizzard"
else if (grepl("coastal", x) & grepl("flood", x)) "Coastal Flood"
else if (grepl("cold/wind", x) | grepl("wind chill", x)) "Cold/Wind Chill"
else if (grepl("debris", x) | grepl("slide", x)) "Debris Flow"
else if (grepl("dense fog", x)) "Dense Fog"
else if (grepl("smoke", x)) "Dense Smoke"
else if (grepl("drought", x)) "Drought"
else if (grepl("dust dev", x)) "Dust Devil"
else if (grepl("dust storm", x)) "Dust Storm"
else if (grepl("heat", x)) "Heat"
else if (grepl("flash", x)) "Flash Flood"
else if (grepl("lakeshore", x)) "Lakeshore Flood"
else if (grepl("flood", x)) "Flood"
else if (grepl("frost", x) | grepl("freeze", x)) "Frost/Freeze"
else if (grepl("funnel", x)) "Funnel Cloud"
else if (grepl("freezing", x) | grepl("sleet", x)) "Freezing Precipitation"
else if (grepl("hail", x)) "Hail"
else if (grepl("rain", x)) "Heavy Rain"
else if (grepl("snow", x)) "Heavy Snow"
else if (grepl("surf", x)) "High Surf"
else if (grepl("wind", x)) "Strong Wind"
else if (grepl("hurricane", x) | grepl("typhoon", x)) "Hurricane"
else if (grepl("tornado", x)) "Tornado"
else if (grepl("waterspout", x)) "Waterspout"
else if (grepl("lightning", x)) "Lightning"
else if (grepl("seiche", x)) "Seiche"
else if (grepl("storm surge", x)) "Storm Surge/Tide"
else if (grepl("tsunami", x)) "Tsunami"
else if (grepl("volcanic ash", x)) "Volcanic Ash"
else if (grepl("fire", x)) "Wildfire"
else if (grepl("tropical storm", x)) "Tropical Storm"
else if (grepl("depression", x)) "Tropical Depression"
else if (grepl("rip", x)) "Rip Current"
else if (grepl("ice storm", x)) "Ice Storm"
else if (grepl("winter storm", x)) "Winter Storm"
else if (grepl("winter weather", x)) "Winter Weather"
else "Not Classified"
}
)
Based on the above categorisation, let us create a new data set that stores the mean values by category for the respective variables:
means.by.category <- data.frame()
invisible(
lapply(
split(data, data$category),
function(x) {
means.by.category <<- rbind(
means.by.category,
data.frame(
cat = x$category[1],
freq = nrow(x),
fat = mean(x$FATALITIES),
inj = mean(x$INJURIES),
prop = mean(x$PROPDMG),
crop = mean(x$CROPDMG)
)
)
}
)
)
# Place "Not Classified" category in last row...
# These are events that are not categorised by the pattern matching above.
means.by.category <- rbind(
means.by.category[means.by.category$cat != "Not Classified", ],
means.by.category[means.by.category$cat == "Not Classified", ]
)
means.by.category
cat freq fat inj prop crop
1 Astronomical High Tide 277 0.0000000 0.0000000 4.52527 0.000000
2 Avalanche 387 0.5788114 0.4418605 4.20904 0.000000
3 Blizzard 2743 0.0368210 0.2934743 9.48541 0.062705
4 Coastal Flood 852 0.0070423 0.0082160 21.24529 0.065728
5 Cold/Wind Chill 1584 0.1395202 0.0227273 2.96338 0.410354
6 Debris Flow 653 0.0673813 0.0842266 31.00925 0.056662
7 Dense Fog 1296 0.0138889 0.2638889 6.34680 0.000000
8 Dense Smoke 21 0.0000000 0.0000000 4.76190 0.000000
9 Drought 2512 0.0023885 0.0075637 1.71141 13.516879
10 Dust Devil 151 0.0132450 0.2847682 4.76245 0.000000
11 Dust Storm 429 0.0512821 1.0256410 11.88695 4.898601
12 Flash Flood 55675 0.0185900 0.0323664 26.48088 3.349514
13 Flood 26176 0.0184902 0.2595889 36.04845 6.798843
14 Freezing Precipitation 504 0.0257937 0.0753968 9.42514 0.000000
15 Frost/Freeze 1512 0.0013228 0.0019841 1.12270 6.375933
16 Funnel Cloud 6991 0.0000000 0.0004291 0.02855 0.000000
17 Hail 290399 0.0001550 0.0050517 2.40814 2.017764
18 Heat 2631 1.1904219 3.5001900 1.15274 0.538731
19 Heavy Rain 11875 0.0088421 0.0237474 4.55311 1.049036
20 Heavy Snow 17620 0.0093076 0.0660045 8.59663 0.123480
21 High Surf 1064 0.1560150 0.2312030 6.06214 0.000000
22 Hurricane 298 0.4463087 4.4731544 84.51862 39.019430
23 Ice Storm 2007 0.0443448 0.9915296 32.88524 0.841530
24 Lakeshore Flood 23 0.0000000 0.0000000 2.06522 0.000000
25 Lightning 15761 0.0518368 0.3319586 38.28353 0.227182
27 Rip Current 774 0.7390181 0.6834625 0.21059 0.000000
28 Seiche 21 0.0000000 0.0000000 46.66667 0.000000
29 Storm Surge/Tide 409 0.0586797 0.1051345 63.98665 2.090465
30 Strong Wind 362120 0.0032890 0.0312686 8.61428 0.602366
31 Tornado 60698 0.0928531 1.5059310 52.97947 1.647942
32 Tropical Depression 60 0.0000000 0.0000000 12.30000 0.000000
33 Tropical Storm 697 0.0946915 0.5494978 71.63943 9.275638
34 Tsunami 20 1.6500000 6.4500000 45.26500 1.000000
35 Volcanic Ash 27 0.0000000 0.0000000 18.51852 0.000000
36 Waterspout 3845 0.0007802 0.0075423 2.48744 0.000000
37 Wildfire 4239 0.0212314 0.3793347 29.53958 2.256603
38 Winter Storm 11436 0.0188877 0.1169990 11.64923 0.216771
39 Winter Weather 8155 0.0074801 0.0659718 2.07338 0.001839
26 Not Classified 6279 0.0675267 0.2739290 9.73426 1.641263
I wrote the following function to draw a bar plot of the means for the n greatest categories.
plot.top.n <- function(variable, n, ...) {
order.by.variable <- means.by.category[order(-means.by.category[[variable]]), ]
r <- nrow(order.by.variable)
order.by.variable <- rbind(
order.by.variable[1:n,],
data.frame(
cat = "Other",
freq = sum(order.by.variable$freq[(n + 1):r]),
fat = sum(order.by.variable$fat[(n + 1):r]),
inj = sum(order.by.variable$inj[(n + 1):r]),
prop = sum(order.by.variable$prop[(n + 1):r]),
crop = sum(order.by.variable$crop[(n + 1):r])
)
)
barplot(
order.by.variable[[variable]][order.by.variable$cat != "Other"],
names.arg = order.by.variable$cat[order.by.variable$cat != "Other"],
...
)
}
We can now use this function to draw the bar plots showing the five events with the greatest nett severity in each category.
par(mfrow = c(2, 2))
plot.top.n('fat', 5, main = "Fatalities")
plot.top.n('inj', 5, main = "Injuries")
plot.top.n('prop', 5, main = "Property damage")
plot.top.n('crop', 5, main = "Crop damage")
The graphical summary above might not be most appropriate for decisions regarding distribution of resources. For example, even though The category Tsunami has a high associated fatality rate, it is actually a relatively rare event; it might make more sense - that is, save more lives in the long term - to distribute resources to the reduction of fatalities during less deadly - but more frequent - events.
Thus, I feel that the risk - by which I mean the product of the severity and the relative frequency - associated with each event would be a more important measure of how resources might be directed than raw severity.
In order to provide a summary that provides some indication of both frequency and severity of events, I wrote the following function to position each event type according to its log-severity and log-frequency (the logarithmic scale was used because of the great range of values in each variable). Categories that appear near the top right corner may then be, in terms of both frequency and severity, the most appropriate candidates for our attention.
plot.risk.n <- function(variable, ...) {
cond <- means.by.category[[variable]] != 0 & means.by.category$freq != 0
x <- log(means.by.category[[variable]][cond])
y <- log(means.by.category$freq[cond] / sum(means.by.category$freq[cond]))
cats <- (means.by.category$cat[cond])
min.x <- min(x)
max.x <- max(x)
space.x <- 0.1 * (max.x - min.x)
min.y <- min(y)
max.y <- max(y)
space.y <- 0.1 * (max.y - min.y)
dists <- sapply(
1: length(x),
function(i) {
x0 <- x[i]
y0 <- y[i]
sqrt((x0 - min.x)^2 + (y0 - min.y)^2)
}
)
max.dist <- max(dists)
min.dist <- min(dists)
my.col.scheme <- sapply(dists, function(d) {
p <- (d - min.dist) / (max.dist - min.dist)
rgb(p, 0.4 * (1 - p), 0.2 * (1 - p))
})
my.size.scheme <- sapply(dists, function(d) {
p <- (d - min.dist) / (max.dist - min.dist)
1 + 0.5 * p
})
plot(
x,
y,
type = "n",
bty = "n",
xlab = "Log Mean Severity",
ylab = "Log Frequency",
xlim = c(min.x - space.x, max.x + space.x),
ylim = c(min.y - space.y, max.y + space.y),
...
)
text(
x,
y,
cats,
col = my.col.scheme,
cex = my.size.scheme
)
}
Here are the associated plots. I've included some redundancy in that the position (top-right greater), colour (red greater) and relative size of the category names are selected to convey severity in terms of risk associated with each event.
par(mfrow = c(2, 2))
plot.risk.n('fat', main = "Fatalities")
plot.risk.n('inj', main = "Injuries")
plot.risk.n('prop', main = "Property damage")
plot.risk.n('crop', main = "Crop damage")
Notice, to return to the previous example, that even though Tsunamis are the deadliest event types in terms of net fatality per event, the risk of fatality associated with Tsunami events is not the greatest. Tornado, Strong Wind and Heat appear to have greater associated risks.
To demonstrate this more clearly, let us reproduce the plot in figure 1, but this time with associated risk rather than raw severity.
plot.top.risk.n <- function(variable, n, ...) {
s <- sum(means.by.category$freq)
x <- -means.by.category[[variable]] * means.by.category$freq / s
order.by.variable <- means.by.category[
order(
x
), ]
r <- nrow(order.by.variable)
order.by.variable <- rbind(
order.by.variable[1:n,],
data.frame(
cat = "Other",
freq = sum(order.by.variable$freq[(n + 1):r]),
fat = sum(order.by.variable$fat[(n + 1):r]),
inj = sum(order.by.variable$inj[(n + 1):r]),
prop = sum(order.by.variable$prop[(n + 1):r]),
crop = sum(order.by.variable$crop[(n + 1):r])
)
)
barplot(
order.by.variable[[variable]][order.by.variable$cat != "Other"]
* order.by.variable$freq[order.by.variable$cat != "Other"] / s,
names.arg = order.by.variable$cat[order.by.variable$cat != "Other"],
...
)
}
par(mfrow = c(2, 2))
plot.top.risk.n('fat', 5, main = "Fatalities")
plot.top.risk.n('inj', 5, main = "Injuries")
plot.top.risk.n('prop', 5, main = "Property damage")
plot.top.risk.n('crop', 5, main = "Crop damage")
Although figure 3 does not necessarily show the five most severe events in each category on a per-event basis (as does figure 1), it does show the five events that would produce the best pay-off (in terms of lives saved, injuries prevented, costs avoided in the long term) if we were able to somehow curb the mean severity of the event. Thus, in terms of policies regarding the allocation of resources to prevent or reduce fatalities, injuries or damage to property and crops, the results from figures 2 and 3 would be of greater importance.
Once again, these are not necessarily the most deadly or dangerous per event, but can be expected to produce the most fatalities or injuries over the long term.
TornadoHeatStrong WindFlash FloodLightningOnce again, these events are not necessarily the most damaging per event, but can be expected to produce the most damage over the long term.
TornadoStrong WindFlash FloodFloodHail