Severe weather event data from the NOAA storm database is examined. The data is analyzed here to find which events have the most significant impact on the public health and economy of the US. The most relevant factors are identified.
The data is supplied in a .bz2 archive. This is loaded using read.csv() as it allows for handling BZIP2 archives natively.
With relevance to this analysis, the database specifies the event type, number of fatalities and injuries occurred, property damages, crop damages, to make better use of memory the data is narrowed down to these fields.
csv = read.csv("repdata_data_StormData.csv.bz2",
header = TRUE,
stringsAsFactors = FALSE)
csv <- csv[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
# since there are only
length(unique(csv$EVTYPE))
## [1] 985
# 985 values, it makes sense to convert EVTYPE to a factor
csv$EVTYPE <- factor(csv$EVTYPE)
nrow(csv)
## [1] 902297
# there are a lot (> 90k) of records, this may need to be respected later
The data provides fatality and injury counts. Combining these measures can be a daunting task. Examining them separately however is possible with ease. A top 10 list for both measures is constructed and displayed on charts.
injuries.by.event = aggregate(csv$INJURIES,
by = list(type = csv$EVTYPE),
FUN = sum)
injuries.by.event = injuries.by.event[order(injuries.by.event$x,
decreasing = TRUE), ]
total.injuries = sum(injuries.by.event$x)
fatalities.by.event = aggregate(csv$FATALITIES,
by = list(type = csv$EVTYPE),
FUN = sum)
fatalities.by.event = fatalities.by.event[order(fatalities.by.event$x,
decreasing = TRUE), ]
total.fatalities = sum(fatalities.by.event$x)
par(las = 2, mar = c(2, 12, 2, 2), mfrow = c(2, 1))
barplot(fatalities.by.event$x[1:10],
names.arg =
paste0(fatalities.by.event$type[1:10]),
horiz = TRUE,
xaxt = "n",
main = "Major causes of fatalities")
text(fatalities.by.event$x[1] / 2, 0.7,
paste0(format(fatalities.by.event$x[1] / total.fatalities * 100,
digits = 4), " %")
)
barplot(injuries.by.event$x[1:10],
names.arg =
paste0(injuries.by.event$type[1:10]),
horiz = TRUE,
xaxt = "n",
main = "Major causes of injuries"
)
text(injuries.by.event$x[1] / 2, 0.7,
paste0(format(injuries.by.event$x[1] / total.injuries * 100,
digits = 4), " %")
)
To examine the economic effects, the crop and property damages are added and a by event type breakdown is constructed of this total sum. The events of top 10 significance are exhibited on a chart.
# Examine what exponent symbols are in use
unique(csv$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
# "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(csv$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
# "" "M" "K" "m" "B" "?" "0" "k" "2"
# utility to create multiplier factors from the experienced exponent characters.
exp.to.mult = function(exp.char) {
exps = c("m", "", "K", "M", "B", "0", "1", "2", "3", "4", "5", "6", "7", "8",
"9", "?", "+", "-", "H", "h")
vals = c(6, 0, 3, 6, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9, 0, 0, 0, 2, 2)
# indexes are read up more quickly when converting to a factor
# than via the 'names(...) <- ...' lookup
fac = factor(exp.char, levels = exps)
return(10 ^ vals[as.integer(fac)])
}
# calculate total economic damage and attach the column
ECONDMG = csv$PROPDMG * exp.to.mult(csv$PROPDMGEXP) +
csv$CROPDMG * exp.to.mult(csv$CROPDMGEXP)
csv$ECONDMG <- ECONDMG
# create breakdown of total economic damage over US by event type
econ.damage.by.event =
aggregate(csv$ECONDMG, by = list(EVTTYPE = csv$EVTYPE), FUN = sum)
econ.damage.by.event =
econ.damage.by.event[order(econ.damage.by.event$x, decreasing = TRUE), ]
par(mfrow = c(1, 1), mar = c(2, 12, 2, 2), las = 2)
total.econ.damage = sum(econ.damage.by.event$x, na.rm = TRUE)
barplot(econ.damage.by.event[1:10, ]$x, horiz = TRUE, xaxt = "n",
names.arg = econ.damage.by.event[1:10, ]$EVTTYPE,
main = "Economic damage by event type")
text(econ.damage.by.event$x[1] / 2, 0.7,
paste0(format(econ.damage.by.event$x[1] / total.econ.damage * 100,
digits = 4), " %")
)
At the time inputting the data it is encouraged that the person performing the data entry estimates these numbers in case they are missing. While for this reason the numbers may be approximate, and the exact list order may be sensitive to this, there is definitely some similarity between the two lists, the six major contributors in one list will remarkably all appear in the other top 10.
The members common to both lists are: tornado, tstm wind, flood, excessive heat, lightning, heat, flash flood. “Tstm” wind is probably the same as thunderstorm wind, but it makes no relevant difference in the key findings. The most significant public health impact is beyond doubt associated with tornadoes (65% of all weather event related injuries, 37.19% of the fatalities are attributed to these in the examined data).
The leading cause of economic problems from the examined events are floods (32.96% of the overall losses is down to these). However, hurricanes/typhoons, tornados, storm surges are also comparably significant hazards.