This investigation explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The investigation attempts to assess the impact of different event types in both human and economic terms. A number of data issues obscure the results, but the report does provide a useful starting point for a more thorough investigation.
The first step taken after loading libraries was the download and reading in of the raw data.
library(dplyr, warn.conflicts = FALSE)
## Warning: package 'dplyr' was built under R version 3.1.2
library(ggplot2)
#This line required to get https download working inside knitr on Windows 8
setInternet2(TRUE)
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","StormData.bz2")
handle <- bzfile("StormData.bz2")
raw <- read.csv(handle)
The next step was to multiply the property damage and crop damage by the appropriate exponents. Page 12 of the Storm Damage Documentation suggests that valid entries for the exponent fields are B (billion), M(million) and K(thousand). There is some discussion on the course forums about the acceptability of values outside these, but I feel that the safest thing to do for this analysis is to restrict to those 3 plus a blank entry for no exponent. The most eligible for inclusion value, 0, first occurs with a propdmg of 150 and propdmgexp of 0, but the narrative result suggests a true damage value of $1500, and the use of H or noncapitalised values (this only occurs for m) suggest a sloppiness in data input causing the record to be suspect.
Restricting to B/M/K/blank values gives us over 99.9% coverage of the dataset.
valid_exps <- c("B","M","K","")
valids <- raw$PROPDMGEXP %in% valid_exps & raw$CROPDMGEXP %in% valid_exps
sum(valids) / length(raw$PROPDMGEXP)
## [1] 0.9995833
The following steps extract the variables for the questions of interest. Surplus information like location and timing is discarded, exponent symbols are converted to numbers to scale the base values by, and the damage variables are scaled accordingly. Human cost is derived by adding fatalities to injuries, with injuries rated as 10% as significant as a fatality. This is arbitrary, but seems more reasonable than having an event with 100 injuries but no fatalities considered more extreme than an event with 99 fatalities.
clean <- raw[valids,]
clean <- select(clean, EVTYPE, FATALITIES, INJURIES, PROPDMG,
PROPDMGEXP, CROPDMG, CROPDMGEXP, REFNUM, REMARKS)
exponents <- data.frame(letter = c("B", "M", "K", ""),
factor = c(10^9, 10^6, 10^3, 1))
clean <- left_join(clean, rename(exponents, propfactor = factor),
by = c("PROPDMGEXP" = "letter"))
clean <- left_join(clean, rename(exponents, cropfactor = factor),
by = c("CROPDMGEXP" = "letter"))
clean <- clean %>% mutate(human_cost = FATALITIES + 0.1*INJURIES,
economic_cost = PROPDMG * propfactor +
CROPDMG * cropfactor) %>%
select(Event_type = EVTYPE,
human_cost,
economic_cost,
REFNUM,
REMARKS)
At this point (or even previously) it is possible to perform further data cleaning as numerous problems remain with the data, for instance multiple event types which appear to be different ways of writing the same thing or misspellings. After reading the course forum I have decided it is sufficient to a) uppercase all event types so that simple differences in capitalisation have no effect and b) scale event 605943 down by a factor of 1000 as discussed in the original forum post. My interpretation is that the goal of this project is to provide reproducible research so that a reviewer has visibility of how much cleaning has been performed, rather than to do a perfect job.
clean$economic_cost[clean$REFNUM == 605943] <- clean$economic_cost[clean$REFNUM==605943]/1000
clean$Event_type <- factor(toupper(clean$Event_type))
For ease of analysis, a summarised dataset is produced containing mean and maximum values for human cost/harm and economic cost, grouped by event type. Event types with fewer than 5 data points are discarded because generally they should be aggregated into larger datasets. A fuller investigation would map them to the appropriate places, but that is beyond the scope of this exercise. In particular, consider that this removes extremely rare but very damaging event types from the rest of the investigation.
summarised <- clean %>% group_by(Event_type) %>%
select(Event_type, human_cost, economic_cost) %>%
summarise(avg_harm = mean(human_cost),
max_harm = max(human_cost),
sum_harm = sum(human_cost),
avg_cost = mean(economic_cost),
max_cost = max(economic_cost),
sum_cost = sum(economic_cost),
count = n()) %>%
filter(count > 5)
From the processed data, we can now look into the questions of interest.
There are too many event types to usefully plot, so the items with the highest 5 values for both average and maximum harm are identified. These are used to select the event types to be plotted. Then a boxplot is drawn for visualisation.
top_avg_harm <- order(summarised$avg_harm, decreasing = TRUE)[1:5]
top_max_harm <- order(summarised$max_harm, decreasing = TRUE)[1:5]
top_harm <- union(top_avg_harm, top_max_harm)
harm_events <- summarised$Event_type[top_harm]
harm_data <- clean %>% filter(Event_type %in% harm_events)
p <- ggplot(harm_data, aes(Event_type, human_cost))
p + geom_boxplot() + theme(axis.text.x=element_text(angle = -90, hjust = 0))
This plot shows that the highest human cost was down to an event categorised as HEAT. It’s also clear that tornadoes as an event type have more high-damage occurrences than most other categories. The summarised data shown below suggests that heat-related event types come out with the highest average harm to humans per event. One thing to consider is that perhaps hot days that don’t kill people are rarely recorded, whereas many tornadoes are even if they do minimal damage.
summarised[union(top_avg_harm, top_max_harm),c("Event_type","avg_harm","max_harm","count", "sum_harm")]
## Source: local data frame [9 x 5]
##
## Event_type avg_harm max_harm count sum_harm
## 1 EXTREME HEAT 5.0681818 57.0 22 111.5
## 2 HEAT WAVE 2.7986667 33.0 75 209.9
## 3 TSUNAMI 2.2950000 44.9 20 45.9
## 4 UNSEASONABLY WARM AND DRY 2.2307692 29.0 13 29.0
## 5 HURRICANE/TYPHOON 2.1761364 85.0 88 191.5
## 6 HEAT 1.4954368 583.0 767 1147.0
## 7 TORNADO 0.2434392 273.0 60625 14758.5
## 8 ICE STORM 0.1428928 157.8 2005 286.5
## 9 EXCESSIVE HEAT 1.5229440 99.0 1678 2555.5
Similar filtering was performed, identifying the top 5 event types by average and maximum economic cost before plotting.
top_avg_cost <- order(summarised$avg_cost, decreasing = TRUE)[1:5]
top_max_cost <- order(summarised$max_cost, decreasing = TRUE)[1:5]
cost_events <- summarised$Event_type[union(top_avg_cost, top_max_cost)]
cost_data <- clean %>% filter(Event_type %in% cost_events)
p <- ggplot(cost_data, aes(Event_type, economic_cost))
p + geom_boxplot() + theme(axis.text.x=element_text(angle = -90, hjust = 0))
From this plot it can be seen that the highest-cost single event was of type STORM SURGE, but that HURRICANE/TYPHOON contains more events doing greater than $5 billion of damage. Breaking out the summary details, hurricane-related event types make up the greatest average costs-per-event. We also see that a high number of ICE STORMs have been recorded,
summarised[union(top_avg_cost, top_max_cost),c("Event_type","avg_cost","max_cost","count", "sum_cost")]
## Source: local data frame [8 x 5]
##
## Event_type avg_cost max_cost count sum_cost
## 1 HURRICANE/TYPHOON 817201282 1.6930e+10 88 71913712800
## 2 HURRICANE OPAL 395230750 2.1050e+09 8 3161846000
## 3 STORM SURGE 165990579 3.1300e+10 261 43323541000
## 4 SEVERE THUNDERSTORM 92735385 1.2000e+09 13 1205560000
## 5 HURRICANE 83966833 3.5000e+09 174 14610229010
## 6 RIVER FLOOD 58661298 1.0000e+10 173 10148404500
## 7 TROPICAL STORM 12148169 5.1500e+09 690 8382236550
## 8 ICE STORM 4472338 5.0005e+09 2005 8967037810
Comments
To decide how to allocate funds it would be worthwhile to produce a cleaner data set or work closely with the NOAA to understand the limitations better. As a minimum the categories with fewer than 5 events which were discarded in the processing step should be investigated and associated with one of the more frequently occurring event types, if appropriate. A second step might be to consider separating tornadoes by severity, as their averages are dragged down by a large number of low-damage events. It is also worth considering the frequency of different types of events in particular regions - northern areas will likely benefit more from mechanisms to combat cold weather, whereas southern areas are more likely to be affected by extreme heat and tropical storms.
Filtering should also be done to put the different event types on comparable bases - reading around one can learn that tornado data has been collected for longer than other sorts of weather events, so the total cost and harm values are almost certainly not comparable to the other event types. Early data collection efforts may not have recorded all the data points, for instance recording only fatalities but not economic damage or injuries, and this will have had a significant impact. It may well be worth restricting the investigation to events which have occurred since 1996 because that data includes a wider range of events, however there is then no real consideration of severe events which might happen only once every 50 years.