Using the ‘Storm Data’ dataset from the United States’ National Oceanic and Atmospheric Administration (NOAA), a few simple metrics are calculated to address two questions. The first question concerns which types of weather events are most harmful to public health. Fatality statistics are analyzed to provide a ranking of the top 20 most fatal event types. Injury statistics are provided for these same most fatal event types, and it is shown that addressing top fatal events would also largely address top injury events. In the second and related question, property and crop damage statics are analyzed to determine which weather event types are most costly economically.
library(dplyr,warn.conflicts=FALSE)
library(ggplot2,warn.conflicts=FALSE)
library(data.table,warn.conflicts=FALSE)
library(knitr,warn.conflicts=FALSE)
The original dataset is supplied as a bzip2 compressed CSV file. fread() from the data.table package is used to decompress and read the rather large file in one step. The file location is expected to be in the current working directory.
dplyr package functions are used to summarize target metrics according to weather event type. Tables of results are made using knitr::kabble, and plots are made with ggplot2.
Data transformations: 1. Group by event type and summarize by either fatality or injury - these are the main statistics used to rank weather event types by personal harm measures 2. Bind top-50 fatal event type summaries and top-50 injury event type summaries into a dataframe used to plot cumulative percentages of totals 3. Group by event type and summarize by either property damage or crop damage - these are the main statistics used to rank weather event types by economic measures. The monetary cost is in the original data by mixed units of thousands and millions of dollars, indicated by either a K or M modifier variable. This is taken into account when calculating the overall costing results.
# fread from data.table is faster than read.csv on large files.
# --> see https://www.r-bloggers.com/efficiency-of-importing-large-csv-files-in-r/
# there is some latency to extract the zip file.
# honestly, I think one could check and then unzip only if the csv file wasn't present,
# but the instrctors would probably frown on it...
din <- as_tibble(fread("./repdata_data_StormData.csv.bz2"))
Question 1: Across the United States, which types of events (as indicated in the EVTYPE) are most harmful with respect to population health?
We have data on 2 categories of harmful: fatalities and injuries. The data will be grouped on event type and then summed separately over the harm categories. The fatality and injury tables will be merged on event type. Under the assumption that injury is a better outcome than death, the table will be sorted by decreasing fatalities. The top 10 most fatal event types will be noted, along with the corresponfing number of injuries. The ranking of each event by injury count will also be noted as a means of identifying skew between the fatality and injury measures.
rank_fatal <- group_by(din, EVTYPE) %>% summarize(fatalities=sum(FATALITIES,na.rm=TRUE)) %>% arrange(-fatalities)
rank_injury <- group_by(din, EVTYPE) %>% summarize(injuries=sum(INJURIES,na.rm=TRUE)) %>% arrange(-injuries)
tbl_harm <- full_join(rank_fatal,rank_injury,by="EVTYPE",all=TRUE) %>% arrange(-fatalities)
tbl_harm$fatality.rank <- sapply(tbl_harm$EVTYPE, function(x){which(rank_fatal$EVTYPE==x)})
tbl_harm$injury.rank <- sapply(tbl_harm$EVTYPE, function(x){which(rank_injury$EVTYPE==x)})
tbl_harm <- rename(tbl_harm,event.type=EVTYPE)
t10_harm <- head(tbl_harm,10)
kable(t10_harm,caption="Top 10 Most Fatal Weather Event Types",
col.names = c("Event Type", "Fatalities", "Injuries", "Rank by Fatalities", "Rank by Injuries"))
| Event Type | Fatalities | Injuries | Rank by Fatalities | Rank by Injuries |
|---|---|---|---|---|
| TORNADO | 5633 | 91346 | 1 | 1 |
| EXCESSIVE HEAT | 1903 | 6525 | 2 | 4 |
| FLASH FLOOD | 978 | 1777 | 3 | 8 |
| HEAT | 937 | 2100 | 4 | 6 |
| LIGHTNING | 816 | 5230 | 5 | 5 |
| TSTM WIND | 504 | 6957 | 6 | 2 |
| FLOOD | 470 | 6789 | 7 | 3 |
| RIP CURRENT | 368 | 232 | 8 | 29 |
| HIGH WIND | 248 | 1137 | 9 | 13 |
| AVALANCHE | 224 | 170 | 10 | 32 |
To help assess overall effectiveness of reducing personal harm by addressing a subset of weather event types, the plot below shows cumulative percentage of addressing the top N event types, separately for fatalities and injuries. The point of dimishing return appears to be around addressing the top 10 events for each type of personal harm. Based on the table above, if efforts are focused on addressing events leading to death, a large percentage of injury cases will automatically be addressed.
Ntop <- 50
top_fatal <- select(rank_fatal, "fatalities") %>% head(Ntop)
top_fatal <- rename(top_fatal,count=fatalities)
top_fatal$rank <- seq(1,Ntop)
top_fatal$harm.type <- as.factor("fatality")
sum_fatal <- sum(rank_fatal$fatalities,na.rm=TRUE)
top_fatal$cum_pct <- cumsum(top_fatal$count)/sum_fatal*100
top_injury <- select(rank_injury, "injuries") %>% head(Ntop)
top_injury <- rename(top_injury,count=injuries)
top_injury$rank <- seq(1,Ntop)
top_injury$harm.type <- as.factor("injury")
sum_injury <- sum(rank_injury$injuries,na.rm=TRUE)
top_injury$cum_pct <- cumsum(top_injury$count)/sum_injury*100
top <- rbind(top_fatal, top_injury)
plt <- ggplot(data=top, aes(x=rank, y=cum_pct, color=top$harm.type)) +
geom_line() + geom_point() +
theme_bw() +
labs(x="N = Rank Within Harm Type",
y="Cumulative Percent of All Observations",
title="Effectiveness of Mitigating Harm by Addressing Top N Weather Event Types",
color="")
print(plt)
Question 2: Across the United States, which types of events have the greatest economic consequences?
dmg <- select(din, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, STATE)
dmg$unipropdmg <- dmg$PROPDMG * ifelse (dmg$PROPDMGEXP == "K", 1000, 1000000)
dmg$unicropdmg <- dmg$CROPDMG * ifelse (dmg$CROPDMGEXP == "K", 1000, 1000000)
rank_prop <- group_by(dmg, EVTYPE) %>% summarize(cost=sum(unipropdmg,na.rm=TRUE)) %>% arrange(-cost)
rank_crop <- group_by(dmg, EVTYPE) %>% summarize(cost=sum(unicropdmg,na.rm=TRUE)) %>% arrange(-cost)
all_cost <- sum(rank_prop$cost) + sum(rank_crop$cost)
rank_prop$pct.of.total <- rank_prop$cost / all_cost * 100
rank_crop$pct.of.total <- rank_crop$cost / all_cost * 100
top10_prop <- head(rank_prop,10)
top10_crop <- head(rank_crop,10)
top10_prop$type <- as.factor("Property")
top10_crop$type <- as.factor("Crop")
top_dmg <- rbind(top10_prop, top10_crop) %>% arrange(-cost)
top10_dmg <- head(top_dmg,10)
kable(top10_dmg,caption="Top 10 Most Expensive Weather Event Types",
col.names = c("Event Type", "Cost (USD)", "Percent of Cumulative All-time Costs (USD)", "Damage Type"))
| Event Type | Cost (USD) | Percent of Cumulative All-time Costs (USD) | Damage Type |
|---|---|---|---|
| TORNADO | 51941160480 | 26.479305 | Property |
| FLOOD | 22287209800 | 11.361891 | Property |
| FLASH FLOOD | 15698911510 | 8.003215 | Property |
| HAIL | 14261766720 | 7.270567 | Property |
| DROUGHT | 12474066000 | 6.359207 | Crop |
| THUNDERSTORM WINDS | 7909152850 | 4.032041 | Property |
| HURRICANE | 6174019010 | 3.147479 | Property |
| FLOOD | 5661968450 | 2.886439 | Crop |
| TSTM WIND | 4539928440 | 2.314429 | Property |
| HIGH WIND | 4006346260 | 2.042412 | Property |