Synopsys

Using the ‘Storm Data’ dataset from the United States’ National Oceanic and Atmospheric Administration (NOAA), a few simple metrics are calculated to address two questions. The first question concerns which types of weather events are most harmful to public health. Fatality statistics are analyzed to provide a ranking of the top 20 most fatal event types. Injury statistics are provided for these same most fatal event types, and it is shown that addressing top fatal events would also largely address top injury events. In the second and related question, property and crop damage statics are analyzed to determine which weather event types are most costly economically.

library(dplyr,warn.conflicts=FALSE)
library(ggplot2,warn.conflicts=FALSE)
library(data.table,warn.conflicts=FALSE)
library(knitr,warn.conflicts=FALSE)

Data Processing

The original dataset is supplied as a bzip2 compressed CSV file. fread() from the data.table package is used to decompress and read the rather large file in one step. The file location is expected to be in the current working directory.

dplyr package functions are used to summarize target metrics according to weather event type. Tables of results are made using knitr::kabble, and plots are made with ggplot2.

Data transformations: 1. Group by event type and summarize by either fatality or injury - these are the main statistics used to rank weather event types by personal harm measures 2. Bind top-50 fatal event type summaries and top-50 injury event type summaries into a dataframe used to plot cumulative percentages of totals 3. Group by event type and summarize by either property damage or crop damage - these are the main statistics used to rank weather event types by economic measures. The monetary cost is in the original data by mixed units of thousands and millions of dollars, indicated by either a K or M modifier variable. This is taken into account when calculating the overall costing results.

# fread from data.table is faster than read.csv on large files.
#  --> see https://www.r-bloggers.com/efficiency-of-importing-large-csv-files-in-r/
# there is some latency to extract the zip file. 
# honestly, I think one could check and then unzip only if the csv file wasn't present, 
# but the instrctors would probably frown on it...

din <- as_tibble(fread("./repdata_data_StormData.csv.bz2"))

Results

Question 1: Across the United States, which types of events (as indicated in the EVTYPE) are most harmful with respect to population health?

We have data on 2 categories of harmful: fatalities and injuries. The data will be grouped on event type and then summed separately over the harm categories. The fatality and injury tables will be merged on event type. Under the assumption that injury is a better outcome than death, the table will be sorted by decreasing fatalities. The top 10 most fatal event types will be noted, along with the corresponfing number of injuries. The ranking of each event by injury count will also be noted as a means of identifying skew between the fatality and injury measures.

rank_fatal <- group_by(din, EVTYPE) %>% summarize(fatalities=sum(FATALITIES,na.rm=TRUE)) %>% arrange(-fatalities)
rank_injury <- group_by(din, EVTYPE) %>% summarize(injuries=sum(INJURIES,na.rm=TRUE)) %>% arrange(-injuries)
tbl_harm <- full_join(rank_fatal,rank_injury,by="EVTYPE",all=TRUE) %>% arrange(-fatalities)
tbl_harm$fatality.rank <- sapply(tbl_harm$EVTYPE, function(x){which(rank_fatal$EVTYPE==x)})
tbl_harm$injury.rank <- sapply(tbl_harm$EVTYPE, function(x){which(rank_injury$EVTYPE==x)})
tbl_harm <- rename(tbl_harm,event.type=EVTYPE)
t10_harm <- head(tbl_harm,10)

kable(t10_harm,caption="Top 10 Most Fatal Weather Event Types",
      col.names = c("Event Type", "Fatalities", "Injuries", "Rank by Fatalities", "Rank by Injuries"))
Top 10 Most Fatal Weather Event Types
Event Type Fatalities Injuries Rank by Fatalities Rank by Injuries
TORNADO 5633 91346 1 1
EXCESSIVE HEAT 1903 6525 2 4
FLASH FLOOD 978 1777 3 8
HEAT 937 2100 4 6
LIGHTNING 816 5230 5 5
TSTM WIND 504 6957 6 2
FLOOD 470 6789 7 3
RIP CURRENT 368 232 8 29
HIGH WIND 248 1137 9 13
AVALANCHE 224 170 10 32

To help assess overall effectiveness of reducing personal harm by addressing a subset of weather event types, the plot below shows cumulative percentage of addressing the top N event types, separately for fatalities and injuries. The point of dimishing return appears to be around addressing the top 10 events for each type of personal harm. Based on the table above, if efforts are focused on addressing events leading to death, a large percentage of injury cases will automatically be addressed.

Ntop <- 50
top_fatal <- select(rank_fatal, "fatalities") %>% head(Ntop)
top_fatal <- rename(top_fatal,count=fatalities)
top_fatal$rank <- seq(1,Ntop)
top_fatal$harm.type <- as.factor("fatality")
sum_fatal <- sum(rank_fatal$fatalities,na.rm=TRUE)
top_fatal$cum_pct <- cumsum(top_fatal$count)/sum_fatal*100

top_injury <- select(rank_injury, "injuries") %>% head(Ntop)
top_injury <- rename(top_injury,count=injuries)
top_injury$rank <- seq(1,Ntop)
top_injury$harm.type <- as.factor("injury")
sum_injury <- sum(rank_injury$injuries,na.rm=TRUE)
top_injury$cum_pct <- cumsum(top_injury$count)/sum_injury*100

top <- rbind(top_fatal, top_injury)

plt <- ggplot(data=top, aes(x=rank, y=cum_pct, color=top$harm.type)) +
              geom_line() + geom_point() +
              theme_bw() +
              labs(x="N = Rank Within Harm Type",
                   y="Cumulative Percent of All Observations",
                   title="Effectiveness of Mitigating Harm by Addressing Top N Weather Event Types",
                   color="")
print(plt)



Question 2: Across the United States, which types of events have the greatest economic consequences?

dmg <- select(din, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, STATE)
dmg$unipropdmg <- dmg$PROPDMG * ifelse (dmg$PROPDMGEXP == "K", 1000, 1000000)
dmg$unicropdmg <- dmg$CROPDMG * ifelse (dmg$CROPDMGEXP == "K", 1000, 1000000)
rank_prop <- group_by(dmg, EVTYPE) %>% summarize(cost=sum(unipropdmg,na.rm=TRUE)) %>% arrange(-cost)
rank_crop <- group_by(dmg, EVTYPE) %>% summarize(cost=sum(unicropdmg,na.rm=TRUE)) %>% arrange(-cost)
all_cost <- sum(rank_prop$cost) + sum(rank_crop$cost)
rank_prop$pct.of.total <- rank_prop$cost / all_cost * 100
rank_crop$pct.of.total <- rank_crop$cost / all_cost * 100
top10_prop <- head(rank_prop,10)
top10_crop <- head(rank_crop,10)
top10_prop$type <- as.factor("Property")
top10_crop$type <- as.factor("Crop")
top_dmg <- rbind(top10_prop, top10_crop) %>% arrange(-cost)
top10_dmg <- head(top_dmg,10)

kable(top10_dmg,caption="Top 10 Most Expensive Weather Event Types",
      col.names = c("Event Type", "Cost (USD)", "Percent of Cumulative All-time Costs (USD)", "Damage Type"))
Top 10 Most Expensive Weather Event Types
Event Type Cost (USD) Percent of Cumulative All-time Costs (USD) Damage Type
TORNADO 51941160480 26.479305 Property
FLOOD 22287209800 11.361891 Property
FLASH FLOOD 15698911510 8.003215 Property
HAIL 14261766720 7.270567 Property
DROUGHT 12474066000 6.359207 Crop
THUNDERSTORM WINDS 7909152850 4.032041 Property
HURRICANE 6174019010 3.147479 Property
FLOOD 5661968450 2.886439 Crop
TSTM WIND 4539928440 2.314429 Property
HIGH WIND 4006346260 2.042412 Property