A Simple Analysis of Personal Harm and Property Damage Caused by Weather Events

Synopsys

Using the ‘Storm Data’ dataset from the United States’ National Oceanic and Atmospheric Administration (NOAA), a few simple metrics are calculated to address two questions. The first question concerns which types of weather events are most harmful to public health. Fatality statistics are analyzed to provide a ranking of the top 20 most fatal event types. Injury statistics are provided for these same most fatal event types, and it is shown that addressing top fatal events would also largely address top injury events. In the second and related question, property and crop damage statics are analyzed to determine which weather event types are most costly economically.

library(dplyr,warn.conflicts=FALSE)
library(ggplot2,warn.conflicts=FALSE)
library(data.table,warn.conflicts=FALSE)
library(knitr,warn.conflicts=FALSE)

Data Processing

The original dataset is supplied as a bzip2 compressed CSV file. fread() from the data.table package is used to decompress and read the rather large file in one step. The file location is expected to be in the current working directory.

dplyr package functions are used to summarize target metrics according to weather event type. Tables of results are made using knitr::kabble, and plots are made with ggplot2.

Data transformations: 1. Group by event type and summarize by either fatality or injury - these are the main statistics used to rank weather event types by personal harm measures 2. Bind top-50 fatal event type summaries and top-50 injury event type summaries into a dataframe used to plot cumulative percentages of totals 3. Group by event type and summarize by either property damage or crop damage - these are the main statistics used to rank weather event types by economic measures. The monetary cost is in the original data by mixed units of thousands and millions of dollars, indicated by either a K or M modifier variable. This is taken into account when calculating the overall costing results.

# fread from data.table is faster than read.csv on large files.
#  --> see https://www.r-bloggers.com/efficiency-of-importing-large-csv-files-in-r/
# there is some latency to extract the zip file. 
# honestly, I think one could check and then unzip only if the csv file wasn't present, 
# but the instrctors would probably frown on it...

din <- as_tibble(fread("./repdata_data_StormData.csv.bz2"))

Results

Question 1: Across the United States, which types of events (as indicated in the EVTYPE) are most harmful with respect to population health?

We have data on 2 categories of harmful: fatalities and injuries. The data will be grouped on event type and then summed separately over the harm categories. The fatality and injury tables will be merged on event type. Under the assumption that injury is a better outcome than death, the table will be sorted by decreasing fatalities. The top 10 most fatal event types will be noted, along with the corresponfing number of injuries. The ranking of each event by injury count will also be noted as a means of identifying skew between the fatality and injury measures.

rank_fatal <- group_by(din, EVTYPE) %>% summarize(fatalities=sum(FATALITIES,na.rm=TRUE)) %>% arrange(-fatalities)
rank_injury <- group_by(din, EVTYPE) %>% summarize(injuries=sum(INJURIES,na.rm=TRUE)) %>% arrange(-injuries)
tbl_harm <- full_join(rank_fatal,rank_injury,by="EVTYPE",all=TRUE) %>% arrange(-fatalities)
tbl_harm$fatality.rank <- sapply(tbl_harm$EVTYPE, function(x){which(rank_fatal$EVTYPE==x)})
tbl_harm$injury.rank <- sapply(tbl_harm$EVTYPE, function(x){which(rank_injury$EVTYPE==x)})
tbl_harm <- rename(tbl_harm,event.type=EVTYPE)
t10_harm <- head(tbl_harm,10)

kable(t10_harm,caption="Top 10 Most Fatal Weather Event Types",
      col.names = c("Event Type", "Fatalities", "Injuries", "Rank by Fatalities", "Rank by Injuries"))

Top 10 Most Fatal Weather Event Types
Event Type	Fatalities	Injuries	Rank by Fatalities	Rank by Injuries
TORNADO	5633	91346	1	1
EXCESSIVE HEAT	1903	6525	2	4
FLASH FLOOD	978	1777	3	8
HEAT	937	2100	4	6
LIGHTNING	816	5230	5	5
TSTM WIND	504	6957	6	2
FLOOD	470	6789	7	3
RIP CURRENT	368	232	8	29
HIGH WIND	248	1137	9	13
AVALANCHE	224	170	10	32

To help assess overall effectiveness of reducing personal harm by addressing a subset of weather event types, the plot below shows cumulative percentage of addressing the top N event types, separately for fatalities and injuries. The point of dimishing return appears to be around addressing the top 10 events for each type of personal harm. Based on the table above, if efforts are focused on addressing events leading to death, a large percentage of injury cases will automatically be addressed.

Ntop <- 50
top_fatal <- select(rank_fatal, "fatalities") %>% head(Ntop)
top_fatal <- rename(top_fatal,count=fatalities)
top_fatal$rank <- seq(1,Ntop)
top_fatal$harm.type <- as.factor("fatality")
sum_fatal <- sum(rank_fatal$fatalities,na.rm=TRUE)
top_fatal$cum_pct <- cumsum(top_fatal$count)/sum_fatal*100

top_injury <- select(rank_injury, "injuries") %>% head(Ntop)
top_injury <- rename(top_injury,count=injuries)
top_injury$rank <- seq(1,Ntop)
top_injury$harm.type <- as.factor("injury")
sum_injury <- sum(rank_injury$injuries,na.rm=TRUE)
top_injury$cum_pct <- cumsum(top_injury$count)/sum_injury*100

top <- rbind(top_fatal, top_injury)

plt <- ggplot(data=top, aes(x=rank, y=cum_pct, color=top$harm.type)) +
              geom_line() + geom_point() +
              theme_bw() +
              labs(x="N = Rank Within Harm Type",
                   y="Cumulative Percent of All Observations",
                   title="Effectiveness of Mitigating Harm by Addressing Top N Weather Event Types",
                   color="")
print(plt)

Question 2: Across the United States, which types of events have the greatest economic consequences?

dmg <- select(din, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, STATE)
dmg$unipropdmg <- dmg$PROPDMG * ifelse (dmg$PROPDMGEXP == "K", 1000, 1000000)
dmg$unicropdmg <- dmg$CROPDMG * ifelse (dmg$CROPDMGEXP == "K", 1000, 1000000)
rank_prop <- group_by(dmg, EVTYPE) %>% summarize(cost=sum(unipropdmg,na.rm=TRUE)) %>% arrange(-cost)
rank_crop <- group_by(dmg, EVTYPE) %>% summarize(cost=sum(unicropdmg,na.rm=TRUE)) %>% arrange(-cost)
all_cost <- sum(rank_prop$cost) + sum(rank_crop$cost)
rank_prop$pct.of.total <- rank_prop$cost / all_cost * 100
rank_crop$pct.of.total <- rank_crop$cost / all_cost * 100
top10_prop <- head(rank_prop,10)
top10_crop <- head(rank_crop,10)
top10_prop$type <- as.factor("Property")
top10_crop$type <- as.factor("Crop")
top_dmg <- rbind(top10_prop, top10_crop) %>% arrange(-cost)
top10_dmg <- head(top_dmg,10)

kable(top10_dmg,caption="Top 10 Most Expensive Weather Event Types",
      col.names = c("Event Type", "Cost (USD)", "Percent of Cumulative All-time Costs (USD)", "Damage Type"))

Top 10 Most Expensive Weather Event Types
Event Type	Cost (USD)	Percent of Cumulative All-time Costs (USD)	Damage Type
TORNADO	51941160480	26.479305	Property
FLOOD	22287209800	11.361891	Property
FLASH FLOOD	15698911510	8.003215	Property
HAIL	14261766720	7.270567	Property
DROUGHT	12474066000	6.359207	Crop
THUNDERSTORM WINDS	7909152850	4.032041	Property
HURRICANE	6174019010	3.147479	Property
FLOOD	5661968450	2.886439	Crop
TSTM WIND	4539928440	2.314429	Property
HIGH WIND	4006346260	2.042412	Property