This is a report of exploring NOAA’s database and identifying severe whether events. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. This report address as the following:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
There are some packages for data processing and graphics.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
stormdata <- read.csv(file = "../data/repdata-data-StormData.csv.bz2", sep=",", header = TRUE, na.strings = "")
stormdata.col <- names(stormdata)
names(stormdata) <- tolower(make.names(stormdata.col))
stormdata.df <- tbl_df(stormdata)
stormdata.df$propdmgexp <- toupper(stormdata.df$propdmgexp)
stormdata.df$cropdmgexp <- toupper(stormdata.df$cropdmgexp)
# change evtype to uppercase
stormdata.df$evtype <- toupper(stormdata.df$evtype)
# remove summary in evtype, because it's not event type.
summary_val <- grep("summary", stormdata.df$evtype, ignore.case=TRUE)
stormdata.df <- stormdata.df[- summary_val,]
The most harmful with respect to population health is about the fatalities and the injuries, so it is needed to sum of them. I think that the mean value of this cases is meaningless. These sum values are ordered by descending.
most_harmful <- stormdata.df %>% filter(fatalities > 0 | injuries > 0) %>%
group_by(evtype) %>%
summarize(sum_fatalities = sum(fatalities), sum_injuries = sum(injuries), total_injuries = sum(fatalities) + sum(injuries)) %>%
arrange(desc(total_injuries))
According to Pareto’s law, it should identify events which accounts for 80% of the damages.
most_harmful <- most_harmful %>%
mutate(cum_total = cumsum(total_injuries)) %>%
mutate(per_cum_total = round(cum_total / max(cum_total), 2))
greatest_event <- stormdata.df %>%
filter(propdmg > 0 | cropdmg > 0) %>%
mutate(propdmg2 = propdmg * ifelse(propdmgexp == 'K', 1000, ifelse(propdmgexp == 'M', 1000000, ifelse(propdmgexp == 'B', 1000000000, 0)))) %>%
mutate(cropdmg2 = cropdmg * ifelse(cropdmgexp == 'K', 1000, ifelse(cropdmgexp == 'M', 1000000, ifelse(cropdmgexp == 'B', 1000000000, 0)))) %>%
mutate(totaldmg = propdmg2 + cropdmg2) %>%
group_by(evtype) %>%
summarize(totaldmg2 = sum(totaldmg), avg = mean(totaldmg)) %>%
arrange(desc(totaldmg2))
greatest_event_20 <- greatest_event[1:20,]
Here are graphs.
ggplot(most_harmful, aes(y = per_cum_total, x = reorder(evtype, per_cum_total))) +
geom_point() + xlab("Event Type") +
ylab("Cumulation of Damages") +
theme(axis.text.x = element_blank()) +
ylim(0, 1) +
annotate("rect", xmin=4, xmax=7, ymin=0, ymax=1, alpha=.1, fill="blue")
print(most_harmful[most_harmful$per_cum_total < 0.8, c(1,4)])
## Source: local data frame [4 x 2]
##
## evtype total_injuries
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
These events account for 80% of the damages, so these are most harmful for population health.
And last, this event type have the greatest economic consequences.
ggplot(greatest_event_20, aes(x=totaldmg2, y=reorder(evtype, totaldmg2))) +
geom_point(size=3) + theme_bw() +
theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.major.y = element_line(colour="grey60", linetype="dashed")) +
xlab("Total Damage") + ylab("Event Type")