Sysnopsis

This is a report of exploring NOAA’s database and identifying severe whether events. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. This report address as the following:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

There are some packages for data processing and graphics.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Step 1: read file StormData.csv.bz2 and store into stormdata variable.

stormdata <- read.csv(file = "../data/repdata-data-StormData.csv.bz2", sep=",", header = TRUE, na.strings = "")

Step 2: change column names for easy access.

stormdata.col <- names(stormdata)
names(stormdata) <- tolower(make.names(stormdata.col))

stormdata.df <- tbl_df(stormdata)

stormdata.df$propdmgexp <- toupper(stormdata.df$propdmgexp)
stormdata.df$cropdmgexp <- toupper(stormdata.df$cropdmgexp)

# change evtype to uppercase
stormdata.df$evtype <- toupper(stormdata.df$evtype)

# remove summary in evtype, because it's not event type.
summary_val <- grep("summary", stormdata.df$evtype, ignore.case=TRUE)
stormdata.df <- stormdata.df[- summary_val,]

Step 3: Identifying most harmful with respect to population health and the greatest economic consequences.

Step 3-1: Identifying most harmful to population health

The most harmful with respect to population health is about the fatalities and the injuries, so it is needed to sum of them. I think that the mean value of this cases is meaningless. These sum values are ordered by descending.

most_harmful <- stormdata.df %>% filter(fatalities > 0 | injuries > 0) %>% 
    group_by(evtype) %>% 
    summarize(sum_fatalities = sum(fatalities), sum_injuries = sum(injuries), total_injuries = sum(fatalities) + sum(injuries)) %>%
    arrange(desc(total_injuries))

According to Pareto’s law, it should identify events which accounts for 80% of the damages.

most_harmful <- most_harmful %>% 
    mutate(cum_total = cumsum(total_injuries)) %>% 
    mutate(per_cum_total = round(cum_total / max(cum_total), 2))

Step 3-2: Identifying the events which have the greatest economic consequences.

  • firstly, convert exponential value to real number
  • I used only ‘K’, ‘M’ and ‘B’ because I don’t know the others.
  • propdmg2 and cropdmg2: converted into real number
  • totaldmg is the sum of propdmg2 and cropdmg2
  • group by evtype and aggregate by it.
  • order by descending of totaldmg2
  • last, extract top 20 events
greatest_event <- stormdata.df %>% 
filter(propdmg > 0 | cropdmg > 0) %>% 
mutate(propdmg2 = propdmg * ifelse(propdmgexp == 'K', 1000, ifelse(propdmgexp == 'M', 1000000, ifelse(propdmgexp == 'B', 1000000000, 0)))) %>%
mutate(cropdmg2 = cropdmg * ifelse(cropdmgexp == 'K', 1000, ifelse(cropdmgexp == 'M', 1000000, ifelse(cropdmgexp == 'B', 1000000000, 0)))) %>%
mutate(totaldmg = propdmg2 + cropdmg2) %>%
group_by(evtype) %>%
summarize(totaldmg2 = sum(totaldmg), avg = mean(totaldmg)) %>%
arrange(desc(totaldmg2))

greatest_event_20 <- greatest_event[1:20,]

Results

Here are graphs.

ggplot(most_harmful, aes(y = per_cum_total, x = reorder(evtype, per_cum_total))) + 
    geom_point() + xlab("Event Type") + 
    ylab("Cumulation of Damages") + 
    theme(axis.text.x = element_blank()) + 
    ylim(0, 1) + 
    annotate("rect", xmin=4, xmax=7, ymin=0, ymax=1, alpha=.1, fill="blue")

print(most_harmful[most_harmful$per_cum_total < 0.8, c(1,4)])
## Source: local data frame [4 x 2]
## 
##           evtype total_injuries
## 1        TORNADO          96979
## 2 EXCESSIVE HEAT           8428
## 3      TSTM WIND           7461
## 4          FLOOD           7259

These events account for 80% of the damages, so these are most harmful for population health.

And last, this event type have the greatest economic consequences.

ggplot(greatest_event_20, aes(x=totaldmg2, y=reorder(evtype, totaldmg2))) + 
    geom_point(size=3) + theme_bw() + 
    theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.major.y = element_line(colour="grey60", linetype="dashed")) + 
    xlab("Total Damage") + ylab("Event Type")