Synopsis: this is a quick top-down analysis on the types of events with the most human and material costs in the US. The analysis was delivered on year and state granularity. Tornados top both types of harm – thunderstorm winds add to the injuries –, (excessive) heat and flash floods also contribute heavily to the toll. The top states affected by the fatalities and injuries are Illinois, Texas, Pennsylvania, Alabama, Missouri and Florida – states in the South and in the middle of the US. Fatalities and injuries caused by these events heavily increased throughout the years. Hurricane damages crops like nothing else, while storms and flood destroys the most property. The top states affected by the damage on crops are Florida, North Carolina and Puerto Rico while Mississippi, California and Alabama have the most property damage. Damage caused by mega-events is a relatively new phenomenon and it seems to be getting more frequent – even as the 2006 Columbia hail storm and the whole of Katrina is missing from the dataset.

Data Processing

I loaded the dataset and considering the total size subsetted to records where there any injuries or fatalities or the damage to properties or crops could have been measured in billions of USD - aka the mega-events.

Caveat: as the Storm Events database points out the 2006 Columbia hail storm and the whole damage of Katrina is missing from the dataset.

I subsetted the data further so I can aggregate on years and states where the events happened and calculated a total damage column adding property and crop damages. Also removed the original datafile to save memory and loaded the libraries needed for analysis.

setwd("/Users/soobrosa/Desktop/coursera/repdata_3")
data <- read.csv(bzfile("repdata-data-StormData.csv.bz2"), stringsAsFactors=F)

harmful <- subset(data, FATALITIES != 0 | INJURIES != 0, select=c(BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES))

harmful$BGN_DATE <- as.numeric(format(as.Date(as.character(harmful$BGN_DATE),"%m/%d/%Y %H:%M:%S"), "%Y"))

costly <- subset(data, PROPDMGEXP == "B" | CROPDMGEXP == "B", select=c(BGN_DATE, STATE, EVTYPE, PROPDMG, CROPDMG))

costly$BGN_DATE <- as.numeric(format(as.Date(as.character(costly$BGN_DATE),"%m/%d/%Y %H:%M:%S"), "%Y"))

costly$DMG <- costly$PROPDMG + costly$CROPDMG

rm(data)
library(dplyr)
library(ggplot2)
library(tidyr)

Results

1. Events harmful to health

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Let’s explore the top 10 event types that are fatal and caused injuries.

top <- 
  harmful %>% 
  group_by(EVTYPE) %>% 
  summarize(no_of_fatalities = sum(FATALITIES),
            no_of_injuries = sum(INJURIES)) %>%
  arrange(desc(no_of_fatalities))
top[1:10, ]
## Source: local data frame [10 x 3]
## 
##            EVTYPE no_of_fatalities no_of_injuries
## 1         TORNADO             5633          91346
## 2  EXCESSIVE HEAT             1903           6525
## 3     FLASH FLOOD              978           1777
## 4            HEAT              937           2100
## 5       LIGHTNING              816           5230
## 6       TSTM WIND              504           6957
## 7           FLOOD              470           6789
## 8     RIP CURRENT              368            232
## 9       HIGH WIND              248           1137
## 10      AVALANCHE              224            170

Seven of the EVTYPEs are common, and as far as the data is pretty much inconsistent using EVTYPE names let’s try to collapse also similar EVTYPE names – except HEAT and FLOOD as they’re too short to fuzzy match them.

normalized = c("TORNADO", "FLASH FLOOD", "EXCESSIVE HEAT", "LIGHTNING", "TSTM WIND")

for (item in normalized){
  harmful[agrep(item, harmful$EVTYPE), 3] <- item
}

harmful[agrep("THUNDERSTORM WIND", harmful$EVTYPE), 3] <- "TSTM WIND"

Let’s see the fixed list of the top 10 event types that are fatal or caused injuries. Tornados top both types of harm – thunderstorm winds add to the injuries –, (excessive) heat and flash floods also contribute heavily to the toll.

top <- 
  harmful %>% 
  group_by(EVTYPE) %>% 
  summarize(no_of_fatalities = sum(FATALITIES),
            no_of_injuries = sum(INJURIES)) %>%
  arrange(desc(no_of_fatalities))
top[1:10, ]
## Source: local data frame [10 x 3]
## 
##            EVTYPE no_of_fatalities no_of_injuries
## 1         TORNADO             5661          91407
## 2  EXCESSIVE HEAT             1922           6525
## 3     FLASH FLOOD             1035           1802
## 4            HEAT              937           2100
## 5       LIGHTNING              817           5232
## 6       TSTM WIND              729           9504
## 7           FLOOD              470           6789
## 8     RIP CURRENT              368            232
## 9       HIGH WIND              248           1137
## 10      AVALANCHE              224            170

The top states affected by the fatalities and injuries are Illinois, Texas, Pennsylvania, Alabama, Missouri and Florida – states in the South and in the middle of the US.

sums <- harmful %>%
  group_by(STATE) %>%
  summarize(no_of_fatalities = sum(FATALITIES),
            no_of_injuries = sum(INJURIES)) %>%
  arrange(desc(no_of_fatalities))
sums[1:10, ]
## Source: local data frame [10 x 3]
## 
##    STATE no_of_fatalities no_of_injuries
## 1     IL             1421           5563
## 2     TX             1366          17667
## 3     PA              846           3223
## 4     AL              784           8742
## 5     MO              754           8998
## 6     FL              746           5918
## 7     MS              555           6675
## 8     CA              550           3278
## 9     AR              530           5550
## 10    TN              521           5202

Fatalities and injuries caused by these events heavily increased throughout the years.

sums <- harmful %>%
  group_by(BGN_DATE) %>%
  summarise(fatalities = sum(FATALITIES, na.rm = TRUE),
            injuries = sum(INJURIES, na.rm = TRUE))

sums_to_plot <- sums %>% gather(variable, value, -BGN_DATE)
ggplot(sums_to_plot, aes(BGN_DATE, value, colour=variable)) + 
  geom_line() +
  scale_colour_manual(values=c("red", "blue")) +
  xlab('Years') +
  ylab('Number of people affected') +
  labs(title = "Fatalities and injuries over the years", color="Type of effect")

2. Events with damage

Across the United States, which types of events have the greatest economic consequences?

Let’s explore the top 10 event types by damage.

top <- costly %>%
  group_by(EVTYPE) %>%
  summarize(total_damage = sum(DMG),
            property_damage = sum(PROPDMG),
            crop_damage = sum(CROPDMG)) %>%
  arrange(desc(total_damage))
top[1:10, ]
## Source: local data frame [10 x 4]
## 
##                       EVTYPE total_damage property_damage crop_damage
## 1                  HURRICANE       806.70            5.70      801.00
## 2          HURRICANE/TYPHOON       795.21           65.50      729.71
## 3                  ICE STORM       505.00          500.00        5.00
## 4                      FLOOD       156.00          122.50       33.50
## 5                STORM SURGE        42.56           42.56        0.00
## 6  HURRICANE OPAL/HIGH WINDS        10.10            0.10       10.00
## 7                RIVER FLOOD        10.00            5.00        5.00
## 8             HURRICANE OPAL         8.10            3.10        5.00
## 9                   WILDFIRE         7.54            1.04        6.50
## 10                   TORNADO         5.30            5.30        0.00

Collapse EVTYPE names.

normalized = c("HURRICANE", "FLOOD")

for (item in normalized){
  costly[agrep(item, costly$EVTYPE), 3] <- item
}

Let’s see the fixed list of the top 10 event types by damage. Hurricane damages crops like nothing else, while storms and flood destroys the most property.

top <- costly %>%
  group_by(EVTYPE) %>%
  summarize(total_damage = sum(DMG),
            property_damage = sum(PROPDMG),
            crop_damage = sum(CROPDMG)) %>%
  arrange(desc(total_damage))
top[1:10, ]
## Source: local data frame [10 x 4]
## 
##                        EVTYPE total_damage property_damage crop_damage
## 1                   HURRICANE      1620.11           74.40     1545.71
## 2                   ICE STORM       505.00          500.00        5.00
## 3                       FLOOD       167.00          128.50       38.50
## 4                 STORM SURGE        42.56           42.56        0.00
## 5                    WILDFIRE         7.54            1.04        6.50
## 6                     TORNADO         5.30            5.30        0.00
## 7              TROPICAL STORM         5.15            5.15        0.00
## 8                WINTER STORM         5.00            5.00        0.00
## 9  TORNADOES, TSTM WIND, HAIL         4.10            1.60        2.50
## 10           STORM SURGE/TIDE         4.00            4.00        0.00

The top states affected by the damage on crops are Florida, North Carolina and Puerto Rico while Mississippi, California and Alabama have the most property damage.

sums <- costly %>%
  group_by(STATE) %>%
  summarize(total_damage = sum(DMG),
            property_damage = sum(PROPDMG),
            crop_damage = sum(CROPDMG)) %>%
  arrange(desc(total_damage))
sums[1:10, ]
## Source: local data frame [10 x 4]
## 
##    STATE total_damage property_damage crop_damage
## 1     FL       742.45           31.75      710.70
## 2     MS       532.00          525.49        6.51
## 3     NC       503.00            3.00      500.00
## 4     PR       302.70            1.70      301.00
## 5     CA       155.04          116.04       39.00
## 6     LA        54.73           54.73        0.00
## 7     AL        47.50           12.10       35.40
## 8     TX        14.44           13.44        1.00
## 9     IL        10.00            5.00        5.00
## 10    TN         4.50            3.50        1.00

Damage caused by mega-events is a relatively new phenomenon and it seems to be getting more frequent – even as the 2006 Columbia hail storm and the whole of Katrina is missing from the dataset.

sums <- costly %>%
  group_by(BGN_DATE) %>%
  summarise(property_damage = sum(PROPDMG),
  crop_damage = sum(CROPDMG))

sums_to_plot <- sums %>% gather(variable, value, -BGN_DATE)
ggplot(sums_to_plot, aes(BGN_DATE, value, colour=variable)) + 
  geom_line() +
  scale_colour_manual(values=c("red", "blue")) +
  xlab('Years') +
  ylab('Damage in billion USD') +
  labs(title = "Damage caused over the years", color="Type of damage")