Synopsis: this is a quick top-down analysis on the types of events with the most human and material costs in the US. The analysis was delivered on year and state granularity. Tornados top both types of harm – thunderstorm winds add to the injuries –, (excessive) heat and flash floods also contribute heavily to the toll. The top states affected by the fatalities and injuries are Illinois, Texas, Pennsylvania, Alabama, Missouri and Florida – states in the South and in the middle of the US. Fatalities and injuries caused by these events heavily increased throughout the years. Hurricane damages crops like nothing else, while storms and flood destroys the most property. The top states affected by the damage on crops are Florida, North Carolina and Puerto Rico while Mississippi, California and Alabama have the most property damage. Damage caused by mega-events is a relatively new phenomenon and it seems to be getting more frequent – even as the 2006 Columbia hail storm and the whole of Katrina is missing from the dataset.
I loaded the dataset and considering the total size subsetted to records where there any injuries or fatalities or the damage to properties or crops could have been measured in billions of USD - aka the mega-events.
Caveat: as the Storm Events database points out the 2006 Columbia hail storm and the whole damage of Katrina is missing from the dataset.
I subsetted the data further so I can aggregate on years and states where the events happened and calculated a total damage column adding property and crop damages. Also removed the original datafile to save memory and loaded the libraries needed for analysis.
setwd("/Users/soobrosa/Desktop/coursera/repdata_3")
data <- read.csv(bzfile("repdata-data-StormData.csv.bz2"), stringsAsFactors=F)
harmful <- subset(data, FATALITIES != 0 | INJURIES != 0, select=c(BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES))
harmful$BGN_DATE <- as.numeric(format(as.Date(as.character(harmful$BGN_DATE),"%m/%d/%Y %H:%M:%S"), "%Y"))
costly <- subset(data, PROPDMGEXP == "B" | CROPDMGEXP == "B", select=c(BGN_DATE, STATE, EVTYPE, PROPDMG, CROPDMG))
costly$BGN_DATE <- as.numeric(format(as.Date(as.character(costly$BGN_DATE),"%m/%d/%Y %H:%M:%S"), "%Y"))
costly$DMG <- costly$PROPDMG + costly$CROPDMG
rm(data)
library(dplyr)
library(ggplot2)
library(tidyr)
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Let’s explore the top 10 event types that are fatal and caused injuries.
top <-
harmful %>%
group_by(EVTYPE) %>%
summarize(no_of_fatalities = sum(FATALITIES),
no_of_injuries = sum(INJURIES)) %>%
arrange(desc(no_of_fatalities))
top[1:10, ]
## Source: local data frame [10 x 3]
##
## EVTYPE no_of_fatalities no_of_injuries
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
## 6 TSTM WIND 504 6957
## 7 FLOOD 470 6789
## 8 RIP CURRENT 368 232
## 9 HIGH WIND 248 1137
## 10 AVALANCHE 224 170
Seven of the EVTYPEs are common, and as far as the data is pretty much inconsistent using EVTYPE names let’s try to collapse also similar EVTYPE names – except HEAT and FLOOD as they’re too short to fuzzy match them.
normalized = c("TORNADO", "FLASH FLOOD", "EXCESSIVE HEAT", "LIGHTNING", "TSTM WIND")
for (item in normalized){
harmful[agrep(item, harmful$EVTYPE), 3] <- item
}
harmful[agrep("THUNDERSTORM WIND", harmful$EVTYPE), 3] <- "TSTM WIND"
Let’s see the fixed list of the top 10 event types that are fatal or caused injuries. Tornados top both types of harm – thunderstorm winds add to the injuries –, (excessive) heat and flash floods also contribute heavily to the toll.
top <-
harmful %>%
group_by(EVTYPE) %>%
summarize(no_of_fatalities = sum(FATALITIES),
no_of_injuries = sum(INJURIES)) %>%
arrange(desc(no_of_fatalities))
top[1:10, ]
## Source: local data frame [10 x 3]
##
## EVTYPE no_of_fatalities no_of_injuries
## 1 TORNADO 5661 91407
## 2 EXCESSIVE HEAT 1922 6525
## 3 FLASH FLOOD 1035 1802
## 4 HEAT 937 2100
## 5 LIGHTNING 817 5232
## 6 TSTM WIND 729 9504
## 7 FLOOD 470 6789
## 8 RIP CURRENT 368 232
## 9 HIGH WIND 248 1137
## 10 AVALANCHE 224 170
The top states affected by the fatalities and injuries are Illinois, Texas, Pennsylvania, Alabama, Missouri and Florida – states in the South and in the middle of the US.
sums <- harmful %>%
group_by(STATE) %>%
summarize(no_of_fatalities = sum(FATALITIES),
no_of_injuries = sum(INJURIES)) %>%
arrange(desc(no_of_fatalities))
sums[1:10, ]
## Source: local data frame [10 x 3]
##
## STATE no_of_fatalities no_of_injuries
## 1 IL 1421 5563
## 2 TX 1366 17667
## 3 PA 846 3223
## 4 AL 784 8742
## 5 MO 754 8998
## 6 FL 746 5918
## 7 MS 555 6675
## 8 CA 550 3278
## 9 AR 530 5550
## 10 TN 521 5202
Fatalities and injuries caused by these events heavily increased throughout the years.
sums <- harmful %>%
group_by(BGN_DATE) %>%
summarise(fatalities = sum(FATALITIES, na.rm = TRUE),
injuries = sum(INJURIES, na.rm = TRUE))
sums_to_plot <- sums %>% gather(variable, value, -BGN_DATE)
ggplot(sums_to_plot, aes(BGN_DATE, value, colour=variable)) +
geom_line() +
scale_colour_manual(values=c("red", "blue")) +
xlab('Years') +
ylab('Number of people affected') +
labs(title = "Fatalities and injuries over the years", color="Type of effect")
Across the United States, which types of events have the greatest economic consequences?
Let’s explore the top 10 event types by damage.
top <- costly %>%
group_by(EVTYPE) %>%
summarize(total_damage = sum(DMG),
property_damage = sum(PROPDMG),
crop_damage = sum(CROPDMG)) %>%
arrange(desc(total_damage))
top[1:10, ]
## Source: local data frame [10 x 4]
##
## EVTYPE total_damage property_damage crop_damage
## 1 HURRICANE 806.70 5.70 801.00
## 2 HURRICANE/TYPHOON 795.21 65.50 729.71
## 3 ICE STORM 505.00 500.00 5.00
## 4 FLOOD 156.00 122.50 33.50
## 5 STORM SURGE 42.56 42.56 0.00
## 6 HURRICANE OPAL/HIGH WINDS 10.10 0.10 10.00
## 7 RIVER FLOOD 10.00 5.00 5.00
## 8 HURRICANE OPAL 8.10 3.10 5.00
## 9 WILDFIRE 7.54 1.04 6.50
## 10 TORNADO 5.30 5.30 0.00
Collapse EVTYPE names.
normalized = c("HURRICANE", "FLOOD")
for (item in normalized){
costly[agrep(item, costly$EVTYPE), 3] <- item
}
Let’s see the fixed list of the top 10 event types by damage. Hurricane damages crops like nothing else, while storms and flood destroys the most property.
top <- costly %>%
group_by(EVTYPE) %>%
summarize(total_damage = sum(DMG),
property_damage = sum(PROPDMG),
crop_damage = sum(CROPDMG)) %>%
arrange(desc(total_damage))
top[1:10, ]
## Source: local data frame [10 x 4]
##
## EVTYPE total_damage property_damage crop_damage
## 1 HURRICANE 1620.11 74.40 1545.71
## 2 ICE STORM 505.00 500.00 5.00
## 3 FLOOD 167.00 128.50 38.50
## 4 STORM SURGE 42.56 42.56 0.00
## 5 WILDFIRE 7.54 1.04 6.50
## 6 TORNADO 5.30 5.30 0.00
## 7 TROPICAL STORM 5.15 5.15 0.00
## 8 WINTER STORM 5.00 5.00 0.00
## 9 TORNADOES, TSTM WIND, HAIL 4.10 1.60 2.50
## 10 STORM SURGE/TIDE 4.00 4.00 0.00
The top states affected by the damage on crops are Florida, North Carolina and Puerto Rico while Mississippi, California and Alabama have the most property damage.
sums <- costly %>%
group_by(STATE) %>%
summarize(total_damage = sum(DMG),
property_damage = sum(PROPDMG),
crop_damage = sum(CROPDMG)) %>%
arrange(desc(total_damage))
sums[1:10, ]
## Source: local data frame [10 x 4]
##
## STATE total_damage property_damage crop_damage
## 1 FL 742.45 31.75 710.70
## 2 MS 532.00 525.49 6.51
## 3 NC 503.00 3.00 500.00
## 4 PR 302.70 1.70 301.00
## 5 CA 155.04 116.04 39.00
## 6 LA 54.73 54.73 0.00
## 7 AL 47.50 12.10 35.40
## 8 TX 14.44 13.44 1.00
## 9 IL 10.00 5.00 5.00
## 10 TN 4.50 3.50 1.00
Damage caused by mega-events is a relatively new phenomenon and it seems to be getting more frequent – even as the 2006 Columbia hail storm and the whole of Katrina is missing from the dataset.
sums <- costly %>%
group_by(BGN_DATE) %>%
summarise(property_damage = sum(PROPDMG),
crop_damage = sum(CROPDMG))
sums_to_plot <- sums %>% gather(variable, value, -BGN_DATE)
ggplot(sums_to_plot, aes(BGN_DATE, value, colour=variable)) +
geom_line() +
scale_colour_manual(values=c("red", "blue")) +
xlab('Years') +
ylab('Damage in billion USD') +
labs(title = "Damage caused over the years", color="Type of damage")