Analysing the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, we filtered the data to obtain recent findings, limiting the information from 2007 to 2011. Across the United States, observing consequent fatalities and injuries, the most harmful event with respect to population health is the Tornado. Aside from that, Floods, Hails, Storm suges and Tornados are amongst the events that have the greatest economic consequences, measured in property and crop damages.
After reading the data, the date used for reference is converted.
storm <- read.csv('repdata-data-StormData.csv.bz2')
storm$BGN_DATE <- as.Date(storm$BGN_DATE, "%m/%d/%Y")
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(ggplot2)
## Loading required package: ggplot2
require(lattice)
## Loading required package: lattice
We proceed to analyse the EVTYPE column, with respect to frequency over the years.
evtype.analysis <- as.data.frame(table(storm$EVTYPE, format(storm$BGN_DATE, "%Y")))
names(evtype.analysis) <- c('EVTYPE', 'BGN_DATE_year', 'Frequency')
evtype.analysis <- evtype.analysis[evtype.analysis$Frequency > 0,]
evtype.analysis <- evtype.analysis %>%
group_by(BGN_DATE_year) %>%
summarise(total_freq = sum(Frequency), nr_diff_evtype = n_distinct(EVTYPE))
tail(evtype.analysis, n = 10)
## Source: local data frame [10 x 3]
##
## BGN_DATE_year total_freq nr_diff_evtype
## (fctr) (int) (int)
## 1 2002 36293 99
## 2 2003 39752 51
## 3 2004 39363 38
## 4 2005 39184 46
## 5 2006 44034 50
## 6 2007 43289 46
## 7 2008 55663 46
## 8 2009 45817 46
## 9 2010 48161 46
## 10 2011 62174 46
Only after 2007 the number of different EVTYPEs become uniform, so we keep our analysis from 2007 to 2011 considering that the data is more concise in this period.
storm.recent <- storm[format(storm$BGN_DATE, "%Y") >= 2007, ]
Still in the data preparation, we need to convert the exponential indicators to the real numbers they represent, in order to precisely measure damage.
table(storm.recent$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 0 0 0 0 1 0 0 0 0 0
## 6 7 8 B h H K m M
## 0 0 0 9 0 0 252258 0 2836
storm.recent$PROPDMG_multiplier <- ifelse(storm.recent$PROPDMGEXP == '0', 1,
ifelse(storm.recent$PROPDMGEXP == 'B', 1000000000,
ifelse(storm.recent$PROPDMGEXP == 'K', 1000,
ifelse(storm.recent$PROPDMGEXP == 'M', 1000000, 0))))
table(storm.recent$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 0 0 0 0 2 0 254503 0 599
storm.recent$CROPDMG_multiplier <- ifelse(storm.recent$CROPDMGEXP == 'B', 1000000000,
ifelse(storm.recent$CROPDMGEXP == 'K', 1000,
ifelse(storm.recent$CROPDMGEXP == 'M', 1000000, 0)))
storm.recent$PROPDMG_total <- storm.recent$PROPDMG*storm.recent$PROPDMG_multiplier
storm.recent$CROPDMG_total <- storm.recent$CROPDMG*storm.recent$CROPDMG_multiplier
This data analysis addresses the following questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
To measure how harmful with respect to population health the events are, we analyse the number of fatalities and injuries and get the five first events with more results.
storm.evtype.analysis <- storm.recent %>%
group_by(EVTYPE) %>%
summarise(nr_fatalities = sum(FATALITIES), nr_injuries = sum(INJURIES)) %>%
arrange(desc(nr_fatalities, nr_injuries))
storm.evtype.analysis.5 <- storm.evtype.analysis[1:5, ]
storm.evtype.analysis.5$EVTYPE <- factor(storm.evtype.analysis.5$EVTYPE)
This results in the following plot:
qplot(nr_fatalities, nr_injuries, data = storm.evtype.analysis.5, facets = .~ EVTYPE,
xlab = 'Total number of fatalities from 2007 to 2011',
ylab = 'Total number of injuries from 2007 to 2011',
main = 'Fatalities and Injuries per Event Type from 2007 to 2011')
Therefore, across the United States, TORNADOS are most harmful events with respect to population health.
To see which types of events have the greatest economic consequences across the United States, we measure the consequences using property and crop damages per type of event, analyzing the first five results.
storm.damage.analysis <- storm.recent %>%
group_by(EVTYPE) %>%
summarise(PROPDMG_sum = sum(PROPDMG_total), CROPDMG_sum = sum(CROPDMG_total)) %>%
mutate(TOTALDMG = PROPDMG_sum + CROPDMG_sum) %>%
arrange(desc(TOTALDMG, PROPDMG_sum, CROPDMG_sum))
storm.damage.analysis.5 <- storm.damage.analysis[1:5,]
This results in the following plots:
qplot(CROPDMG_sum/1000000000, PROPDMG_sum/1000000000, data = storm.damage.analysis.5, facets = .~ EVTYPE,
xlab = 'Total crop damage from 2007 to 2011 (Billions)',
ylab = 'Total property damage from 2007 to 2011 (Billions)',
main = 'Total damage (Billions) per Event from 2007 to 2011')
xyplot(TOTALDMG/1000000000 ~ EVTYPE, data = storm.damage.analysis.5,
xlab = 'Event type',
ylab = 'Total (Property plus Crop) damage from 2007 to 2011 (Billions)',
main = 'Total damage (Billions) per Event from 2007 to 2011',
pch = 19)
Therefore, we can see the five events that have the greatest economic consequences:
FLOOD
TORNADO
HAIL
FLASH FLOOD
STORM SURGE/TIDE