This report is issued to fulfill the Peer Assesment 2 of Reproducible Research Course (Coursera repdata-0322)
The data analysis address the following questions:
* Across the United States, which types of events (as indicated in the EVTYPE variable) are most
harmful with respect to population health?
* Across the United States, which types of events have the greatest economic consequences?
The NOAA database used contains storm data and its consequences from 1950 to 2011. As criteria and detail for reporting has changed over the years more work is needed to have consistent time series. That is outside the scope of this report.
More information at:
https://www.ncdc.noaa.gov/stormevents/
The bz2 file is downloaded from the Course Web into R working directory. From there is read directly into Data frame sdf using the file header as column names.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "repdata-data-StormData.csv.bz2")
sdf<-read.table("repdata-data-StormData.csv.bz2", sep = ",", header = TRUE)
R Packages used for analysis and plotting are loaded if required. The number of rows and columns are shown.
require(dplyr, quietly = TRUE)
require(ggplot2, quietly = TRUE)
dim(sdf)
## [1] 902297 37
The population health is measured in variables FATALITIES and INJURIES. Fatalities and Injuries cannot be simply added and applying more weight to Fatalities makes Injuries irrelevant. Only Fatalities has been considered.
Fatalities and number of events are grouped by Event Type using dplyr commands. The ten most deadly events are shown:
gr<-group_by(sdf,EVTYPE)
FatbyEv<-arrange(summarize(gr, Fat=sum(FATALITIES), Count=n()), desc(Fat))
FatbyEv
## Source: local data frame [985 x 3]
##
## EVTYPE Fat Count
## (fctr) (dbl) (int)
## 1 TORNADO 5633 60652
## 2 EXCESSIVE HEAT 1903 1678
## 3 FLASH FLOOD 978 54277
## 4 HEAT 937 767
## 5 LIGHTNING 816 15754
## 6 TSTM WIND 504 219940
## 7 FLOOD 470 25326
## 8 RIP CURRENT 368 470
## 9 HIGH WIND 248 20212
## 10 AVALANCHE 224 386
## .. ... ... ...
Tornados, High temperatures and floods are the more serious events
A new variable named Intensity = Fatalities / number of events has been generated. The ten events most intense in terms of fatalities with at least five occurrences are shown:
FatbyEvI<-mutate(FatbyEv, Intensity= Fat / Count)
filter(arrange(FatbyEvI, desc(Intensity)), Count >5)
## Source: local data frame [225 x 4]
##
## EVTYPE Fat Count Intensity
## (fctr) (dbl) (int) (dbl)
## 1 EXTREME HEAT 96 22 4.3636364
## 2 HEAT WAVE 172 74 2.3243243
## 3 UNSEASONABLY WARM AND DRY 29 13 2.2307692
## 4 TSUNAMI 33 20 1.6500000
## 5 HEAT 937 767 1.2216428
## 6 EXCESSIVE HEAT 1903 1678 1.1340882
## 7 LOW TEMPERATURE 7 7 1.0000000
## 8 HURRICANE ERIN 6 7 0.8571429
## 9 RIP CURRENT 368 470 0.7829787
## 10 HURRICANE/TYPHOON 64 88 0.7272727
## .. ... ... ... ...
High temperatures are clearly the most dangerous event.
To see the evolution over the years of the most deadly events a multiplot of the fatalities and number of events has been drawn.
sdf<-mutate(sdf, Year= as.POSIXlt(as.Date(sdf$BGN_DATE, "%m/%d/%Y %H:%M:%S"))$year+1900)
Evfat<-as.vector(FatbyEv$EVTYPE[1:10])
sdff<-filter(sdf, EVTYPE %in% Evfat)
grf<-group_by(sdff, EVTYPE, Year)
FatbyEvY<-arrange(summarize(grf, Fat=sum(FATALITIES), Count=n()), desc(Fat))
p<-ggplot(FatbyEvY, aes(x=Year, y=Fat))+geom_line()+facet_wrap(~ EVTYPE, ncol=5)+scale_y_log10()
p<-p + ylab("Number of fatalities (log scale)")+ggtitle("NOAA Storm Data \nFatalities")
p<-p + annotation_logticks(base = 10)
print(p)
p<-ggplot(FatbyEvY, aes(x=Year, y=Count))+geom_line()+facet_wrap(~ EVTYPE, ncol=5)+scale_y_log10()
p<-p + ylab("Number of events (log scale)")+ggtitle("NOAA Storm Data \nNumber of events")
p<-p + annotation_logticks(base = 10)
print(p)
The economic consequences of storm events are recorded in the variables PROPDMG and CROPDMG of sdf data frame. As crops seem more difficult to protect from natural events the study will focus only in Property Damage (PROPDMG).
The variable PROPDMGEXP acts as a multiplier of PROPDMG. The values of this multiplier are confusing. Only K for thousand, M for million and B for billion will be used to generate a new variable PROPDMGc = PROPDMG * Multiplier.
sdf<-mutate(sdf, PROPDMGc = PROPDMG * ifelse(PROPDMGEXP == "K", 1E3, ifelse(PROPDMGEXP == "M",
1E6, ifelse(PROPDMGEXP == "B", 1E9, 1))))
Property and number of events are grouped by Event Type using dplyr commands. The ten most costly events are shown:
gr<-group_by(sdf,EVTYPE)
DmgbyEv<-arrange(summarize(gr, Damage=sum(PROPDMGc), Count=n()), desc(Damage))
head(format(DmgbyEv, digits=5),10)
## EVTYPE Damage Count
## 1 FLOOD 1.4466e+11 25326
## 2 HURRICANE/TYPHOON 6.9306e+10 88
## 3 TORNADO 5.6926e+10 60652
## 4 STORM SURGE 4.3324e+10 261
## 5 FLASH FLOOD 1.6141e+10 54277
## 6 HAIL 1.5727e+10 288661
## 7 HURRICANE 1.1868e+10 174
## 8 TROPICAL STORM 7.7039e+09 690
## 9 WINTER STORM 6.6885e+09 11433
## 10 HIGH WIND 5.2700e+09 20212
Floods, hurricane and tornados are the most costly.
A new variable named Intensity = Damage / number of events has been generated. The ten events most intense in terms of damage with at least five occurrences are shown:
DmgbyEvI<-mutate(DmgbyEv, Intensity= Damage / Count)
head(format(filter(arrange(DmgbyEvI, desc(Intensity)), Count >5), digits=5), 10)
## EVTYPE Damage Count Intensity
## 1 HURRICANE/TYPHOON 6.9306e+10 88 7.8757e+08
## 2 HURRICANE OPAL 3.1528e+09 9 3.5032e+08
## 3 STORM SURGE 4.3324e+10 261 1.6599e+08
## 4 SEVERE THUNDERSTORM 1.2054e+09 13 9.2720e+07
## 5 HURRICANE 1.1868e+10 174 6.8209e+07
## 6 TYPHOON 6.0023e+08 11 5.4566e+07
## 7 HURRICANE ERIN 2.5810e+08 7 3.6871e+07
## 8 STORM SURGE/TIDE 4.6412e+09 148 3.1359e+07
## 9 RIVER FLOOD 5.1189e+09 173 2.9589e+07
## 10 WILDFIRES 1.0050e+08 8 1.2562e+07
Hurricanes are clearly the most onerous event.
To see the evolution over the years of the most costly events a multiplot of the damage has been drawn.
Evdmg<-as.vector(DmgbyEv$EVTYPE[1:10])
sdfd<-filter(sdf, EVTYPE %in% Evdmg)
grd<-group_by(sdfd, EVTYPE, Year)
DmgbyEvY<-arrange(summarize(grd, Damage=sum(PROPDMGc)), desc(Damage))
p<-ggplot(DmgbyEvY, aes(x=Year, y=Damage))+geom_line()+facet_wrap(~ EVTYPE, ncol=5)+scale_y_log10()
p<-p + ylab("Damage $ (log scale)")+ggtitle("NOAA Storm Data \nProperty Damage")
p<-p + annotation_logticks(base = 10)
print(p)