In this report we aim to describe the most harmful types of the weather severe events with respect to population health and greatest economic consequences in the United States between the years 1950 and November 2011. There is no overall hypothesis. It is a descriptive analysis. We use all available data about storms from the U.S. National Oceanic and Atmospheric Administration’s (NOAA).
The data for this research is public available:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
I use download method “curl” in accordance with OS X specific:
sessionInfo()
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.3 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] formatR_1.2 tools_3.2.0 htmltools_0.2.6 yaml_2.1.13
## [5] rmarkdown_0.5.1 knitr_1.10 stringr_0.6.2 digest_0.6.8
## [9] evaluate_0.7
Download file and read raw data:
fileLocation = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
fileDest <- "stormData.csv.bz2"
download.file(fileLocation, fileDest, method = "curl")
stormsData <- read.csv(bzfile(fileDest))
As you can see the dataset is really big:
format(object.size(stormsData), units = "Mb")
## [1] "409.4 Mb"
head(stormsData)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
For the calculating damage amount we need to normalize it values:
normalizeValue <- function(power, value) {
power <- toupper(power)
if (power == "K") {return(value*1000)}
if (power == "M") {return(value*1000000)}
if (power == "B") {return(value*1000000000)}
return(value)
}
stormsData$PROPDMG_VALUE <- normalizeValue(stormsData$PROPDMGEXP, stormsData$PROPDMG)
stormsData$CROPDMG_VALUE <- normalizeValue(stormsData$CROPDMGEXP, stormsData$CROPDMG)
During this section I will try to answer 2 questions:
Calculate full amount of fatalities by weather events:
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
fatalitiesByEvent <- aggregate(FATALITIES ~ EVTYPE, stormsData, sum)
fatalitiesByEvent <- filter(fatalitiesByEvent, FATALITIES > 0)
fatalitiesByEvent <- arrange(fatalitiesByEvent, -FATALITIES)
Only first 20 records have more then 100 fatalities for all times:
fatalitiesByEvent[1:20,]
## EVTYPE FATALITIES
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
## 11 WINTER STORM 206
## 12 RIP CURRENTS 204
## 13 HEAT WAVE 172
## 14 EXTREME COLD 160
## 15 THUNDERSTORM WIND 133
## 16 HEAVY SNOW 127
## 17 EXTREME COLD/WIND CHILL 125
## 18 STRONG WIND 103
## 19 BLIZZARD 101
## 20 HIGH SURF 101
par(mar=c(5.1,6.1,4.1,2.1))
barplot(fatalitiesByEvent[10:1, "FATALITIES"],
horiz = TRUE,
las = 1,
names.arg = fatalitiesByEvent[10:1, "EVTYPE"],
cex.names = 0.5,
main="Top 10 fatality causing events",
xlab="Number of fatalities")
Calculate full amount of injuries by weather events:
injuriesByEvent <- aggregate(INJURIES ~ EVTYPE, stormsData, sum)
injuriesByEvent <- filter(injuriesByEvent, INJURIES > 0)
injuriesByEvent <- arrange(injuriesByEvent, -INJURIES)
Only first 14 records have more then 1000 injuries for all times:
injuriesByEvent[1:14,]
## EVTYPE INJURIES
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
## 11 WINTER STORM 1321
## 12 HURRICANE/TYPHOON 1275
## 13 HIGH WIND 1137
## 14 HEAVY SNOW 1021
par(mar=c(5.1,6.1,4.1,2.1))
barplot(injuriesByEvent[10:1, "INJURIES"],
horiz = TRUE,
las = 1,
names.arg = injuriesByEvent[10:1, "EVTYPE"],
cex.names = 0.6,
main="Top 10 injuries causing events",
xlab="Number of injuries")
As you can see the biggest damage is to tornado.
It is a very strong correlation between injuries and fatalities reasons:
comparison <- merge(fatalitiesByEvent, injuriesByEvent)
cor(comparison$FATALITIES, comparison$INJURIES)
## [1] 0.9453787
For economic impact events I will use the same analysis type as in previous section.
stormsData <- mutate(stormsData, CONSEQUENCES = PROPDMG_VALUE + CROPDMG_VALUE)
consequenceByEvent <- aggregate(CONSEQUENCES ~ EVTYPE, stormsData, sum)
consequenceByEvent <- filter(consequenceByEvent, CONSEQUENCES > 0)
consequenceByEvent <- arrange(consequenceByEvent, -CONSEQUENCES)
“Top 20” by the greatest economic consequences:
head(consequenceByEvent, n = 20)
## EVTYPE CONSEQUENCES
## 1 TORNADO 3212358179
## 2 FLASH FLOOD 1420303790
## 3 TSTM WIND 1336074813
## 4 FLOOD 900106518
## 5 THUNDERSTORM WIND 876910961
## 6 HAIL 689272976
## 7 LIGHTNING 603355361
## 8 THUNDERSTORM WINDS 446311865
## 9 HIGH WIND 324748843
## 10 WINTER STORM 132722569
## 11 HEAVY SNOW 122254156
## 12 WILDFIRE 84463704
## 13 ICE STORM 66002359
## 14 STRONG WIND 62995427
## 15 HIGH WINDS 55626760
## 16 HEAVY RAIN 50853263
## 17 TROPICAL STORM 48429579
## 18 WILD/FOREST FIRE 39349140
## 19 FLASH FLOODING 28502276
## 20 URBAN/SML STREAM FLD 26054734
par(mar=c(5.1,6.1,4.1,2.1))
barplot((consequenceByEvent[10:1, "CONSEQUENCES"])/1000000,
horiz = TRUE,
names.arg = consequenceByEvent[10:1, "EVTYPE"],
cex.names = 0.5,
las = 1,
xlab = "Total Damage (in millions $)",
main = "Top 10 economic damage causing events")
As we can see, again, tornado at the first place.