In this report we analyze the impact of specific types of weather events on public health (injuries and fatalities caused) and damage to property and crops. based on the storm database collected from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) from 1950 - 2011. We will use the estimates of fatalities, injuries, property and crop damage to decide which types of event are the most significant in each of these four areas.
#options to show all output and turn off scientific notation
echo = TRUE
options(scipen = 1)
library(ggplot2)
Now we need to download and read the file (if its not already available)
#download the file containing the data (if necessary)
if (!"StormData.csv.bz2" %in% dir()) {
download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "StormData.csv.bz2")
}
#read the file (if necessary)
if (!"stormData" %in% ls()) {
stormData <- read.csv("stormData.csv.bz2")
}
#Read the begin date for each record to get the year, creating a new column in the data frame
stormData$year <- as.numeric(format(as.Date(stormData$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
Now we look at the top event types sorted by injuries and fatalities.
#we want to see the events with the most injuries, so we add them up (by event type) and show the top 10
#we want to see the events with the most injuries/fatalities so we add them up (by event type) and show the top 10
topInjuryEvents <- as.matrix(head(sort(tapply(stormData$INJURIES, stormData$EVTYPE, sum), decreasing=TRUE),10))
topFatalityEvents <- as.matrix(head(sort(tapply(stormData$FATALITIES, stormData$EVTYPE, sum), decreasing=TRUE),10))
topInjuryEvents <- as.data.frame(topInjuryEvents)
names(topInjuryEvents) <- "Injuries"
topFatalityEvents <- as.data.frame(topFatalityEvents)
names(topFatalityEvents) <- "Fatalities"
topInjuryEvents
## Injuries
## TORNADO 91346
## TSTM WIND 6957
## FLOOD 6789
## EXCESSIVE HEAT 6525
## LIGHTNING 5230
## HEAT 2100
## ICE STORM 1975
## FLASH FLOOD 1777
## THUNDERSTORM WIND 1488
## HAIL 1361
topFatalityEvents
## Fatalities
## TORNADO 5633
## EXCESSIVE HEAT 1903
## FLASH FLOOD 978
## HEAT 937
## LIGHTNING 816
## TSTM WIND 504
## FLOOD 470
## RIP CURRENT 368
## HIGH WIND 248
## AVALANCHE 224
It is apparent that the vast majority of injuries are caused by tornadoes, while fatalities are mostly caused by tornadoes, excessive heat and flooding.
ggplot(topFatalityEvents, aes(x = rownames(topFatalityEvents), y = Fatalities)) + geom_bar(stat = "identity") + xlab("Event Type") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(topInjuryEvents, aes(x = rownames(topInjuryEvents), y = Injuries)) + geom_bar(stat = "identity") + xlab("Event Type") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Each of Property and crop damage estimates are stored in two separate fields. One containing a number (i.e. 2.5), and the other containing a multiplier (i.e. hundreds, millions, or billions of dollars). This is explained (albeit inadequately) in the codebook and open to some interpretation as certain values are ambiguous or strange. The codebook can be found at http://ire.org/nicar/database-library/databases/storm-events/ (click on “Record Layout” and read the entry for PROPDMGEXP).
unique(stormData$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
That is to say, we know that “M” or “m” probably indicate “millions” and “K” or “k” indicate thousands. There is also a blank multiplier which we take to mean “dollars.” Symbols like “?”, “-” and “4” are also used and it is not clear whether for instance “8” indicates 10^8 (100 million) or anything else, so we treat the other multipliers as “1.” Luckily the vast majority of the data consists of data in the hundreds, thousands and millions which we believe have been interpreted correctly (the symbols h and H indicating 100’s). The Crop damage data is similarly strange and we dealt with it in a similar fashion.
#We have to deal with both M and m, K and k, etc. forcing uppercase will save a few lines of code
stormData$PROPDMGEXP <- toupper(stormData$PROPDMGEXP)
stormData$CROPDMGEXP <- toupper(stormData$CROPDMGEXP)
#function we use to convert "B" to "1000000000" and "K" to "1000" as needed
mult <- function(t) {
if (t == "B") 1e9
else if (t == "M") 1e6
else if (t == "K") 1e3
else if (t == "H") 100
else 1
}
#we apply the mult function to the entire data set and store the result in a new column
stormData$PropDmgMult <- sapply(stormData$PROPDMGEXP, mult)
stormData$CropDmgMult <- sapply(stormData$CROPDMGEXP, mult)
#now we multiply the storm damage by the multiplier to get the actual dollar amounts
stormData$PropDmgAmount <- stormData$PropDmgMult * stormData$PROPDMG
stormData$CropDmgAmount <- stormData$CropDmgMult * stormData$CROPDMG
#we add up the dollar amounts by event type and look at the top 10
topPropDmg <- as.matrix(head(sort(tapply(stormData$PropDmgAmount, stormData$EVTYPE, sum), decreasing = TRUE), 10))
topCropDmg <- as.matrix(head(sort(tapply(stormData$CropDmgAmount, stormData$EVTYPE, sum), decreasing = TRUE), 10))
topPropDmg
## [,1]
## FLOOD 144657709807
## HURRICANE/TYPHOON 69305840000
## TORNADO 56937160779
## STORM SURGE 43323536000
## FLASH FLOOD 16140812067
## HAIL 15732267543
## HURRICANE 11868319010
## TROPICAL STORM 7703890550
## WINTER STORM 6688497251
## HIGH WIND 5270046295
topCropDmg
## [,1]
## DROUGHT 13972566000
## FLOOD 5661968450
## RIVER FLOOD 5029459000
## ICE STORM 5022113500
## HAIL 3025954473
## HURRICANE 2741910000
## HURRICANE/TYPHOON 2607872800
## FLASH FLOOD 1421317100
## EXTREME COLD 1292973000
## FROST/FREEZE 1094086000
As you can see Floods, Hurricanes/Typhoons, and Tornadoes caused the most property damage (other events are variations of the top 3, i.e. Flood, Flash Flood, a separate Hurricane category, and Tropical Storm).
The vast majority of crop damage appears to be caused by drought.