In this report we analuse storm damage data in order to determine what types of event are most damaging in the United States. We found that tornadoes had the greatest effect on population health, whereas flooding caused the greatest economic damage.
We use these libraries for this analysis:
library(knitr)
library(dplyr)
library(ggplot2)
The storm data is available here.
First we download the file:
if(!file.exists("StormData.csv.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "StormData.csv.bz2")
}
Now read the data into the raw_stormdata data frame:
raw_stormdata <- read.csv("StormData.csv.bz2")
We will only need certain fields for our analysis, so we’ll create a new stormdata data frame by discarding the unwanted fields. We will also add a year column which we will use to illustrate the quality of the data over time:
stormdata = subset(raw_stormdata, select=c(BGN_DATE,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP))
stormdata$YEAR <- as.numeric(format(as.Date(stormdata$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
## Warning in strptime(x, format, tz = "GMT"): unknown timezone 'zone/tz/
## 2017c.1.0/zoneinfo/Europe/London'
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. So let’s find a good cut-off point.
Plot the number of events per year:
hist(stormdata$YEAR,main="Number of events per year",breaks = 50, xlab="Year", ylab="Total Number Of Events")
There’s a jump from 1993 to 1994, so let’s only keep the data from 1994 onwards:
stormdata <- stormdata[stormdata$YEAR >= 1994,]
Property and crop damage is given by the PROPDMG and CROPDMG columns respectively. Associated with each of these is an EXP column (PROPDMGEXP & CROPDMGEXP) which indicates the units for the damage value column.
According to the Storm Data Documentation
Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
However the situation is more complicated than that. Here is the range of PROPDMGEXP:
levels(stormdata$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
Here is the range of CROPDMGEXP:
levels(stormdata$CROPDMGEXP)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
which is a larger set of units.
According to this analysis the units have the following meanings:
H,h,K,k,M,m,B,b,+,-,?,0,1,2,3,4,5,6,7,8, and blank-character
- H,h = hundreds = 100
- K,k = kilos = thousands = 1,000
- M,m = millions = 1,000,000
- B,b = billions = 1,000,000,000
- (+) = 1
- (-) = 0
- (?) = 0
- black/empty character = 0
numeric 0..8 = 10
It would be useful to normalise all economic damage to the same unit. Here we apply multiplication factors for each unit as per the findings above:
stormdata$PROPDMG_MULTIPLIER[stormdata$PROPDMGEXP %in% c("H","h")] <- 100
stormdata$PROPDMG_MULTIPLIER[stormdata$PROPDMGEXP %in% c("K","k")] <- 1000
stormdata$PROPDMG_MULTIPLIER[stormdata$PROPDMGEXP %in% c("M","m")] <- 1000000
stormdata$PROPDMG_MULTIPLIER[stormdata$PROPDMGEXP %in% c("B","b")] <- 1000000000
stormdata$PROPDMG_MULTIPLIER[stormdata$PROPDMGEXP == "+"] <- 1
stormdata$PROPDMG_MULTIPLIER[stormdata$PROPDMGEXP %in% c("-","?","")] <- 0
stormdata$PROPDMG_MULTIPLIER[stormdata$PROPDMGEXP %in% c("0","1","2","3","4","5","6","7","8")] <- 10
stormdata$CROPDMG_MULTIPLIER[stormdata$CROPDMGEXP %in% c("H","h")] <- 100
stormdata$CROPDMG_MULTIPLIER[stormdata$CROPDMGEXP %in% c("K","k")] <- 1000
stormdata$CROPDMG_MULTIPLIER[stormdata$CROPDMGEXP %in% c("M","m")] <- 1000000
stormdata$CROPDMG_MULTIPLIER[stormdata$CROPDMGEXP %in% c("B","b")] <- 1000000000
stormdata$CROPDMG_MULTIPLIER[stormdata$CROPDMGEXP == "+"] <- 1
stormdata$CROPDMG_MULTIPLIER[stormdata$CROPDMGEXP %in% c("-","?","")] <- 0
stormdata$CROPDMG_MULTIPLIER[stormdata$CROPDMGEXP %in% c("0","1","2","3","4","5","6","7","8")] <- 10
Here the multipliers to property and crop damage thus calculated are applied and scaled to millions of USD.
stormdata$PROP_DAMAGE <- stormdata$PROPDMG * stormdata$PROPDMG_MULTIPLIER/10^6
stormdata$CROP_DAMAGE <- stormdata$CROPDMG * stormdata$CROPDMG_MULTIPLIER/10^6
We can use the sum of fatalities and injuries as an indicator of the effect on population health. Here we create a new column called POPULATION_DAMAGE which is the sum of FATALITIES and INJURIES:
stormdata$POPULATION_DAMAGE <- stormdata$FATALITIES + stormdata$INJURIES
stormdataByPopDamage <- aggregate(POPULATION_DAMAGE ~ EVTYPE, data=stormdata, FUN="sum")
Here we extract the top 10 most damaging event types:
stormdataByPopDamage <- arrange(stormdataByPopDamage, desc(POPULATION_DAMAGE))
stormdataByPopDamageTop10 <- stormdataByPopDamage[1:10,]
Plot the total population damage by event type:
qplot(stormdataByPopDamageTop10$EVTYPE, data=stormdataByPopDamageTop10, fill=stormdataByPopDamageTop10$EVTYPE, xlab="Event Type", ylab="Total Affected Population",
geom = "bar", weight=POPULATION_DAMAGE) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_fill_discrete(guide=FALSE) + ggtitle("Top 10 Most Damaging Events - Population Health") + theme(plot.title = element_text(hjust = 0.5))
Calculate total economic damage as sum of propery and crop damage:
stormdata$ECONOMIC_DAMAGE <- stormdata$PROP_DAMAGE + stormdata$CROP_DAMAGE
stormdataByEcoDamage <- aggregate(ECONOMIC_DAMAGE ~ EVTYPE, data=stormdata, FUN="sum")
Extract the top 10 most economically damaging event types:
stormdataByEcoDamage <- arrange(stormdataByEcoDamage, desc(ECONOMIC_DAMAGE))
stormdataByEcoDamageTop10 <- stormdataByEcoDamage[1:10,]
Plot total economic damage by event type:
qplot(stormdataByEcoDamageTop10$EVTYPE, data=stormdataByEcoDamageTop10, fill=stormdataByEcoDamageTop10$EVTYPE, xlab="Event Type", ylab="Total Economic Damage (Million USD)",
geom = "bar", weight=ECONOMIC_DAMAGE) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_fill_discrete(guide=FALSE) + ggtitle("Top 10 Most Damaging Events - Economic Damage") + theme(plot.title = element_text(hjust = 0.5))