suppressMessages(library(dplyr))
suppressMessages(library(reshape))
suppressMessages(library(lubridate))
library(ggplot2)
In this project we will examine what type of events are most harmful to population health. Particularly, we will focus on injury data. It is natural to include data on fatalities too – which we will lump together with injuries – since injuries can be life threatening. In fact, the data does not show how many injuries were life-threatening and which were not so I will measure population health as the combined effects of injuries and fatalities as one variable called casualties.
The analysis will then determine which types of events cause the most economic damage. This will likely be best shown using bar charts comparing the event to the total cost over the 62 year period. Damage from hurricanes at the outset appears to dwarf the economic damages caused by other events but it turns out that floods cause the most damage.
The first step is to clean the data, so you will see many different transformations of the data. Data that cannot be determined exactly (ambiguous data) will left out in the anlysis for the reasons explained below.
stormData <- read.csv("repdata%2Fdata%2FStormData.csv.bz2", stringsAsFactors = FALSE)
Keep only the relevant variables for the analysis.
stormData <- select(stormData, EVTYPE, BGN_DATE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP,
CROPDMG, CROPDMGEXP)
stormData$BGN_DATE <- as.Date(strptime(stormData$BGN_DATE, "%m/%d/%Y"))
Some of the units in the columns PROPDMGEXP and CROPDMGEXP are wrong. The units should be K/k for thousands, B/b for billions, M/m for millions, H/h for hundreds or nothing. Instead, there are the symbols ‘?’, +‘, and the numbers 0 - 8 for values. Since the number of observations with theses values total 340, I will simply delete these observations from the analysis. It’s very difficult to go through the ’remarks’ and try to determine whether the damage was in hundreds, thousands, millions, or billions.
This is reasonable to do for two reasons. 1) there are nearly 1 million observations, so the ommission of 340 observations is insignificant in both the size and in the amount of damage incurred. 2) If the damage was excessive, say millions or billions, then more care would have been taken to ensure that the correct dollar amount had been entered. It’s safe to assume that for these observations (the ones with erroneous values), the damage is probably in the thousands and in some cases, maybe millions.
v1 <- which(stormData[, "PROPDMGEXP"] %in% c("+", "?", "0", "1", "2", "3", "4", "5", "6",
"7", "8", "-"))
v2 <- which(stormData[, "CROPDMGEXP"] %in% c("0", "?", "2"))
v <- c(v1, v2)
l <- 1:dim(stormData)[1]
stormData <- stormData[!(l %in% v), ]
Change lower case units h/k/m/b to H/K/M/B respectively and convert the PROPDMG variable to the correct dollar amount
stormData[stormData$PROPDMGEXP == "h", "PROPDMGEXP"] <- "H"
stormData[stormData$PROPDMGEXP == "m", "PROPDMGEXP"] <- "M"
stormData[stormData$CROPDMGEXP == "m", "CROPDMGEXP"] <- "M"
stormData[stormData$CROPDMGEXP == "k", "CROPDMGEXP"] <- "K"
stormData[stormData$PROPDMGEXP == "B", "PROPDMG"] <-
1e9*stormData[stormData$PROPDMGEXP == "B", "PROPDMG"]
stormData[stormData$PROPDMGEXP == "M", "PROPDMG"] <-
1e6*stormData[stormData$PROPDMGEXP == "M", "PROPDMG"]
stormData[stormData$PROPDMGEXP == "K", "PROPDMG"] <-
1e3*stormData[stormData$PROPDMGEXP == "K", "PROPDMG"]
stormData[stormData$PROPDMGEXP == "H", "PROPDMG"] <-
1e2*stormData[stormData$PROPDMGEXP == "H", "PROPDMG"]
stormData[stormData$CROPDMGEXP == "B", "CROPDMG"] <-
1e9*stormData[stormData$CROPDMGEXP == "B", "CROPDMG"]
stormData[stormData$CROPDMGEXP == "M", "CROPDMG"] <-
1e6*stormData[stormData$CROPDMGEXP == "M", "CROPDMG"]
stormData[stormData$CROPDMGEXP == "K", "CROPDMG"] <-
1e3*stormData[stormData$CROPDMGEXP == "K", "CROPDMG"]
Change event names in EVTYPE to contain lower case letters i.e. TORNADO -> Tornado. I found the SimpleCap function in the help menu.
stormData$EVTYPE <- tolower(stormData$EVTYPE)
simpleCap <- function(x){
s <- strsplit(x, " ")[[1]]
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "", collapse =" ")
}
stormData$EVTYPE <- sapply(stormData$EVTYPE, simpleCap)
First determine which events have the most casualties (injuries + fatalities). Only data from the most impactful events will be analyzed.
stormData$casualties <- stormData$INJURIES + stormData$FATALITIES
casualtyTally <- sort(tapply(stormData$casualties, stormData$EVTYPE, sum), decreasing = TRUE)
mostCasualties <- names(casualtyTally)[1:10]
injuryTally <- sort(tapply(stormData$INJURIES, stormData$EVTYPE, sum), decreasing = TRUE)
injuryTotal <- sum(injuryTally)
otherInjury <- injuryTally[!(names(injuryTally) %in% mostCasualties)]
otherInjuryTotal <- sum(otherInjury)
injuryRatio <- otherInjuryTotal/injuryTotal
injuryTally <- injuryTally[mostCasualties]
fatalityTally <- sort(tapply(stormData$FATALITIES, stormData$EVTYPE, sum), decreasing = TRUE)
fatalityTotal <- sum(fatalityTally)
otherFatality <- fatalityTally[!(names(fatalityTally) %in% mostCasualties)]
otherFatalityTotal <- sum(otherFatality)
fatalityRatio <- otherFatalityTotal/fatalityTotal
fatalityTally <- fatalityTally[mostCasualties]
casualty <- data.frame(injury = injuryTally, fatality = fatalityTally)
casualty$idv <- rownames(casualty)
casualty <- melt(casualty, id.vars = "idv")
casualty$idv <- with(casualty, reorder(idv, value, sum))
casualty$idv <- with(casualty, factor(idv, levels = rev(levels(idv))))
tempV1 <- data.frame(idv = "Other", variable = "injury", value = otherInjuryTotal)
tempV2 <- data.frame(idv = "Other", variable = "fatality", value = otherFatalityTotal)
casualty <- rbind(casualty, tempV1, tempV2)
g <- ggplot(casualty, aes(idv, value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
xlab("Event") +
ylab("Count") +
labs(title = "Number of Injuries/Fatalities for the 10 Worst Events over 1950 -- 2011")
print(g)
The injury/fatality amount that corresponds to ‘Other’ variable accounts for 10.67% and 22.92% of the total injury data and fatality data respectively. The graph suggests that tornados are the most dangerous overall.
Only the most economically impactful events will be analyzed in what follows.
stormData$cropPropDamage <- stormData$PROPDMG + stormData$CROPDMG
economicTally <- sort(tapply(stormData$cropPropDamage, stormData$EVTYPE, sum), decreasing = TRUE)
worstEconomicEvents <- names(economicTally)[1:10]
cropTally <- sort(tapply(stormData$CROPDMG, stormData$EVTYPE, sum), decreasing = TRUE)
totalCropDamage <- sum(cropTally)
otherCropDamage <- cropTally[!(names(cropTally) %in% worstEconomicEvents)]
otherCropDamageTotal <- sum(otherCropDamage)
cropRatio <- otherCropDamageTotal/totalCropDamage
cropTally <- cropTally[worstEconomicEvents]
propTally <- sort(tapply(stormData$PROPDMG, stormData$EVTYPE, sum), decreasing = TRUE)
totalPropDamage <- sum(propTally)
otherPropDamage <- propTally[!(names(propTally) %in% worstEconomicEvents)]
otherPropDamageTotal <- sum(otherPropDamage)
propRatio <- otherPropDamageTotal/totalPropDamage
propTally <- propTally[worstEconomicEvents]
damage <- data.frame(crop = cropTally, property = propTally)
damage$rn <- rownames(damage)
damage <- melt(damage, id.vars = "rn")
damage$rn <- with(damage, reorder(rn, value, sum))
damage$rn <- with(damage, factor(rn, levels = rev(levels(rn))))
tempV1X <- data.frame(rn = "Other", variable = "crop", value = otherCropDamageTotal)
tempV2X <- data.frame(rn = "Other", variable = "property", value = otherPropDamageTotal)
damage <- rbind(damage, tempV1X, tempV2X)
gg <- ggplot(damage, aes(rn, value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
xlab("Event") +
ylab("Dollars") +
labs(title = "Total Damage Incurred by the 10 Worst Events over 1950 -- 2011")
print(gg)
The crop and property amount that corresponds to ‘Other’ variable accounts for 18.78% and 13.86% of the total crop damage and property damage respectively. This data suggests that a considerable amount of resources need to be allocated for areas that get flooded.
This code chunk takes the 5 most economically disastrous events overall (1950 - 2011) and computes the total amount of damage in dollars (crop + property) every year for those events for the years 2000 - 2011.
yearVec <- c(2000:2011)
storm <- stormData[stormData$EVTYPE %in% worstEconomicEvents, ]
storm <- filter(storm, year(BGN_DATE) %in% yearVec)
d <- data.frame(year = yearVec)
j <- 2
for(i in worstEconomicEvents[1:5]){
df <- filter(storm, EVTYPE == i)
if(!dim(df)[1]){
d[, j] <- 0
} else{
vt <- with(df, aggregate(cropPropDamage, list(year(BGN_DATE)), sum))
d[which(d$year %in% vt$Group.1), j] <- vt$x
}
j <- j + 1
}
names(d) <- c("year", worstEconomicEvents[1:5])
d <- melt(d, id.vars = "year")
names(d)[2] <- "Event"
d[is.na(d$value), "value"] <- 0
ggg <- ggplot(d, aes(year, value)) + geom_line(aes(color = Event)) +
xlab("year") +
ylab("dollars") +
labs(title = "Total Cost in Dollars per Year For the 5 Worst Economic Events from 2000 - 2011")
print(ggg)