This study addresses the following questions:
Across the United States, which types of severe weather events are most harmful with respect to population health?
Across the United States, which types of sever weather events have the greatest economic consequences?
In order to answer those questions, data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database are used. The data may be accessed here . A detailed description of them can be found here.
Read the Data from the given file
StormData <- read.csv("repdata-data-StormData.csv", header = T, stringsAsFactors = F);
The available data come in a data.frame format of 902297 obs. of 37 variables.
However, for the purposes of this study the variables of interest are:
EVTYPE: A description of the severe weather phenomenon.
FATALITIES: The number of fatalities caused by the event.
INJURIES: The number of injuries caused by the event.
PROPDMG/PROPDMGEXP: The estimated dollar amount (and exponent) of property damage caused by the event.
CROPDMG/CROPDMGEXP: The estimated dollar amount (and exponent) of crop damage caused by the event.
According to the Storm Events Database webpage, the full 48 severe weather events were recorded from
1996 and on. Therefore, we must select the valid subset of the whole data frame which corresponds to dates
from 1-1-1996 to date. Also we must remove the incomplete (NA’s) event types.
StormData$BGN_DATE <- strptime(StormData$BGN_DATE, format = "%m/%d/%Y 0:00:00")
StormData2 <- StormData[StormData$BGN_DATE >= "1996-01-01",c('EVTYPE','FATALITIES','INJURIES','PROPDMG', 'PROPDMGEXP', 'CROPDMG', 'CROPDMGEXP')]
StormData2 <- StormData2[!is.na(StormData2$EVTYPE) ,]
Another peculiarity of this dataset is that event types were entered in a variety of ways in the
variable EVTYPE creating confusion and uncertainty to the dataset.
length(unique(StormData2$EVTYPE))
## [1] 952
In order to reduce this fuzziness, we must find a way to map all those 917 different types of event
descriptions to the valid 48 event types. This is done using the minimum “string distance” between
the events as described in EVTYPE entries and the given valid event types.
valid_event_types <- c("astronomical low tide","avalanche","blizzard","coastal flood","cold wind chill",
"debris flow","dense fog","dense smoke","drought","dust devil","dust storm",
"excessive heat","extreme cold wind chill","flash flood","flood","frost freeze",
"funnel cloud","freezing fog","hail","heat","heavy rain","heavy snow","high surf",
"high wind","hurricane typhoon","ice storm","lake effect snow","lakeshore flood",
"lightning","marine hail","marine high wind","marine strong wind",
"marine thunderstorm wind","rip current","seiche","sleet","storm surge tide",
"strong wind","thunderstorm wind","tornado","tropical depression","tropical storm",
"tsunami","volcanic ash","waterspout","wildfire","winter storm","winter weather")
StormData2$EVTYPE <- tolower(StormData2$EVTYPE);
StormData2$EVTYPE[StormData2$EVTYPE == "tstm wind"] <- "thunderstorm wind";
StormData2$validevent <- valid_event_types[apply(adist(StormData2$EVTYPE, valid_event_types), 1, which.min)];
Now let’s check the number of new (valid) event types:
length(unique(StormData2$validevent))
## [1] 48
Next we aggregate the numbers of injuries and fatalities by valid event type:
populationhealth <- aggregate(cbind(INJURIES,FATALITIES)~validevent,data=StormData2,FUN=sum)
top20fatalities <- populationhealth[order(populationhealth[,3], decreasing = TRUE)[1:20] ,]
top20injuries <- populationhealth[order(populationhealth[,2], decreasing = TRUE)[1:20] ,]
We turn our attention to the economic aspect of this assignment. First we must convert the exponents PROPDMGEXP and CROPDMGEXP to a usable format:
StormData2$PROPDMGEXP[StormData2$PROPDMGEXP == "" | StormData2$PROPDMGEXP == "0" | StormData2$PROPDMGEXP == "?"]<- 1
StormData2$PROPDMGEXP[StormData2$PROPDMGEXP == "-" | StormData2$PROPDMGEXP == "+"] <- 0
StormData2$PROPDMGEXP[StormData2$PROPDMGEXP == "h" | StormData2$PROPDMGEXP == "H"] <- 10^2
StormData2$PROPDMGEXP[StormData2$PROPDMGEXP == "k" | StormData2$PROPDMGEXP == "K"] <- 10^3
StormData2$PROPDMGEXP[StormData2$PROPDMGEXP == "m" | StormData2$PROPDMGEXP == "M"] <- 10^6
StormData2$PROPDMGEXP[StormData2$PROPDMGEXP == "b" | StormData2$PROPDMGEXP == "B"] <- 10^9
StormData2$PROPDMGEXP <- as.numeric(StormData2$PROPDMGEXP)
StormData2$CROPDMGEXP[StormData2$CROPDMGEXP == "" | StormData2$CROPDMGEXP == "?"] <- 1
StormData2$cROPDMGEXP[StormData2$CROPDMGEXP == "h" | StormData2$CROPDMGEXP == "H"] <- 10^2
StormData2$CROPDMGEXP[StormData2$CROPDMGEXP == "k" | StormData2$CROPDMGEXP == "K"] <- 10^3
StormData2$CROPDMGEXP[StormData2$CROPDMGEXP == "m" | StormData2$CROPDMGEXP == "M"] <- 10^6
StormData2$CROPDMGEXP[StormData2$CROPDMGEXP == "b" | StormData2$CROPDMGEXP == "B"] <- 10^9
StormData2$CROPDMGEXP <- as.numeric(StormData2$CROPDMGEXP)
After calculating the total cost per valid event type, we aggregate all the damage costs per valid event type:
StormData2$propdmg <- StormData2$PROPDMG*StormData2$PROPDMGEXP
StormData2$cropdmg <- StormData2$CROPDMG*StormData2$CROPDMGEXP
damage <- aggregate(cbind(propdmg,cropdmg)~validevent,data=StormData2,FUN=sum)
top20propdmg <- damage[order(damage[,2], decreasing = TRUE)[1:20] ,]
top20cropdmg <- damage[order(damage[,3], decreasing = TRUE)[1:20] ,]
par(mfrow=c(1,2), mar = c(2, 6, 1, 0.5), cex.axis = 0.7, las = 1, oma = c(1,1,1,1))
barplot(top20fatalities$FATALITIES, names.arg = top20fatalities$validevent, col = "black"
, main = "Total Number of Fatalities", horiz = T )
barplot(top20injuries$INJURIES , names.arg = top20injuries$validevent, col = "red"
, main = "Total Number of Injuries" , horiz = T)
From this analysis we can infer that tornadoes are the cause for most fatalities and injuries
while the second most fatal severe weather phenomenon is excessive heat.
par(mfrow=c(1,2), mar = c(4, 8, 2, 1), cex.axis = 0.7, cex.lab = 0.85, cex.main = 0.9, las = 1)
barplot(top20propdmg$propdmg/10^9, names.arg = top20propdmg$validevent, col = "magenta"
,main = "Total Cost of Property Damage", xlab = "Damage in Billions USD", horiz = T)
barplot(top20cropdmg$cropdmg/10^9, names.arg = top20cropdmg$validevent, col = "cyan"
,main = "Total Cost of Crop Damage", xlab = "Damage in Billions USD", horiz = T)
From those plots we can say that tornadoes cause the most expensive property damage and droughts incur the most expensive crop damage.