Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The data for this assignment come from the National Weather Service Storm Data. The events in the database start in the year 1950 and end in November 2011.
Loading the necessary Library from the R packages for the analisys
library(dplyr)
library(ggplot2)
library(knitr)
library(reshape2)
Loading the CSV file and getting only neccessary variables for analysis. Also we will keep subset of data where there is either a fatality or injury.
stormData <- read.csv("repdata-data-StormData.csv.bz2")
damage <- subset(stormData,FATALITIES > 0 | INJURIES > 0, select = c(BGN_DATE,STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP,CROPDMG,CROPDMGEXP))
It is necessary to note that the data categories are not in a perfect shape - there are cases of misspeling and duplication of categories. In further research, it might be interesting to try cleaning the data in more detail. However due to the time constraint, we tried to perform only the basic data processing steps - converting all characters to upper case and trimming the categories from leading and trailing spaces
damage$EVTYPE <- toupper(damage$EVTYPE)
damage$EVTYPE <- trimws(damage$EVTYPE)
The public health damage data need to be summarized so as to show the number of injuries and fatalities by event name. Finally, top 10 events resulting in injuries and top 10 events resulting in fatalities are selected.
# sums of injuries and fatalities
sumhealth <- damage %>% group_by(EVTYPE) %>% summarise(fatalities = sum(FATALITIES),injuries = sum(INJURIES))
## Select ten most harmful events
topfatalities <- head(sumhealth[order(sumhealth$fatalities, decreasing = T),], n = 10)[, c(1, 2)]
topinjuries <- head(sumhealth[order(sumhealth$injuries, decreasing = T), ], n = 10)[, c(1, 3)]
## Prepare data for plotting
topfatalitiesPlot <- melt(topfatalities)
topinjuriesPlot <- melt(topinjuries)
The economic damage data is present in the form of a base and a multiplier. Hence, we multiply the base numbers by multipliers. we add a new colum for equivalent Prop_Mult to hold the multiplier
# property damage mulitiplier
damage$Prop_mult <- 0
damage$Prop_mult[damage$PROPDMGEXP == ""] <- 1
damage$Prop_mult[grep("[-+?]", damage$PROPDMGEXP)] <- 1
damage$Prop_mult[grep("[Hh]", damage$PROPDMGEXP)] <- 100
damage$Prop_mult[grep("[Kk]", damage$PROPDMGEXP)] <- 1000
damage$Prop_mult[grep("[Mm]", damage$PROPDMGEXP)] <- 1e+06
damage$Prop_mult[grep("[Bb]", damage$PROPDMGEXP)] <- 1e+09
damage$Prop_mult <- as.numeric(damage$Prop_mult)
damage$PROPDMG <- damage$Prop_mult * damage$PROPDMG
# Crop damage multiplier: prepare and use to multiply the damage
damage$Crop_mult <- 0
damage$Crop_mult[damage$CROPDMGEXP == ""] <- 1
damage$Crop_mult[grep("[-+?]", damage$CROPDMGEXP)] <- 1
damage$Crop_mult[grep("[Hh]", damage$CROPDMGEXP)] <- 100
damage$Crop_mult[grep("[Kk]", damage$CROPDMGEXP)] <- 1000
damage$Crop_mult[grep("[Mm]", damage$CROPDMGEXP)] <- 1e+06
damage$Crop_mult[grep("[Bb]", damage$CROPDMGEXP)] <- 1e+09
damage$Crop_mult <- as.numeric(damage$Crop_mult)
damage$CROPDMG <- damage$Crop_mult * damage$CROPDMG
Similar to the health data processing, the economic damage figures are first summarized according to the type of event. Subsequently, top 10 events with the highest economic impact (defined as damage to crops plus damage to property) were selected.
# Make sums of injuries and fatalities
sumecon <- damage %>% group_by(EVTYPE) %>% summarise(cropdmg = sum(CROPDMG), propdmg =sum(PROPDMG))
sumecon$totaldamage <- sumecon$cropdmg + sumecon$propdmg
## Select top 10
topecon <- head(sumecon[order(sumecon$totaldamage, decreasing = T), ], n = 10)
## Prepare data for plotting
topeconPlot <- topecon %>% select(-totaldamage) %>% melt
Question 1: Public health
The following table and chart present the 10 most damaging events from the perspective of fatalities.
topfatalities
## Source: local data frame [10 x 2]
##
## EVTYPE fatalities
## (chr) (dbl)
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
## Make the barchart
ggplot(topfatalitiesPlot,aes(x = factor(topfatalitiesPlot$EVTYPE), y = topfatalitiesPlot$value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") + theme(axis.text.x = element_text(angle = -270),
plot.title = element_text(face = "bold")) + labs(x = "Weather event", y = "Number of injuries") +
theme(legend.position = "none") + ggtitle("Fatalities")
The event with the highest fatalities is Tornado followed by Excessive Heat from 1950 to 2011
The following table and chart present the 10 most damaging events from the perspective of injuries.
topinjuries
## Source: local data frame [10 x 2]
##
## EVTYPE injuries
## (chr) (dbl)
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
## Make the barchart
ggplot(topinjuriesPlot, aes(x = factor(topinjuriesPlot$EVTYPE), y = topinjuriesPlot$value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") + theme(axis.text.x = element_text(angle = -270),
plot.title = element_text(face = "bold")) + labs(x = "Weather event", y = "Number of injuries") +
theme(legend.position = "none") + ggtitle("Injuries")
The event with the highest injuries is Tornado from 1950 to 2011. Tornado seems to cause more damage in terms of public health according to the analysis.
Question 2: Economic damage The following table and chart present the 10 most damaging events from the perspective of economic damage.
topecon
## Source: local data frame [10 x 4]
##
## EVTYPE cropdmg propdmg totaldamage
## (chr) (dbl) (dbl) (dbl)
## 1 TORNADO 141477000 41885103080 42026580080
## 2 HURRICANE/TYPHOON 2273120800 32747770000 35020890800
## 3 FLOOD 127602500 6632851640 6760454140
## 4 TROPICAL STORM 157265000 6560156000 6717421000
## 5 WINTER STORM 10283000 5221162500 5231445500
## 6 HURRICANE 1406720000 2812660000 4219380000
## 7 STORM SURGE/TIDE 0 4004000000 4004000000
## 8 FLASH FLOOD 47966750 3901610800 3949577550
## 9 HAIL 166300000 3505785700 3672085700
## 10 WILDFIRE 182087000 3484359200 3666446200
## Make the barchart
ggplot(topeconPlot, aes(x = factor(topeconPlot$EVTYPE), y = topeconPlot$value, fill = variable)) +
geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = -270),
plot.title = element_text(face = "bold")) + labs(x = "Weather event", y = "Economic damage") +
scale_fill_discrete(name = "Type of damage", labels = c("Crop", "Property")) +
theme(legend.position = "top") + ggtitle("Economic impact")
The event with the highest economic impact from the analysis of data from 1950 to 2011 is tornado.