In this report I am describing the hydro/meteorological events occurring in the United States between 1950 and 2011 that are a) most harmful with respect to population health and b) have the greatest economic consequences. The data source used in this analysis is from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The data indicated that the most harmful event by far with respect to fatalities and injuries is a tornado. In terms of property and crop damage, flooding has the highest economic impact.
The data used in this analysis can be downloaded in bz2 format from here: Storm Data
It should be noted that there are numerous inconsistencies and overlap in some of the Event categories. For example, we have FLOOD FLASH, FLOOD FLOOD/FLASH, FLOOD/FLASH, Flood/Flash Flood, FLOOD/FLASH FLOOD, FLOOD/FLASH FLOODING, FLOOD/FLASH/FLOOD, FLOOD/FLASHFLOOD, FLOODING and finally, FLOODS. These are all listed as separate events yet clearly there is overlap. It would be beyond the scope of this report to determine whether any of these overlapping categories could be combined into a more clear delineation of separate events. These inconsistencies however do not significantly change the results presented below as many of the duplicate event categories do not contain large numbers that would otherwise skew the data as it is presented here.
I am making the assumption that you have the bz2 zip file in the working directory.
if (!exists("stormData")) {
stormData <- read.csv("repdata%2Fdata%2FStormData.csv.bz2", na.strings = "")
}
We check the dimensions of the dataset. There are 902297 rows and 37 columns.
dim(stormData)
## [1] 902297 37
I’m defining “population health” to be fatalities or injuries directly attributable to the hydro/meteorological event itself or impact by debris due to these events as described in section 2.6.1 of the Storm Data Documentation. This data is recorded in the FATALITIES and INJURIES columns in the data set.
library(dplyr)
library(ggplot2)
I’m going to take a subset of the data that includes the event, fatalities and injuries and calculate the sum of fatalities and injuries by event as well as calculating the total number of both fatalities and injuries for each event.
healthDataByEvent <- select(stormData,
EVTYPE,
FATALITIES,
INJURIES) %>%
group_by(EVTYPE) %>%
summarise( Fatalities = sum(FATALITIES), Injuries = sum(INJURIES), total.carnage = sum(FATALITIES) + sum(INJURIES))
We’ll sort that new dataset from most to least damage with regards to the sum total of injuries and fatalities.
healthDataSorted <- healthDataByEvent[order(healthDataByEvent$total.carnage, decreasing = TRUE),]
This is what we have.
head(healthDataSorted, 10)
## # A tibble: 10 x 4
## EVTYPE Fatalities Injuries total.carnage
## <fct> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89.0 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
library(reshape2)
We’re going to drop the total carnage column as we were only using it to get top health damage events. We then melt the data in order to be able to plot both injuries and fatalities in a single bar in the bar plot. We have to resort after the melt.
healthDataSorted <- healthDataSorted[, -4]
healthMelted <- melt(head(healthDataSorted, 10), id.vars = "EVTYPE")
healthMelted <- healthMelted[order(-healthMelted$value),]
We’ll make a bar plot of the top 10 damaging weather events with regard to both injuries and fatalities.
healthPlot <- ggplot(data=healthMelted, aes(x=reorder(EVTYPE, value, sum), value, fill=variable)) +
geom_bar(stat="identity") +
coord_flip() +
labs(title="Injuries and Fatalities by Event",
subtitle="USA - 1950-2011",
y ="Total",
x = "Event Type") +
theme(legend.title=element_blank())
print(healthPlot)
The plot above clearly shows that tornados are by far the most damaging with regards to both injuries and fatalities with 96,979 in total.
I’m going to be using the variables EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP so let’s get a subset of the data with these variables.
economicCols <- c("EVTYPE", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
economicDmg <- stormData[economicCols]
It is necessary to use the values in the exponent (EXP) variables to translate the values in the DMG variables into actual dollar amounts.
The possible values in the exponent columns are as follows:
exponentValues <- c(as.character(economicDmg$PROPDMGEXP), as.character(economicDmg$CROPDMGEXP))
unique(exponentValues)
## [1] "K" "M" NA "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8" "k"
I’m calculating the monetary value of damage by multiplying damage by 10 to the power of the exponent. Eg. PROPDMG * 10 ^ PROPDMGEXP
I’m making the assumptions that:
NA, +, - and ? will be changed to 0 in order that the value in the DMG column will be taken as is.
So here I am replacing the letters with their exponential equivalents and non-numeric characters with 0.
library(car)
economicDmg$PROPDMGEXP <- recode(economicDmg$PROPDMGEXP, "'K'=3;'k'=3;'H'=2;'h'=2; 'M'=6;'m'=6;'B'=9;c('+','-', '?',NA) = 0")
economicDmg$CROPDMGEXP <- recode(economicDmg$CROPDMGEXP, "'K'=3;'k'=3;'H'=2;'h'=2; 'M'=6;'m'=6;'B'=9;c('+','-', '?',NA) = 0")
Now I’ll calculate the dollar values and append these values to the dataset along with a column representing the total of both crop and property damage for each event.
economicDmg$property.damage.dollars <- economicDmg$PROPDMG * 10 ^ as.numeric(as.character(economicDmg$PROPDMGEXP))
economicDmg$crop.damage.dollars <- economicDmg$CROPDMG * 10 ^ as.numeric(as.character(economicDmg$CROPDMGEXP))
economicDmg$total.damage.amount <- economicDmg$property.damage.dollars + economicDmg$crop.damage.dollars
ecoDmgSorted <- economicDmg[order(-economicDmg$total.damage.amount),]
Now we’ll group the damage costs by event.
ecoDmgByEvent <- select(ecoDmgSorted,
EVTYPE,
property.damage.dollars,
crop.damage.dollars,
total.damage.amount) %>%
group_by(EVTYPE) %>%
summarise( Property = sum(property.damage.dollars), Crop = sum(crop.damage.dollars), Total = sum(property.damage.dollars) + sum(crop.damage.dollars))
ecoDmgSorted <- ecoDmgByEvent[order(-ecoDmgByEvent$Total),]
top10eco <- head(ecoDmgSorted, 10)
These are the top 10 most damaging events by cost in dollars (USD).
top10eco
## # A tibble: 10 x 4
## EVTYPE Property Crop Total
## <fct> <dbl> <dbl> <dbl>
## 1 FLOOD 144657709807 5661968450 150319678257
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 TORNADO 56947380676 414953270 57362333946
## 4 STORM SURGE 43323536000 5000 43323541000
## 5 HAIL 15735267513 3025954473 18761221986
## 6 FLASH FLOOD 16822673978 1421317100 18243991078
## 7 DROUGHT 1046106000 13972566000 15018672000
## 8 HURRICANE 11868319010 2741910000 14610229010
## 9 RIVER FLOOD 5118945500 5029459000 10148404500
## 10 ICE STORM 3944927860 5022113500 8967041360
I’m removing the Total column in order to melt the data for the plot.
top10eco <- top10eco[, -4]
ecoMelted <- melt(top10eco, id.vars = "EVTYPE")
ecoMelted <- ecoMelted[order(-ecoMelted$value),]
ecoMelted$value <- ecoMelted$value / 10^9
ecoPlot <- ggplot(data=ecoMelted, aes(x=reorder(EVTYPE, value / 10 ^ 9, sum), value, fill=variable)) +
geom_bar(stat="identity") +
coord_flip() +
labs(title="Economic Damage by Event",
subtitle="USA - 1950-2011",
y ="USD Cost (Billions)",
x = "Event Type") +
theme(legend.title=element_blank())
print(ecoPlot)
The plot above shows that flooding causes the most economic damage when both property and crop damage are considered. Approximately 150 Billion USD of damage was caused between the years 1950 and 2011. The data also indicates that drought causes higher damage to crops than the other most damaging weather events.