Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
The goal of this analysis is to explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and to analyze the impact of different types of severe weather events in the US with respect to population health and economic consequences.
By far the most harmful severe weather events for public health across the U.S. are tornados with 5633 fatalities and 91,346 injuries. Flood, drought and severe wind events account for most of the economic damage with a total damage of more than 30 billions U.S. dollars. For the analysis, data from the years 1950 to 2011 across the U.S. were considered. More details can be found in the results section.
The data for this analysis comes from the U.S. storm database and can be downloaded from here as a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
There is also some documentation of the database available. Details of how some of the variables are constructed/defined can be found here:
After downloading the raw data, if it has not been stored locally already, we load it into the variable stromDataRaw. Since the data requires a lot of memory when read into the dataframe (ca. 500MB), this may take a while.
local_file <- "~/coursera/reproducible_research/repdata-data-StormData.csv.bz2"
source_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists(local_file)) {
download.file(source_url,local_file, method = "curl")
}
stormDataRaw <- read.csv(bzfile(local_file))
Considering the dimensions of the raw data (902297 observations of 37 variables), it makes sense to constrict our analysis to the most important columns for understanding the economic and health consequences of severe weather.
dim(stormDataRaw)
## [1] 902297 37
After exploring the contents of the variables, we decided to keep the following 10 columns for the analysis:
| Columns | Description |
|---|---|
| BGN_DATE | Start date of event |
| END_DATE | End date of event |
| STATE | State were event ocurred |
| EVTYPE | Event type |
| FATALITIES | Total number of fatalities |
| INJURIES | Total number of injuries |
| PROPDMG | Estimated property damage with unspecified units |
| PROPDMGEXP | Exponential multiplier for PROPDMG to obtain correct number in US dollars |
| CROPDMG | Estimated agricultural damage with unspecified units |
| CROPDMGEXP | Exponential multiplier for CROPDMG to obtain correct number in US dollars |
attach(stormDataRaw)
Furthermore, to understand the economic and health consequences, we restrict our data to rows which have values larger than zero in the columns FATALITIES, INJURIES, PROPDMG or CROPDMG.
stormDataSub <- stormDataRaw[FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0, c("BGN_DATE", "END_DATE", "STATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
As a side effect of our subsetting, we have cleaned the dataset such that there are no NAs.
sum(is.na(stormDataSub))
## [1] 0
However, we still have to merge PROPDMG with PROPDMGEXP and CROPDMG with CROPDMGEXP to obtain an actual clean dataset. According to the storm database documentation (page 12), alphabetical characters used to signify the magnitude of damage include “K” for thousands, “M” for millions, and “B” for billions. For example 1.55B would mean $1,550,000,000 with this convention. Considering the different levels of PROPDMEXP shows that this convention was not kept throughout the dataset.
table(stormDataSub$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 11585 1 0 5 210 0 1 1 4 18
## 6 7 8 B h H K m M
## 3 3 0 40 1 6 231428 7 11320
Since we have to deal with the extra characters, we decided to apply the following steps:
In a first step we define the characters and multipliers and introduce two new columns PropDamage and CropDamage for storing the results.
# Multiplier for damage calculation
characters <- c("+", "-", "?", " ", "H", "K", "M", "B")
multipliers <- c(0, 0, 0, 1, 100, 10^3, 10^6, 10^9)
detach(stormDataRaw)
attach(stormDataSub)
stormDataSub$PropDamage <- PROPDMG
stormDataSub$CropDamage <- CROPDMG
Apply multipliers for characters
# Apply multipliers for characters
for (i in 1:length(characters)) {
rowFilter <- which(toupper(PROPDMGEXP) == characters[i])
stormDataSub$PropDamage[rowFilter] <- PROPDMG[rowFilter] * multipliers[i]
rowFilter <- which(toupper(CROPDMGEXP) == characters[i])
stormDataSub$CropDamage[rowFilter] <- CROPDMG[rowFilter] * multipliers[i]
}
Apply multipliers for numbers
# Apply multipliers for numbers
for (i in 0:9) {
rowFilter <- which(PROPDMGEXP == i)
stormDataSub$PropDamage[rowFilter] <- PROPDMG[rowFilter] * 10^i
rowFilter <- which(CROPDMGEXP == i)
stormDataSub$CropDamage[rowFilter] <- CROPDMG[rowFilter] * 10^i
}
Now we can exclude the columns that we used for calculating PropDamage and CropDamage.
stormData <- stormDataSub[c(1:6,11,12)]
The storm database documentation considers only 48 types of severe weather events from 1996 on, but this dataset contains 985 levels for the variable EVTYPE. Exploring the dataset, it becomes obvious that most of the additional levels appear due to inconsistent naming conventions and typos.
str(stormData$EVTYPE)
## Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
Before spending too much time with converting the event names consistently, we apply the following steps to reduce the number of severe weather events.
These steps do not take care of all discrepancies but clean up the majority of events. Aditionally, we clean the dataset from all the levels that are unused due to subsetting.
# Clean unused levels
stormData2 <- droplevels(stormData)
# Clean up thunderstorm events
stormData2$EVTYPE <- toupper(stormData2$EVTYPE)
stormData2$EVTYPE <- gsub("TSTM", "THUNDERSTORM", stormData2$EVTYPE)
stormData2$EVTYPE <- gsub("THUNDERSTORM.*", "THUNDERSTORM", stormData2$EVTYPE)
stormData2$EVTYPE <- gsub("S$", "", stormData2$EVTYPE)
The important variables for population health are INJURIES and FATALITIES. Aggregating the clean dataset with respect to the weather event types and summing over INJURIES and FATALITIES, we obtain a table with health impacts. Sorted by fatalities and/or injuries, we can plot the results for the top 20 most harmful severe weather events. Note that we applied a logarithmic scaling for the y-axis.
healthData <- aggregate(cbind(FATALITIES,INJURIES) ~ EVTYPE, data=stormData2, sum)
fatalities <- head(healthData[order(healthData$FATALITIES, decreasing = TRUE),], n=20)
injuries <- head(healthData[order(healthData$INJURIES, decreasing = TRUE),], n=20)
par(mfrow = c(1,2), mar = c(12, 4, 6, 2), cex.axis = 0.7)
barplot(fatalities$FATALITIES, names.arg = fatalities$EVTYPE, col = "blue", las = 2, ylab = "Number of total victims", log = "y")
barplot(injuries$INJURIES, names.arg = injuries$EVTYPE, col = "lightblue", las = 2, ylab = "", log = "y")
mtext("20 most harmful weather events for public health", side = 3, line = -2, outer =TRUE)
legend("topright", legend = c("Fatalities", "Injuries"), fill = c("blue", "lightblue"))
head(fatalities)
## EVTYPE FATALITIES INJURIES
## 313 TORNADO 5633 91346
## 50 EXCESSIVE HEAT 1903 6525
## 62 FLASH FLOOD 980 1777
## 126 HEAT 937 2100
## 221 LIGHTNING 816 5230
## 307 THUNDERSTORM 710 9508
head(injuries)
## EVTYPE FATALITIES INJURIES
## 313 TORNADO 5633 91346
## 307 THUNDERSTORM 710 9508
## 74 FLOOD 470 6789
## 50 EXCESSIVE HEAT 1903 6525
## 221 LIGHTNING 816 5230
## 126 HEAT 937 2100
By far the most harmful severe weather events for public health are tornados with 5633 fatalities and 91,346 injuries. In case of fatalities, this is followed by excessive heat (1903) and flash floods (980). The second most injuries occurr for thunderstorms (9508), followed by flood (6789).
The important variables for economic consequences are PropDamage and CropDamage. Aggregating the clean dataset with respect to the weather event types and summing over PropDamage and CropDamage, we obtain a table with economic impacts. Sorted by property and agricultural damage, we can plot the results for the top 20 most harmful severe weather events. Note that we applied a logarithmic scaling for the y-axis and divided all values by 10^9 to provide the damage in billions of U.S. dollars.
economicData <- aggregate(cbind(PropDamage, CropDamage) ~ EVTYPE, data=stormData2, sum)
property <- head(economicData[order(economicData$PropDamage, decreasing = TRUE),], n=20)
agricultural <- head(economicData[order(economicData$CropDamage, decreasing = TRUE),], n=20)
par(mfrow = c(1,2), mar = c(12, 4, 6, 2), cex.axis = 0.7)
barplot(property$PropDamage/10^9, names.arg = property$EVTYPE, col = "red", las = 2, ylab = "Total damage in billion U.S. dollars", log = "y")
barplot(agricultural$CropDamage/10^9, names.arg = agricultural$EVTYPE, col = "green", las = 2, ylab = "", log = "y",)
mtext("20 most harmful weather events for economy", side = 3, line = -2, outer =TRUE)
legend("topright", legend = c("Property", "Agricultural"), fill = c("red", "green"))
head(property)
## EVTYPE PropDamage CropDamage
## 74 FLOOD 1.447e+11 5.662e+09
## 194 HURRICANE/TYPHOON 6.931e+10 2.608e+09
## 313 TORNADO 5.695e+10 4.150e+08
## 300 STORM SURGE 4.332e+10 5.000e+03
## 62 FLASH FLOOD 1.683e+10 1.421e+09
## 110 HAIL 1.574e+10 3.026e+09
head(agricultural)
## EVTYPE PropDamage CropDamage
## 39 DROUGHT 1.046e+09 1.397e+10
## 74 FLOOD 1.447e+11 5.662e+09
## 266 RIVER FLOOD 5.119e+09 5.029e+09
## 207 ICE STORM 3.945e+09 5.022e+09
## 110 HAIL 1.574e+10 3.026e+09
## 185 HURRICANE 1.187e+10 2.742e+09
For property and agricultural damage different events turn out to be most harmful. Flood and severe wind events account for most of the property damage with approximately 10 billion U.S. dollars together. In case of agricultural damage, either an abundance of water or the lack of it accounts for in sum approximately 20 billion U.S. dollars.