Kristin Abkemeier, October 2017
In this study, we have queried the storm database compiled by the U.S. National Oceanic and Atmospheric Administration (NOAA) in order to determine which types of severe weather events have had the greatest effect on population health and which have had the greatest economic consequences between 1950 and 2011 in the United States (U.S.). Tornados have caused by far the most injuries over the 61-year period of study, with a total of 91,468. Along with causing the most fatalities (5,667 total lives lost), tornados are clearly the most harmful weather event for population health, followed by thunderstorms, wind, heat, and floods. In a study of the economic consequences, tornados also rate highly, but floods cause the most property damage, at a total of more than an estimated US$200 billion (in 2017 dollars) during the time period from 1950 to 2011. Thunderstorms, hurricanes, and marine hazards (weather events at sea) rounded out the top five causes of property damaage. Drought caused the most crop damage, but the economic consequence over the 61-year period US$19.4 billion was only one-tenth of the cost of property damage from floods. Floods, thunderstorms, hurricanes, and ice rounded out the top five weather hazards to crops. Overall, floods caused the most economic consequences.
For this report, we read in the weather event data contained in the file repdata%2Fdata%2FStormData.csv.bz2 downloaded from the Coursera website for this assignment, which we can do with read.csv() without a separate unzipping step. Then, we immediately discard the rows of data that show no fatalities, injuries, or damage to property or crops, because these rows will not contribute to identifying the events that contribute the most to health effects (fatalities and injuries) or economic consequences (property damage and crop damage).
rawData <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")
rawDataDamage <- rawData[which((!is.na(rawData$FATALITIES) & rawData$FATALITIES > 0) |
(!is.na(rawData$INJURIES) & rawData$INJURIES > 0) |
(!is.na(rawData$PROPDMG) & rawData$PROPDMG > 0) |
(!is.na(rawData$CROPDMG) & rawData$CROPDMG > 0) ),]
Note that applying this filter removes over 70% of the data from our analysis!
When we first read the data into the data frame rawDataDamage, we note that the column with the weather event types, EVTYPE, contains factors rather than characters. This is convenient for scanning the entries using the levels() function in order to begin to identify categories into which we can group the individual observations. The following R command identifies the top 20 events listed in rawData$EVTYPE, which suggests some categories to start with, such as HAIL, THUNDERSTORM, and WIND.
topEvents <- sort(table(rawData$EVTYPE),decreasing=TRUE)[1:20]
topEvents
##
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE
## 3392 2761
We then convert the factors in the EVTYPE column into strings because working with strings enables us to use grepl() to sort our data.
library(dplyr)
rawDataDamage %>% mutate_if(is.factor, as.character) -> charData
One thing that a full inspection of levels(rawData$EVTYPE) shows us is that there are many misspellings and abbreviations in the weather event type names. So, when we group the events into categories, we need to account for all of these variant spellings.
Also, let’s convert EVTYPE values to uppercase, which will make our string matching easier, because we will only need to worry about matching capital letters.
charData <- charData %>% mutate(EVTYPE = toupper(EVTYPE))
We index the various weather events to a data frame called categories. As we saw when we looked at the top factors, we see that some events can belong to multiple categories, e.g., THUNDERSTORM WIND. We also do sapply to convert the data from Boolean into 1’s and 0’s, which will facilitate the multiplication and summing that we will need to do to calculate the health and economic consequences later on. We need to initialize the first column of categories with a vector of the appropriate length of integer type, so let’s just use charData$COUNTY as a placeholder that will be quickly overwritten.
categories <- data.frame(TORNADO = charData$COUNTY)
categories$TORNADO <- sapply(grepl("TORNADO|TORNDAO|SPOUT|FUNNEL|MICROBURST|MIRCOBURST|MICOBURST|ROTATING WALL|WALL CLOUD", charData$EVTYPE), as.numeric)
categories$HURRICANE <- sapply(grepl("HURRICANE|TYPHOON|TROPICAL", charData$EVTYPE), as.numeric)
categories$FLOOD <- sapply(grepl("FLOOD|FLOOOD|FLDG|FLD|RISING WATER|HIGH TIDE|HIGH WATER", charData$EVTYPE), as.numeric)
categories$DROUGHT <- sapply(grepl("DROUGHT", charData$EVTYPE), as.numeric)
categories$THUNDERSTORM <- sapply(grepl("TSTM|THUNDER|THUNER|THUNDES|THUDER|STORM", charData$EVTYPE), as.numeric)
categories$MARINE_HAZARD <- sapply(grepl("TSUNAMI|WAVE|SWELL|SURF|RIP CURRENT|STORM SURGE|SEICHE|HEAVY SEAS|HIGH SEAS|ROUGH SEAS|MARINE", charData$EVTYPE), as.numeric)
categories$LIGHTNING <- sapply(grepl("LIGHTNING|LIGHTING|LIGNTNING", charData$EVTYPE), as.numeric)
categories$RAIN <- sapply(grepl("RAIN|PRECIP|SHOWER", charData$EVTYPE), as.numeric)
categories$HAIL <- sapply(grepl("HAIL", charData$EVTYPE), as.numeric)
categories$WIND <- sapply(grepl("WIND|WND|W IND|BLOW|STORMW|WINS", charData$EVTYPE), as.numeric)
categories$HEAT <- sapply(grepl("WARM|RECORD TEMPERATURE|HEAT|HOT", charData$EVTYPE), as.numeric)
categories$BLIZZARD <- sapply(grepl("BLIZZARD", charData$EVTYPE), as.numeric)
categories$WINTER <- sapply(grepl("WINTER|WINTRY", charData$EVTYPE), as.numeric)
categories$FIRE <- sapply(grepl("FIRE", charData$EVTYPE), as.numeric)
categories$VOLCANO <- sapply(grepl("VOLCAN", charData$EVTYPE), as.numeric)
categories$SNOW <- sapply(grepl("SNOW", charData$EVTYPE), as.numeric)
categories$SLIDE <- sapply(grepl("SLIDE|SLUMP", charData$EVTYPE), as.numeric)
categories$ICE <- sapply(grepl("ICE|ICY|GLAZE|SLEET", charData$EVTYPE), as.numeric)
categories$COLD <- sapply(grepl("COLD|CHILL|HYPOTHERMIA|EXPOSURE|FREEZ|FROST|LOW TEMPERATURE", charData$EVTYPE), as.numeric)
categories$GUSTNADO <- sapply(grepl("GUSTNADO", charData$EVTYPE), as.numeric)
categories$DUST <- sapply(grepl("DUST", charData$EVTYPE), as.numeric)
categories$DAM <- sapply(grepl("DAM", charData$EVTYPE), as.numeric)
categories$AVALANCHE <- sapply(grepl("AVALANC", charData$EVTYPE), as.numeric)
categories$EROSION <- sapply(grepl("EROS", charData$EVTYPE), as.numeric)
categories$SMALLSTREAM <- sapply(grepl("SMALL STREAM", charData$EVTYPE), as.numeric)
categories$FOG <- sapply(grepl("FOG", charData$EVTYPE), as.numeric)
categories$WET <- sapply(grepl("WET|WETNESS|OTHER", charData$EVTYPE), as.numeric)
Most of the data fall into the categories above, but after several iterations of sorting most of it, we are left with some straggler rows that don’t fit elsewhere, and just to make sure that we’re not overlooking some hitherto-unidentified killer effect, we create an EVERYTHING_ELSE column in categories:
otherIndices <- grep("WIND|WND|W IND|BLOW|STORMW|WINS|TSTM|THUNDER|THUNER|THUNDES|THUDER|STORM|LIGHTNING|LIGHTING|LIGNTNING|RAIN|PRECIP|SHOWER|HAIL|TORNADO|TORNDAO|SPOUT|FUNNEL|MICROBURST|MIRCOBURST|MICOBURST|ROTATING WALL|WALL CLOUD|BLIZZARD|FLOOD|FLOOOD|FLDG|FLD|RISING WATER|HIGH TIDE|HIGH WATER|WINTER|WINTRY|FIRE|HURRICANE|TYPHOON|TROPICAL|VOLCAN|TSUNAMI|WAVE|SWELL|SURF|RIP CURRENT|SEICHE|MARINE|STORM SURGE|SNOW|SLEET|WARM|RECORD TEMPERATURE|HEAT|HOT|SLIDE|SLUMP|ICE|ICY|GLAZE|SLEET|COLD|CHILL|HYPOTHERMIA|EXPOSURE|FREEZ|FROST|LOW TEMPERATURE|GUSTNADO|DUST|DAM|AVALANC|EROS|SMALL STREAM|DROUGHT|FOG|SEICHE|HEAVY SEAS|HIGH SEAS|ROUGH SEAS|OTHER|WET|WETNESS",
charData$EVTYPE, invert=TRUE)
categories$EVERYTHING_ELSE <- rep(0,nrow(categories))
categories$EVERYTHING_ELSE[otherIndices] = 1
Fun fact: in the course grouping data and reading the charData$REMARKS column for some of the as-yet ungrouped observations, I learned that apparently someone reported a dead goldfish as a result of a storm.
First, we look at the health effects by summing the fatalities and injuries that happen for each category of severe weather event. Then we set up the data summed by category for both fatalities and injuries so that we can display it as a stacked bar plot.
totalFatalities <- sapply(charData$FATALITIES * categories, sum)
totalInjuries <- sapply(charData$INJURIES * categories, sum)
fatalitiesSorted <- sort(totalFatalities,decreasing=TRUE)[1:10]
library(knitr)
kable(fatalitiesSorted, caption = "Top severe weather events in terms of fatalities")
| TORNADO | 5667 |
| HEAT | 3178 |
| FLOOD | 1557 |
| WIND | 1453 |
| THUNDERSTORM | 1179 |
| MARINE_HAZARD | 1039 |
| LIGHTNING | 817 |
| COLD | 497 |
| WINTER | 279 |
| AVALANCHE | 225 |
injuriesSorted <- sort(totalInjuries,decreasing=TRUE)[1:10]
kable(injuriesSorted, caption = "Top severe weather events in terms of injuries")
| TORNADO | 91468 |
| THUNDERSTORM | 13758 |
| WIND | 11539 |
| HEAT | 9243 |
| FLOOD | 8683 |
| LIGHTNING | 5232 |
| ICE | 2413 |
| WINTER | 1968 |
| HURRICANE | 1716 |
| FIRE | 1608 |
As we can see above, tornados have caused both the most fatalities (at 5,667) and injuries (at 91,468) over the 61-year stretch from 1950 to 2011. Excessive heat, flooding, wind, and thunderstorms follow in the numbers of fatalities, while the next-greatest causes of injuries are thunderstorms, wind, heat, and flooding. Based on these figures, and the plot below, we can see how these five types of severe weather events create the most harm for population health.
totalHealth <- data.frame(names(totalFatalities))
totalHealth$Fatalities <- totalFatalities
totalHealth$Injuries <- totalInjuries
totalHealth %>% mutate_if(is.factor, as.character) -> totalHealth
totalHealth <- rename(totalHealth, Events = names.totalFatalities.)
library(reshape2)
totHealth.m <- melt(totalHealth, id.vars = "Events")
library(ggplot2)
p <- ggplot(totHealth.m, aes(x = Events, y = value, fill=variable)) +
xlab("Types of weather events") +
ylab("Total number of injuries/fatalities") +
ggtitle("Effect of Weather Events on Population Health, from 1950 to 2011") +
geom_bar(stat='identity') + theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(p)
Total number of fatalities and injuries from various types of severe weather events in the U.S. from 1950 to 2011
Next, calculating the economic consequences of severe weather events is more complicated. While one human life in 1950 is the same as one human life in 2011, the value of a U.S. dollar has increased more than ten-fold over the same time period. Thus, the dollar value of the damage from a tornado in 1950 must be amplified by an appropriate multiplier over that of damage from a weather event in 2011. Also, the property damage (PROPDMG) and crop damage (CROPDMG) columns contain only part of the data about the dollar amount of the damage. Each of these columns has a corresponding column (PROPDMGEXP and CROPDMGEXP, respectively) that contains a letter identifying the order of magnitude (“K” for 1,000, “M” for 1,000,000, and “B” for 1,000,000,000) by which the initial damage number must be multiplied. Also, a blank value in these “exponential” columns means that the multiplier is one.
Visual inspection of the entries for other values that appeared in the PROPDMGEXP and CROPDMGEXP columns, including “?”, “+”, “-”, “H”, and individual numerical digits, suggested that an alternative data entry scheme was in place for part of 1994 and 1995, as well as that these events were small-scale. For the purpose of this analysis, we chose to ignore these couple of dozen events, because they did not look to have the magnitude to change the main result of our analysis.
To determine the economic consequences of property damage, we need to multiply PROPDMG by PROPDMGEXP (converted into a numeric value), and CROPDMG by CROPDMGEXP. Then we need to look at the dates of the severe weather events and multiply by an inflation factor to convert all values into 2017 U.S. dollar amounts.
charData <- charData %>% mutate(PROPDMGEXP = toupper(PROPDMGEXP))
charData <- charData %>% mutate(CROPDMGEXP = toupper(CROPDMGEXP))
charData$PROPDMG_MULT <- rep(1,nrow(charData))
charData$PROPDMG_MULT[TRUE == grepl("K", charData$PROPDMGEXP)] <- 1000
charData$PROPDMG_MULT[TRUE == grepl("M", charData$PROPDMGEXP)] <- 1000000
charData$PROPDMG_MULT[TRUE == grepl("B", charData$PROPDMGEXP)] <- 1000000000
charData$CROPDMG_MULT <- rep(1,nrow(charData))
charData$CROPDMG_MULT[TRUE == grepl("K", charData$CROPDMGEXP)] <- 1000
charData$CROPDMG_MULT[TRUE == grepl("M", charData$CROPDMGEXP)] <- 1000000
charData$CROPDMG_MULT[TRUE == grepl("B", charData$CROPDMGEXP)] <- 1000000000
Now we need the inflation multiplier based on year. A convenient package called blscrapeR handles this and is available from CRAN. We use the inflation_adjust() function with a start date of 1950 to calculate a series of inflation multipliers. We narrow down the data frame that it returns to just a pair of columns of year matched with multiplier.
library(blscrapeR)
df <- inflation_adjust(1950)
df <- df[,c(1,3)]
We need to get the year for each severe weather event so that we can get a correct inflation multiplier. So, we need to strip out everything from the BGN_DATE column in charData that is not the year, which we do here using regular expressions to identify the month and day text, and then the time text. Then, when we have our list of years only in charData$year, we can do a merge with the inflation adjustment data frame df that we just obtained above.
charData$PARTIALDATE <- sub("^\\d{1,2}/\\d{1,2}/", "", charData$BGN_DATE)
charData$year <- sub("[ ]\\d{1,2}:\\d{2}:\\d{2}", "", charData$PARTIALDATE)
inf_mult <- merge(charData, df)
We want to do some multiplication and division of data frames where corresponding members get operated on. So, we want a vector with the same number of items as the number of rows in charData with each value set equal to 1.0, so we can take the inverse of our inflation adjustment values and scale the dollar values appropriately to get to 2017 USD values:
inf_multiplier_2017 <- df[which(df$year == "2017"),2]
im2017 <- rep(inf_multiplier_2017$adj_value, nrow(charData))
one_vector <- rep(1.0,nrow(charData))
charData$INFLATION_MULTIPLIER <- one_vector / inf_mult$adj_value * im2017
Finally, we calculate the total property damage and total crop damage for each type of severe weather event between 1950 and 2011 that recorded damages. We perform similar manipulations as for the health effects above so that we obtain a stacked bar plot for these results as well:
totalPropertyDmg <- sapply(charData$PROPDMG * charData$PROPDMG_MULT *
charData$INFLATION_MULTIPLIER * categories, sum)
totalCropDmg <- sapply(charData$CROPDMG * charData$CROPDMG_MULT *
charData$INFLATION_MULTIPLIER * categories, sum)
totalPropertyDmgSorted <- sort(totalPropertyDmg,decreasing=TRUE)[1:10]
kable(totalPropertyDmgSorted, caption="Top severe weather events in terms of property damage (USD)")
| FLOOD | 209947034901 |
| TORNADO | 198266556406 |
| HURRICANE | 121915797309 |
| THUNDERSTORM | 104604070300 |
| MARINE_HAZARD | 59945944101 |
| WIND | 24476990911 |
| HAIL | 23619030230 |
| FIRE | 11431359277 |
| WINTER | 10732521704 |
| ICE | 5475810232 |
totalCropDmgSorted <- sort(totalCropDmg,decreasing=TRUE)[1:10]
kable(totalCropDmgSorted, caption="Top severe weather events in terms of crop damage (USD)")
| DROUGHT | 19363776842 |
| FLOOD | 18194998666 |
| THUNDERSTORM | 11032959049 |
| HURRICANE | 8474038960 |
| ICE | 8287671449 |
| COLD | 4874020760 |
| HAIL | 4293925625 |
| WIND | 2993779919 |
| RAIN | 1310038366 |
| HEAT | 1293926633 |
Notice that the top causes of property and crop damage vary from the events that had the worst population health impacts. Tornados cause the second-highest amount of property damage in dollar amounts after flooding, and thunderstorms are in fourth here, so two of the top five severe weather events that affect population health also cause considerable property damage. But flooding causes the greatest amount of property damage, and hurricanes and marine hazards (severe weather that happens on the seas) also do a lot of damage, even though the harm to people is relatively less (or has been mitigated).
What is also significant is how the top events for crop damage are different from those for property damage. Both flooding and hurricanes show significant impact on crops, but the other top five severe weather events that harm crops are drought, thunderstorms, and ice. Also, the total financial effect of crop damage is considerably less than that for property damage, which is illustrated very clearly in the stacked bar plot below. Property damage due to hurricanes, and even the next half dozen weather events, dwarfs the value of crop damage caused by drought and flood together.
totalEconDmg <- data.frame(names(totalPropertyDmg))
totalEconDmg$Crop <- totalCropDmg / 1000000000
totalEconDmg$Property <- totalPropertyDmg / 1000000000
totalEconDmg %>% mutate_if(is.factor, as.character) -> totalEconDmg
totalEconDmg <- rename(totalEconDmg, Events = names.totalPropertyDmg.)
totEcon.m <- melt(totalEconDmg, id.vars = "Events")
p <- ggplot(totEcon.m, aes(x = Events, y = value, fill=variable)) +
xlab("Types of weather events") +
ylab("Total damage cost, in billion USD (2017 value)") +
ggtitle("Economic Consequences of Weather Events, from 1950 to 2011") +
geom_bar(stat='identity') + theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(p)
Total costs of damage to property and to crops from various types of severe weather events in the U.S. from 1950 to 2011. All amounts have been converted into U.S. dollar values as of 2017.