Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

The goal of this analysis is to explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and to analyze the impact of different types of severe weather events in the US with respect to population health and economic consequences.

By far the most harmful severe weather events for public health across the U.S. are tornados with 5633 fatalities and 91,346 injuries. Flood, drought and severe wind events account for most of the economic damage with a total damage of more than 30 billions U.S. dollars. For the analysis, data from the years 1950 to 2011 across the U.S. were considered. More details can be found in the results section.

Data

The data for this analysis comes from the U.S. storm database and can be downloaded from here as a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

References

There is also some documentation of the database available. Details of how some of the variables are constructed/defined can be found here:

Data Processing

Downloading and reading the raw data

After downloading the raw data, if it has not been stored locally already, we load it into the variable stromDataRaw. Since the data requires a lot of memory when read into the dataframe (ca. 500MB), this may take a while.

local_file <- "~/coursera/reproducible_research/repdata-data-StormData.csv.bz2"
source_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

if (!file.exists(local_file)) {
      download.file(source_url,local_file, method = "curl")
}

stormDataRaw <- read.csv(bzfile(local_file))

Considering the dimensions of the raw data (902297 observations of 37 variables), it makes sense to constrict our analysis to the most important columns for understanding the economic and health consequences of severe weather.

dim(stormDataRaw)      
## [1] 902297     37

Selecting columns and observations

After exploring the contents of the variables, we decided to keep the following 10 columns for the analysis:

Columns Description
BGN_DATE Start date of event
END_DATE End date of event
STATE State were event ocurred
EVTYPE Event type
FATALITIES Total number of fatalities
INJURIES Total number of injuries
PROPDMG Estimated property damage with unspecified units
PROPDMGEXP Exponential multiplier for PROPDMG to obtain correct number in US dollars
CROPDMG Estimated agricultural damage with unspecified units
CROPDMGEXP Exponential multiplier for CROPDMG to obtain correct number in US dollars
attach(stormDataRaw)

Furthermore, to understand the economic and health consequences, we restrict our data to rows which have values larger than zero in the columns FATALITIES, INJURIES, PROPDMG or CROPDMG.

stormDataSub <- stormDataRaw[FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0, c("BGN_DATE", "END_DATE", "STATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]

As a side effect of our subsetting, we have cleaned the dataset such that there are no NAs.

sum(is.na(stormDataSub))
## [1] 0

Calculating the estimated damage

However, we still have to merge PROPDMG with PROPDMGEXP and CROPDMG with CROPDMGEXP to obtain an actual clean dataset. According to the storm database documentation (page 12), alphabetical characters used to signify the magnitude of damage include “K” for thousands, “M” for millions, and “B” for billions. For example 1.55B would mean $1,550,000,000 with this convention. Considering the different levels of PROPDMEXP shows that this convention was not kept throughout the dataset.

table(stormDataSub$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
##  11585      1      0      5    210      0      1      1      4     18 
##      6      7      8      B      h      H      K      m      M 
##      3      3      0     40      1      6 231428      7  11320

Since we have to deal with the extra characters, we decided to apply the following steps:

  • observations with either “+”, “-” or “?” are zero multipliers
  • treat lower and upper case letters the same
  • no entry is treated as a 1 multiplier
  • include numbers 0 to 9 as powers of 10 multipliers

In a first step we define the characters and multipliers and introduce two new columns PropDamage and CropDamage for storing the results.

# Multiplier for damage calculation

characters <- c("+", "-", "?", " ", "H", "K", "M", "B")
multipliers <- c(0, 0, 0, 1, 100, 10^3, 10^6, 10^9)

detach(stormDataRaw)
attach(stormDataSub)

stormDataSub$PropDamage <- PROPDMG
stormDataSub$CropDamage <- CROPDMG

Apply multipliers for characters

# Apply multipliers for characters

for (i in 1:length(characters)) {
      rowFilter <- which(toupper(PROPDMGEXP) == characters[i])
      stormDataSub$PropDamage[rowFilter] <- PROPDMG[rowFilter] * multipliers[i]
      rowFilter <- which(toupper(CROPDMGEXP) == characters[i])
      stormDataSub$CropDamage[rowFilter] <- CROPDMG[rowFilter] * multipliers[i]
}

Apply multipliers for numbers

# Apply multipliers for numbers

for (i in 0:9) {
      rowFilter <- which(PROPDMGEXP == i)
      stormDataSub$PropDamage[rowFilter] <- PROPDMG[rowFilter] * 10^i
      rowFilter <- which(CROPDMGEXP == i)
      stormDataSub$CropDamage[rowFilter] <- CROPDMG[rowFilter] * 10^i
}

Now we can exclude the columns that we used for calculating PropDamage and CropDamage.

stormData <- stormDataSub[c(1:6,11,12)]

Cleaning the severe weather event types

The storm database documentation considers only 48 types of severe weather events from 1996 on, but this dataset contains 985 levels for the variable EVTYPE. Exploring the dataset, it becomes obvious that most of the additional levels appear due to inconsistent naming conventions and typos.

str(stormData$EVTYPE)
##  Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...

Before spending too much time with converting the event names consistently, we apply the following steps to reduce the number of severe weather events.

  • convert all names to upper case
  • exchange TSTM with THUNDERSTORM
  • merge thunderstorm events
  • remove expressions ending with “S” such that e.g. winds and wind become one event

These steps do not take care of all discrepancies but clean up the majority of events. Aditionally, we clean the dataset from all the levels that are unused due to subsetting.

# Clean unused levels
stormData2 <- droplevels(stormData)  

# Clean up thunderstorm events
stormData2$EVTYPE <- toupper(stormData2$EVTYPE)
stormData2$EVTYPE <- gsub("TSTM", "THUNDERSTORM", stormData2$EVTYPE)
stormData2$EVTYPE <- gsub("THUNDERSTORM.*", "THUNDERSTORM", stormData2$EVTYPE)
stormData2$EVTYPE <- gsub("S$", "", stormData2$EVTYPE)

Results

1. Most harmful types of severe weather events with respect to population health across the U.S.

The important variables for population health are INJURIES and FATALITIES. Aggregating the clean dataset with respect to the weather event types and summing over INJURIES and FATALITIES, we obtain a table with health impacts. Sorted by fatalities and/or injuries, we can plot the results for the top 20 most harmful severe weather events. Note that we applied a logarithmic scaling for the y-axis.

healthData <- aggregate(cbind(FATALITIES,INJURIES) ~ EVTYPE, data=stormData2, sum)
fatalities <- head(healthData[order(healthData$FATALITIES, decreasing = TRUE),], n=20)
injuries <- head(healthData[order(healthData$INJURIES, decreasing = TRUE),], n=20)
par(mfrow = c(1,2),  mar = c(12, 4, 6, 2), cex.axis = 0.7)
barplot(fatalities$FATALITIES, names.arg = fatalities$EVTYPE, col = "blue", las = 2, ylab = "Number of total victims", log = "y")
barplot(injuries$INJURIES, names.arg = injuries$EVTYPE, col = "lightblue", las = 2, ylab = "", log = "y")
mtext("20 most harmful weather events for public health", side = 3, line = -2, outer =TRUE)
legend("topright", legend = c("Fatalities", "Injuries"), fill = c("blue", "lightblue"))

plot of chunk unnamed-chunk-14

head(fatalities)
##             EVTYPE FATALITIES INJURIES
## 313        TORNADO       5633    91346
## 50  EXCESSIVE HEAT       1903     6525
## 62     FLASH FLOOD        980     1777
## 126           HEAT        937     2100
## 221      LIGHTNING        816     5230
## 307   THUNDERSTORM        710     9508
head(injuries)
##             EVTYPE FATALITIES INJURIES
## 313        TORNADO       5633    91346
## 307   THUNDERSTORM        710     9508
## 74           FLOOD        470     6789
## 50  EXCESSIVE HEAT       1903     6525
## 221      LIGHTNING        816     5230
## 126           HEAT        937     2100

Summary of most harmful events for public health

By far the most harmful severe weather events for public health are tornados with 5633 fatalities and 91,346 injuries. In case of fatalities, this is followed by excessive heat (1903) and flash floods (980). The second most injuries occurr for thunderstorms (9508), followed by flood (6789).

2. Most harmful types of severe weather events with respect to economic consequences across the U.S.

The important variables for economic consequences are PropDamage and CropDamage. Aggregating the clean dataset with respect to the weather event types and summing over PropDamage and CropDamage, we obtain a table with economic impacts. Sorted by property and agricultural damage, we can plot the results for the top 20 most harmful severe weather events. Note that we applied a logarithmic scaling for the y-axis and divided all values by 10^9 to provide the damage in billions of U.S. dollars.

economicData <- aggregate(cbind(PropDamage, CropDamage) ~ EVTYPE, data=stormData2, sum)
property <- head(economicData[order(economicData$PropDamage, decreasing = TRUE),], n=20)
agricultural <- head(economicData[order(economicData$CropDamage, decreasing = TRUE),], n=20)
par(mfrow = c(1,2),  mar = c(12, 4, 6, 2), cex.axis = 0.7)
barplot(property$PropDamage/10^9, names.arg = property$EVTYPE, col = "red", las = 2, ylab = "Total damage in billion U.S. dollars", log = "y")
barplot(agricultural$CropDamage/10^9, names.arg = agricultural$EVTYPE, col = "green", las = 2, ylab = "", log = "y",)
mtext("20 most harmful weather events for economy", side = 3, line = -2, outer =TRUE)
legend("topright", legend = c("Property", "Agricultural"), fill = c("red", "green"))

plot of chunk unnamed-chunk-17

head(property)
##                EVTYPE PropDamage CropDamage
## 74              FLOOD  1.447e+11  5.662e+09
## 194 HURRICANE/TYPHOON  6.931e+10  2.608e+09
## 313           TORNADO  5.695e+10  4.150e+08
## 300       STORM SURGE  4.332e+10  5.000e+03
## 62        FLASH FLOOD  1.683e+10  1.421e+09
## 110              HAIL  1.574e+10  3.026e+09
head(agricultural)
##          EVTYPE PropDamage CropDamage
## 39      DROUGHT  1.046e+09  1.397e+10
## 74        FLOOD  1.447e+11  5.662e+09
## 266 RIVER FLOOD  5.119e+09  5.029e+09
## 207   ICE STORM  3.945e+09  5.022e+09
## 110        HAIL  1.574e+10  3.026e+09
## 185   HURRICANE  1.187e+10  2.742e+09

Summary of most harmful events for economy

For property and agricultural damage different events turn out to be most harmful. Flood and severe wind events account for most of the property damage with approximately 10 billion U.S. dollars together. In case of agricultural damage, either an abundance of water or the lack of it accounts for in sum approximately 20 billion U.S. dollars.