Effect of major weather events on the population and economic health of the U.S.

Synopsis

This project explores the NOAA storm database in order to assess the effect of severe weather events on the economy and population of the U.S. Of particular interest are the following two questions:

  • Across the United States, which types of events are most harmful with respect to population health?
  • Across the United States, which types of events have the greatest economic consequences?

We found that tornadoes are the most harmful events to population health, causing the most injuries and deaths. As far as economic health is concerned, floods caused the most property damage and droughts caused the most crop damage.

Data preprocessing

The data for this assignment is available here. It contains characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and crop and property damage. The events in the database start in the year 1950 and end in November 2011.

Load the data unless it already is in the workspace. We assume that the compressed file resides in the current directory.

library(readr)
if (!"stormdata" %in% ls())
  stormdata <- read_csv("repdata_data_StormData.csv.bz2")

Examine the structure of the dataset.

str(stormdata)
## Classes 'tbl_df', 'tbl' and 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  NA NA NA NA ...
##  $ BGN_LOCATI: chr  NA NA NA NA ...
##  $ END_DATE  : chr  NA NA NA NA ...
##  $ END_TIME  : chr  NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: chr  NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  NA NA NA NA ...
##  $ END_LOCATI: chr  NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  NA NA NA NA ...
##  $ WFO       : chr  NA NA NA NA ...
##  $ STATEOFFIC: chr  NA NA NA NA ...
##  $ ZONENAMES : chr  NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Since we are most concerned with weather effects on population and economic health across the whole country as well as across time we ignore all time and location related variables and concentrate on the following variables:

  • EVTYPE - the event type, e.g. “Tornado”, etc.
  • FATALITIES - the number of deaths caused by the event
  • INJURIES - the number of injuries caused by the event
  • PROPDMG - together with PROPDMGEXP specifies the amount of property damage caused by the event
  • PROPDMGEXP - the exponent with which to scale the value of PROPDMG
  • CROPDMG - together with CROPDMGEXP specifies the amount of crop damage caused by the event
  • CROPDMGEXP - the exponent with which to scale the value of CROPDMG
library(dplyr)
stormdata <- select(stormdata, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

Note: while the data in the earlier years is less complete than that of more recent years we choose to consider the full span of years nonetheless.

There are many event types, 977 to be precise:

length(unique(stormdata$EVTYPE))
## [1] 977

Many of them really refer to the same type, e.g. TORNADO and TORNADOS:

unique(stormdata[grep("^.*TORNADO.*$", stormdata$EVTYPE, ignore.case = TRUE),]$EVTYPE)
##  [1] "TORNADO"                    "TORNADO F0"                
##  [3] "TORNADOS"                   "WATERSPOUT/TORNADO"        
##  [5] "WATERSPOUT TORNADO"         "WATERSPOUT-TORNADO"        
##  [7] "TORNADOES, TSTM WIND, HAIL" "COLD AIR TORNADO"          
##  [9] "WATERSPOUT/ TORNADO"        "TORNADO F3"                
## [11] "TORNADO F1"                 "TORNADO/WATERSPOUT"        
## [13] "TORNADO F2"                 "TORNADOES"                 
## [15] "TORNADO DEBRIS"

We re-label all the tornado-like events as TORNADO and similarly for all the hurricane-like and flood-like events.

stormdata[grep("^.*TORNADO.*$", stormdata$EVTYPE, ignore.case = TRUE),]$EVTYPE = "TORNADO"
stormdata[grep("^.*HURRICANE.*$", stormdata$EVTYPE, ignore.case = TRUE),]$EVTYPE = "HURRICANE"
stormdata[grep("^.*FLOOD.*$", stormdata$EVTYPE, ignore.case = TRUE),]$EVTYPE = "FLOOD"

A thorough analysis would take into consideration all event types but this is hardly the goal of this particular exercise.

Effect on population - injuries and fatalities

We compute the total numbers of injuries and fatalities by event type and sort the rows in descending order.

totInjuries.by.eventtype <- group_by(stormdata, EVTYPE) %>%
  summarise(totInjuries = sum(INJURIES)) %>%
  arrange(desc(totInjuries))
totFatalities.by.eventtype <- group_by(stormdata, EVTYPE) %>%
  summarise(totFatalities = sum(FATALITIES)) %>%
  arrange(desc(totFatalities))
totInjuries.by.eventtype[1:10,]
## Source: local data frame [10 x 2]
## 
##               EVTYPE totInjuries
##                (chr)       (dbl)
## 1            TORNADO       91407
## 2              FLOOD        8604
## 3          TSTM WIND        6957
## 4     EXCESSIVE HEAT        6525
## 5          LIGHTNING        5230
## 6               HEAT        2100
## 7          ICE STORM        1975
## 8  THUNDERSTORM WIND        1488
## 9               HAIL        1361
## 10         HURRICANE        1328
totFatalities.by.eventtype[1:10,]
## Source: local data frame [10 x 2]
## 
##            EVTYPE totFatalities
##             (chr)         (dbl)
## 1         TORNADO          5661
## 2  EXCESSIVE HEAT          1903
## 3           FLOOD          1525
## 4            HEAT           937
## 5       LIGHTNING           816
## 6       TSTM WIND           504
## 7     RIP CURRENT           368
## 8       HIGH WIND           248
## 9       AVALANCHE           224
## 10   WINTER STORM           206

We want to display bar graphs of the number of injuries and fatalities for the most severe weather events, say the “top” 10 event types. Thus we need to reorder the levels of the factors totInjuries.by.eventtype$EVTYPE and totFatalities.by.eventtype$EVTYPE in decreasing order of totInjuries and totFatalities. However, specifying decreasing = TRUE as the argument to sort has no effect so we simply reverse the order of the levels.

totInjuries.by.eventtype$EVTYPE <- reorder(totInjuries.by.eventtype$EVTYPE,
                                           totInjuries.by.eventtype$totInjuries,
                                           FUN = sort)
totInjuries.by.eventtype$EVTYPE =
  factor(totInjuries.by.eventtype$EVTYPE, 
         levels = rev(levels(totInjuries.by.eventtype$EVTYPE)))

totFatalities.by.eventtype$EVTYPE <- reorder(totFatalities.by.eventtype$EVTYPE,
                                             totFatalities.by.eventtype$totFatalities,
                                             FUN = sort)
totFatalities.by.eventtype$EVTYPE =
  factor(totFatalities.by.eventtype$EVTYPE, 
         levels = rev(levels(totFatalities.by.eventtype$EVTYPE)))

We now plot the worst 10 weather events for the number of injuries and fatalities.

library(ggplot2)
totInjuries.plot <- ggplot(totInjuries.by.eventtype[1:10,], aes(x=EVTYPE, y=totInjuries)) +
  geom_bar(stat = "identity", fill="lightblue", colour="black") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("Event Type") +
  ylab("Total number of injuries") +
  ggtitle("Top 10 weather events in terms of injuries")

totFatalities.plot <- ggplot(totFatalities.by.eventtype[1:10,], aes(x=EVTYPE, y=totFatalities)) +
  geom_bar(stat = "identity", fill="lightblue", colour="black") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.margin = unit(c(0.2,0,1,0.1), "cm")) +
  xlab("Event Type") +
  ylab("Total number of fatalities") +
  ggtitle("Top 10 weather events in terms of fatalities")
library(grid)
library(gridExtra)
grid.arrange(totInjuries.plot, totFatalities.plot, ncol=2)

Effect on the economy - property and crop damage

We compute the damage (in dollars) caused by the various event types to property and crops. We use the variables PROPDMG and PROPDMGEXP for property damage and CROPDMG and CROPDMGEXP for crop damage. The variables PROPDMG and PROPDMGEXP encode the actual amount as \(x * 10^y\) where \(x\) is the value stored in PROPDMG and \(y\) the value stored in PROPDMGEXP. Similarly for CROPDMG and CROPDMGEXP. However, the variables PROPDMGEXP and CROPDMGEXP hold many NA values together with several invalid quantities. Furthermore, valid values can be either a number or a letter code such as “K” for \(10^3\) or “M” for \(10^6\).

unique(stormdata$PROPDMGEXP)
##  [1] "K" "M" NA  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(stormdata$CROPDMGEXP)
## [1] NA  "M" "K" "m" "B" "?" "0" "k" "2"

We replace invalid values with 0 and the letter codes with the corresponding exponent.

stormdata$PROPDMGEXP[which(is.na(stormdata$PROPDMGEXP))] <- 0
stormdata$PROPDMGEXP[which(stormdata$PROPDMGEXP %in% c("+","?","-"))] <- 0
stormdata$PROPDMGEXP[which(stormdata$PROPDMGEXP %in% c("H","h"))] <- 2
stormdata$PROPDMGEXP[which(stormdata$PROPDMGEXP == "K")] <- 3
stormdata$PROPDMGEXP[which(stormdata$PROPDMGEXP %in% c("M","m"))] <- 6
stormdata$PROPDMGEXP[which(stormdata$PROPDMGEXP == "B")] <- 9

stormdata$CROPDMGEXP[which(is.na(stormdata$CROPDMGEXP))] <- 0
stormdata$CROPDMGEXP[which(stormdata$CROPDMGEXP == "?")] <- 0
stormdata$CROPDMGEXP[which(stormdata$CROPDMGEXP %in% c("K","k"))] <- 3
stormdata$CROPDMGEXP[which(stormdata$CROPDMGEXP %in% c("M","m"))] <- 6
stormdata$CROPDMGEXP[which(stormdata$CROPDMGEXP == "B")] <- 9

We now create two new variables PropDmgAmount and CropDmgAmount to hold the actual damage amounts.

stormdata <- mutate(stormdata,
                    PropDmgAmount = PROPDMG * 10^as.numeric(PROPDMGEXP),
                    CropDmgAmount = CROPDMG * 10^as.numeric(CROPDMGEXP))

We tally the property and crop damage amounts by event type. This is entirely similar to what was done above for the number of injuries and fatalities.

totPropDmg.by.eventtype <- group_by(stormdata, EVTYPE) %>%
  summarise(totPropDmg = sum(PropDmgAmount)) %>%
  arrange(desc(totPropDmg))
totPropDmg.by.eventtype$EVTYPE <- reorder(totPropDmg.by.eventtype$EVTYPE,
                                          totPropDmg.by.eventtype$totPropDmg,
                                          FUN = sort)
totPropDmg.by.eventtype$EVTYPE =
  factor(totPropDmg.by.eventtype$EVTYPE, levels = rev(levels(totPropDmg.by.eventtype$EVTYPE)))

totCropDmg.by.eventtype <- group_by(stormdata, EVTYPE) %>%
  summarise(totCropDmg = sum(CropDmgAmount)) %>%
  arrange(desc(totCropDmg))
totCropDmg.by.eventtype$EVTYPE <- reorder(totCropDmg.by.eventtype$EVTYPE,
                                          totCropDmg.by.eventtype$totCropDmg,
                                          FUN = sort)
totCropDmg.by.eventtype$EVTYPE =
  factor(totCropDmg.by.eventtype$EVTYPE, levels = rev(levels(totCropDmg.by.eventtype$EVTYPE)))

Here are the top 10 worst weather events for property damage:

totPropDmg.by.eventtype[1:10,]
## Source: local data frame [10 x 2]
## 
##              EVTYPE   totPropDmg
##              (fctr)        (dbl)
## 1             FLOOD 168212215835
## 2         HURRICANE  84756180010
## 3           TORNADO  58603317927
## 4       STORM SURGE  43323536000
## 5              HAIL  15735267513
## 6    TROPICAL STORM   7703890550
## 7      WINTER STORM   6688497251
## 8         HIGH WIND   5270046295
## 9          WILDFIRE   4765114000
## 10 STORM SURGE/TIDE   4641188000

and the top 10 worst for crop damage:

totCropDmg.by.eventtype[1:10,]
## Source: local data frame [10 x 2]
## 
##            EVTYPE  totCropDmg
##            (fctr)       (dbl)
## 1         DROUGHT 13972566000
## 2           FLOOD 12380109100
## 3       HURRICANE  5515292800
## 4       ICE STORM  5022113500
## 5            HAIL  3025954473
## 6    EXTREME COLD  1292973000
## 7    FROST/FREEZE  1094086000
## 8      HEAVY RAIN   733399800
## 9  TROPICAL STORM   678346000
## 10      HIGH WIND   638571300

Finally, we display bar graphs of the 10 worst events:

totPropDmg.plot <- ggplot(totPropDmg.by.eventtype[1:10,], aes(x=EVTYPE, y=totPropDmg)) +
  geom_bar(stat = "identity", fill="lightblue", colour="black") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("Event Type") +
  ylab("Total amount of property damage ($)") +
  ggtitle("Top 10 weather events\n in terms of property damage")

totCropDmg.plot <- ggplot(totCropDmg.by.eventtype[1:10,], aes(x=EVTYPE, y=totCropDmg)) +
  geom_bar(stat = "identity", fill="lightblue", colour="black") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.margin = unit(c(0.2,0,0.5,0.2), "cm")) +
  xlab("Event Type") +
  ylab("Total amount of crop damage ($)") +
  ggtitle("Top 10 weather events\n in terms of crop damage")
grid.arrange(totPropDmg.plot, totCropDmg.plot, ncol=2)

Results

We see that tornadoes caused the most injuries and deaths while flood caused the most property damage and drought caused the most crop damage.