Synopsis

In this analysis, we sought to answer the following questions:
1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?
For population health, tornados resulted in the greatest impact on event fatality, injury, and composite casualty totals. As for economic consequences, flooding appeared to have the greatest impact.

About the NOAA Storm Database

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data Processing

To start, data from the NOAA database are read in from their online repository, and we can inspect the general organization of these data.

# reading in data
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "./FStormData.csv.bz2", method = "curl")
stormdata <- read.csv("FStormData.csv.bz2")

# show first few fines of dataset, and show variable characteristics
head(stormdata)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6
dim(stormdata)
## [1] 902297     37
str(stormdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

This is a relatively large data set, with many variables that are not immediately pertinent to the objectives of this analysis, so they will be excluded henceforth for simplicity and speed of processing.

# selecting high-yield columns
stormdata <- stormdata %>%
        select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, REMARKS)

# convert PROPDMGEXP to actual numbers
value <- function(x) {
  x <- tolower(x)
  if (x == "k") res <- 1000
  if (x == "m") res <- 1e+06
  if (x == "b") res <- 1e+09
  else res <- 1
  res
}

stormdata$PROP_DMG <- stormdata$PROPDMG * sapply(stormdata$PROPDMGEXP, value) /1000000
stormdata$CROP_DMG <- stormdata$CROPDMG * sapply(stormdata$CROPDMGEXP, value) /1000000
stormdata$TOTAL_DMG <- stormdata$PROP_DMG + stormdata$CROP_DMG

# casualties = composite measure of fatalities + injuries
stormdata$CASUALTIES <- stormdata$FATALITIES + stormdata$INJURIES

# again, show first few fines of dataset, and show variable characteristics
head(stormdata)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0           
##   REMARKS  PROP_DMG CROP_DMG TOTAL_DMG CASUALTIES
## 1         0.0000250        0 0.0000250         15
## 2         0.0000025        0 0.0000025          0
## 3         0.0000250        0 0.0000250          2
## 4         0.0000025        0 0.0000025          2
## 5         0.0000025        0 0.0000025          2
## 6         0.0000025        0 0.0000025          6
dim(stormdata)
## [1] 902297     12
str(stormdata)
## 'data.frame':    902297 obs. of  12 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ PROP_DMG  : num  0.000025 0.0000025 0.000025 0.0000025 0.0000025 0.0000025 0.0000025 0.0000025 0.000025 0.000025 ...
##  $ CROP_DMG  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TOTAL_DMG : num  0.000025 0.0000025 0.000025 0.0000025 0.0000025 0.0000025 0.0000025 0.0000025 0.000025 0.000025 ...
##  $ CASUALTIES: num  15 0 2 2 2 6 1 0 15 0 ...

Results

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Fatalities

First, we will look at the number of fatalities with each event type.

# calculate total number of fatalities
totalfatalities <- stormdata %>%
        summarise(total = sum(FATALITIES))
# calculate total fatalities for each event type (over entire dataset), and display the top event types
fatalitydata <- stormdata %>%
        group_by(EVTYPE, ) %>%
        summarise(events = n(), total = sum(FATALITIES)) %>%
        arrange(desc(total))
topfatality <- fatalitydata[1:5,]
topfatality
## # A tibble: 5 x 3
##   EVTYPE         events total
##   <fct>           <int> <dbl>
## 1 TORNADO         60652  5633
## 2 EXCESSIVE HEAT   1678  1903
## 3 FLASH FLOOD     54277   978
## 4 HEAT              767   937
## 5 LIGHTNING       15754   816

We will break this down by showing the total number of fatalities for each weather event type, starting from most deadly event type to least. This shows that tornadoes resulted in the greatest number of fatalities, followed by excessive heat.

Injuries

It is also worth looking at the total number of non-fatal injuries for each weather event type. We will also show the top event types leading to the greatest number of injuries.

# calculate total number of injuries
totalinjuries <- stormdata %>%
        summarise(total = sum(INJURIES))

# calculate total injuries for each event type (over entire dataset), and display the top event types
injurydata <- stormdata %>%
        group_by(EVTYPE) %>%
        summarise(n = n(), total = sum(INJURIES)) %>%
        arrange(desc(total))
topinjury <- injurydata[1:5,]
topinjury 
## # A tibble: 5 x 3
##   EVTYPE              n total
##   <fct>           <int> <dbl>
## 1 TORNADO         60652 91346
## 2 TSTM WIND      219940  6957
## 3 FLOOD           25326  6789
## 4 EXCESSIVE HEAT   1678  6525
## 5 LIGHTNING       15754  5230

Again, tornadoes are at the top of the list, and are responsible for the greatest number of injuries.

Casualties

Finally, we will demonstrate which events account for the most casualties, defined as the sum of fatalities and injuries.

# calculate total casualties for each event type (over entire dataset), and display the top event types
casualtydata <- stormdata %>%
        group_by(EVTYPE) %>%
        summarise(n = n(), total = sum(CASUALTIES)) %>%
        arrange(desc(total))
topcasualty <- casualtydata[1:5,]
topcasualty
## # A tibble: 5 x 3
##   EVTYPE              n total
##   <fct>           <int> <dbl>
## 1 TORNADO         60652 96979
## 2 EXCESSIVE HEAT   1678  8428
## 3 TSTM WIND      219940  7461
## 4 FLOOD           25326  7259
## 5 LIGHTNING       15754  6046

Unsurprisingly, tornadoes account for the greatest number of casualties, with excessive heat as a runner-up. The data for fatalities, injuries, and casualties are displayed graphically below (Fig 1).

# graph of fatality/injury/casualty data
fig1a <- ggplot(topfatality, aes(x = EVTYPE, y = total, fill = EVTYPE)) +
        geom_bar(stat = "identity") + 
        xlab("Top 5 events") + 
        ylab("Fatalties") + 
        ggtitle("Figure 1a. Severe weather event fatalities in USA from 1950-2011") +
        theme(axis.text.x = element_text(angle = 45, vjust=0.5), legend.position = "none")
fig1b <- ggplot(topinjury, aes(x = EVTYPE, y = total, fill = EVTYPE)) +
        geom_bar(stat = "identity") +
        xlab("Top 5 events") + 
        ylab("Injuries") + 
        ggtitle("Figure 1b. Severe weather event injuries in USA from 1950-2011") +
        theme(axis.text.x = element_text(angle = 45, vjust=0.5), legend.position = "none")
fig1c <- ggplot(topcasualty, aes(x = EVTYPE, y = total, fill = EVTYPE)) +
        geom_bar(stat = "identity") +
        xlab("Top 5 events") + 
        ylab("Casualties") + 
        ggtitle("Figure 1c. Severe weather event casualties in USA from 1950-2011") +
        theme(axis.text.x = element_text(angle = 45, vjust=0.5), legend.position = "none")
grid.arrange(fig1a, fig1b, fig1c, nrow = 3)

### Across the United States, which types of events have the greatest economic consequences? We will similarly examine the impact of severe weather events on damages to property, crops, and both (total damage).

Property

# calculate total property damage for each event type (over entire dataset), and display the top event types
propertydata <- stormdata %>%
        group_by(EVTYPE, ) %>%
        summarise(events = n(), total = sum(PROP_DMG)) %>%
        arrange(desc(total))
topproperty <- propertydata[1:5,]
topproperty
## # A tibble: 5 x 3
##   EVTYPE            events   total
##   <fct>              <int>   <dbl>
## 1 FLOOD              25326 122501.
## 2 HURRICANE/TYPHOON     88  65500.
## 3 STORM SURGE          261  42560.
## 4 HURRICANE            174   5700.
## 5 TORNADO            60652   5303.

Flooding appears to have caused the greatest property damage.

Crops

# calculate total property damage for each event type (over entire dataset), and display the top event types
cropdata <- stormdata %>%
        group_by(EVTYPE, ) %>%
        summarise(events = n(), total = sum(CROP_DMG)) %>%
        arrange(desc(total))
topcrop <- cropdata[1:5,]
topcrop
## # A tibble: 5 x 3
##   EVTYPE            events total
##   <fct>              <int> <dbl>
## 1 RIVER FLOOD          173 5000.
## 2 ICE STORM           2006 5000.
## 3 HURRICANE/TYPHOON     88 1510.
## 4 DROUGHT             2488 1500.
## 5 HEAT                 767  400.

Flooding also appears to have caused the greatest damage to crops.

Total Damage (Property + Crops)

# calculate total property damage for each event type (over entire dataset), and display the top event types
totaldata <- stormdata %>%
        group_by(EVTYPE, ) %>%
        summarise(events = n(), total = sum(TOTAL_DMG)) %>%
        arrange(desc(total))
toptotal <- totaldata[1:5,]
toptotal
## # A tibble: 5 x 3
##   EVTYPE            events   total
##   <fct>              <int>   <dbl>
## 1 FLOOD              25326 122501.
## 2 HURRICANE/TYPHOON     88  67010.
## 3 STORM SURGE          261  42560.
## 4 RIVER FLOOD          173  10000.
## 5 HURRICANE            174   5700.

Flooding appears to have caused the greatest economic damage, overall.

ggplot(toptotal, aes(x = EVTYPE, y = total, fill = EVTYPE)) +
        geom_bar(stat = "identity") +
        xlab("Top 5 events") + 
        ylab("Property and crop damage ($)") + 
        ggtitle("Figure 2. Severe weather event economic impact in USA from 1950-2011") +
        theme(axis.text.x = element_text(angle = 45, vjust=0.5), legend.position = "none")

Discussion

In summary, tornadoes had the greatest impact on population health (i.e., event fatalities, injuries, and composite casualties). As for economic consequences, flooding appeared to have the greatest impact.

There are some important limitations to this research. For instance, there is significant overlap among the various event types that severe weather events were categorized into. As such, the totals for each outcome examined may have been affected. In a future analysis, it would likely be useful to recategorize similar events together for a more complete understanding of impact on outcomes of question.