Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. Using insights of the impact by severe weather events on population health and economic consequence, government or municipal managers who are responsible for preparing for severe weather events are able to prioritize resources for different types of events.

This report provides some insights of the impact across the United States by identifying (1) the types of events which are most harmful with respect to population health, and (2) the types of events which have the greatest economic consequences. This report presents the findings based on the data in the U.s. National Oceanic and Atmospheric Administration’s (NOAA) storm database for events between the year 1950 and Nov 2011.

During the period, Tornados posted the greatest impact to human health (91,346 injuries and 5,633 fatalities), followed by Excessive Heat (1,903 fatalities and 6525 injuries), TSTM Wind (504 fatalities and 6,957 injuries).

On the economic impact, Flood accounted for the highest damage of over US$150 billions, followed by Hurricane/Typhoon (over US$71 billions) and Tornado (over US$57 billions).

In conclusion, the analysis showed that Tornadoes and Flood were the two most costly threats to the human health (measured by fatalities and injuries) and economy health (measured by damages to property and crops). Threats by Excessive Heat and Hurricane/Typhoon shall not be overlooked as they also posted significant undesirable impact to both the human health and economy health.

Data Processing

1. Download and Read the NOAA Data File

About NOAA storm database: It tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

# Create a data directory (if not created yet).
if (!file.exists("./data")) {
dir.create("./data")
}

# Download the data file (if not downloaded yet).
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

if (!file.exists("./data/stormdata.csv.bz2")) {
  download.file(fileUrl, destfile="./data/stormdata.csv.bz2", cacheOK = TRUE)
}

stormdata <- read.csv(bzfile("./data/StormData.csv.bz2"))
str(stormdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The database contained 902,297 observations of 37 variables.

2. Subset with Relevant Data to Support Impact Studies

To analyze the impact of weather events to population health and economy, we will focus on the following measurements across the U.S.: * Measurements related to population health: FATALITIES, INJURIES * * Measurements related to economy consequences: PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP * * Events: Tidy the EV

data_subset <- subset(stormdata, 
                      (stormdata$INJURIES > 0 | stormdata$FATALITIES > 0 |
                       stormdata$PROPDMG > 0 | stormdata$CROPDMG > 0),
                      select = c("STATE","COUNTY","BGN_DATE","END_DATE",
                                 "EVTYPE","FATALITIES","INJURIES","PROPDMG",
                                 "PROPDMGEXP","CROPDMG","CROPDMGEXP"))
str(data_subset)
## 'data.frame':    254633 obs. of  11 variables:
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

The final dataset used for the weather events studies contains 254,633 observations of 11 variables.

3. Tidy the Data - Event Type

The variable EVTYPE records the events required for our studies. However, we observed inconsistency in the data due to typos, different abbreviations, letter capitalization standards, etc. For example: BEACH EROSIN, Beach Erosion, BEACH EROSION refer to the same event. This step performs some basic data cleansing on EVTYPE to resolve such inconsistency issue.

A new variable, ENTYPE_CLEAN, is created to store the cleansed event type.

data_subset$EVTYPE_CLEAN <- data_subset$EVTYPE
data_subset$EVTYPE_CLEAN <- sub(".*Avalanche.*", "Avalanche", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Blizzard.*", "Blizzard", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Cold.*", "Cold", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Debris.*", "Debris", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Drought.*", "Drought", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Dust.*", "Dust", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Depression.*", "Depression", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Fog.*", "Fog", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Flood.*", "Flood", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Freeze.*", "Freeze", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Funnel.*", "Funnel", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Hurricane.*", "Hurricane/Typhoon", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Hail.*", "Hail", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Heat.*", "Heat", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Lightning.*", "Lightning", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Rain.*", "Rain", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Rip.*", "Rip", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Snow.*", "Snow", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Smoke.*", "Smoke", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Surf.*", "Surf", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Seiche.*", "Seiche", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Sleet.*", "Sleet", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Thunderstorm.*", "Thunderstorm", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*tstm.*", "Thunderstorm", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Tornado.*", "Tornado", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Typhoon.*", "Hurricane/Typhoon", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Tsunami.*", "Tsunami", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Tide.*", "Tide", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Volcanic.*", "Volcanic", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Waterspout.*", "Waterspout", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*Wildfire|fire.*", "Wildfire", data_subset$EVTYPE, ignore.case=TRUE)
data_subset$EVTYPE_CLEAN <- sub(".*microburst.*", "Microburst", data_subset$EVTYPE, ignore.case=TRUE)

4. Tidy the Data - Damage Costs

This step computes the property damage costs (PROPDMG) and crop damage costs (CROPDMG). To compute the cost, we multiply the value in PROPDMG and CROPDMG with the unit (K, M, B) indicated in PROPDMGEXP and CROPDMGEXP respectively, where “K” stands for thousands, “M” for millions, and “B” for billions.

New variables created:

  1. PROPDMG_COST - property damage costs in million dollars

  2. CROPDMG_COST - crop damage costs in million dollars

compute_cost <- function(value, unit){
  # Cost is computed as value * unit, and then standardized to million dollars.
  if (is.na(unit)) {
    factor = 1
  } else if (unit == 'k' || unit == 'K') {
    factor = 1000
  } else if (unit == 'm' || unit == 'M'){
    factor = 1000000
  } else if (unit == 'b' || unit == 'B'){
    factor = 1000000000
  } else {
    factor = 1
  }
  # return computed value
  return((value * factor)/1000000) # standardize to million dollars
}

data_subset$PROPDMG_COST <- mapply(compute_cost, data_subset$PROPDMG, unit=data_subset$PROPDMGEXP)
data_subset$CROPDMG_COST <- mapply(compute_cost, data_subset$CROPDMG, unit=data_subset$CROPDMGEXP)

Now, the data is ready for us to condct a data analysis to address the following questions:

1.Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

2.Across the United States, which types of events have the greatest economic consequences?

5. Data Analysis - Events with most harmful impact to Population Health

For damage related to population health, we summarize to get the total FATALITIES and total INJURIES per event type using the tidy dataset, and then find the top 10 events (EVTYPE_CLEAN) with the highest combined total fatalities and injuries.

New variables created:

  1. fatalities - total fatalities per event type

  2. injuries - total injuries per event type

  3. total_health - total fatalities + injuries per event type

  4. top10_health - top 10 events

# Aggregate fatalities by event type
fatalities <- aggregate(data_subset$FATALITIES, by=list(data_subset$EVTYPE_CLEAN), sum)
names(fatalities) <- c("event","fatalities")

# Aggregate injuries by event type
injuries <- aggregate(data_subset$INJURIES, by=list(data_subset$EVTYPE_CLEAN), sum)
names(injuries) <- c("event","injuries")

# Aggregate fatalities+injuries by event type
total_health <- aggregate(data_subset$FATALITIES+data_subset$INJURIES, by=list(data_subset$EVTYPE), sum)
names(total_health) <- c("event","total_health")

# Merge results
health_merged  <- merge(fatalities, injuries)
health_merged  <- merge(health_merged, total_health)

# Get top 10
top10_health <- head(health_merged[order(health_merged$total_health, decreasing = TRUE),],10)

6. Data Analysis - Events with greatest economic consequences

For damage related to economy, we summarize to get the total property damage (PROPDMG_COST) and total crop damage (CROPDMG_COST) using the tidy dataset, and then find the top 10 events (EVTYPE_CLEAN) with the highest combined total damage costs.

New variables created:

  1. property_damage - total property damage per event type

  2. crop_damage - total crop damage per event type

  3. total_damage - total property damage + crop damage per event type

  4. top10_damage - top 10 events

# Aggregate property damage by event type
property_damage <- aggregate(data_subset$PROPDMG_COST, by=list(data_subset$EVTYPE_CLEAN), sum)
names(property_damage) <- c("event","property_damage")

# Aggregate crop damage by event type
crop_damage <- aggregate(data_subset$CROPDMG_COST, by=list(data_subset$EVTYPE), sum)
names(crop_damage) <- c("event","crop_damage")

# Aggregate property damage and crop damage by event type
total_damage <- aggregate(data_subset$PROPDMG_COST+data_subset$CROPDMG_COST, 
                          by=list(data_subset$EVTYPE_CLEAN), sum)
names(total_damage) <- c("event","total_damage")

# Merge results
damage_merge <- merge(property_damage, crop_damage)
damage_merge <- merge(damage_merge, total_damage)

# Get top 10 events
top10_damage <- head(damage_merge[order(damage_merge$total_damage, decreasing = TRUE),],10)

Results

1. Events with the Most Harmful Impact to Population Health

This section lists the top 10 storm events with the most harmful impact to population health in terms of fatalities and injuries (observations made between 1950 to 2011).

From List 1, we observed that Tornado accounted for the maximum number of health impact (5,633 fatalities and 91,346 injuries), followed by Excessive Heat (1,903 fatalities and 6525 injuries), TSTM Wind (504 fatalities and 6,957 injuries) and so forth.

List 1 - Top 10 storm events with the most harmful impact to population health
# List the top 10 events
top10_health
##                 event fatalities injuries total_health
## 404           TORNADO       5633    91346        96979
## 60     EXCESSIVE HEAT       1903     6525         8428
## 420         TSTM WIND        504     6957         7461
## 85              FLOOD        470     6789         7259
## 257         LIGHTNING        816     5230         6046
## 150              HEAT        937     2100         3037
## 72        FLASH FLOOD        978     1777         2755
## 237         ICE STORM         89     1975         2064
## 361 THUNDERSTORM WIND        133     1488         1621
## 477      WINTER STORM        206     1321         1527
Figure 1 - Top 10 storm events with the most harmful impact to population health
# Melt the data to plot the breakdown by fatalities, injuries and total.
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.1.2
top10_health_melted <- melt(top10_health, id=c("event"), 
                            measure.var=c("fatalities", "injuries", "total_health"))

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.2
# generate the plot
ggplot(top10_health_melted, aes(event, value)) + 
       geom_bar(aes(fill=variable), stat="identity", position="dodge") +
       xlab("Events") + 
       ylab("Populations") +
       ggtitle("Top 10 Events Most Harmful to Population Health") + 
       scale_fill_manual(values=c("#F0E442", "#0072B2", "#D55E00"), 
                         labels=c("Fatalities", "Injuries", "Total")) + 
theme(axis.text.x = element_text(angle=90))

From the trends displayed in Figure 1, we observed that Tornado posted the greatest threat to population health when comparing the magnitude of injuries against that caused by the other events. Though the magnitude of fatalities caused by Toranto was relatively lower than injuries, it was still the highest amongst all other events.

2. Events with greatest economic consequences

This section lists the top 10 storm events with the greatest economic consequences measured by the cost of property damage and crop damage for the observations made between 1950 to 2011.

From List 2, we observed that flood accounted for the highest damage of US$150,319 millions (or over US$150 billion), followed by Hurricane/Typhoon (total damage of US$71,913 millions or over US$71 billions) and Tornado (total damage of US$57,352 million or over US$57 billions).

List 2 - Top 10 storm events with greatest economic consequences
# List the top 10 events
top10_damage
##                 event property_damage crop_damage total_damage
## 85              FLOOD      144657.710   5661.9685   150319.678
## 223 HURRICANE/TYPHOON       69305.840   2607.8728    71913.713
## 404           TORNADO       56937.161    414.9533    57352.114
## 347       STORM SURGE       43323.536      0.0050    43323.541
## 133              HAIL       15732.267   3025.9545    18758.222
## 72        FLASH FLOOD       16140.812   1421.3171    17562.129
## 49            DROUGHT        1046.106  13972.5660    15018.672
## 214         HURRICANE       11868.319   2741.9100    14610.229
## 307       RIVER FLOOD        5118.945   5029.4590    10148.405
## 237         ICE STORM        3944.928   5022.1135     8967.041
Figure 2 - Top 10 storm events with greatest economic consequences
# Melt the data to plot the breakdown by property damage, crop damage, and total.
library(reshape2)
top10_damage_melted <- melt(top10_damage, id=c("event"), 
                            measure.var=c("property_damage", "crop_damage", "total_damage"))

library(ggplot2)
# generate the plot
ggplot(top10_damage_melted, aes(event, value)) + 
       geom_bar(aes(fill=variable), stat="identity", position="dodge") +
       xlab("Events") + 
       ylab("Damages (US$ in Million)") +
       ggtitle("Top 10 Events with greatest economic consequences") + 
       scale_fill_manual(values=c("#F0E442", "#0072B2", "#D55E00"), 
                         labels=c("Property", "Crop", "Total")) + 
theme(axis.text.x = element_text(angle=90))

From the trends displayed in Figure 2, we observed that Flood caused the greatest loss in terms of property damage and crop damage, followed by Hurricane/Typhoon and Tornado. In comparison of damages between property and crop, flooding was the most damaging to property, with over $144 billions in reported damages (vs total damage of over $150 billions).