Synopsis

In this report we examine storm data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, to determine which storm conditions are most deleterous to human life and to economic prosperity.

Data Processing

Data Munging

The NOAA storm database contains data collected from 1950 to 2011 (for later years the data is more complete). We first extracted all the records for which there were fatalities, and all the records for which there were injuries. The only fields we are interested in (for this assignment) are those describing the type of weather, the numbers of fatality and injuries, and the fields dealing with property and crop damage. Note that we had to combine two columns to find the actual property damage and crop damage, as one column held a number while another held the suffix (like K or M) that it needed to be combined with to yield an actual dollar amount. Upon printing out the unique weather event types (EVTYPE), it became clear that sometimes different naming convention were used for the same weather type. Although time limits precluded a comprehensive naming standardization, we made some efforts to reduce these discrepancies. This attempt at standardization though incomplete, reduced the number of weather conditions from 890 down to 45. It must be noted that the drowning category no doubt somewhat counfounds the numer of deaths due to flooding, tsunamis, maritime accidents, etc.

# Read in data.
df <- data.frame(read.csv('repdata-data-StormData.csv.bz2', stringsAsFactors = FALSE))
# Only keep these columns.
keep <- c('EVTYPE',  'FATALITIES', 'INJURIES', 'PROPDMG', 'PROPDMGEXP', 'CROPDMG', 'CROPDMGEXP')
df <- df[, keep]

# Convert to upper case.
df$EVTYPE <- toupper(df$EVTYPE)

# Remove white space from beginning and end of entries.
library(stringr)
df$EVTYPE <- str_trim(df$EVTYPE, side = "both")

# Compute dollar values for PROPDMG from the values in PROPDMG and PROPDMGEXP,
# and compute dollar values for CROPDMG from the values in CROPDMG and CROPDMGEXP.
df$PROPDMG <- ifelse((df$PROPDMGEXP == "k") | (df$PROPDMGEXP == "K"), df$PROPDMG * 1e3, df$PROPDMG)
df$PROPDMG <- ifelse((df$PROPDMGEXP == "m") | (df$PROPDMGEXP == "M"), df$PROPDMG * 1e6, df$PROPDMG)
df$PROPDMG <- ifelse((df$PROPDMGEXP == "b") | (df$PROPDMGEXP == "B"), df$PROPDMG * 1e9, df$PROPDMG)
df$CROPDMG <- ifelse((df$CROPDMGEXP == "k") | (df$CROPDMGEXP == "K"), df$CROPDMG * 1e3, df$CROPDMG)
df$CROPDMG <- ifelse((df$CROPDMGEXP == "m") | (df$CROPDMGEXP == "M"), df$CROPDMG * 1e6, df$CROPDMG)
df$CROPDMG <- ifelse((df$CROPDMGEXP == "b") | (df$CROPDMGEXP == "B"), df$CROPDMG * 1e9, df$CROPDMG)
df$PROPDMGEXP <- NULL;  df$CROPDMGEXP <- NULL    # no longer needed

sorted_storm_types <- sort(unique(df$EVTYPE))
#print(sorted_storm_types)

# Reduce storm naming discrepencies, and agglomerate similar weather conditions.
df[grep("AVALANCE", df$EVTYPE), "EVTYPE"] <- "AVALANCHE"
df[grep("BEACH EROSIN", df$EVTYPE), "EVTYPE"] <- "BEACH EROSION"
df[grep("BITTER WIND CHILL|EXTREME WIND CHILL", df$EVTYPE), "EVTYPE"] <- "BITTER WIND CHILL"
df[grep("BLIZZARD", df$EVTYPE), "EVTYPE"] <- "BLIZZARD"
df[grep("COASTAL|CSTL|BEACH|SURF|SWELLS|BEACH|SEAS|WAVES|TIDES|SURGE|TIDE", df$EVTYPE), "EVTYPE"] <- "COASTAL FLOODING"
df[grep("TORNADO|TORNDAO|LANDSPOUT|WATERSPOUT|WAYTERSPOUT|WATER SPOUT|FUNNEL", 
        df$EVTYPE), "EVTYPE"] <- "TORNADO"
df[grep("COLD|COOL|FREEZ|ICE|LOW TEMPERATURE|HYPOTHERMIA|RECORD LOW|WInd", df$EVTYPE), "EVTYPE"] <- "COLD"
df[grep("DRY MICROBURST", df$EVTYPE), "EVTYPE"] <- "D_MICROBURST"
df[grep("DRY", df$EVTYPE), "EVTYPE"] <- "DRY"
df[grep("HEAT|HOT|WARM|HIGH TEMPERATURE|RECORD HIGH", df$EVTYPE), "EVTYPE"] <- "HOT"
df[grep("RECORD TEMPERATURE|HYPERTHERMIA|TEMPERATURE RECORD|HIGH", df$EVTYPE), "EVTYPE"] <- "HOT"
df[grep("FIRE|SMOKE|DENSE SMOKE", df$EVTYPE), "EVTYPE"] <- "FIRE"
df[grep("SNOW|BLIZZARD", df$EVTYPE), "EVTYPE"] <- "SNOW"
df[grep("FLOOD|FLOOOD|FLD|RAPIDLY RISING WATER|HIGH WATER|DAM", df$EVTYPE), "EVTYPE"] <- "FLOOD"
df[grep("FOG", df$EVTYPE), "EVTYPE"] <- "FOG"
df[grep("RAIN|PRECIP|SHOWER|WET WEATHER|DOWNBURST", df$EVTYPE), "EVTYPE"] <- "RAIN"
df[grep("WIND|GUST|WND|SEVERE TURBULENCE", df$EVTYPE), "EVTYPE"] <- "WIND"
df[grep("HAIL", df$EVTYPE), "EVTYPE"] <- "HAIL"
df[grep("HURRICANE|FLOYD", df$EVTYPE), "EVTYPE"] <- "HURRICANE"
df[grep("LIGNTNING|LIGHTNING", df$EVTYPE), "EVTYPE"] <- "LIGHTNING"
df[grep("LANDSL|MUD|SLIDE", df$EVTYPE), "EVTYPE"] <- "LANDSLIDE"
df[grep("THUNDERSTORM", df$EVTYPE), "EVTYPE"] <- "THUNDERSTORM"
df[grep("TROPICAL", df$EVTYPE), "EVTYPE"] <- "TROPICAL"
df[grep("VOLCANIC|VOG", df$EVTYPE), "EVTYPE"] <- "VOLCANIC"
df[grep("WINTER", df$EVTYPE), "EVTYPE"] <- "WINTER STORM"
df[grep("URBAN", df$EVTYPE), "EVTYPE"] <- "URBAN/SMALL STREAM"
df[grep("SMALL STREEM", df$EVTYPE), "EVTYPE"] <- "SMALL STREEM"
df[grep("SLEET", df$EVTYPE), "EVTYPE"] <- "SLEET"
df[grep("MARINE", df$EVTYPE), "EVTYPE"] <- "MARINE ACCIDENT"
df[grep("RIP CURRENT", df$EVTYPE), "EVTYPE"] <- "RIP CURRENT"
df[grep("WALL CLOUD", df$EVTYPE), "EVTYPE"] <- "WALL CLOUD"
df[grep("DROUGHT|DRY|DRIEST", df$EVTYPE), "EVTYPE"] <- "DRY"  
df[grep("DUST", df$EVTYPE), "EVTYPE"] <- "DUST"
df[grep("WINTER|WINTRY", df$EVTYPE), "EVTYPE"] <- "WINTER STORM"
df[grep("WINTER|WINTRY", df$EVTYPE), "EVTYPE"] <- "WINTER STORM"
df[grep("WET MICROBURST|WET MICOBURST", df$EVTYPE), "EVTYPE"] <- "WET MICROBURST"
df[grep("D_MICROBURST", df$EVTYPE), "EVTYPE"] <- "DRY MICROBURST"
df[grep("FROST", df$EVTYPE), "EVTYPE"] <- "FROST"
df[grep("TSTM", df$EVTYPE), "EVTYPE"] <- "TSTMW"
df[grep("WET MONTH|WET YEAR|EXTREMELY WET|ABNORMALLY WET", df$EVTYPE), "EVTYPE"] <- "WET"
df[grep("SUMMARY|NONE|OTHER|METRO STORM|NO SEVERE|SMALL STREAM AND", 
        df$EVTYPE), "EVTYPE"] <- "?"
df[grep("EXCESSIVE|RED FLAG CRITERIA|APACHE COUNTY|MONTHLY TEMPERATURE", 
        df$EVTYPE), "EVTYPE"] <- "?"
#
sorted_storm_types <- sort(unique(df$EVTYPE))
#print( sorted_storm_types )

Data Analysis

For this analysis we didn’t take into account that during the early years, certain data wasn’t collected. For our analysis this shouldn’t be a problem, since we aren’t combining (for example) the amount of property damage with the amount of crop damage, but deal with each category separately.

Harm to Human Health

We now use this data to discover which weather conditions cause the most injuries and which result in the largest number of deaths. For the analysis of weather related injuries we only use data for which there was at least one injury, while for the fatality analysis we only used data for which there was at least one fatality.

## Injuries & Fatalities

# Only use data with at least one injury.
df.injury <- df[df$INJURIES > 0, ]
counts.injury <- table(df.injury$INJURIES)
# Simple Bar Plot
#barplot(counts.injury, main = "Counts", 
#    xlab = "Number of Injuries", col = 'red')

# Only use data with at least one fatality.
df.fatal <- df[df$FATALITIES > 0, ]
counts.fatal <- table(df.fatal$FATALITIES)
# Simple Bar Plot 
#barplot(counts.fatal, main = "Counts", 
#    xlab = "Number of Fatalities", col = 'red')

# For each of our 45 weather conditions, we form the aggregrate sums.
conditions <- unique(df$EVTYPE)

# Create new dataframes with one of each type of weather condition along with 
# the sum of all injuries/fatalities for that weather condition.
sum_injuries <- data.frame(EVTYPE = character(), Sum = character(), stringsAsFactors = FALSE)
sum_fatalities <- data.frame(EVTYPE = character(), Sum = character(), stringsAsFactors = FALSE)
for (w in conditions) {
    # Injuries
    condition <- subset(df.injury, EVTYPE == w )
    new_row <- data.frame(EVTYPE = w, Sum = sum(condition$INJURIES), stringsAsFactors = FALSE)
    sum_injuries = rbind(sum_injuries, new_row)
    # Fatalities
    condition <- subset(df.fatal, EVTYPE == w )
    new_row <- data.frame(EVTYPE = w, Sum = sum(condition$FATALITIES), stringsAsFactors = FALSE)
    sum_fatalities = rbind(sum_fatalities, new_row)
}
# Save the top 10 in terms of numbers of injuries caused by the various weather conditions.
sum_injuries.top10 <- sum_injuries[with(sum_injuries, order(-Sum)), ][1:10, ]
# The top 6 weather-related causes of injuries:
print( head(sum_injuries.top10) )
##       EVTYPE   Sum
## 1    TORNADO 91439
## 7        HOT 10726
## 2       WIND  9931
## 14     FLOOD  9031
## 10 LIGHTNING  5232
## 4       COLD  2524
# Save the top 10 in terms of numbers of fatalities caused by the various weather conditions.
sum_fatalities.top10 <- sum_fatalities[with(sum_fatalities, order(-Sum)), ][1:10, ]
# The top 6 weather-related causes of injuries:
print( head(sum_fatalities.top10) )
##       EVTYPE  Sum
## 1    TORNADO 5664
## 7        HOT 3429
## 14     FLOOD 1820
## 2       WIND  902
## 10 LIGHTNING  817
## 4       COLD  578

Damage to Property

We now use this data to discover which weather conditions cause the most damage to property and to crops. For the analysis of weather related damage we only use data for which there was some damage, for each case.

## Property and Crop damage

# Only use data with reported property damage.
df.prop <- df[df$PROPDMG > 0, ]
counts.prop <- table(df.injury$PROPDMG)
# Simple Bar Plot
#barplot(counts.prop, main = "Counts", 
#    xlab = "Amount of Property Damage", col = 'red')

# Only use data with at least one fatality.
df.crop <- df[df$CROPDMG > 0, ]
counts.crop <- table(df.crop$CROPDMG)
# Simple Bar Plot 
#barplot(counts.crop, main = "Counts", 
#    xlab = "Amount of Crop Damage", col = 'red')

# Create new dataframes with one of each type of weather condition along with 
# the sum of all property/crop damage for that weather condition.
sum_prop <- data.frame(EVTYPE = character(), Sum = character(), stringsAsFactors = FALSE)
sum_crop <- data.frame(EVTYPE = character(), Sum = character(), stringsAsFactors = FALSE)
for (w in conditions) {
    # Injuries
    condition <- subset(df.prop, EVTYPE == w )
    new_row <- data.frame(EVTYPE = w, Sum = sum(condition$PROPDMG), stringsAsFactors = FALSE)
    sum_prop = rbind(sum_prop, new_row)
    # Fatalities
    condition <- subset(df.fatal, EVTYPE == w )
    new_row <- data.frame(EVTYPE = w, Sum = sum(condition$CROPDMG), stringsAsFactors = FALSE)
    sum_crop = rbind(sum_crop, new_row)
}
# Save the top 10 in terms of numbers of injuries caused by the various weather conditions.
sum_prop.top10 <- sum_prop[with(sum_prop, order(-Sum)), ][1:10, ]
# The top 6 weather-related causes of injuries:
print( head(sum_prop.top10) )
##       EVTYPE          Sum
## 14     FLOOD 215668685232
## 8  HURRICANE  84656105010
## 1    TORNADO  58602869829
## 3       HAIL  15973898048
## 2       WIND   9960411048
## 23      FIRE   8501728500
# Save the top 10 in terms of numbers of fatalities caused by the various weather conditions.
sum_crop.top10 <- sum_crop[with(sum_crop, order(-Sum)), ][1:10, ]
# The top 6 weather-related causes of injuries:
print( head(sum_crop.top10) )
##       EVTYPE        Sum
## 8  HURRICANE 3291230800
## 7        HOT  752605100
## 14     FLOOD  188520750
## 30  TROPICAL  135435000
## 2       WIND  126148500
## 5       SNOW  112040000

Results

### Plots for Weather Caused Harm to Human Health

library(ggplot2);  library(gridExtra)

# We want this to display in the order the records appear in sum_injuries.top10 
# rather than alphabetical order (and the same for sum_fatalities.top10) ...
# 'EVTYPE' column --> ordered factor
sum_injuries.top10$EVTYPE <- factor(sum_injuries.top10$EVTYPE, 
                                       levels = rev(sum_injuries.top10$EVTYPE))
sum_fatalities.top10$EVTYPE <- factor(sum_fatalities.top10$EVTYPE, 
                                       levels = rev(sum_fatalities.top10$EVTYPE))

# Injuries plot
p1 <- ggplot(data = sum_injuries.top10, aes(x = EVTYPE, y = Sum)) +
  geom_bar(stat="identity", col = 'black', fill = "#E69F00") +
  coord_flip() +
  guides(fill=FALSE) +
  xlab("Weather Condition") + 
  ylab("# Injuries") +
  ggtitle("# Injuries by Weather") +
  theme(title = element_text(face = "bold"),
        axis.text = element_text(face = "bold"))

# Fatalities plot
p2 <- ggplot(data = sum_fatalities.top10, aes(x = EVTYPE, y = Sum)) +
  geom_bar(stat="identity", col = 'black', fill = "#E69F00") +
  coord_flip() +
  guides(fill=FALSE) +
  xlab("Weather Condition") + 
  ylab("# Fatalities") +
  ggtitle("# Fatalities by Weather") +
  theme(title = element_text(face ="bold"),
        axis.text = element_text(face="bold"))

grid.arrange(p1, p2, ncol = 2)
grid.text("Top 10", x=0.57, y=.98, gp = gpar(fontsize = 20, fontface = 2))

Figure 1: This figure depicts the top 10 most dangerous weather conditions. The plot on the left shows a bar chart showing the number injuries incurred by the 10 most injurious weather conditions, while the plot on the right shows the number of fatalities caused by the 10 most fatal weather conditions.

# Calculate the percent of injuries caused by the 5 most damaging weather conditions.
total_injuries <- sum(sum_injuries$Sum)
top5_injuries <- sum(sum_injuries.top10$Sum[1: 5])
print(top5_injuries / total_injuries * 100)
## [1] 89.91731
# Calculate the percent of crop damage caused by the 5 most damaging weather conditions.
total_fatalities <- sum(sum_fatalities$Sum)
top5_fatalities <- sum(sum_fatalities.top10$Sum[1: 5])
print(top5_fatalities / total_fatalities * 100)
## [1] 83.40707

Harm to Human Health

The top 2 dangerous weather conditions, both for number of injuries and number of fatalities, are (most dangerous first), tornados and hot weather. For third and fourth places, the most injurious are windy conditions and flooding, while the most fatal are flooding and windy conditions. Apparently one is more likely to be injured by windy conditions than to be injured by a flood, while one is more likely to be killed by a flood than by windy conditions. The fifth most dangerous weather condition, both in terms of injuries and fatalities, is cold weather. The top 5 most injurious weather conditions together account for about 90% of the weather caused injuries, while the top 5 most fatal weather conditions account for about 83% of weather caused fatalities (note that the top 10 account for about 96% of the weather fatalities).

### Plots for Weather Caused Harm to Human Health

# We want this to display in the order the records appear in sum_prop.top10 
# rather than alphabetical order (and the same for sum_crop.top10) ...
# 'EVTYPE' column --> ordered factor
sum_prop.top10$EVTYPE <- factor(sum_prop.top10$EVTYPE, 
                                levels = rev(sum_prop.top10$EVTYPE))
sum_crop.top10$EVTYPE <- factor(sum_crop.top10$EVTYPE, 
                                levels = rev(sum_crop.top10$EVTYPE))

# Injuries plot
p1 <- ggplot(data = sum_prop.top10, aes(x = EVTYPE, y = Sum)) +
  geom_bar(stat="identity", col = 'black', fill = "#E69F00") +
  coord_flip() +
  guides(fill=FALSE) +
  xlab("Weather Condition") + 
  ylab("Property Damage") +
  ggtitle("Property Damage by Weather") +
  theme(title = element_text(face="bold"),
        axis.text = element_text(face="bold"))

# Fatalities plot
p2 <- ggplot(data = sum_crop.top10, aes(x = EVTYPE, y = Sum)) +
  geom_bar(stat="identity", col = 'black', fill = "#E69F00") +
  coord_flip() +
  guides(fill=FALSE) +
  xlab("Weather Condition") + 
  ylab("Crop Damage") +
  ggtitle("Crop Damage by Weather") +
  theme(title = element_text(face="bold"),
        axis.text = element_text(face="bold"))

grid.arrange(p1, p2, ncol = 2)
grid.text("Top 10", x = 0.57, y = .98, gp = gpar(fontsize = 20, fontface = 2))

Figure 2: This figure depicts the top 10 most damaging weather conditions. The plot on the left shows a bar chart showing the dollar amount of property damage caused by the 10 most destructive weather conditions, while the plot on the right shows the dollar amount of crop damage caused by the 10 most destructive weather conditions.

# Calculate the percent of property damage caused by the 5 most damaging weather conditions.
total_prop_damage <- sum(sum_prop$Sum)
top5_prop_damage <- sum(sum_prop.top10$Sum[1: 5])
print(top5_prop_damage / total_prop_damage * 100)
## [1] 90.0644
# Calculate the percent of crop damage caused by the 5 most damaging weather conditions.
total_crop_damage <- sum(sum_crop$Sum)
top5_crop_damage <- sum(sum_crop.top10$Sum[1: 5])
print(top5_crop_damage / total_crop_damage * 100)
## [1] 91.61891

Damage to Property and Crops

The highest amount of property damage is caused by floods, followed by hurricanes and then tornados. Next come hail and wind as the most damaging. These top 5 most damaging weather conditions account for over 90% of the property damage caused by weather. As to crops, the most damage is caused by hurricanes, seconded by hot weather conditions. Next comes flooding, then tropical storms, and finally wind. These top 5 weather conditions most damaging to crops account for over 91% of all the weather related crop damage.