Synopsis

This document outlines a simple analysis of the US National Weather Service’s Storm Data Documentation. Of the event types covered, tornados caused the greatest number of deaths in total, while unseasonably warm and dry weather had the highest number of deaths on average per event. Tornados and hurricanes caused the most economic damage, both in total and on average per event.

This study was limited in scope. A more in depth analysis could be made of property and crop damage.

Note:

I use in-text calculations throughout. This means that sometimes, I create an object that stores a summary statistic, but don’t print it to the console. Instead I call it in the text. I think this is neater. I hope you agree.

Data Processing

Installing Packages

I use bunzip2 to unzip the data once downloaded. I also like to use dplyr to manipulate data. You’re not interested in the output from this chunk, so I’ve hidden it with the message = FALSE option.

# install.packages(c("R.utils", "dplyr"))
library(R.utils)
library(dplyr)

Loading and preprocessing the data

The data is a bit large, so I’ve cached this code chunk.

# Download if not already present
if(!file.exists("repdata-data-StormData.csv")){
    file_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    download.file(file_url, "repdata-data-StormData.csv.bz2", method = "curl")
    bunzip2("repdata-data-StormData.csv.bz2")
} 

# Load the data
storm <- read.csv("repdata-data-StormData.csv", header = T, stringsAsFactors = T)
storm <- as.tbl(storm)

First, I check the variables associated with impacts on population health. There are two variables in the dataset that relate to health impacts: INJURIES and FATALITIES. There is no advice on how to weight deaths versus injuries, so I will treat them separately here.

# Check for missing data
na_death <- sum(is.na(storm$FATALITIES))
na_injury <- sum(is.na(storm$INJURIES))

There are no missing data in the injuries or fatalities. However, it is worth noting that there are 985 different levels to the EVTYPE variable. This raises some doubts about how the events were classified. A quick inspection reveals that there are multiple misspellings and often the same description in upper and lower case. A full clean-up of the data is beyond the scope of this assignment, but the capitalisation one is easy to fix:

# Fix capitalisation
storm$EVTYPE <- as.factor(toupper(as.character(storm$EVTYPE)))

Next, I check the variables associated with economic damage. There are four variables that relaate to economic damage in the data set: two refer to property damage and two refer to crop damage. In both cases, one variable gives a value, and the second variable gives the exponent (in powers of 10) for that value. A quick glance at the data reveals that the exponent variable data is very messy:

summary(storm$PROPDMGEXP)
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
summary(storm$CROPDMGEXP)
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

The levels of these variables are all over the place. According to the documentation for the data set: “Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.” On this basis, I correct character cases, then convert any value that is not “”, K“,”M“,”B" to NA (I assume that “” means that the value of the exponent it 0). Then I create two new variables that have the actual property and crop damage values (to three signficant figures).

# Convert all values to upper case
storm$PROPDMGEXP <- toupper(storm$PROPDMGEXP)
storm$CROPDMGEXP <- toupper(storm$CROPDMGEXP)

# Replace values not equal to "K", "M" or "B" with NA
valid <- c("", "K", "M", "B")
storm$PROPDMGEXP[!(storm$PROPDMGEXP %in% valid)]  <- NA
storm$CROPDMGEXP[!(storm$CROPDMGEXP %in% valid)]  <- NA

# Convert to character
storm$PROPDMGEXP <- as.character(storm$PROPDMGEXP)
storm$CROPDMGEXP <- as.character(storm$CROPDMGEXP)

# Replace alphabetical character with numeric powers of 10
for(k in 1:length(valid)){
    storm$PROPDMGEXP[storm$PROPDMGEXP == valid[k]] <- 10^(3*(k-1))
    storm$CROPDMGEXP[storm$CROPDMGEXP == valid[k]] <- 10^(3*(k-1))
}

# Convert to numeric
storm$PROPDMGEXP <- as.numeric(storm$PROPDMGEXP)
storm$CROPDMGEXP <- as.numeric(storm$CROPDMGEXP)

# Create new variables with all property and crop damage
storm$PROPDMG_full <- storm$PROPDMG * storm$CROPDMGEXP
storm$CROPDMG_full <- storm$CROPDMG * storm$CROPDMGEXP

Results

Which event types are most harmful to population health?

Preliminary analysis

First, it is helpful to calculate some simple summary statistics:

total_death <- sum(storm$FATALITIES)
total_injury <- sum(storm$INJURIES)
perc_death <- sum(storm$FATALITIES > 0) * 100 / length(storm$FATALITIES)
perc_injury <- sum(storm$INJURIES > 0) * 100 / length(storm$INJURIES)
summary(storm$INJURIES)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0     0.2     0.0  1700.0
summary(storm$FATALITIES)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0     583

The analysis above shows that the data is highly left-skewed. In particular, the vast majority of events cause neither death nor injury. Only 1.951% of events resulted in injury and 0.7729% resulted in a death. However the distribution has a long tail, as can be seen from the summaries. One event caused 1700 injuries, and another caused 583 deaths. Because of this, I will work with only those events where there was at least one injury or death.

# Filter out events with neither injuries nor fatalities, group by event type
storm_health <- filter(storm, INJURIES > 0 | FATALITIES > 0) %>% group_by(EVTYPE)

I now look at fatalities and injuries separately.

Which event types cause the most deaths?

There are two questions that we could ask: which types of event accounted for the greatest number of total deaths? and which types of event caused the greates number of deaths on average? These aren’t necessarily the same: if one type of event happens often, but causes few fatalities, and another happens rarely, but causes many, the first could cause more total deaths, but the second could cause more on average. This is relevant because it will be important to be prepared for both kinds of event.

# Find which individual events caused most deaths
total_deaths <- storm_health %>%
                     summarise(deaths = sum(FATALITIES)) %>% 
                     arrange(desc(deaths))

head(total_deaths, n = 10)
## Source: local data frame [10 x 2]
## 
##            EVTYPE deaths
## 1         TORNADO   5633
## 2  EXCESSIVE HEAT   1903
## 3     FLASH FLOOD    978
## 4            HEAT    937
## 5       LIGHTNING    816
## 6       TSTM WIND    504
## 7           FLOOD    470
## 8     RIP CURRENT    368
## 9       HIGH WIND    248
## 10      AVALANCHE    224

From this we can see that tornados account for by far the most deaths. They account for 5633 out of a total of 1.5145 × 104 deaths, or 37.19%. Excessive heat is another major cause of death, accounting for 1903 deaths, or around 12.57% of the total.

# Calculate mean deaths for each event type
avg_deaths <- storm_health %>%
                   summarise(mean_deaths = mean(FATALITIES)) %>% 
                   arrange(desc(mean_deaths))

head(avg_deaths, n = 10)
## Source: local data frame [10 x 2]
## 
##                        EVTYPE mean_deaths
## 1   UNSEASONABLY WARM AND DRY      29.000
## 2  TORNADOES, TSTM WIND, HAIL      25.000
## 3       RECORD/EXCESSIVE HEAT      17.000
## 4                     TSUNAMI      16.500
## 5               COLD AND SNOW      14.000
## 6               WINTER STORMS      10.000
## 7       TROPICAL STORM GORDON       8.000
## 8                EXTREME HEAT       6.857
## 9                   HEAT WAVE       5.548
## 10           STORM SURGE/TIDE       5.500

This shows that ‘unseasonably warm and dry’ weather events cause the most deaths on average. Tornados also cause a high average number of deaths per event. Notably, a further three of the top ten events for fatalities are forms of hot weather. This confirms the concerns raised above about the coding of the weather events: are the factor levels meaningfully different?

Which event types cause the most injuries?

As with the numbers of deaths, I also look at the types of event that account for the greatest number of injuries in total and the types of event that cause the most injuries on average.

# Calculate total injuries for each event type
total_injuries <- storm_health %>%
                       summarise(injuries = sum(INJURIES)) %>% 
                       arrange(desc(injuries))

head(total_injuries, n = 10)
## Source: local data frame [10 x 2]
## 
##               EVTYPE injuries
## 1            TORNADO    91346
## 2          TSTM WIND     6957
## 3              FLOOD     6789
## 4     EXCESSIVE HEAT     6525
## 5          LIGHTNING     5230
## 6               HEAT     2100
## 7          ICE STORM     1975
## 8        FLASH FLOOD     1777
## 9  THUNDERSTORM WIND     1488
## 10              HAIL     1361

Here again, we see that tornadoes top the list, accounting for 9.1346 × 104 out of a total of 1.4053 × 105 injuries, or 65%. Excessive heat features again, along with ‘tstm wind’, flood and lightning, all of which account for over 5,000 injuries.

# Calculate mean deaths for each event type
avg_injuries <- storm_health %>%
                   summarise(mean_injuries = mean(INJURIES)) %>% 
                   arrange(desc(mean_injuries))

head(avg_injuries, n = 10)
## Source: local data frame [10 x 2]
## 
##                   EVTYPE mean_injuries
## 1             WILD FIRES        150.00
## 2                TSUNAMI         64.50
## 3      HURRICANE/TYPHOON         49.04
## 4  TROPICAL STORM GORDON         43.00
## 5     WINTER WEATHER MIX         34.00
## 6          THUNDERSTORMW         27.00
## 7             WINTRY MIX         25.67
## 8            RECORD HEAT         25.00
## 9              BLACK ICE         24.00
## 10    EXCESSIVE RAINFALL         21.00

In terms of average numbers of injury, wild fires cause more than twice as many injuries on average than the next on the list. Hot weather events also cause a large number of injuries on average in the top ten.

par(mfrow = c(2,2), mar = c(10, 5, 4, 1), las = 2)

barplot(total_deaths$deaths[1:10],
      ylab = "Total Fatalities",
      names.arg = total_deaths$EVTYPE[1:10],
      cex.names = 0.6,
      cex.axis = 0.6,
      main = "Total Fatalities by Event Type"
      )

barplot(avg_deaths$mean_deaths[1:10], 
      ylab = "Mean Fatalities per Event",
      names.arg = avg_deaths$EVTYPE[1:10], 
      cex.names = 0.6,
      cex.axis = 0.6,
      main = "Average Fatalities by Event Type"
      )

barplot(total_injuries$injuries[1:10], 
      ylab = "Total Injuries per Event",
      names.arg = total_injuries$EVTYPE[1:10], 
      cex.names = 0.6,
      cex.axis = 0.6,
      main = "Total Injuries by Event Type"
      )

barplot(avg_injuries$mean_injuries[1:10], 
      ylab = "Mean Injuries per Event",
      names.arg = avg_injuries$EVTYPE[1:10], 
      cex.names = 0.6,
      cex.axis = 0.6,
      main = "Average Injuries by Event Type"
      )

plot of chunk unnamed-chunk-11 Caption: The figure above contains four plots showing (in order) the top ten event types for total deaths, average deaths per event, total injuries and average injuries per event. From this we can see that total deaths and injuries fall away more quickly than average deaths and injuries. This suggests that differences in frequency contribute to which events cause the most harm to population health in total.

Which event types have the greatest economic consequences?

To calculate the total economic impact of each event, I sum the crop damage and property damage variables to get total economic damage. Note that where an event has NA for either property or crop damage, the total will also be NA. I thought this more appropriate than, for example assuming NA meant zero, or using any other method of imputing missing values. Then I calculate the total and mean economic damage for each event type.

# Calculate total economic damage (property + crop)
storm$DMG_full <- storm$PROPDMGEXP + storm$CROPDMGEXP

# Group by event type and calculate total and mean economic damage
econ <- group_by(storm, EVTYPE)

total_damage <- econ %>% 
            summarise(total_economic_damage = sum(DMG_full, na.rm = T)) %>%
            arrange(desc(total_economic_damage))

avg_damage <- econ %>%
          summarise(mean_economic_damage = mean(DMG_full, na.rm = T)) %>%
          arrange(desc(mean_economic_damage))

We can see which events cause the most damage in total and on average below.

head(avg_damage, n = 10)
## Source: local data frame [10 x 2]
## 
##                        EVTYPE mean_economic_damage
## 1   HURRICANE OPAL/HIGH WINDS            1.001e+09
## 2  TORNADOES, TSTM WIND, HAIL            1.001e+09
## 3   HEAVY RAIN/SEVERE WEATHER            5.000e+08
## 4              HURRICANE OPAL            2.229e+08
## 5           HURRICANE/TYPHOON            1.485e+08
## 6         SEVERE THUNDERSTORM            7.708e+07
## 7                   HURRICANE            1.790e+07
## 8                      FREEZE            1.334e+07
## 9                 RIVER FLOOD            1.167e+07
## 10                STORM SURGE            7.824e+06
head(total_damage, n = 10)
## Source: local data frame [10 x 2]
## 
##               EVTYPE total_economic_damage
## 1  HURRICANE/TYPHOON             1.307e+10
## 2            TORNADO             7.617e+09
## 3              FLOOD             6.905e+09
## 4            DROUGHT             4.174e+09
## 5          HURRICANE             3.115e+09
## 6        FLASH FLOOD             2.746e+09
## 7               HAIL             2.604e+09
## 8        STORM SURGE             2.042e+09
## 9        RIVER FLOOD             2.019e+09
## 10    HURRICANE OPAL             2.006e+09

From this we can see that tornados cause the most damage in total, and joint first most economic damage on average per event. Hurricane Opal also causes a lot of economic damage, indicating a rare event with a large impact.

par(mfrow = c(1,2), mar = c(10, 5, 4, 1), las = 2)

barplot(total_damage$total_economic_damage[1:10],
      ylab = "Total Economic Damage ($)",
      names.arg = total_damage$EVTYPE[1:10],
      cex.names = 0.6,
      cex.axis = 0.6,
      main = "Total Economic Damage by Event Type"
      )

barplot(avg_damage$mean_economic_damage[1:10], 
      ylab = "Mean Economic Damage ($)",
      names.arg = avg_damage$EVTYPE[1:10], 
      cex.names = 0.6,
      cex.axis = 0.6,
      main = "Average Economic Damage by Event Type"
      )

plot of chunk unnamed-chunk-14 Caption: The plot above shows the top ten events for total economic damage and average economic damage per event.