Synopsis

This will be a report on storm data from the National Oceanic and Atmospheric Administration (NOAA), a federal agency in the United States. The data set has nearly a million rows and 37 variables. Each row contains information about a weather event in a specific geographic location. For this analysis we will describe which types of events have the greatest effect on human health and economic activity.

Data Processing

We’ll need to clean our data in order to do some analysis. To begin we’ll read the compressed storm data set into R and convert the variables which define the date of the beginning and end of weather events into Date objects.

library(readr)
library(dplyr)
library(ggplot2)
# Storm data must be in working directory. Unzip and read into R with a readr function which is faster than base read.csv
storm <- read_csv(bzfile('repdata-data-StormData.csv.bz2'))

# convert date vars to Date objects
storm$BGN_DATE <- as.Date(storm$BGN_DATE, "%m/%d/%Y")
storm$END_DATE <- as.Date(storm$END_DATE, "%m/%d/%Y")

We’ll need to measure the extent of damage to crops and property for this analysis. According to documention, the PROPDMGEXP variable is a multiplier signifying the scale of the units in the PROPDMG column. These are in hundreds, thousands, millions and billions. We’ll use this to calculate a new variable for the nominal cash damage to crops and property.

# PROPDMGEXP is a multiplier which signifies the correct units for the CROPDMG variable. Convert codes into correct numeric values
storm$PROPDMGEXP[storm$PROPDMGEXP == 'K'] <- 1000
storm$CROPDMGEXP[storm$CROPDMGEXP == 'K'] <- 1000
storm$PROPDMGEXP[storm$PROPDMGEXP == 'k'] <- 1000
storm$CROPDMGEXP[storm$CROPDMGEXP == 'k'] <- 1000
storm$PROPDMGEXP[storm$PROPDMGEXP == 'M'] <- 1e6
storm$CROPDMGEXP[storm$CROPDMGEXP == 'M'] <- 1e6
storm$PROPDMGEXP[storm$PROPDMGEXP == 'm'] <- 1e6
storm$CROPDMGEXP[storm$CROPDMGEXP == 'm'] <- 1e6
storm$PROPDMGEXP[storm$PROPDMGEXP == 'B'] <- 1e9
storm$CROPDMGEXP[storm$CROPDMGEXP == 'B'] <- 1e9
# odd characters and small numbers will be converted to hundreds
storm$CROPDMGEXP[storm$CROPDMGEXP %in% c('-', '?', '+', '0', '1', '2', '3', '4', '5', '6', '7', '8', 'h', "H")] <- '100'
storm$PROPDMGEXP[storm$PROPDMGEXP %in% c('-', '?', '+', '0', '1', '2', '3', '4', '5', '6', '7', '8', 'h', "H")] <- '100'
# create new variable for total cash damage to property and crops
storm$PROPCASH <- storm$PROPDMG * as.numeric(storm$PROPDMGEXP)
storm$CROPCASH <- storm$CROPDMG * as.numeric(storm$CROPDMGEXP)
# Flood in Napa causing over $100 billion in damage is a typo. Online sources verify $115 million is correct
napa <- which(storm$PROPCASH > 99e9)
storm$PROPCASH[napa] <- storm$PROPCASH[napa] / 1000

Finally, the variable ‘EVTYPE’ attempts to categorize the type of event which occured. It is very messy. We will use pattern matching and replacement to slightly standardize and unify the types of events.

# EVTYPE is very raw. Use grepl to standardize events a little
pattern <- c('torn', 'thunder', 'tstm', 'heat', 'snow', 'wind', 'sleet', 'flood', 'fld', 'hail', 'hurr', 'cold', 'fire', 'rip')
replace.with <- c('TORNADO', 'THUNDERSTORM', 'THUNDERSTORM', 'HEAT WAVE', 'SNOW', 'HIGH WIND', 'SLEET', 'FLOOD', 'FLOOD', 'HAIL', 'HURRICANE', 'COLD', 'WILD FIRE', 'RIP CURRENT')
for (i in seq_len(length(pattern))) {
      storm$EVTYPE[grepl(pattern[i], storm$EVTYPE, ignore.case=TRUE)] <- replace.with[i]
}

Across the United States, which types of events are most harmful with respect to population health?

To answer this question we’ll create two tidy data sets which summarise the events which caused a fatality or an injury. Then we can plot total deaths and injuries for each event type.

# subset with only rows where someone died
deaths.sum <- filter(storm, FATALITIES > 0) %>%
      group_by(EVTYPE) %>% # group by event type
      # calculate sum and mean for each event type
      summarise(sum(FATALITIES), mean(FATALITIES), n()) 
names(deaths.sum)[2:4] <- c('sum', 'mean', 'count')
# top ten causes of death by weather event
most.deaths <- head(deaths.sum[order(deaths.sum$sum, decreasing = TRUE),], 7)

#summarise events involving injuries
injury.sum <- filter(storm, INJURIES > 0) %>%
      group_by( EVTYPE) %>%
      summarise(sum(INJURIES), mean(INJURIES), n())
names(injury.sum)[2:4] <- c('sum', 'mean', 'count')
most.injuries <- head(injury.sum[order(injury.sum$sum, decreasing = TRUE),], 7)

# plot weather events with most fatalities and injuries
par(mfrow=c(2,1))
par(las=1)
par(mar=c(3,6.8,3,2) + 0.1)
barplot(sort(most.deaths$sum), names.arg=rev(most.deaths$EVTYPE), cex.names=.8, horiz=TRUE,
        main="Total Fatalities for Major Weather Events", col='darkblue')
barplot(sort(most.injuries$sum), names.arg=rev(most.injuries$EVTYPE), cex.names=.8, horiz=TRUE,
        main="Total Injuries for Major Weather Events", col='darkblue')

Tornadoes have caused the most fatalities and injuries. The other major events which have caused a large number of casualties are thunderstorms, floods, and heat waves. Thunderstorms and heat waves have very similar totals for injuries but nearly twice as many people have died from heat waves as from thunderstorms. The expected (mean) injury and death rate from heat waves is also very high.

injury.sum[injury.sum$EVTYPE=='HEAT WAVE' |injury.sum$EVTYPE == 'THUNDERSTORM',]
## Source: local data frame [2 x 4]
## 
##         EVTYPE   sum      mean count
##          (chr) (dbl)     (dbl) (int)
## 1    HEAT WAVE  9224 40.456140   228
## 2 THUNDERSTORM  9545  2.608636  3659
deaths.sum[deaths.sum$EVTYPE == 'HEAT WAVE' | deaths.sum$EVTYPE == 'THUNDERSTORM',]
## Source: local data frame [2 x 4]
## 
##         EVTYPE   sum     mean count
##          (chr) (dbl)    (dbl) (int)
## 1    HEAT WAVE  3138 3.947170   795
## 2 THUNDERSTORM   731 1.238983   590

On average, each heat wave event has resulted in over 40 injuries and nearly 4 deaths. In comparison, thunderstorms result in 2.6 injuries and 1.2 fatalities. This should be qualified to state that these means are derived only from events which resulted in death and/or injury, that is, not every single thunderstorm and heat wave were taken into account.

Let’s look at the mean of fatalities and injuries when taking every single event into account.

# summarise data by event type to find mean fatalities and injuries for each unique event
storm.sum <- group_by(storm, EVTYPE) %>%
      summarize(round(mean(FATALITIES), 3), round(mean(INJURIES), 3), n()) %>%
      filter(`n()` > 61) # ignore events which don't occur often
names(storm.sum)[2:4] <- c("MEAN_FATALITIES", "MEAN_INJURIES", "COUNT")
head(storm.sum[order(storm.sum$MEAN_FATALITIES, decreasing = TRUE),-3], 5)
## Source: local data frame [5 x 3]
## 
##        EVTYPE MEAN_FATALITIES COUNT
##         (chr)           (dbl) (int)
## 1   HEAT WAVE           1.185  2648
## 2 RIP CURRENT           0.743   777
## 3   AVALANCHE           0.580   386
## 4   HURRICANE           0.463   287
## 5        COLD           0.239   899
head(storm.sum[order(storm.sum$MEAN_INJURIES, decreasing = TRUE), -2], 5)
## Source: local data frame [5 x 3]
## 
##       EVTYPE MEAN_INJURIES COUNT
##        (chr)         (dbl) (int)
## 1  HURRICANE         4.627   287
## 2  HEAT WAVE         3.483  2648
## 3    TORNADO         1.506 60701
## 4        FOG         1.364   538
## 5 DUST STORM         1.030   427

Heat waves remain very deadly but we have the interesting additions of avalanches and hurricanes. Our most common events for mean injuries now include fog and dust storms.

Across the United States, which types of events have the greatest economic consequences?

We’ve pre-processed our data to create vectors with nominal cash damage to crops and property for each event. We’ll summarise this crop and property damage for each unique type of weather event.

# summarise events involving property damage with cash values above 0
property.sum <- filter(storm, PROPCASH > 0) %>%
      group_by(EVTYPE) %>%
      summarise(sum(PROPCASH), mean(PROPCASH), n())
# rename variables so they don't look like functions
names(property.sum)[2:4] <- c('sum', 'mean', 'count')
majorprop <- head(property.sum[order(property.sum$sum, decreasing = TRUE),1:2], 8)

# summarise events involving crop damage with cash values above 0
crop.sum <- filter(storm, CROPCASH > 0) %>%
      group_by(EVTYPE) %>%
      summarise(sum(CROPCASH), mean(CROPCASH), n())
# rename variables
names(crop.sum)[2:4] <- c('sum_crop', 'mean_crop', 'count_crop')
majorcrop <- head(crop.sum[order(crop.sum$sum, decreasing = TRUE),1:2], 8)

par(mfrow=c(2,1))
par(las=1)
par(mar=c(4.5,6.8,2,3.5) + 0.1)
barplot(sort(majorprop$sum)/1e9, names.arg=rev(majorprop$EVTYPE), cex.names=.8, horiz=TRUE,
        main="Total Property Damage for Major Weather Events", col='darkgreen')
barplot(sort(majorcrop$sum_crop)/1e9, names.arg=rev(majorcrop$EVTYPE), cex.names=.8, horiz=TRUE,
        main="Total Crop Damage for Major Weather Events", col='darkgreen', xlim=c(0,80),xlab='Total Damage (Billions of USD)' )

From this two-paneled figure we can see that damage to property is much costlier than crop damage. The types of events have some overlap but crops are clearly more effected by cold weather and lack of water. This makes sense, as most property is not living and is therefore not effected by thirst or extreme cold temperatures.

Let’s now look at the expected cash damage for each weather event and sort the results by size.

# summarise data by event type to find mean crop and property damage for each unique event
storm.sum2 <- group_by(storm, EVTYPE) %>%
      summarize(mean(CROPCASH, na.rm=T), mean(PROPCASH, na.rm=T), n()) %>%
      filter(`n()` > 30) # ignore events which don't occur often
names(storm.sum2)[2:4] <- c("MEAN_CROP_DMG", "MEAN_PROP_DMG", "COUNT")
head(storm.sum2[order(storm.sum2$MEAN_CROP_DMG, decreasing = TRUE),-3], 5)
## Source: local data frame [5 x 3]
## 
##      EVTYPE MEAN_CROP_DMG COUNT
##       (chr)         (dbl) (int)
## 1 HURRICANE      49597232   287
## 2    FREEZE      29748333    74
## 3     FROST      11000000    53
## 4      COLD       9993727   899
## 5   DROUGHT       9192478  2488
head(storm.sum2[order(storm.sum2$MEAN_PROP_DMG, decreasing = TRUE),-2], 5)
## Source: local data frame [5 x 3]
## 
##             EVTYPE MEAN_PROP_DMG COUNT
##              (chr)         (dbl) (int)
## 1        HURRICANE     395589626   287
## 2      STORM SURGE     247563063   261
## 3 STORM SURGE/TIDE      32008193   148
## 4   TROPICAL STORM      13635205   690
## 5        WILD FIRE       3693149  4240

Hurricanes are clearly the most costly event in both total and average damage values. The significance of excessive cold and dryness to crop damage is confirmed in the first table as well.

Results

In terms of total deaths and injuries caused by a single type of weather event in the period covered [1950-2011] tornadoes are clearly the most dangerous. They have caused nearly 10 times as many total injuries as the next most injurious event, thunderstorms. However, when viewing the danger through expected outcomes from adverse events we find that heat waves are more dangerous in a sense. The average deaths per heat wave is 1.1 while the average deaths per tornado is much smaller at 0.09. So expected deaths per heat wave is 10 times larger than per tornado if we assume this data set is representative of the population of all weather events. In regards to crop and property damage it is clear that hurricanes are the most costly in economic terms. Any policy preparation should take both total and expected deaths and injuries into account.