This will be a report on storm data from the National Oceanic and Atmospheric Administration (NOAA), a federal agency in the United States. The data set has nearly a million rows and 37 variables. Each row contains information about a weather event in a specific geographic location. For this analysis we will describe which types of events have the greatest effect on human health and economic activity.
We’ll need to clean our data in order to do some analysis. To begin we’ll read the compressed storm data set into R and convert the variables which define the date of the beginning and end of weather events into Date objects.
library(readr)
library(dplyr)
library(ggplot2)
# Storm data must be in working directory. Unzip and read into R with a readr function which is faster than base read.csv
storm <- read_csv(bzfile('repdata-data-StormData.csv.bz2'))
# convert date vars to Date objects
storm$BGN_DATE <- as.Date(storm$BGN_DATE, "%m/%d/%Y")
storm$END_DATE <- as.Date(storm$END_DATE, "%m/%d/%Y")
We’ll need to measure the extent of damage to crops and property for this analysis. According to documention, the PROPDMGEXP variable is a multiplier signifying the scale of the units in the PROPDMG column. These are in hundreds, thousands, millions and billions. We’ll use this to calculate a new variable for the nominal cash damage to crops and property.
# PROPDMGEXP is a multiplier which signifies the correct units for the CROPDMG variable. Convert codes into correct numeric values
storm$PROPDMGEXP[storm$PROPDMGEXP == 'K'] <- 1000
storm$CROPDMGEXP[storm$CROPDMGEXP == 'K'] <- 1000
storm$PROPDMGEXP[storm$PROPDMGEXP == 'k'] <- 1000
storm$CROPDMGEXP[storm$CROPDMGEXP == 'k'] <- 1000
storm$PROPDMGEXP[storm$PROPDMGEXP == 'M'] <- 1e6
storm$CROPDMGEXP[storm$CROPDMGEXP == 'M'] <- 1e6
storm$PROPDMGEXP[storm$PROPDMGEXP == 'm'] <- 1e6
storm$CROPDMGEXP[storm$CROPDMGEXP == 'm'] <- 1e6
storm$PROPDMGEXP[storm$PROPDMGEXP == 'B'] <- 1e9
storm$CROPDMGEXP[storm$CROPDMGEXP == 'B'] <- 1e9
# odd characters and small numbers will be converted to hundreds
storm$CROPDMGEXP[storm$CROPDMGEXP %in% c('-', '?', '+', '0', '1', '2', '3', '4', '5', '6', '7', '8', 'h', "H")] <- '100'
storm$PROPDMGEXP[storm$PROPDMGEXP %in% c('-', '?', '+', '0', '1', '2', '3', '4', '5', '6', '7', '8', 'h', "H")] <- '100'
# create new variable for total cash damage to property and crops
storm$PROPCASH <- storm$PROPDMG * as.numeric(storm$PROPDMGEXP)
storm$CROPCASH <- storm$CROPDMG * as.numeric(storm$CROPDMGEXP)
# Flood in Napa causing over $100 billion in damage is a typo. Online sources verify $115 million is correct
napa <- which(storm$PROPCASH > 99e9)
storm$PROPCASH[napa] <- storm$PROPCASH[napa] / 1000
Finally, the variable ‘EVTYPE’ attempts to categorize the type of event which occured. It is very messy. We will use pattern matching and replacement to slightly standardize and unify the types of events.
# EVTYPE is very raw. Use grepl to standardize events a little
pattern <- c('torn', 'thunder', 'tstm', 'heat', 'snow', 'wind', 'sleet', 'flood', 'fld', 'hail', 'hurr', 'cold', 'fire', 'rip')
replace.with <- c('TORNADO', 'THUNDERSTORM', 'THUNDERSTORM', 'HEAT WAVE', 'SNOW', 'HIGH WIND', 'SLEET', 'FLOOD', 'FLOOD', 'HAIL', 'HURRICANE', 'COLD', 'WILD FIRE', 'RIP CURRENT')
for (i in seq_len(length(pattern))) {
storm$EVTYPE[grepl(pattern[i], storm$EVTYPE, ignore.case=TRUE)] <- replace.with[i]
}
To answer this question we’ll create two tidy data sets which summarise the events which caused a fatality or an injury. Then we can plot total deaths and injuries for each event type.
# subset with only rows where someone died
deaths.sum <- filter(storm, FATALITIES > 0) %>%
group_by(EVTYPE) %>% # group by event type
# calculate sum and mean for each event type
summarise(sum(FATALITIES), mean(FATALITIES), n())
names(deaths.sum)[2:4] <- c('sum', 'mean', 'count')
# top ten causes of death by weather event
most.deaths <- head(deaths.sum[order(deaths.sum$sum, decreasing = TRUE),], 7)
#summarise events involving injuries
injury.sum <- filter(storm, INJURIES > 0) %>%
group_by( EVTYPE) %>%
summarise(sum(INJURIES), mean(INJURIES), n())
names(injury.sum)[2:4] <- c('sum', 'mean', 'count')
most.injuries <- head(injury.sum[order(injury.sum$sum, decreasing = TRUE),], 7)
# plot weather events with most fatalities and injuries
par(mfrow=c(2,1))
par(las=1)
par(mar=c(3,6.8,3,2) + 0.1)
barplot(sort(most.deaths$sum), names.arg=rev(most.deaths$EVTYPE), cex.names=.8, horiz=TRUE,
main="Total Fatalities for Major Weather Events", col='darkblue')
barplot(sort(most.injuries$sum), names.arg=rev(most.injuries$EVTYPE), cex.names=.8, horiz=TRUE,
main="Total Injuries for Major Weather Events", col='darkblue')
Tornadoes have caused the most fatalities and injuries. The other major events which have caused a large number of casualties are thunderstorms, floods, and heat waves. Thunderstorms and heat waves have very similar totals for injuries but nearly twice as many people have died from heat waves as from thunderstorms. The expected (mean) injury and death rate from heat waves is also very high.
injury.sum[injury.sum$EVTYPE=='HEAT WAVE' |injury.sum$EVTYPE == 'THUNDERSTORM',]
## Source: local data frame [2 x 4]
##
## EVTYPE sum mean count
## (chr) (dbl) (dbl) (int)
## 1 HEAT WAVE 9224 40.456140 228
## 2 THUNDERSTORM 9545 2.608636 3659
deaths.sum[deaths.sum$EVTYPE == 'HEAT WAVE' | deaths.sum$EVTYPE == 'THUNDERSTORM',]
## Source: local data frame [2 x 4]
##
## EVTYPE sum mean count
## (chr) (dbl) (dbl) (int)
## 1 HEAT WAVE 3138 3.947170 795
## 2 THUNDERSTORM 731 1.238983 590
On average, each heat wave event has resulted in over 40 injuries and nearly 4 deaths. In comparison, thunderstorms result in 2.6 injuries and 1.2 fatalities. This should be qualified to state that these means are derived only from events which resulted in death and/or injury, that is, not every single thunderstorm and heat wave were taken into account.
Let’s look at the mean of fatalities and injuries when taking every single event into account.
# summarise data by event type to find mean fatalities and injuries for each unique event
storm.sum <- group_by(storm, EVTYPE) %>%
summarize(round(mean(FATALITIES), 3), round(mean(INJURIES), 3), n()) %>%
filter(`n()` > 61) # ignore events which don't occur often
names(storm.sum)[2:4] <- c("MEAN_FATALITIES", "MEAN_INJURIES", "COUNT")
head(storm.sum[order(storm.sum$MEAN_FATALITIES, decreasing = TRUE),-3], 5)
## Source: local data frame [5 x 3]
##
## EVTYPE MEAN_FATALITIES COUNT
## (chr) (dbl) (int)
## 1 HEAT WAVE 1.185 2648
## 2 RIP CURRENT 0.743 777
## 3 AVALANCHE 0.580 386
## 4 HURRICANE 0.463 287
## 5 COLD 0.239 899
head(storm.sum[order(storm.sum$MEAN_INJURIES, decreasing = TRUE), -2], 5)
## Source: local data frame [5 x 3]
##
## EVTYPE MEAN_INJURIES COUNT
## (chr) (dbl) (int)
## 1 HURRICANE 4.627 287
## 2 HEAT WAVE 3.483 2648
## 3 TORNADO 1.506 60701
## 4 FOG 1.364 538
## 5 DUST STORM 1.030 427
Heat waves remain very deadly but we have the interesting additions of avalanches and hurricanes. Our most common events for mean injuries now include fog and dust storms.
We’ve pre-processed our data to create vectors with nominal cash damage to crops and property for each event. We’ll summarise this crop and property damage for each unique type of weather event.
# summarise events involving property damage with cash values above 0
property.sum <- filter(storm, PROPCASH > 0) %>%
group_by(EVTYPE) %>%
summarise(sum(PROPCASH), mean(PROPCASH), n())
# rename variables so they don't look like functions
names(property.sum)[2:4] <- c('sum', 'mean', 'count')
majorprop <- head(property.sum[order(property.sum$sum, decreasing = TRUE),1:2], 8)
# summarise events involving crop damage with cash values above 0
crop.sum <- filter(storm, CROPCASH > 0) %>%
group_by(EVTYPE) %>%
summarise(sum(CROPCASH), mean(CROPCASH), n())
# rename variables
names(crop.sum)[2:4] <- c('sum_crop', 'mean_crop', 'count_crop')
majorcrop <- head(crop.sum[order(crop.sum$sum, decreasing = TRUE),1:2], 8)
par(mfrow=c(2,1))
par(las=1)
par(mar=c(4.5,6.8,2,3.5) + 0.1)
barplot(sort(majorprop$sum)/1e9, names.arg=rev(majorprop$EVTYPE), cex.names=.8, horiz=TRUE,
main="Total Property Damage for Major Weather Events", col='darkgreen')
barplot(sort(majorcrop$sum_crop)/1e9, names.arg=rev(majorcrop$EVTYPE), cex.names=.8, horiz=TRUE,
main="Total Crop Damage for Major Weather Events", col='darkgreen', xlim=c(0,80),xlab='Total Damage (Billions of USD)' )
From this two-paneled figure we can see that damage to property is much costlier than crop damage. The types of events have some overlap but crops are clearly more effected by cold weather and lack of water. This makes sense, as most property is not living and is therefore not effected by thirst or extreme cold temperatures.
Let’s now look at the expected cash damage for each weather event and sort the results by size.
# summarise data by event type to find mean crop and property damage for each unique event
storm.sum2 <- group_by(storm, EVTYPE) %>%
summarize(mean(CROPCASH, na.rm=T), mean(PROPCASH, na.rm=T), n()) %>%
filter(`n()` > 30) # ignore events which don't occur often
names(storm.sum2)[2:4] <- c("MEAN_CROP_DMG", "MEAN_PROP_DMG", "COUNT")
head(storm.sum2[order(storm.sum2$MEAN_CROP_DMG, decreasing = TRUE),-3], 5)
## Source: local data frame [5 x 3]
##
## EVTYPE MEAN_CROP_DMG COUNT
## (chr) (dbl) (int)
## 1 HURRICANE 49597232 287
## 2 FREEZE 29748333 74
## 3 FROST 11000000 53
## 4 COLD 9993727 899
## 5 DROUGHT 9192478 2488
head(storm.sum2[order(storm.sum2$MEAN_PROP_DMG, decreasing = TRUE),-2], 5)
## Source: local data frame [5 x 3]
##
## EVTYPE MEAN_PROP_DMG COUNT
## (chr) (dbl) (int)
## 1 HURRICANE 395589626 287
## 2 STORM SURGE 247563063 261
## 3 STORM SURGE/TIDE 32008193 148
## 4 TROPICAL STORM 13635205 690
## 5 WILD FIRE 3693149 4240
Hurricanes are clearly the most costly event in both total and average damage values. The significance of excessive cold and dryness to crop damage is confirmed in the first table as well.
In terms of total deaths and injuries caused by a single type of weather event in the period covered [1950-2011] tornadoes are clearly the most dangerous. They have caused nearly 10 times as many total injuries as the next most injurious event, thunderstorms. However, when viewing the danger through expected outcomes from adverse events we find that heat waves are more dangerous in a sense. The average deaths per heat wave is 1.1 while the average deaths per tornado is much smaller at 0.09. So expected deaths per heat wave is 10 times larger than per tornado if we assume this data set is representative of the population of all weather events. In regards to crop and property damage it is clear that hurricanes are the most costly in economic terms. Any policy preparation should take both total and expected deaths and injuries into account.