Storms and other weather events in the US: Public health and economic impacts

Rafael Marino

February 22th, 2015.

Synopsis

This projects consist in analyzing the National Oceanic and Atmospheric Administration’s (NOAA) Storm Events Database, in order to explore 2 main questions: 1) which weather events are most harmful to population health and 2) which weather events have the highest economic consequences. To address the first question, total fatalities and injuries of each event type were added together, from 1950 to 2011, resulting in Tornados being the most harmful event, with 96,979 instances. A “Deadly Ratio” was also calculated for each event, formed was fatalities/injuries; the deadliest event was Heat with 55%. To answer the second question, the cost of property damage and crop damage was summed into a total damage cost, resulting in Floods being the event with the greatest economic consequences of all, totaling 150 Bn USD from 1950 to 2011.

Data Processing

The original data file was compressed using the bzip 2 algorithm, so it can’t be unzipped using the regular unzip() function. The following code first checks if the .bz2 file is in the current directory, if not, it is downloaded. Then the data is read into R directly, without unzipping it, using the bzfile() function as an argument of read.csv(). Finally 2 R packages are loaded: ggplot2 and dplyr.

#downloads the data and reads it into R
if (!"repdata-data-StormData.csv.bz2" %in% dir()) {
  url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
  download.file(url, "repdata-data-StormData.csv.bz2")
}

if (!"data" %in% ls()) {
  data <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))
}

library(ggplot2)
library(dplyr)

Data Analysis

Question 1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

To answer this question, fatalities and injuries are summed by event type, from the beginning to the end of the data set, meaning 1950 to 2011. For instance, the event type “TORNADO” has either killed or injuried 96,979 people from 1950 to 2011, according to this dataset.

The code computes the new variable fatalities+injuries for every event type, and then selects the top 15 events with the highest value. This is shown in Figure 1.The second approach differentiates between injuries and fatalities.

#Group by event type, then sum fatalities+injuries. Sort decreasingly and take top 15.
eventTypes <- data %>% group_by(EVTYPE) %>% summarise(fatalInjuries= sum(FATALITIES,INJURIES),
                                                      fatalities=sum(FATALITIES), injuries=sum(INJURIES))
orderedEvents <- eventTypes %>% arrange(desc(fatalInjuries))
top15Events <- orderedEvents[1:15,] #Only give me the top 15 events by (fatalities+injuries)
top15Events$EVTYPE <- factor(top15Events$EVTYPE, levels=top15Events$EVTYPE) #Order the levels decreasingly
head(top15Events)

## Source: local data frame [6 x 4]
## 
##           EVTYPE fatalInjuries fatalities injuries
## 1        TORNADO         96979       5633    91346
## 2 EXCESSIVE HEAT          8428       1903     6525
## 3      TSTM WIND          7461        504     6957
## 4          FLOOD          7259        470     6789
## 5      LIGHTNING          6046        816     5230
## 6           HEAT          3037        937     2100

Figure 1.

#Plot1. Top 15 most harmful events (fatalities+injuries)
plot1 <- ggplot(top15Events, aes(EVTYPE, fatalInjuries))
  plot1 + geom_bar(stat="identity", colour="black") +ylab("Fatalities + Injuries")+ 
  xlab("") + ggtitle("Top 15 most harmful weather events to population health
  in USA (1950-2011)") + 
  theme(legend.position="none") + theme(axis.text.x = element_text(angle = 90, hjust = 1,vjust = 0.5))

Another dimension computed to analyze this question is the ratio of fatalities to injuries, which represents the proportion to how much people a weather event killed as a percentage of how much people it injured. The higher this percentage the more deadly the event is.

#Creates a new variable, fatalities/Injuries, by each Event Type. 
top15Events$deadly <- top15Events$fatalities/top15Events$injuries
summary(top15Events$deadly) #the max is 55% deadly rate

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01102 0.06545 0.08938 0.16160 0.18710 0.55040

The weather event with the highest deadly rate are Flas Floods.

top15Events[top15Events$deadly==max(top15Events$deadly),]

## Source: local data frame [1 x 5]
## 
##        EVTYPE fatalInjuries fatalities injuries    deadly
## 1 FLASH FLOOD          2755        978     1777 0.5503658

Question 2: Across the United States, which types of events have the greatest economic consequences?

Two features are used to answer this question: Property Damage (“PROPDMG”) and Crop Damage (“CROPDMG”). They measure the costs incurred by the weather event, in US dollars. There is one problem with these variables:

Problem: These variables by themselves do not express the full USD amount of the cost. Two other, auxiliary features are required in order to compute the full cost. These are “PROPDMGEXP” and CROPDMGEXP“, they indicate whether the cost was in the order of thousands =”K“, millions =”M" or billions=“B”. However, there is one big issue with this features, which is that they not only have “K”, “M” or “B” entries, but also numeric entries (0-8) and symbols such as “?”, “+”, “-”.

After a research period, including the NOAA’s website and this thread: https://class.coursera.org/repdata-011/forum/thread?thread_id=39 , the following conclusions were reached:

1. Our data set is an old version ending in 2011 with inconsistencies such as numbers (0-8) and symbols (?, +, -) in the CROPDMGEXP and PROPDMGEXP features.
1. Using the Storm Events Database search engine (http://www.ncdc.noaa.gov/stormevents/choosedates.jsp?statefips=-999%2CALL) it is demonstrated that numbers represent a 10 multiplier, “+” represents a 1 multiplier, “-” , “?” and “blank” a 0 multiplier. (Soesilo Wijono)
1. The current database hosted in the NOAA’s FTP (http://www.ncdc.noaa.gov/stormevents/ftp.jsp) professes to be updated as of 8/24/14, yet presents a result different from that of the search engine: PROPDMG and PROPDMGEXP are concatenated into a new feature. Numbers and symbols are not “treated” into a valid multiplier, but simply pasted together. (Eddie Song)

Conclusion: given the negligible amounts, the proportionally small amount of numeric and symbolic entries on the PROP&CROP EXP features, and the different interpretations provided by the own NOAA’s different datasets; these inconsistent observations will be removed.

#Removing observations with numbers and symbols in PROPDMGEXP & CRODMGPEXP 
leaveOut <- c(0:8, "?","+","-")
numbersSymbolsPROP <- data$PROPDMGEXP %in% leaveOut | data$CROPDMGEXP %in% leaveOut
#sum(numbersSymbolsPROP) 341 rows to be deleted
datav2 <- data[-which(numbersSymbolsPROP),]

#Converting H to 100, K to 1000, M to 1 000 000 and B to 1 000 000 000 in PROPERTY DAMAGE
datav2$unitPROPDMG[grepl("H|h",datav2$PROPDMGEXP)] <- 100
datav2$unitPROPDMG[grepl("K|k",datav2$PROPDMGEXP)] <- 1000
datav2$unitPROPDMG[grepl("M|m",datav2$PROPDMGEXP)] <- 1000000
datav2$unitPROPDMG[grepl("B|b",datav2$PROPDMGEXP)] <- 1000000000
datav2$unitPROPDMG[is.na(datav2$unitPROPDMG)] <- 0 # Turn NAs to 0 so that I can add up 
#Converting H to 100, K to 1000, M to 1 000 000 and B to 1 000 000 000 in CROP DAMAGE
datav2$unitCROPDMG[grepl("H|h",datav2$CROPDMGEXP)] <- 100
datav2$unitCROPDMG[grepl("K|k",datav2$CROPDMGEXP)] <- 1000
datav2$unitCROPDMG[grepl("M|m",datav2$CROPDMGEXP)] <- 1000000
datav2$unitCROPDMG[grepl("B|b",datav2$CROPDMGEXP)] <- 1000000000  
datav2$unitCROPDMG[is.na(datav2$unitCROPDMG)] <- 0 #Turn NAs to 0 so that I can add up

#Multiplying the Cost by the unit of measurement
datav2$PROPCOST <- datav2$PROPDMG*datav2$unitPROPDMG
datav2$CROPCOST <- datav2$CROPDMG*datav2$unitCROPDMG
#Adding up both costs
datav2$PROPCROPCOST <- datav2$PROPCOST + datav2$CROPCOST

Now that the Property Damage Cost and the Crop Damage Cost have been added up together, we can group by event to see which ones have the greatest economic impact.

#Group by type and add up cost and divides by 1000000000 to convert to billions, then takes top 15
costEvents <- datav2 %>% group_by(EVTYPE) %>% summarise(prop_cropCost= sum(PROPCROPCOST)/1000000000)
orderedCostEvents <- arrange(costEvents, desc(prop_cropCost))
top15Cost <- orderedCostEvents[1:15,]
top15Cost$EVTYPE <- factor(top15Cost$EVTYPE, levels=top15Cost$EVTYPE) #Relevel factors in order
head(top15Cost)

## Source: local data frame [6 x 2]
## 
##              EVTYPE prop_cropCost
## 1             FLOOD     150.31968
## 2 HURRICANE/TYPHOON      71.91371
## 3           TORNADO      57.30194
## 4       STORM SURGE      43.32354
## 5              HAIL      18.73322
## 6       FLASH FLOOD      17.56154

Figure 2.

#Plot2
plot2 <- ggplot(top15Cost, aes(EVTYPE, prop_cropCost))
    plot2 + geom_bar(stat="identity", color="black") +
    labs(list(x="", y="Total Property + Crop Damage (Bn USD)", 
              title="Cumulative Economic Impact of weather events in the US.
    1950-2011 (Bn USD)")) +
    theme(legend.position="none") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1,vjust = 0.5))

Results

As shown in Figure1, Tornados have been by far the most harmful weather event to population health in the US in the period from 1950 to 2011, as measured by the 96,979 instances of etiher fatalities or injuries. The 2nd most harmful event is Excessive Heat, and it does not even come close to being as harmful as Tornados, with a comparable figure of “only” 8,482.
Another metric that was computed to see the effect on population health was the “Deadly Rate”, or the percentage of fatalities over injuries, the highest weather event in this category were Flash Floods with a 55% and in 2nd place Heat with 44%.
Floods are the weather event with the highest economic impact, totaling 150 Bn USD in both property and crop damage from 1950 to 2011. With Hurricanes/Typhoons ranking 2nd with 72 Bn USD. Surpisingly, the most harmful weather event to population healt, Tornados, only ranked 3rd in terms of economic impact, with damages totaling 57 Bn USD.