Synopsis

In this report, we analyzed a dataset from the US NOAA on severe weather events. We focused on the impact on the health, in terms of fatalities and injuries, and on the economy, in terms of property and crop damage, of each type of event. Our analysis showed that, in accrued terms, tornados are the most damaging event in terms of fatalities and injuries. In economic terms, it turns out that floods, hurricanes and tornados are the most critical ones. In contrast, events with a wide area impact and long time span, provide the highest average impact per individual occurence: heat waves or hurricanes in the health domain and hurricanes and high tides in the cost domain.

Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The following questions are addressed in this report:

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

The data for this assignment is in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. This file will be downloaded from the Internet, uncompressed and the data loaded into R for further processing:

## Get data from the Internet (if not already done)
if (!file.exists("./data")) {dir.create("./data")}

fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata_data_StormData.csv.bz2"
bz2file <- "./data/repdata_data_StormData.csv.bz2"
csvfile <- "./data/repdata_data_StormData.csv"

if(!file.exists(bz2file)){
    print("Downloading zip")
    download.file(fileUrl,bz2file,method="curl")
}
## Loading data(if not already done)
if(!file.exists(csvfile)){
    print("Reading data...")
    csvdata<-read.csv(bz2file)
}
## [1] "Reading data..."

The type of event is coded in variable EVTYPE. This variable is poorly coded in the dataset, therefore some transformations should be done. Only the 48 types defined in the Storm Data Documentation will be used. For other wrongly coded elements, the most similar event type will be looked for.

if(!require(stringdist)){install.packages("stringdist")}
library(stringdist)
library(dplyr)
## Allowed types
allowed <- c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood",
             "Cold/Wind Chill","Debris Flow","Dense Fog","Dense Smoke","Drought",
             "Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill",
             "Flash Flood","Flood","Frost/Freeze","Funnel Cloud","Freezing Fog",
             "Hail","Heat","Heavy Rain","Heavy Snow","High Surf","High Wind",
             "Hurricane (Typhoon)","Ice Storm","Lake-Effect Snow","Lakeshore Flood",
             "Lightning","Marine Hail","Marine High Wind","Marine Strong Wind",
             "Marine Thunderstorm Wind","Rip Current","Seiche","Sleet",
             "Storm Surge/Tide","Strong Wind","Thunderstorm Wind","Tornado",
             "Tropical Depression","Tropical Storm","Tsunami","Volcanic Ash",
             "Waterspout","Wildfire","Winter Storm","Winter Weather")

types <- csvdata %>% count(EVTYPE) %>% arrange(desc(n))
types$NEWTYPE <- tolower(gsub("FLD","FLOOD",types$EVTYPE,fixed = T))
types$NEWTYPE <- gsub("tstm","thunderstorm",types$NEWTYPE,fixed = T)
types$NEWTYPE <- gsub("light ","",types$NEWTYPE,fixed = T)
types$NEWTYPE <- gsub("hurricane","hurricane (typhoon)",types$NEWTYPE,fixed = T)
types$NEWTYPE <- gsub("record","excessive",types$NEWTYPE,fixed = T)
types$NEWTYPE <- gsub("fog","dense fog",types$NEWTYPE,fixed = T)
types$NEWTYPE <- gsub("ice storm","freezing storm",types$NEWTYPE,fixed = T)
types$NEWTYPE <- gsub("ice","winter weather",types$NEWTYPE,fixed = T)
types$NEWTYPE <- gsub("freezing storm","ice storm",types$NEWTYPE,fixed = T)
types$NEWTYPE <- gsub("urban","",types$NEWTYPE,fixed = T)

# Distance with the longest common substring metric
distmtx <- sapply(tolower(allowed),function(x)
    stringdist(types$NEWTYPE,x,method="lcs"))
whichmin <- apply(distmtx,1,which.min)

types$NEWTYPE <- allowed[whichmin]

# A new type will be applied to those quite different with each other and
# with small number of occurrences
types$NEWTYPE[(types$n<20) && (abs(nchar(types$EVTYPE)-nchar(types$NEWTYPE))>9)]<-"Other"

# Finally, some changes are made by hand
types$NEWTYPE[types$EVTYPE=="LANDSLIDE"]<-"Debris Flow"
types$NEWTYPE[types$EVTYPE=="FREEZING RAIN"]<-"Winter Storm"
types$NEWTYPE[types$EVTYPE=="DRY MICROBURST"]<-"Thunderstorm Wind"
types$NEWTYPE[types$EVTYPE=="UNSEASONABLY WARM"]<-"Heat"
types$NEWTYPE[types$EVTYPE=="ASTRONOMICAL HIGH TIDE"]<-"High Surf"
types$NEWTYPE[types$EVTYPE=="ASTRONOMICAL HIGH TIDE"]<-"High Surf"
types$NEWTYPE[types$EVTYPE=="HEAVY SURF"]<-"High Surf"
types$NEWTYPE[types$EVTYPE=="COLD"]<-"Cold/Wind Chill"
types$NEWTYPE[types$EVTYPE=="RECORD COLD"]<-"Cold/Wind Chill"
types$NEWTYPE[types$EVTYPE=="FROST"]<-"Frost/Freeze"

csvdata<-merge(csvdata,types)

To address the harm to the population and the cost of the weather events, the following additional variables will be considered:

It will be needed to transform the damage figures into a single variable, coded into a homogeneous scale. For health damage, we will consider that a injury will consist 0.25 the damage of a fatality (fatal injury), according to the statistical value (near type ‘severe’) that the FAA sets out. For economic damage, we will choose billion dollars.

## Mutate the cost variables into only one (0..9,-,? ignored)
## Select only the appropriate columns
data <- csvdata %>% mutate(DAMAGE=PROPDMG*((PROPDMGEXP=='B')*1 +
                                           (PROPDMGEXP=='M')*0.001 +
                                           (PROPDMGEXP=='K')*0.000001 +
                                           ((PROPDMGEXP=='H' || PROPDMGEXP=='h')*0.0000001))+
                                  CROPDMG*((CROPDMGEXP=='B')*1 +
                                           (CROPDMGEXP=='M')*0.001 +
                                           (CROPDMGEXP=='K')*0.000001 +
                                           ((CROPDMGEXP=='H' || CROPDMGEXP=='h')*0.0000001)),
                           INJURIES=FATALITIES+INJURIES*0.25) %>%
                    select(NEWTYPE,INJURIES,DAMAGE)
remove(csvdata)
summary <- data %>% group_by(NEWTYPE) %>%
    summarise(AVGINJURIES=mean(INJURIES),
              AVGDAMAGE=mean(DAMAGE),
              TOTINJURIES=sum(INJURIES),
              TOTDAMAGE=sum(DAMAGE))

Results

The analysis will consist of sorting the events first according to the health cost (fatalities and injuries), in descending order, and afterwards according to the aggregate cost, and select the 10 most damaging event types.

1. Across the United States, which types of events are most harmful with respect to population health?

library(ggplot2)
## Sort according to effect in health

result <- summary %>% slice_max(TOTINJURIES,n=10)
result$NEWTYPE=factor(result$NEWTYPE,levels=result$NEWTYPE)

ggplot(result, aes(NEWTYPE,TOTINJURIES))+ geom_col() +
    ylab("Number of casualties") + xlab("Event type") + theme_bw() +
    ggtitle("Total casualties per type of event (1950-2011)") +
    theme(axis.text.x=element_text(colour = "black",angle=45,hjust=1,vjust=1))

The figure shows that the type of event which more casualties has brought in the period under analysis is the tornado. Events of excessive heat follow far below, and then thunderstorm, flood and lightning.

2. Across the United States, which types of events have the greatest economic consequences?

The cost elements are summarized in an analogous manner:

## Sort and select according to economic effect
result <- summary %>% slice_max(TOTDAMAGE,n=10)
result$NEWTYPE=factor(result$NEWTYPE,levels=result$NEWTYPE)

## Plot most harmful events according to economic effect
ggplot(result, aes(NEWTYPE,TOTDAMAGE))+ geom_col() +
    ylab("Cost (B$)") + xlab("Event type") + theme_bw() +
    ggtitle("Total cost per type of event (1950-2011)") +
    theme(axis.text.x=element_text(colour = "black",angle=45,hjust=1,vjust=1))

It can be noticed in the previous figure that the events whose impact has been the greatest in economic terms are floods, hurricanes and tornados.

Analysis of the average impact of the events

Although the outcome of the previous analysis, in accrued terms, is clear, it is worth analysing the impact of an individual event in both the health and economic dimensions.

result<- summary %>% slice_max(AVGINJURIES,n=10)

par(mfrow=c(1,2))
par(mar=c(10,5,5,2))
## Plot most harmful events according to effect in health
barplot(result$AVGINJURIES,names.arg=result$NEWTYPE,las=2,main="Average casualties (1950-2011)",ylab="Average casualties")

result <- summary %>% slice_max(AVGDAMAGE,n=10)

## Plot most harmful events according to economic effect
barplot(result$AVGDAMAGE,names.arg=result$NEWTYPE,las=2,main="Average cost (1950-2011)",ylab="Average cost (B$)")

The previous figure shows that heat episodes and hurricanes (wide area/long period events) provide, in average, the most individual impact in terms of health. The impact of a single hurricane or tropical storm in the cost dimension is even more evident.

Events affecting a large area during several days can induce a high individual impact; for instance, a heat wave or a hurricane. In any case, the impact of a high number of violent local events, such as tornados, can be important and even weigh more than those wide-area events in the long run. The contrast of both approaches (total and average) is interesting, and is explained by the different nature of each event. Ultimately, this analysis provides additional guidance to the authorities about their potential investments.