Synopsis

While tornados clearly lead in human costs (fatalities, injuries), floods have the greatest financial impact from property and crop damage.

Introduction

From coursera assignment.

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data

From coursera assignment.

The data for this analysis comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. We downloaded the file from the course web site:

Storm Data [47Mb]

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation

National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Initialize required packages

library(data.table)
library(dplyr)
library(lubridate)
library(ggplot2)
library(reshape2)

In order to avoid duplicating the source data in our analysis repository, we check for it locally and only download it if it is not already available locally. The source data is ignored by our analysis repository.

DATA_URL = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
DATA_FILE = "StormData.csv.bz2"

# check if data file exists locally
if(!file.exists(DATA_FILE)) {

    message("Downloading data from url.")
    download.file(DATA_URL, destfile=DATA_FILE)

} else {
    message("Downloaded data found locally, not repeating.")
}
## Downloaded data found locally, not repeating.

Data Processing

data = read.csv(DATA_FILE)

For our analysis, we are only concerned with select fields from the data set, in particular: * BGN_DATE * EVTYPE * FATALITIES * INJURIES * PROPDMG * PROPDMGEXP * CROPDMG * CROPDMGEXP

The following allows us to extract the fields listed above:

#select only the columns necessary for the analysis
cols <- c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
sub <- subset(data, select = cols)

#extract the year from the BGN_DATE column
sub$year <- year(mdy_hms(sub$BGN_DATE))

Data is sparse in early years, and impacts are likely less applicable due to population growth and inflation.

# This is a count of events by year
hist(sub$year, breaks = 60)

# We can see a clear increase in the data points starting in 1994. We will restrict the analyis to this year and later

sub <- filter(sub, year >= 1994)

Here is a summary of the features we are interested in:

summary(sub)
##               BGN_DATE                    EVTYPE         FATALITIES     
##  5/25/2011 0:00:00:  1202   HAIL             :222616   Min.   :  0.000  
##  4/27/2011 0:00:00:  1193   TSTM WIND        :128970   1st Qu.:  0.000  
##  6/9/2011 0:00:00 :  1030   THUNDERSTORM WIND: 82482   Median :  0.000  
##  5/30/2004 0:00:00:  1016   FLASH FLOOD      : 53396   Mean   :  0.015  
##  4/4/2011 0:00:00 :  1009   TORNADO          : 25274   3rd Qu.:  0.000  
##  4/2/2006 0:00:00 :   981   FLOOD            : 24906   Max.   :583.000  
##  (Other)          :695700   (Other)          :164487                    
##     INJURIES            PROPDMG          PROPDMGEXP        CROPDMG       
##  Min.   :   0.0000   Min.   :   0.00   K      :387417   Min.   :  0.000  
##  1st Qu.:   0.0000   1st Qu.:   0.00          :306087   1st Qu.:  0.000  
##  Median :   0.0000   Median :   0.00   M      :  8266   Median :  0.000  
##  Mean   :   0.0949   Mean   :  12.22   0      :   215   Mean   :  1.889  
##  3rd Qu.:   0.0000   3rd Qu.:   1.50   B      :    37   3rd Qu.:  0.000  
##  Max.   :1568.0000   Max.   :5000.00   5      :    28   Max.   :990.000  
##                                        (Other):    81                    
##    CROPDMGEXP          year     
##         :419089   Min.   :1994  
##  K      :281034   1st Qu.:1999  
##  M      :  1955   Median :2004  
##  k      :    21   Mean   :2004  
##  0      :    17   3rd Qu.:2008  
##  B      :     8   Max.   :2011  
##  (Other):     7

We need to clean up the the fields that indicate property damage and crop damage so we can quantify them more easily. Currently, the PROPDMGEXP field indicates the magnitude of PROPDMG. e.g. k corresponds to ,000s. The formula below will translate the PROPDMGEXP and CROPDMGEXP fields to $ multipliers

Convert Damage Exponents

    # convert the PROPDMGEXP and CROPDMGEXP multipliers into numbers
    dmg_mult = function(exp) {
        if(exp == '' || exp == '-' || exp == '?' || exp == '+' || exp == 0) {
            value = 1
        }  else if(exp == 1) {
            value = 10          
         } else if(exp == 'H' || exp == 'h' || exp == 2) {
            value = 100
         } else if(exp == 'K' || exp == 'k' || exp == 3) {
            value = 1000
         } else if (exp == 4) {
            value = 10000         
         } else if (exp == 5) {
            value = 100000         
         } else if (exp == 'M' || exp == 'm' || exp == 6) {
            value = 1000000
         } else if (exp == 7) {
            value = 10000000
         } else if (exp == 8) {
            value = 100000000
         } else if (exp == 'B' || exp == 'b') {
            value = 1000000000
         } else {}
            
            return(value)
    }

    #add new multiplier and $ damage variables

sub$propmult <- mapply(dmg_mult, sub$PROPDMGEXP)
sub$cropmult <- mapply(dmg_mult, sub$CROPDMGEXP)
sub$propdmg. <- sub$PROPDMG * sub$propmult
sub$cropdmg. <- sub$CROPDMG * sub$cropmult

Also, we only care about events that had a cost in terms of human injury, human death, property damage and crop damage. After filtering out the weather events without costs, we sum them by event type, getting an overall cost per event type.

We then sum the totals of fatalities and fnjuries to arrive at the total human costs The property and crop damage are summed to arrive at the total economic cost

nonzero = subset(
    sub, 
    FATALITIES > 0 | INJURIES > 0 | propdmg. > 0 | cropdmg. > 0
)

hc_by_event <- aggregate(cbind(INJURIES, FATALITIES) ~ EVTYPE, data = nonzero, FUN = "sum")
ec_by_event <- aggregate(cbind(propdmg., cropdmg.) ~ EVTYPE, data = nonzero, FUN = "sum")

Results

Human Costs due to Weather Events

First let us examine the most costly weather events in terms of human fatalities and injuries.

Here we sort the data by total fatalities and injuries and maintain only the top 10 observations

top_hum_cost <- arrange(hc_by_event, desc(FATALITIES + INJURIES))[1:10,]

top_hum_cost
##               EVTYPE INJURIES FATALITIES
## 1            TORNADO    22571       1593
## 2     EXCESSIVE HEAT     6525       1903
## 3              FLOOD     6778        450
## 4          LIGHTNING     5116        794
## 5          TSTM WIND     3631        241
## 6               HEAT     2095        930
## 7        FLASH FLOOD     1754        951
## 8          ICE STORM     1971         86
## 9  THUNDERSTORM WIND     1476        133
## 10      WINTER STORM     1298        195

Visualizing the results:

## convert wide to long format  
top_hum_cost. <- melt(top_hum_cost, id.vars="EVTYPE")  

xaxis <- reorder(top_hum_cost.$EVTYPE, -(top_hum_cost.$value))    

gp <- ggplot(aes(x=xaxis, y=value, fill=variable), data=top_hum_cost.) +  
  geom_bar(stat="identity") +  
  labs(x="Event Type", y="Number of Population Affected",  
  title="Human Costs by Event Type (since 1994) \n(Total, Injuries and Fatalities)") +  
  theme(axis.text.x=element_text(angle=40, hjust=1))  
print(gp)        

The graph shows that Tornadoes account for the overwhelming majority of human costs since 1994. Injuries make up most of those costs with fatalities being a smaller proportion.

Financial Costs

We now turn to the economic costs of weather events. Here we will define the economic cost as the sum of the property and crop damage for the given event type.

We sort the data by total financial cost and maintain only the top 10 observations

top_eco_cost <- arrange(ec_by_event, desc(propdmg. + cropdmg.))[1:10,]

colnames(top_eco_cost) <- c("EVTYPE", "PROPERTY_DAMAGE", "CROP_DAMAGE")

top_eco_cost
##               EVTYPE PROPERTY_DAMAGE CROP_DAMAGE
## 1              FLOOD    144179608807  5506942450
## 2  HURRICANE/TYPHOON     69305840000  2607872800
## 3        STORM SURGE     43193536000        5000
## 4            TORNADO     25630588401   361824470
## 5               HAIL     15338044461  2982699123
## 6        FLASH FLOOD     16398255929  1402661500
## 7            DROUGHT      1046106000 13922066000
## 8          HURRICANE     11862819010  2741410000
## 9          ICE STORM      3832377860  5022113500
## 10    TROPICAL STORM      7703385550   677841000

Visualizing the results:

## convert wide to long format  
top_eco_cost. <- melt(top_eco_cost, id.vars="EVTYPE")   

xaxis <- reorder(top_eco_cost.$EVTYPE, -(top_eco_cost.$value))    

gp <- ggplot(aes(x=xaxis, y=value/1e9, fill=variable), data=top_eco_cost.) +  
  geom_bar(stat="identity") +  
  labs(x="Event Type", y="Economic Damage ($billions)",  
  title="Economic Costs by Event Type (since 1994) \n(Total, Property & Crop Damage)") +  
  theme(axis.text.x=element_text(angle=40, hjust=1))  
print(gp)        

The graph shows that floods account for the overwhelming majority of economic costs since 1994. Property damage makes up most of those costs with crop damage being a smaller proportion.