SYNOPSIS

This report details the analysis made to understand the public health and economic impact of major weather events in the United States. Impact of events is considered as fatalities and injuries for public health, and property and crop damage for economics.

The data analyzed is from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which contains events from 1950 to November 2011. For this study, only data from 1991 to November 2011 was used.

This analysis could be useful to prioritize resources needed for different types of events, and to make disaster contingency plans to be prepared in case of these events happens.

DATA PROCESSING

Setting knitr options and libraries needed.

require(lubridate)
require(plyr)
require(stringr)
require(ggplot2)
require(gridExtra)
require(knitr)

1. Data loading

The raw data is downloaded from Coursera provided link of NOAA’s storm data.

filename <- "repdata-data-StormData.csv.bz2"
if (!file.exists(filename)) {
        url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(url=url,destfile=filename)
}
rawdata <- read.csv(filename, strip.white=TRUE)
head(rawdata)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

2. Data subsetting and cleaning

For the intended analysis of impact, the variables pertinent are the following:

  • EVTYPE. Type of event classified by NOAA.
  • BGN_DATE. Date when event began.
  • FATALITIES. Estimation of fatalities caused by the event.
  • INJURIES. Estimation of injuries caused by the event.
  • PROPDMG. Estimation of property damages caused by the event.
  • PROPDMGEXP. Multiplication factor related to property damages caused by the event.
  • CROPDMG. Estimation of crop damages caused by the event.
  • CROPDMGEXP. Multiplication factor related to crop damages caused by the event.

First, I subset the variables which are used as-is, and give them more descriptive names.

data <- rawdata[,c("EVTYPE","FATALITIES","INJURIES")]
names(data) <- c("event.type","fatalities","injuries")

Next, get year of the event.

data$year <- year(mdy_hms(rawdata$BGN_DATE))

For the damage estimations it’s necessary to normalize the values that will be used for analysis, so I will adjust values of PROPDMG and CROPDMG variables according to PROPDMGEXP and CROPDMEXP code units. From Storm Data Documentation (page 12): “Alphabetical characters used to signify magnitude include K for thousands, M for millions, and B for billions”. Any other character different from the established I’ll assume that there is no multiplication factor, so value in the corresponding damage column would not be affected (multiply by 1).

For this task , I define a function to normalize damage amounts.

## Function to normalize damage amounts
norm_factor <- function(code) {
        factor <- 1
        code <- toupper(code)
        if (code == "B")
                factor <- 1000000000
        else if (code == "M")
                factor <- 1000000
        else if (code == "K")
                factor <- 1000
        return(factor)
}
# Normalize property and crop damage and recalculate in millions base, also
# calculate total damage.
data$property.damage <- rawdata$PROPDMG * sapply(rawdata$PROPDMGEXP, 
                                                 FUN=norm_factor)
data$crop.damage <- rawdata$CROPDMG * sapply(rawdata$CROPDMGEXP, 
                                             FUN=norm_factor)
data$total.damage <- data$property.damage + data$crop.damage
head(data)
##   event.type fatalities injuries year property.damage crop.damage
## 1    TORNADO          0       15 1950           25000           0
## 2    TORNADO          0        0 1950            2500           0
## 3    TORNADO          0        2 1951           25000           0
## 4    TORNADO          0        2 1951            2500           0
## 5    TORNADO          0        2 1951            2500           0
## 6    TORNADO          0        6 1951            2500           0
##   total.damage
## 1        25000
## 2         2500
## 3        25000
## 4         2500
## 5         2500
## 6         2500

The full range of data compiled by NOAA is 62 years, but considering that early years there was fewer events recorded, most likely due to lack of good records, I use the last 20 years that represent 82.1% of the observations.

## Percentage of last 20 years records
sum(data$year>1990) / length(data$year)
## [1] 0.821009
# Subsetting for the last 20 years (year > 1990)
data <- data[data$year>1990,]

The event types defined by NOAA are 48 types, and data analyzed has 985 types, mainly due to typographics errors or names slightly different. As basic normalization steps, I lowercase types and remove extra blanks, and as a result, there are only 883 types.

length(unique(data$event.type))
## [1] 985
head(unique(data$event.type))
## [1] TSTM WIND             HAIL                  TORNADO              
## [4] FREEZING RAIN         SNOW                  ICE STORM/FLASH FLOOD
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND
data$event.type <- str_trim(gsub("[[:blank:]]{2,}"," ",data$event.type))
data$event.type <- tolower(data$event.type)
length(unique(data$event.type))
## [1] 883
head(unique(data$event.type))
## [1] "tstm wind"             "hail"                  "tornado"              
## [4] "freezing rain"         "snow"                  "ice storm/flash flood"

Important. The event types should be cleaned more accurately, to combine those events that are the same but are classified with an older type or combined types in one record, but this goal requires a very deep understanding of data and I consider it is outside of the goal of this project.

3. Tidy data

Now, I make the tidy data data frame summarizing total amount fatalities, injuries and property, crop and total damage per event type. After that, we eliminate those event type that has no contribution at all (fatalities, injuries, property damage and crop damage are all equal to zero for a specific event type).

tidydata <- ddply(data, .(event.type), summarize, fatalities=sum(fatalities), 
                  injuries=sum(injuries), property.damage=sum(property.damage), 
                  crop.damage=sum(crop.damage), total.damage=sum(total.damage))
filter <- (tidydata$fatalities + tidydata$injuries + 
        tidydata$property.damage + tidydata$crop.damage) > 0
tidydata <- tidydata[filter,]

Finally, we have the tidy data required for the analysis.

dim(tidydata)
## [1] 440   6
str(tidydata)
## 'data.frame':    440 obs. of  6 variables:
##  $ event.type     : chr  "?" "agricultural freeze" "apache county" "astronomical high tide" ...
##  $ fatalities     : num  0 0 0 0 0 1 224 0 1 101 ...
##  $ injuries       : num  0 0 0 0 0 0 170 0 24 805 ...
##  $ property.damage: num  5000 0 5000 9425000 320000 ...
##  $ crop.damage    : num  0 28820000 0 0 0 ...
##  $ total.damage   : num  5000 28820000 5000 9425000 320000 ...
head(tidydata)
##                event.type fatalities injuries property.damage crop.damage
## 1                       ?          0        0            5000           0
## 6     agricultural freeze          0        0               0    28820000
## 7           apache county          0        0            5000           0
## 8  astronomical high tide          0        0         9425000           0
## 9   astronomical low tide          0        0          320000           0
## 10               avalance          1        0               0           0
##    total.damage
## 1          5000
## 6      28820000
## 7          5000
## 8       9425000
## 9        320000
## 10            0

RESULTS

1. Public Health

Fatalities and injuries cannot be combined for analysis, because they have different implications;for example, injuries may require hospitalization and medical treatment, whereas fatalities don’t. So, I filter the top 5 events by fatalities and the 5 top events by injuries.

## Get the top five (fatalities)
top.fatalities <- arrange(tidydata,desc(fatalities))[,c(1,2)]
top.fatalities$percentage <- top.fatalities$fatalities / 
        sum(top.fatalities$fatalities) * 100
top.fatalities <- head(top.fatalities,5)

kable(top5.fatalities, format="markdown", col.names=c("Event Type","Fatalities","%"))

Table PH-1. Top 5 events causing fatalities

Event Type Fatalities %
excessive heat 1903 17.312591
tornado 1699 15.456696
flash flood 978 8.897380
heat 937 8.524381
lightning 816 7.423581
## Get the top five (injuries)
top.injuries <- arrange(tidydata,desc(injuries))[,c(1,3)]
top.injuries$percentage <- top.injuries$injuries / 
        sum(top.injuries$injuries) * 100
top.injuries <- head(top.injuries,5)

kable(top5.injuries, format="markdown", col.names=c("Event Type","Injuries","%"))

Table PH-2. Top 5 events causing injuries

Event Type Injuries %
tornado 25497 35.474581
flood 6789 9.445697
excessive heat 6525 9.078387
lightning 5230 7.276623
tstm wind 4441 6.178869
## Plot these events
p1 <- ggplot(data=top.fatalities, aes(x=reorder(event.type,percentage), 
                                        y=percentage, fill=fatalities)) +
        geom_bar(stat="identity") +
        coord_flip() + theme(legend.position="none") +
        xlab("") + ylab("% of fatalities caused by event")
p2 <- ggplot(data=top.injuries, aes(x=reorder(event.type,percentage), 
                                        y=percentage, fill=injuries)) +
        geom_bar(stat="identity") +
        coord_flip() + theme(legend.position="none") +
        xlab("") + ylab("% of injuries caused by event")
grid.arrange(p1, p2, main="Fig. 1 - Public Health Top Weather Events (1991-2011)")

From Table PH-1 and Figure 1, Excessive Heat, Tornado, Flash Flood, Heat and Lighting causes 56.19% of fatalities in US. Similarly, from table PH-2 and Figure 1, Tornado, Flood, Excessive Heat, Lightning and Tstm Wind causes 57.04% of injuries in US.

The recomendation is work on prevention for Tornado, Excessive Heat and Lightning for Public Health disaster politics, because those three are common on each top 5.

2.Economic

Economic impact may be totalized from both type of damages, to identify the most dangerous events, so I filter the top 5 events by total damage caused.

## Get the top five
top.damage <- arrange(tidydata,desc(total.damage))[,c(1,4,5,6)]
top.damage$percentage <- top.damage$total.damage / 
        sum(top.damage$total.damage) * 100
top.damage <- head(top.damage,5)

kable(top5.fatalities, format="markdown", col.names=c("Event Type","Fatalities","%"))

Table EC-1. Top 5 events by economic damage caused

Event Type Property Damage Crop Damage Total Damage % Total
flood 144657709807 5661968450 150319678257 33.524792
hurricane/typhoon 69305840000 2607872800 71913712800 16.038434
storm surge 43323536000 5000 43323541000 9.662159
tornado 28897947539 414953270 29312900809 6.537460
hail 15732267048 3025954473 18758221521 4.183521
## Plot these events
p3 <- ggplot(data=top.damage, aes(x=reorder(event.type,percentage), 
                                        y=percentage, fill=total.damage)) +
        geom_bar(stat="identity") +
        coord_flip() + theme(legend.position="none") +
        xlab("") + ylab("% of total damage caused by event")
grid.arrange(p3, main="Fig. 2 - Economic Damage Top Weather Events (1991-2011)")

From Table EC-1 and Figure 2, Flood, Hurricane/Typhon causes 49.56% of total damage in US. The recomendation is work on prevention of those types of event to avoid great economic damage.