Synopsis

This report analyzes data from severe weather events in the United States over the period 1950-2011. As is shown in the analysis below, the largest number of fatalities and injuries have been reported in tornadoes, while the most property damage occurs in floods and the greatest crop damage occurs in droughts. The analysis also shows that the deaths from tornados spiked in four separate years, three of which were before 1980. Additionally, the greatest property damage from floods occurred in 2006.

Data Processing

First, prepare the environment by loading libraries necessary for the analysis.

## prepare environment
library("dplyr")
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Next, read in the bzipped data file.

## read in data
s_d <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))

Since most of the data file deals with extreme weather events that did not cause injury and damage, only those records with values in at least one of the four variables of interest (property damage, crop damage, fatalities, or injuries) are kept. The original much larger dataset is removed from memory to increase efficiency.

## keep only rows with injuries, fatalities, property damage or crop damage
keeprows <- which(s_d$PROPDMG!=0 | s_d$CROPDMG!=0 | s_d$FATALITIES!=0 | s_d$INJURIES!=0)

## subset only rows and columns of interest
storms<-s_d[keeprows,c("BGN_DATE","EVTYPE","FATALITIES","INJURIES",
                       "PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
## clear large data frame out of memory
remove(s_d)

Data Cleaning

Several data cleaning processes are necessary before analysing the data. First, the dollar values for property damage and crop damage are in several different units. In order to be able to meaningfully aggregate these numbers, all observations must be in the same units. For this analysis all damage values are converted to millions of dollars.

## Property Damage and Crop Damage are not in $1 units -- the _EXP variable
##      contains the units (hundreds, thousands, etc)
##      this code chunk creates a factor to multiply the damage number by
##      to get all the damage in the same units -- will convert all to millions

##      Note that there are a few hundred observations with the _EXP variable
##      equal to a single digit or +/-. These are assumed to have a factor of 1.

## Property Damage
storms$PROPDMGEXP <- toupper(storms$PROPDMGEXP)
storms <- mutate(storms,propdmgfac = 0.000001)
storms[which(storms$PROPDMGEXP=="H"),c("propdmgfac")]<-0.0001
storms[which(storms$PROPDMGEXP=="K"),c("propdmgfac")]<-0.001
storms[which(storms$PROPDMGEXP=="M"),c("propdmgfac")]<-1
storms[which(storms$PROPDMGEXP=="B"),c("propdmgfac")]<-1000
storms <- mutate(storms,P_damage =storms$propdmgfac*storms$PROPDMG)

## Crop Damage
storms$CROPDMGEXP <- toupper(storms$CROPDMGEXP)
storms <- mutate(storms,cropdmgfac = 0.000001)
storms[which(storms$CROPDMGEXP=="H"),c("cropdmgfac")]<-0.0001
storms[which(storms$CROPDMGEXP=="K"),c("cropdmgfac")]<-0.001
storms[which(storms$CROPDMGEXP=="M"),c("cropdmgfac")]<-1
storms[which(storms$CROPDMGEXP=="B"),c("cropdmgfac")]<-1000
storms <- mutate(storms,C_damage =storms$cropdmgfac*storms$CROPDMG)

## compute total damage
storms <- mutate(storms,damage=P_damage+C_damage)

## keep only variables of interest
storms <- storms[,c("BGN_DATE","EVTYPE","FATALITIES","INJURIES","P_damage",
                    "C_damage","damage")]

The EVTYPE field has free-form text describing the event. In order to produce meaningful aggregation, similar events must be grouped. Note that, given that this is only a preliminary analysis, the data has not been extensively cleaned. The first step in this process is to trim leading and trailing whitespace and to convert all event descriptions to upper case.

## Convert all event types to upper case and trim leading/trailing white space
storms$EVTYPE <- toupper(storms$EVTYPE)
storms$EVTYPE <- trimws(storms$EVTYPE)

The next chunk of code looks at the beginning of the EVTYPE field to combine similar events. It is assumed that the first named event is primary when several types of events are listed. In addition, common abbreviations have been included.

## cleaning event names:
## grouping all hurricanes together
storms[which(substr(storms$EVTYPE,1,9)=="HURRICANE"),c("EVTYPE")]<-"HURRICANE"

## grouping thunderstorm wind with various abbreviations
storms[which(substr(storms$EVTYPE,1,17)=="THUNDERSTORM WIND" | substr(storms$EVTYPE,
        1,9)=="TSTM WIND"),c("EVTYPE")]<-"THUNDERSTORM WIND"

## group all flash floods 
storms[which(substr(storms$EVTYPE,1,11)=="FLASH FLOOD"),c("EVTYPE")]<-"FLASH FLOOD"

## group all other floods
storms[which(substr(storms$EVTYPE,1,5)=="FLOOD"),c("EVTYPE")]<-"FLOOD"

## group all types of hail
storms[which(substr(storms$EVTYPE,1,4)=="HAIL"),c("EVTYPE")]<-"HAIL"

## group all heavy rain
storms[which(substr(storms$EVTYPE,1,10)=="HEAVY RAIN"),c("EVTYPE")]<-"HEAVY RAIN"

## group all heavy snow
storms[which(substr(storms$EVTYPE,1,10)=="HEAVY SNOW"),c("EVTYPE")]<-"HEAVY SNOW"

## group all high wind
storms[which(substr(storms$EVTYPE,1,9)=="HIGH WIND"),c("EVTYPE")]<-"HIGH WIND"

## group all tornadoes
storms[which(substr(storms$EVTYPE,1,7)=="TORNADO"),c("EVTYPE")]<-"TORNADO"

## group all winter storms
storms[which(substr(storms$EVTYPE,1,12)=="WINTER STORM"),c("EVTYPE")]<-"WINTER STORM"

Once the EVTYPE field has been cleaned, it is used to aggregate the data on fatalities, injuries, and damage.

Sum_Data<-summarise(group_by(storms,EVTYPE),Deaths=sum(FATALITIES),
                    Injuries=sum(INJURIES), Prop_Damage=sum(P_damage), 
                    Crop_Damage=sum(C_damage), Damage=sum(damage))

Results

As a first step in the analysis, the ten event types with the highest aggregate number of fatalities, injuries, property damage, crop damage and total damage over the time period are listed here.

fatal<-arrange(Sum_Data,desc(Deaths))
fatal<-fatal[,c("EVTYPE","Deaths")]
head(fatal, n=10L)
## Source: local data frame [10 x 2]
## 
##               EVTYPE Deaths
## 1            TORNADO   5658
## 2     EXCESSIVE HEAT   1903
## 3        FLASH FLOOD   1018
## 4               HEAT    937
## 5          LIGHTNING    816
## 6  THUNDERSTORM WIND    709
## 7              FLOOD    495
## 8        RIP CURRENT    368
## 9          HIGH WIND    293
## 10         AVALANCHE    224
injury<-arrange(Sum_Data,desc(Injuries))
injury<-injury[,c("EVTYPE","Injuries")]
head(injury, n=10L)
## Source: local data frame [10 x 2]
## 
##               EVTYPE Injuries
## 1            TORNADO    91364
## 2  THUNDERSTORM WIND     9458
## 3              FLOOD     6806
## 4     EXCESSIVE HEAT     6525
## 5          LIGHTNING     5230
## 6               HEAT     2100
## 7          ICE STORM     1975
## 8        FLASH FLOOD     1785
## 9          HIGH WIND     1471
## 10              HAIL     1361
pdam<-arrange(Sum_Data,desc(Prop_Damage))
pdam<-pdam[,c("EVTYPE","Prop_Damage")]
head(pdam, n=10L)
## Source: local data frame [10 x 2]
## 
##               EVTYPE Prop_Damage
## 1              FLOOD  144957.524
## 2          HURRICANE   84756.180
## 3            TORNADO   58541.932
## 4        STORM SURGE   43323.536
## 5        FLASH FLOOD   16732.869
## 6               HAIL   15974.470
## 7  THUNDERSTORM WIND    9760.518
## 8     TROPICAL STORM    7703.891
## 9       WINTER STORM    6748.997
## 10         HIGH WIND    6003.353
cdam<-arrange(Sum_Data,desc(Crop_Damage))
cdam<-cdam[,c("EVTYPE","Crop_Damage")]
head(cdam, n=10L)
## Source: local data frame [10 x 2]
## 
##               EVTYPE Crop_Damage
## 1            DROUGHT   13972.566
## 2              FLOOD    5878.708
## 3          HURRICANE    5515.293
## 4        RIVER FLOOD    5029.459
## 5          ICE STORM    5022.114
## 6               HAIL    3026.095
## 7        FLASH FLOOD    1437.163
## 8       EXTREME COLD    1312.973
## 9  THUNDERSTORM WIND    1224.398
## 10      FROST/FREEZE    1094.186
tdam<-arrange(Sum_Data,desc(Damage))
tdam<-tdam[,c("EVTYPE","Damage")]
head(tdam,n=10L)
## Source: local data frame [10 x 2]
## 
##               EVTYPE     Damage
## 1              FLOOD 150836.232
## 2          HURRICANE  90271.473
## 3            TORNADO  58959.393
## 4        STORM SURGE  43323.541
## 5               HAIL  19000.565
## 6        FLASH FLOOD  18170.032
## 7            DROUGHT  15018.672
## 8  THUNDERSTORM WIND  10984.916
## 9        RIVER FLOOD  10148.405
## 10         ICE STORM   8967.041

From these lists, it is clear that the most injuries and fatalities occur when the event type is TORNADO, while the most property damage occurs when the event type is FLOOD.

Next, focusing on the most fatal event type, TORNADO, deaths from this event are plotted over time.

The data must be re-formatted to get a time series of deaths per year.

## Add Year variable to storms data frame
storms<-mutate(storms,storm_year=as.numeric(substr(as.character(BGN_DATE),
                nchar(as.character(BGN_DATE))-11,
                nchar(as.character(BGN_DATE))-8)))

## summarise data by year and event type
Yrly_Data<-summarise(group_by(storms,storm_year,EVTYPE),Deaths=sum(FATALITIES), 
                     Injuries=sum(INJURIES), Prop_Damage=sum(P_damage), 
                     Crop_Damage=sum(C_damage), Damage=sum(damage))

## data for graph 1: yearly deaths from tornadoes
g1data<-Yrly_Data[which(Yrly_Data$EVTYPE==as.character(fatal[1,1])),]
g1title<-paste0("Annual Deaths from Event Type = ",as.character(fatal[1,1]))
plot(g1data$storm_year,g1data$Deaths,type="l", xlab="Year",
     ylab="Deaths",main=g1title)

From the plot it is evident that there are four years with extreme values, three of which are before 1980 while the largest number of deaths on record occurred in 2011. It does not appear that this is solely an artifact of better collection in more recent years as the number of deathes from tornadoes is not rising over the period 1980-2010.

Next, data is prepared to plot the annual property damage from floods.

## data for graph 2: yearly property damage from floods
g2data<-Yrly_Data[which(Yrly_Data$EVTYPE==as.character(pdam[1,1])),]
g2title<-paste0("Annual Property Damage from Event Type = ",
                as.character(pdam[1,1]))
plot(g2data$storm_year,g2data$Prop_Damage,type="l", xlab="Year",
     ylab="Property Damage ($millions)",main=g2title)

From this graph it is evident that property damage from floods was not collected prior to 1993. The total property damage from floods is driven by damage in a single year, 2006, when property damage from floods was $116.52 billion.

Limitations and Directions for Further Analysis

As noted above, the EVTYPE field has not been extensively cleaned. Further work could be done in classifying events.

This analysis focused on the events where there was damage or injury. Therefore, it does not tell us how likely damage or injury is for a given event type. An analysis of the full data set could answer questions about the average damage incurred in, for example, a tornado. The current analysis just looks at those tornado events where damage occurred.

Additionally, it would be interesting to look at which event types had the highest annual average over the time period studied.