This report analyzes data from severe weather events in the United States over the period 1950-2011. As is shown in the analysis below, the largest number of fatalities and injuries have been reported in tornadoes, while the most property damage occurs in floods and the greatest crop damage occurs in droughts. The analysis also shows that the deaths from tornados spiked in four separate years, three of which were before 1980. Additionally, the greatest property damage from floods occurred in 2006.
First, prepare the environment by loading libraries necessary for the analysis.
## prepare environment
library("dplyr")
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Next, read in the bzipped data file.
## read in data
s_d <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))
Since most of the data file deals with extreme weather events that did not cause injury and damage, only those records with values in at least one of the four variables of interest (property damage, crop damage, fatalities, or injuries) are kept. The original much larger dataset is removed from memory to increase efficiency.
## keep only rows with injuries, fatalities, property damage or crop damage
keeprows <- which(s_d$PROPDMG!=0 | s_d$CROPDMG!=0 | s_d$FATALITIES!=0 | s_d$INJURIES!=0)
## subset only rows and columns of interest
storms<-s_d[keeprows,c("BGN_DATE","EVTYPE","FATALITIES","INJURIES",
"PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
## clear large data frame out of memory
remove(s_d)
Several data cleaning processes are necessary before analysing the data. First, the dollar values for property damage and crop damage are in several different units. In order to be able to meaningfully aggregate these numbers, all observations must be in the same units. For this analysis all damage values are converted to millions of dollars.
## Property Damage and Crop Damage are not in $1 units -- the _EXP variable
## contains the units (hundreds, thousands, etc)
## this code chunk creates a factor to multiply the damage number by
## to get all the damage in the same units -- will convert all to millions
## Note that there are a few hundred observations with the _EXP variable
## equal to a single digit or +/-. These are assumed to have a factor of 1.
## Property Damage
storms$PROPDMGEXP <- toupper(storms$PROPDMGEXP)
storms <- mutate(storms,propdmgfac = 0.000001)
storms[which(storms$PROPDMGEXP=="H"),c("propdmgfac")]<-0.0001
storms[which(storms$PROPDMGEXP=="K"),c("propdmgfac")]<-0.001
storms[which(storms$PROPDMGEXP=="M"),c("propdmgfac")]<-1
storms[which(storms$PROPDMGEXP=="B"),c("propdmgfac")]<-1000
storms <- mutate(storms,P_damage =storms$propdmgfac*storms$PROPDMG)
## Crop Damage
storms$CROPDMGEXP <- toupper(storms$CROPDMGEXP)
storms <- mutate(storms,cropdmgfac = 0.000001)
storms[which(storms$CROPDMGEXP=="H"),c("cropdmgfac")]<-0.0001
storms[which(storms$CROPDMGEXP=="K"),c("cropdmgfac")]<-0.001
storms[which(storms$CROPDMGEXP=="M"),c("cropdmgfac")]<-1
storms[which(storms$CROPDMGEXP=="B"),c("cropdmgfac")]<-1000
storms <- mutate(storms,C_damage =storms$cropdmgfac*storms$CROPDMG)
## compute total damage
storms <- mutate(storms,damage=P_damage+C_damage)
## keep only variables of interest
storms <- storms[,c("BGN_DATE","EVTYPE","FATALITIES","INJURIES","P_damage",
"C_damage","damage")]
The EVTYPE field has free-form text describing the event. In order to produce meaningful aggregation, similar events must be grouped. Note that, given that this is only a preliminary analysis, the data has not been extensively cleaned. The first step in this process is to trim leading and trailing whitespace and to convert all event descriptions to upper case.
## Convert all event types to upper case and trim leading/trailing white space
storms$EVTYPE <- toupper(storms$EVTYPE)
storms$EVTYPE <- trimws(storms$EVTYPE)
The next chunk of code looks at the beginning of the EVTYPE field to combine similar events. It is assumed that the first named event is primary when several types of events are listed. In addition, common abbreviations have been included.
## cleaning event names:
## grouping all hurricanes together
storms[which(substr(storms$EVTYPE,1,9)=="HURRICANE"),c("EVTYPE")]<-"HURRICANE"
## grouping thunderstorm wind with various abbreviations
storms[which(substr(storms$EVTYPE,1,17)=="THUNDERSTORM WIND" | substr(storms$EVTYPE,
1,9)=="TSTM WIND"),c("EVTYPE")]<-"THUNDERSTORM WIND"
## group all flash floods
storms[which(substr(storms$EVTYPE,1,11)=="FLASH FLOOD"),c("EVTYPE")]<-"FLASH FLOOD"
## group all other floods
storms[which(substr(storms$EVTYPE,1,5)=="FLOOD"),c("EVTYPE")]<-"FLOOD"
## group all types of hail
storms[which(substr(storms$EVTYPE,1,4)=="HAIL"),c("EVTYPE")]<-"HAIL"
## group all heavy rain
storms[which(substr(storms$EVTYPE,1,10)=="HEAVY RAIN"),c("EVTYPE")]<-"HEAVY RAIN"
## group all heavy snow
storms[which(substr(storms$EVTYPE,1,10)=="HEAVY SNOW"),c("EVTYPE")]<-"HEAVY SNOW"
## group all high wind
storms[which(substr(storms$EVTYPE,1,9)=="HIGH WIND"),c("EVTYPE")]<-"HIGH WIND"
## group all tornadoes
storms[which(substr(storms$EVTYPE,1,7)=="TORNADO"),c("EVTYPE")]<-"TORNADO"
## group all winter storms
storms[which(substr(storms$EVTYPE,1,12)=="WINTER STORM"),c("EVTYPE")]<-"WINTER STORM"
Once the EVTYPE field has been cleaned, it is used to aggregate the data on fatalities, injuries, and damage.
Sum_Data<-summarise(group_by(storms,EVTYPE),Deaths=sum(FATALITIES),
Injuries=sum(INJURIES), Prop_Damage=sum(P_damage),
Crop_Damage=sum(C_damage), Damage=sum(damage))
As a first step in the analysis, the ten event types with the highest aggregate number of fatalities, injuries, property damage, crop damage and total damage over the time period are listed here.
fatal<-arrange(Sum_Data,desc(Deaths))
fatal<-fatal[,c("EVTYPE","Deaths")]
head(fatal, n=10L)
## Source: local data frame [10 x 2]
##
## EVTYPE Deaths
## 1 TORNADO 5658
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 1018
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 THUNDERSTORM WIND 709
## 7 FLOOD 495
## 8 RIP CURRENT 368
## 9 HIGH WIND 293
## 10 AVALANCHE 224
injury<-arrange(Sum_Data,desc(Injuries))
injury<-injury[,c("EVTYPE","Injuries")]
head(injury, n=10L)
## Source: local data frame [10 x 2]
##
## EVTYPE Injuries
## 1 TORNADO 91364
## 2 THUNDERSTORM WIND 9458
## 3 FLOOD 6806
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1785
## 9 HIGH WIND 1471
## 10 HAIL 1361
pdam<-arrange(Sum_Data,desc(Prop_Damage))
pdam<-pdam[,c("EVTYPE","Prop_Damage")]
head(pdam, n=10L)
## Source: local data frame [10 x 2]
##
## EVTYPE Prop_Damage
## 1 FLOOD 144957.524
## 2 HURRICANE 84756.180
## 3 TORNADO 58541.932
## 4 STORM SURGE 43323.536
## 5 FLASH FLOOD 16732.869
## 6 HAIL 15974.470
## 7 THUNDERSTORM WIND 9760.518
## 8 TROPICAL STORM 7703.891
## 9 WINTER STORM 6748.997
## 10 HIGH WIND 6003.353
cdam<-arrange(Sum_Data,desc(Crop_Damage))
cdam<-cdam[,c("EVTYPE","Crop_Damage")]
head(cdam, n=10L)
## Source: local data frame [10 x 2]
##
## EVTYPE Crop_Damage
## 1 DROUGHT 13972.566
## 2 FLOOD 5878.708
## 3 HURRICANE 5515.293
## 4 RIVER FLOOD 5029.459
## 5 ICE STORM 5022.114
## 6 HAIL 3026.095
## 7 FLASH FLOOD 1437.163
## 8 EXTREME COLD 1312.973
## 9 THUNDERSTORM WIND 1224.398
## 10 FROST/FREEZE 1094.186
tdam<-arrange(Sum_Data,desc(Damage))
tdam<-tdam[,c("EVTYPE","Damage")]
head(tdam,n=10L)
## Source: local data frame [10 x 2]
##
## EVTYPE Damage
## 1 FLOOD 150836.232
## 2 HURRICANE 90271.473
## 3 TORNADO 58959.393
## 4 STORM SURGE 43323.541
## 5 HAIL 19000.565
## 6 FLASH FLOOD 18170.032
## 7 DROUGHT 15018.672
## 8 THUNDERSTORM WIND 10984.916
## 9 RIVER FLOOD 10148.405
## 10 ICE STORM 8967.041
From these lists, it is clear that the most injuries and fatalities occur when the event type is TORNADO, while the most property damage occurs when the event type is FLOOD.
Next, focusing on the most fatal event type, TORNADO, deaths from this event are plotted over time.
The data must be re-formatted to get a time series of deaths per year.
## Add Year variable to storms data frame
storms<-mutate(storms,storm_year=as.numeric(substr(as.character(BGN_DATE),
nchar(as.character(BGN_DATE))-11,
nchar(as.character(BGN_DATE))-8)))
## summarise data by year and event type
Yrly_Data<-summarise(group_by(storms,storm_year,EVTYPE),Deaths=sum(FATALITIES),
Injuries=sum(INJURIES), Prop_Damage=sum(P_damage),
Crop_Damage=sum(C_damage), Damage=sum(damage))
## data for graph 1: yearly deaths from tornadoes
g1data<-Yrly_Data[which(Yrly_Data$EVTYPE==as.character(fatal[1,1])),]
g1title<-paste0("Annual Deaths from Event Type = ",as.character(fatal[1,1]))
plot(g1data$storm_year,g1data$Deaths,type="l", xlab="Year",
ylab="Deaths",main=g1title)
From the plot it is evident that there are four years with extreme values, three of which are before 1980 while the largest number of deaths on record occurred in 2011. It does not appear that this is solely an artifact of better collection in more recent years as the number of deathes from tornadoes is not rising over the period 1980-2010.
Next, data is prepared to plot the annual property damage from floods.
## data for graph 2: yearly property damage from floods
g2data<-Yrly_Data[which(Yrly_Data$EVTYPE==as.character(pdam[1,1])),]
g2title<-paste0("Annual Property Damage from Event Type = ",
as.character(pdam[1,1]))
plot(g2data$storm_year,g2data$Prop_Damage,type="l", xlab="Year",
ylab="Property Damage ($millions)",main=g2title)
From this graph it is evident that property damage from floods was not collected prior to 1993. The total property damage from floods is driven by damage in a single year, 2006, when property damage from floods was $116.52 billion.
As noted above, the EVTYPE field has not been extensively cleaned. Further work could be done in classifying events.
This analysis focused on the events where there was damage or injury. Therefore, it does not tell us how likely damage or injury is for a given event type. An analysis of the full data set could answer questions about the average damage incurred in, for example, a tornado. The current analysis just looks at those tornado events where damage occurred.
Additionally, it would be interesting to look at which event types had the highest annual average over the time period studied.