In this report, I attempt to describe the most serious natural disasters requiring the attention of local, state, and federal government agencies and emergency managers. Various natural disasters exact tolls in both human terms (injuries and deaths) and economic impact (property and crop loss). The report analyzes data from the National Weather Service to identify the most consequential of the natural disasters faced by the United States. Because the data are spread across a large time period, and are less accurate in earlier years, I focused only on the years 2000-2011. A full description of the data on which the analysis is performed is available here:
First we load the raw data (see link above) into a data frame, and (for convenience) change the column names to lower case.
data<-read.table('repdata_data_StormData.csv.bz2',header = TRUE, sep=',')
colnames(data)<-tolower(colnames(data))
As I chose to focus on the 21st Century data available (2000-2011), I first convert the “bgn_date” column into a date using the Lubridate package https://cran.r-project.org/web/packages/lubridate/index.html and the dplyr library https://cran.r-project.org/web/packages/dplyr/index.html and add a column to the data called “year” containing just the year from the date. Then I remove the data prior to 2000.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data$year<-mdy_hms(data$bgn_date) %>% year
data<-data[which(data$year>1999),]
The data.table package https://cran.r-project.org/web/packages/data.table/index.html makes many of the following transformations much easier and more efficient. The data set uses a notation for economic losses in which dollar amounts are written across two columns, such as propdmg and propdmgexp, where the second column explains whether the first is to be multiplied by 1,000, 1,000,000, or 1,000,000,000 (K,M,B). For the first set of transformations we convert this into a single numerical column.
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
data<-as.data.table(data)
data[propdmgexp=='K',propdmg:=propdmg*1000]
data[propdmgexp=='M',propdmg:=propdmg*1000000]
data[propdmgexp=='B',propdmg:=propdmg*1000000000]
data[cropdmgexp=='K',cropdmg:=cropdmg*1000]
data[cropdmgexp=='M',cropdmg:=cropdmg*1000000]
data[cropdmgexp=='B',cropdmg:=cropdmg*1000000000]
Next, I created new columns to describe the annual financial losses (combining losses by crops as well as property losses into a new column “totaldmg”
data[,totaldmg:=cropdmg+propdmg]
In order to understand the effect per year of each of the disasters, I created columns “fatalities_year_evtype”, “injuries_year_evtype”, and “totaldmg_year_evtype” where each of these can be read as the cariable (fatalities, injuries, total damage) per year per event type.
data[,fatalities_year_evtype:=sum(fatalities), by=list(year,evtype)]
data[,injuries_year_evtype:=sum(injuries), by=list(year,evtype)]
data[,totaldmg_year_evtype:=sum(totaldmg), by=list(year,evtype)]
I also derived columns to examine the total impact by event type.
data[,fatalities_evtype:=sum(fatalities), by=evtype]
data[,injuries_evtype:=sum(injuries), by=evtype]
data[,propdmg_evtype:=sum(propdmg), by=evtype]
data[,cropdmg_evtype:=sum(cropdmg), by=evtype]
data[,totaldmg_evtype:=sum(totaldmg), by=evtype]
With all of the derived columns added, analysis is far easier.
With the data processed, I can find the most impactful event type for each of the years examined by type of casualty and type of damage. Fatalities are examined first:
data[,evtype[which.max(fatalities_year_evtype)], by=year]
## year V1
## 1: 2000 EXCESSIVE HEAT
## 2: 2001 EXCESSIVE HEAT
## 3: 2002 EXCESSIVE HEAT
## 4: 2003 FLASH FLOOD
## 5: 2004 FLASH FLOOD
## 6: 2005 EXCESSIVE HEAT
## 7: 2006 EXCESSIVE HEAT
## 8: 2007 TORNADO
## 9: 2008 TORNADO
## 10: 2009 RIP CURRENT
## 11: 2010 FLASH FLOOD
## 12: 2011 TORNADO
Excessive heat is the most common cause of death in each year. And now injuries:
data[,evtype[which.max(injuries_year_evtype)], by=year]
## year V1
## 1: 2000 TORNADO
## 2: 2001 TORNADO
## 3: 2002 TORNADO
## 4: 2003 TORNADO
## 5: 2004 HURRICANE/TYPHOON
## 6: 2005 TORNADO
## 7: 2006 EXCESSIVE HEAT
## 8: 2007 TORNADO
## 9: 2008 TORNADO
## 10: 2009 TORNADO
## 11: 2010 TORNADO
## 12: 2011 TORNADO
Clearly, Tornados are the most common source of injury by disaster during the period reviewed. Next, look at economic damage - as total crop and propery damage by year:
data[,evtype[which.max(totaldmg_year_evtype)], by=year]
## year V1
## 1: 2000 DROUGHT
## 2: 2001 TROPICAL STORM
## 3: 2002 HURRICANE
## 4: 2003 WILDFIRE
## 5: 2004 HURRICANE/TYPHOON
## 6: 2005 HURRICANE/TYPHOON
## 7: 2006 FLOOD
## 8: 2007 TORNADO
## 9: 2008 STORM SURGE/TIDE
## 10: 2009 HAIL
## 11: 2010 FLOOD
## 12: 2011 TORNADO
Notice tornados have appeared on all three lists - they are clearly a significant concern for much of the US.
For the final bit of analysis, I examined the top five overall most impactful disastors from both a public health and an economic impact perspective. Using gather from the tidyr library https://cran.r-project.org/web/packages/tidyr/index.html I reshaped the data to enable a plots of the most harmful event types from a public health perspective.
library(tidyr)
# Select the previously processed 'fatalities_evtype','injuries_evtype', and 'evtype' columns and remove duplicates
casualties<-data %>% select('fatalities_evtype','injuries_evtype','evtype') %>% unique
# create a sum of casualties (fatalities + injuries)
casualties[,total:=fatalities_evtype+injuries_evtype]
#rename the column names for clener labeling later
colnames(casualties)[1:2]<-c('fatalities','injuries')
# order the frame by the total casualty count, take the top five entries, and gather the data into a narrow frame with keys for type of casualty
casualties<-casualties[order(total, decreasing = TRUE)]%>%head(5) %>% gather(casualty_type,casualties,fatalities:injuries)
library(ggplot2)
# use ggplot2 to plot the data
p<-ggplot(casualties, aes(evtype,casualties, fill=casualty_type))+geom_bar(stat='identity')+ theme(axis.text.x = element_text(angle = 90)) + labs(title ="Fig. 1 - Top Five Causes of Casualties by Disaster Type 2000-2011", x = "Event Type", y = "Number of Casualties")
print(p)
Now we look at the same plot for economic impact:
# Select the previously processed 'propdmg_evtype','cropdmg_evtype','totaldmg_evtype','evtype' columns and remove duplicates
losses<-data %>% select('propdmg_evtype','cropdmg_evtype','totaldmg_evtype','evtype') %>% unique
#rename the column names for clener labeling later
colnames(losses)[1:2]<-c('property','crop')
# order the frame by the total loss, take the top five entries, and gather the data into a narrow frame with keys for type of loss
losses<-losses[order(totaldmg_evtype, decreasing = TRUE)]%>%head(5) %>% gather(loss_type,losses,property:crop)
p<-ggplot(losses, aes(evtype,losses, fill=loss_type))+geom_bar(stat='identity')+ theme(axis.text.x = element_text(angle = 90)) + labs(title ="Fig. 2 - Top Five Causes of Loss by Disaster Type 2000-2011", x = "Event Type", y = "Amount of Loss")
print(p)
As we can see from the graphs tornados are by far the most costly in terms of public health events. Floods, hurricanes, and storm surges cause more property damage, but tornados are also not insignificant contributors to economic damage from natural disasters.