Synopsis
While tornados clearly lead in human costs (fatalities, injuries), floods have the greatest financial impact from property and crop damage.
From coursera assignment.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
From coursera assignment.
The data for this analysis comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. We downloaded the file from the course web site:
Storm Data [47Mb]
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Initialize required packages
library(data.table)
library(dplyr)
library(lubridate)
library(ggplot2)
library(reshape2)
In order to avoid duplicating the source data in our analysis repository, we check for it locally and only download it if it is not already available locally. The source data is ignored by our analysis repository.
DATA_URL = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
DATA_FILE = "StormData.csv.bz2"
# check if data file exists locally
if(!file.exists(DATA_FILE)) {
message("Downloading data from url.")
download.file(DATA_URL, destfile=DATA_FILE)
} else {
message("Downloaded data found locally, not repeating.")
}
## Downloaded data found locally, not repeating.
data = read.csv(DATA_FILE)
For our analysis, we are only concerned with select fields from the data set, in particular: * BGN_DATE * EVTYPE * FATALITIES * INJURIES * PROPDMG * PROPDMGEXP * CROPDMG * CROPDMGEXP
The following allows us to extract the fields listed above:
#select only the columns necessary for the analysis
cols <- c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
sub <- subset(data, select = cols)
#extract the year from the BGN_DATE column
sub$year <- year(mdy_hms(sub$BGN_DATE))
Data is sparse in early years, and impacts are likely less applicable due to population growth and inflation.
# This is a count of events by year
hist(sub$year, breaks = 60)
# We can see a clear increase in the data points starting in 1994. We will restrict the analyis to this year and later
sub <- filter(sub, year >= 1994)
Here is a summary of the features we are interested in:
summary(sub)
## BGN_DATE EVTYPE FATALITIES
## 5/25/2011 0:00:00: 1202 HAIL :222616 Min. : 0.000
## 4/27/2011 0:00:00: 1193 TSTM WIND :128970 1st Qu.: 0.000
## 6/9/2011 0:00:00 : 1030 THUNDERSTORM WIND: 82482 Median : 0.000
## 5/30/2004 0:00:00: 1016 FLASH FLOOD : 53396 Mean : 0.015
## 4/4/2011 0:00:00 : 1009 TORNADO : 25274 3rd Qu.: 0.000
## 4/2/2006 0:00:00 : 981 FLOOD : 24906 Max. :583.000
## (Other) :695700 (Other) :164487
## INJURIES PROPDMG PROPDMGEXP CROPDMG
## Min. : 0.0000 Min. : 0.00 K :387417 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.00 :306087 1st Qu.: 0.000
## Median : 0.0000 Median : 0.00 M : 8266 Median : 0.000
## Mean : 0.0949 Mean : 12.22 0 : 215 Mean : 1.889
## 3rd Qu.: 0.0000 3rd Qu.: 1.50 B : 37 3rd Qu.: 0.000
## Max. :1568.0000 Max. :5000.00 5 : 28 Max. :990.000
## (Other): 81
## CROPDMGEXP year
## :419089 Min. :1994
## K :281034 1st Qu.:1999
## M : 1955 Median :2004
## k : 21 Mean :2004
## 0 : 17 3rd Qu.:2008
## B : 8 Max. :2011
## (Other): 7
We need to clean up the the fields that indicate property damage and crop damage so we can quantify them more easily. Currently, the PROPDMGEXP field indicates the magnitude of PROPDMG. e.g. k corresponds to ,000s. The formula below will translate the PROPDMGEXP and CROPDMGEXP fields to $ multipliers
# convert the PROPDMGEXP and CROPDMGEXP multipliers into numbers
dmg_mult = function(exp) {
if(exp == '' || exp == '-' || exp == '?' || exp == '+' || exp == 0) {
value = 1
} else if(exp == 1) {
value = 10
} else if(exp == 'H' || exp == 'h' || exp == 2) {
value = 100
} else if(exp == 'K' || exp == 'k' || exp == 3) {
value = 1000
} else if (exp == 4) {
value = 10000
} else if (exp == 5) {
value = 100000
} else if (exp == 'M' || exp == 'm' || exp == 6) {
value = 1000000
} else if (exp == 7) {
value = 10000000
} else if (exp == 8) {
value = 100000000
} else if (exp == 'B' || exp == 'b') {
value = 1000000000
} else {}
return(value)
}
#add new multiplier and $ damage variables
sub$propmult <- mapply(dmg_mult, sub$PROPDMGEXP)
sub$cropmult <- mapply(dmg_mult, sub$CROPDMGEXP)
sub$propdmg. <- sub$PROPDMG * sub$propmult
sub$cropdmg. <- sub$CROPDMG * sub$cropmult
Also, we only care about events that had a cost in terms of human injury, human death, property damage and crop damage. After filtering out the weather events without costs, we sum them by event type, getting an overall cost per event type.
We then sum the totals of fatalities and fnjuries to arrive at the total human costs The property and crop damage are summed to arrive at the total economic cost
nonzero = subset(
sub,
FATALITIES > 0 | INJURIES > 0 | propdmg. > 0 | cropdmg. > 0
)
hc_by_event <- aggregate(cbind(INJURIES, FATALITIES) ~ EVTYPE, data = nonzero, FUN = "sum")
ec_by_event <- aggregate(cbind(propdmg., cropdmg.) ~ EVTYPE, data = nonzero, FUN = "sum")
First let us examine the most costly weather events in terms of human fatalities and injuries.
Here we sort the data by total fatalities and injuries and maintain only the top 10 observations
top_hum_cost <- arrange(hc_by_event, desc(FATALITIES + INJURIES))[1:10,]
top_hum_cost
## EVTYPE INJURIES FATALITIES
## 1 TORNADO 22571 1593
## 2 EXCESSIVE HEAT 6525 1903
## 3 FLOOD 6778 450
## 4 LIGHTNING 5116 794
## 5 TSTM WIND 3631 241
## 6 HEAT 2095 930
## 7 FLASH FLOOD 1754 951
## 8 ICE STORM 1971 86
## 9 THUNDERSTORM WIND 1476 133
## 10 WINTER STORM 1298 195
Visualizing the results:
## convert wide to long format
top_hum_cost. <- melt(top_hum_cost, id.vars="EVTYPE")
xaxis <- reorder(top_hum_cost.$EVTYPE, -(top_hum_cost.$value))
gp <- ggplot(aes(x=xaxis, y=value, fill=variable), data=top_hum_cost.) +
geom_bar(stat="identity") +
labs(x="Event Type", y="Number of Population Affected",
title="Human Costs by Event Type (since 1994) \n(Total, Injuries and Fatalities)") +
theme(axis.text.x=element_text(angle=40, hjust=1))
print(gp)
The graph shows that Tornadoes account for the overwhelming majority of human costs since 1994. Injuries make up most of those costs with fatalities being a smaller proportion.
We now turn to the economic costs of weather events. Here we will define the economic cost as the sum of the property and crop damage for the given event type.
We sort the data by total financial cost and maintain only the top 10 observations
top_eco_cost <- arrange(ec_by_event, desc(propdmg. + cropdmg.))[1:10,]
colnames(top_eco_cost) <- c("EVTYPE", "PROPERTY_DAMAGE", "CROP_DAMAGE")
top_eco_cost
## EVTYPE PROPERTY_DAMAGE CROP_DAMAGE
## 1 FLOOD 144179608807 5506942450
## 2 HURRICANE/TYPHOON 69305840000 2607872800
## 3 STORM SURGE 43193536000 5000
## 4 TORNADO 25630588401 361824470
## 5 HAIL 15338044461 2982699123
## 6 FLASH FLOOD 16398255929 1402661500
## 7 DROUGHT 1046106000 13922066000
## 8 HURRICANE 11862819010 2741410000
## 9 ICE STORM 3832377860 5022113500
## 10 TROPICAL STORM 7703385550 677841000
Visualizing the results:
## convert wide to long format
top_eco_cost. <- melt(top_eco_cost, id.vars="EVTYPE")
xaxis <- reorder(top_eco_cost.$EVTYPE, -(top_eco_cost.$value))
gp <- ggplot(aes(x=xaxis, y=value/1e9, fill=variable), data=top_eco_cost.) +
geom_bar(stat="identity") +
labs(x="Event Type", y="Economic Damage ($billions)",
title="Economic Costs by Event Type (since 1994) \n(Total, Property & Crop Damage)") +
theme(axis.text.x=element_text(angle=40, hjust=1))
print(gp)
The graph shows that floods account for the overwhelming majority of economic costs since 1994. Property damage makes up most of those costs with crop damage being a smaller proportion.