This report provides an analysis of the most damaging and most injurious/fatal weather events in the U.S. from 1950 to 2011 based upon the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. It contains information on many characteristics of major weather events throughout the United States; most importantly for this analysis are estimates of injuries/fatalities and property/crop damage costs.
The results show that, in general, tornadoes are the weather event that causes the greatest damage and the most injuries/fatalities U.S. wide. However, the record-setting hurricane activity of 2005 greatly skews the results when those events are included in the analysis. When included, floods and hurricanes appear to be the most damage producing types of events, but interestingly tornadoes remain the most injurious type of event. An interesting follow up study would test a hypothesis that tornadoes may be the leading cause of human harm due to lack of early enough warning or lack of effective warnings ahead of the event.
The data for this analysis was provided by the requestor. All the data needed for this analysis is pulled from this database.
The following libraries are required for this analysis.
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(tidyr))
The entire database is initially read from the raw text file named repdata-data-StormData.csv.bz2. The file is a bzip2 encoded CSV file of the NOAA StormData database. The first row is read as header data. This file must be located in the same working directory as the processing script.
NOTE: it may take several minutes to read the database file. The results are cached for this report to avoid the delay for casual reading.
rdata <- read.csv("repdata-data-StormData.csv.bz2", stringsAsFactors=FALSE)
The following shows the characteristics of the raw data.
dim(rdata)
## [1] 902297 37
head(rdata)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
This report uses the following variables from the dataset: BGN_DATE, EVTYPE, PROPDMG, FATALITIES, INJURIES, PROPDMGEXP, CROPDMG, and CROPDMGEXP. These are separated out.
pdata <- rdata %>% select(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG:CROPDMGEXP) %>%
filter(PROPDMGEXP %in% c("B", "b", "M", "m", "K", "k", "") &
CROPDMGEXP %in% c("B", "b", "M", "m", "K", "k", ""))
Note that some processing of the raw data is begun in the code above. One thing discovered about the NOAA database is that there is little adherence to the published data guidelines for some variables, PROPDMGEXP and CROPDMGEXP in particular. The processing of these are described in more detail below, but suffice it to say that some of the raw records need to be filtered out because the values of these items cannot be interpreted. Fortunately, less than 0.04% of the data is discarded.
The event date information (BGN_DATE) is originally read as date/time character vectors. These are converted to true Date class. A new column is added which contains just the year portion of the date (as a character vector).
pdata$BGN_DATE <- as.Date(pdata$BGN_DATE, format="%m/%d/%Y")
pdata$BGN_DATE_YR <- year(pdata$BGN_DATE)
The property damage and crop damage figures are stored in the database in an unusual form. Property damage is supposed to be stored in numeric form rounded to three significant digits in the PROPDMG variable. This value is then multiplied by an “exponent” provided by PROPDMGEXP to get the true value. The expected values for PROPDMGEXP are “B” for billions, “M” for millions, “K” for thousands, and nothing (“”) for anything less than 1,000. Lower case letters are also acceptable. For example, if the PROPDMG value is 155 and the PROPDMGEXP value is “K” then the true property damage value is $155 * 1,000 = $155,000. The crop damage variables (CROPDMG and CROPDMGEXP) work the same way.
The code below computes true damage values. To do so, it creates two additional variables, PROPX and CROPX, as numeric multipliers based on the character values in PROPDMGEXP and CROPDMGEXP, respectively. The PROPDMG and CROPDMG values are then multipled by PROPX and CROPX, respectively, and the results placed back into PROPDMG and CROPDMG.
This code may seem really awkard, but after testing several alternatives, this algorithm was far faster than any other.
pdata$PROPX <- 1
pdata[pdata$PROPDMGEXP=="B","PROPX"] <- 1.0e9
pdata[pdata$PROPDMGEXP=="b","PROPX"] <- 1.0e9
pdata[pdata$PROPDMGEXP=="M","PROPX"] <- 1.0e6
pdata[pdata$PROPDMGEXP=="m","PROPX"] <- 1.0e6
pdata[pdata$PROPDMGEXP=="K","PROPX"] <- 1000
pdata[pdata$PROPDMGEXP=="k","PROPX"] <- 1000
pdata$PROPDMG <- pdata$PROPDMG * pdata$PROPX
pdata$CROPX <- 1
pdata[pdata$CROPDMGEXP=="B","CROPX"] <- 1.0e9
pdata[pdata$CROPDMGEXP=="b","CROPX"] <- 1.0e9
pdata[pdata$CROPDMGEXP=="M","CROPX"] <- 1.0e6
pdata[pdata$CROPDMGEXP=="m","CROPX"] <- 1.0e6
pdata[pdata$CROPDMGEXP=="K","CROPX"] <- 1000
pdata[pdata$CROPDMGEXP=="k","CROPX"] <- 1000
pdata$CROPDMG <- pdata$CROPDMG * pdata$CROPX
The data is now in a form ready for analysis.
This report first addresses the types of events that have caused the most damage to property and to crops. The reason why this is looked at first will become apparent in the results below.
To begin the analysis, the processed storm data is grouped by year of event occurence (BGN_DATE_YR). The property damage and crop damage values are totaled for each year and the results plotted.
# Create the plot data for all years.
gdata <- pdata %>% group_by(BGN_DATE_YR) %>%
summarise(PROPDMGTOT=sum(PROPDMG) / 1.0e9, CROPDMGTOT=sum(CROPDMG) / 1.0e9) %>%
gather("TYPE", "COST", PROPDMGTOT:CROPDMGTOT)
gdata$DCAT <- "All Years" # For plot facet grouping & labeling
# Create the same plot data, but exclude years 2005 & 2006.
g2 <- gdata[!(gdata$BGN_DATE_YR %in% c(2005, 2006)),]
g2$DCAT <- "Excluding 2005-2006" # For plot facet grouping & labeling
gdata <- bind_rows(gdata, g2) # Append both data frames for plotting
a <- ggplot(data=gdata, aes(x=BGN_DATE_YR, y=COST, fill=TYPE, group=1)) +
facet_grid(DCAT ~ ., scales = "free") +
geom_bar(stat="identity") +
scale_fill_manual(values=c("red", "blue"), name="Damage Type",
labels = c("Crop", "Property")) +
theme(plot.title=element_text(face="bold"), axis.title=element_text(face="bold"),
legend.position="bottom") +
xlab("Year") +
ylab("Total Damage (x $1B)") +
ggtitle("Annual Property & Crop Damage\n1950-2011")
print(a)
Even a causal glance at the “All Years” chart above reveals a major spike in the results for the years 2005 and 2006. An initial reaction may be to think that this must be bad data. However, it is not. Recall that 2005 was the worst season on record for tropical storm activity.
“The 2005 season was the most active season on record, shattering records on repeated occasions. A record 28 tropical and subtropical storms formed, of which a record fifteen became hurricanes. Of these, seven strengthened into major hurricanes, a record-tying five became Category 4 hurricanes, a record four of which reached Category 5 strength, the highest categorization for North Atlantic tropical cyclones. Among these Category 5 storms was Hurricane Wilma, the most intense hurricane ever recorded in the Atlantic.”
“The most notable storms of the season were the five Category 4 and Category 5 hurricanes: Dennis, Emily, Katrina, Rita, and Wilma, along with the Category 1 Hurricane Stan. These storms made a combined twelve landfalls as major hurricanes (Category 3 strength or higher) throughout Cuba, Mexico, and the Gulf Coast of the United States, causing over $100 billion (2005 USD) in damages and at least 2,048 deaths.”
Source: Wikipedia
The bottom, “Excluding 2005-2006” chart shows what the damage totals look like without years 2005 and 2006. This difference between 2005/2006 and the other years will carry forward in the next analysis of which types of events are most damaging.
A final note regarding this plot is the apparent difference in reported damage starting in 1993. This is shown most dramatically in the lower chart. It is unknown what happened in 1993 to cause this change. Perhaps better reporting mechanisms were implemented. Or, maybe the number of reporting sites was increased. For the remainder of the analysis, the pre-1993 data is included in the results.
This report now considers the types of events that produce the most damage. The data is grouped by event type and the property damage and crop damage values combined into a single total damage value for each event type. The totals are then sorted in descending order and the top four separated out for plotting.
In consideration of the 2005/2006 data, the information is, again, presented as two charts: one covering all years and the other with 2005/2006 excluded.
# Create the plot data for all years.
gdata <- pdata %>% group_by(EVTYPE) %>%
summarise(DMGTOT=(sum(PROPDMG) + sum(CROPDMG)) / 1.0e9) %>%
arrange(desc(DMGTOT)) %>%
slice(1:4)
gdata$DCAT <- "All Years" # For plot facet grouping & labeling
# Create the same plot data, but exclude years 2005 & 2006.
g2 <- pdata[!(pdata$BGN_DATE_YR %in% c(2005, 2006)),] %>%
group_by(EVTYPE) %>%
summarise(DMGTOT=(sum(PROPDMG) + sum(CROPDMG)) / 1.0e9) %>%
arrange(desc(DMGTOT)) %>%
slice(1:4)
g2$DCAT <- "Excluding 2005-2006" # For plot facet grouping & labeling
gdata <- bind_rows(gdata, g2) # Append both data frames for plotting
gdata$EVTYPE <- factor(gdata$EVTYPE, levels=gdata$EVTYPE) # for X axis sorting
a <- ggplot(data=gdata, aes(x=EVTYPE, y=DMGTOT, fill=EVTYPE, group=1)) +
facet_grid(DCAT ~ .) +
geom_bar(stat="identity") +
theme(plot.title=element_text(face="bold"), axis.title=element_text(face="bold"),
legend.position="bottom", legend.title = element_blank(),
axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
xlab("Event Type") +
ylab("Total Damage (x $1B)") +
ggtitle("Top Damage Producing Event Types\n1950-2011")
print(a)
It is clear that floods, hurrucanes/typhoons, and storm surges easily dominate when 2005 and 2006 are included. The results are much different when those years are excluded - tornadoes then become the dominate damage-causing event. As hail frequently occurs with tornadoes, some portion of that damage is likely related to tornado events. The total damage produced across all four top events is also significantly less when 2005/2006 are not included.
The analysis now turns to looking at which events produce the highest number of injuries and fatalaties. The data is grouped by event type and the number of injuries and fatalities totaled for each event type. A total of both injuries and fatalities is also computed. This total is then sorted in descending order and the top four separated out for plotting.
In consideration of the 2005/2006 data, the information is, again, presented as two charts: one covering all years and the other with 2005/2006 excluded.
# Create the plot data for all years.
gdata <- pdata %>% group_by(EVTYPE) %>%
summarise(INJTOT=sum(INJURIES) / 1000, FATTOT=sum(FATALITIES) / 1000,
IFTOT=INJTOT + FATTOT) %>%
gather("TYPE", "TOTAL", INJTOT:FATTOT) %>%
arrange(desc(IFTOT)) %>%
slice(1:8)
gdata$DCAT <- "All Years" # For plot facet grouping & labeling
# Create the same plot data, but exclude years 2005 & 2006.
g2 <- pdata[!(pdata$BGN_DATE_YR %in% c(2005, 2006)),] %>%
group_by(EVTYPE) %>%
summarise(INJTOT=sum(INJURIES) / 1000, FATTOT=sum(FATALITIES) / 1000,
IFTOT=INJTOT + FATTOT) %>%
gather("TYPE", "TOTAL", INJTOT:FATTOT) %>%
arrange(desc(IFTOT)) %>%
slice(1:8)
g2$DCAT <- "Excluding 2005-2006" # For plot facet grouping & labeling
gdata <- bind_rows(gdata, g2) # Append both data frames for plotting
gdata$EVTYPE <- factor(gdata$EVTYPE, levels=gdata$EVTYPE) # for X axis sorting
# Create the plot.
a <- ggplot(data=gdata, aes(x=EVTYPE, y=TOTAL, fill=TYPE, group=1)) +
facet_grid(DCAT ~ .) +
geom_bar(stat="identity") +
scale_fill_manual(values=c("red", "gold2"), name="",
labels = c("Fatalities", "Injuries")) +
theme(plot.title=element_text(face="bold"), axis.title=element_text(face="bold"),
legend.position="bottom") +
# theme(plot.title=element_text(face="bold"), axis.title=element_text(face="bold"),
# legend.position="bottom", legend.title = element_blank(),
# axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
xlab("Event Type") +
ylab("Total Injuries/Fatalities (x 1,000)") +
ggtitle("Most Injurious/Fatal Event Types\n1950-2011")
print(a)
It is clear that tornados are the cause of the vast majority of human injuries and fatalities. It appears that, whereas floods and hurricanes produce the most economic damage, they are not the direct causes of the greatest storm-related personal harm. This is further evident in that there is very little noticable difference between the upper and lower charts (i.e., with and without years 2005 and 2006).
It would be interesting to study why hurricanes and floods are not more closely associated with human harm. It is the author’s speculation that perhaps the reason has to do with pre-event warnings. Hurricanes form in the ocean and are typically tracked for a number of days before they make landfall. Weather science is able to predict storm strength, and where/when landfall will occur as well as the potential path of the storm over land. This gives people a long lead time to prepare for the event.
Tornadoes, on the other hand, as not as predictable. Weather forecasts provide warnings of storms with the potential to spawn tornadoes, but cannot yet predicting the actual time, location, and magnitude of a real tornado. Thus, people tend to be caught by surprise and unprepared. Would better warning systems and/or better preparation result in fewer injuries/fatalities?